Azure Marketplace Lakehouse Optimizer Installation: Getting Started

This guide is meant to guide you through the Lakehouse Optimizer install process using Azure Marketplace. It is designed to get at least one of your workspaces fully configured to the point where you are seeing reporting in Lakehouse Optimizer (LHO) on your Databricks entities like Jobs, Workflows, All Purpose Clusters, Pools, and more. We will also help you set up the parent subscription to load consumption data from Databricks allowing for cost analysis data to be shown in reports and overview graphs and charts.

https://www.youtube.com/watch?v=f8jGrQqRkqQ

Requirements

To ensure you have all the permissions required to do everything in this guide, please check out our Azure Marketplace Installation Readiness Checklist.

Azure Marketplace Installation

Go to the Lakehouse Optimizer for Azure Databricks page in Azure Marketplace. Select Free Trial and click the Create button.

The first page will ask for some basic information about where to install the resources and what to name the resource groups.

The Application name and Managed Resource group must be unique. The Managed Resource Group name will be pre-provided with a number based on the date for convenience.

On the next page you will need to specify some more basic information.

Again, Application, Virtual Machine, and Storage Account names must all be unique, though you can make this easy by adding a suffix onto the end of the resource group name you picked on the last page as shown above.

Admin Email Address

NOTE: Be sure to use a valid email address in the “Admin Email Address” field because later on when the SSL certificate eventually expires, this email is used to receive the notification email from letsencrypt

The email address and password will be the login credentials after installation, so be sure to write them down somewhere as there will not be a way to retrieve the provisional user username/password if you forget them.

SSH Public Key

The SSH key is a generated access key that will allow you to directly access the virtual machine generated by this installation and make customizations to the resources. Note that SSH access is only granted to connections that are coming from Azure CloudShell, so trying to connect using your local PC terminal won’t work even if you have the correct SSH key.

If left blank, a username and password will be generated for SSH access and you can retrieve that later if access is needed.

On the last page simply review what you entered and accept the terms and conditions. If everything looks good, go ahead and click Create.

The installation process should take about 10 minutes to complete.

Login as Provisioning User

Once installation is complete you can click Go to resource to open the managed application resource, and then click the managed resource group link in the upper right hand side of the page to view all the installed Azure resources.

Click on the Virtual Machine resource, and you should see a DNS name in the upper right side of the page.

This is the url link to your new installation of Lakehouse Optimizer. Click the copy to clipboard button and paste the link in your browser url bar. You should see a prompt to login as a provisioning user.

Enter the username and password you chose during the installation process, and you should be taken to the licensing page.

The free trial period will last for 30 days starting from the Azure Marketplace installation date. If you would like to extend it, you will need to contact us at support@bpcs.com. For now, click User Settings on the left hand side navigation and we will continue with setting up LHO.

Most of the functionality in Lakehouse Optimizer is dependent on Active Directory permissions in Azure and Databricks. If a user has permission to view an asset in Azure or Databricks, they will automatically be able to see the information in Lakehouse Optimizer as well. This keeps all user management in one place: Azure Active Directory – but first we need to connect this application to the Active Directory account so everyone can log in with their Active Directory Single Sign-On IDs. Click the Configure button in the section titled Active Directory Status.

To configure Active Directory in LHO we will need four pieces of information from the App Registration we will make in the next step:

  1. Tenant Id - the Directory (tenent) ID (can be copied easily in App Registration

  2. Client Id - the Application (client) ID from the App Registration

  3. Secret - The Value of a client secret in your App Registration

  4. Object Id - Object Id of the managed application in the App Registration

Create an App Registration

Azure Marketplace does not allow the creation of this specific resource during the install process due to security reasons, so we will create an app registration now.

Start by opening up Azure Portal, and type “Entra” into the search bar. Select “Microsoft Entra ID” when it appears. If you have a role of Cloud Application Administrator or higher, you can click “App Registrations” in the left hand side navigation under Manage, and then click “New registration” in the upper left hand corner of the page that opens. 

As with everything in Azure you need to choose a unique name, so again you can reuse the application name we used before and add something like “AR” to the end. 

The Supported Account Types will depend on your organizations IT structure and if you have more than one Azure tenent governed by your company.  

If in doubt you can consult your IT department on what to choose here, but for this example I will select a single Tenent setup. 

For the redirect URI select “Web” from the dropdown. For the URI field we will need to construct this based on the URL of your deployed application:

https://[your-dns-name-url]/login/oauth2/code/azure

Click Register.

Implicit grant and hybrid flows

Next, we need to grant LHO access to subscription data. Go to Authentication on the left hand side nav, and scroll down till you see the section “Implicit grant and hybrid flows”. Check the box next to ID tokens (used for implicit and hybrid flows).

Client Secret creation

Go to Certificates and Secrets on the left hand side navigation, and click New client secret. Give it a name (doesn’t have to be unique), and pick a time period for how long till it expires.

Be sure to copy the Value field from the client secret as you will need it in the next section, and after you leave this page you will no longer be able to access the full value.

Enter the info in LHO

Enter Secret

The secret you need is the Value from the client secret you created in the section above. Paste that value in the Secret field in LHO.

Enter Tenant Id and Client Id

These are both found right next to each other in the App Registration overview.

Copy each value into their respective fields in LHO.

Enter Object Id

To get the object id first click on the link for the managed application inside your app registration. You will find the link on the right and side of the overview page for the App Registration.

Once on that page, you should see the Object ID in the overview header section.

Copy the value into the LHO field and hit save. Once everything applies you should see a big green check mark that says “Configured”.

Log in using Active Directory SSO

Now that Active Directory is configured you can log out, and you should now see the option to sign in using Single Sign On (SSO).

Click “Login With Azure Active Directory” and the application will ask you to accept the permissions requested by LHO.

Click Accept and you will be taken to the overview page, but there will be no data yet because no workspaces are configured.

Configure Workspaces and Subscriptions in LHO

Configure Azure Subscription

Step 1. Grant Access to Consumption Data

Navigate to Settings panel and grant access to the consumption (cost) data to the Service Principal used by LHO.

In order for Lakehouse Optimizer (LHO) to be able to read consumption data from Azure, LHO's application identity requires the BILLING_READER role to be granted in this Azure subscription.

Once this step is complete, you will see the following green check mark.

LHO can function also without consumption (cost) data access, but this means that LHO will not be able to report on your actual costs.

You can read more about access configuration here: https://blueprinttechnologies.atlassian.net/wiki/spaces/BLMPD/pages/2532605979/Security+Requirements+for+VM+runtime#Phase-3)-Access-roles-configuration

Configure Databricks Workspace

The following actions are required in order to enable Lakehouse Optimizer to gather cost and telemetry data:

  • Grant Access to Service Principal

  • Enable LHO Collector Agent

  • Enable Global Init Scripts

  • Create Secret Scope

Step 1. Enable Service Principal

 

Step 2. Enable LHO Collector Agent

  • Upload .jar library responsible for collecting telemetry data and the initialization scripts into selected workspace DBFS. Note that this is automatic. You just need to click upload and the process will happen on the server automatically. You don’t need to provide/upload a file from your local machine

Step 3. Enable Global Init Scripts

 

Step 4. Create Secret Scope

Click Create.

Select Databricks under Secret Scope Backend Type, and the name will need to be unique inside this workspace in Databricks.

Step 5. Configuration Complete Confirmation

Once these steps are done, you should see the following green banner with “Complete Configuration”.

This setup is the quickest option to get your Databricks monitored. There are also other configuration options for LHO, for example to enable monitoring on assets one-by-on. For more configuration options please contact Blueprint or follow the more advanced topics in the documentation material.

Load Consumption Data

Step 1. Navigate to the Consumption Data panel.

This page is available only to the role of Billing Admin.

Step 2. Load consumption data

LHO supports loading consumption (cost) data from your Azure subscription either on demand or on a schedule basis.

At this step, for this tutorial purpose, select a data only a month or two in the past. This process will load data from that date up till the current date. Depending on your Azure Subscription size this process might be long, therefore we recommend to load for a smaller date interval, the purpose being to see cost and telemetry data in LHO as soon as possible.

Loading consumption data for large subscriptions for the past 12 months, can take up to 12 hours or even more.

Step 3. Scheduled load consumption data

Most likely, Databricks resources are used on a daily basis in your infrastructure. Therefore we recommend you to create a scheduled daily consumption data load in order for LHO to report updated costs on a daily basis.

Recommended schedule configuration:

  • load data: incrementally

  • frequency: daily

You can configure multiple schedules based on your particular needs.

 


VI. Explore Cost and Telemetry Insights

Once all previous steps are completed, your LHO instance is ready to monitor your cloud infrastructure.

Select Reports and select the Azure Subscription and Databricks Workspace you just configured.