[WIP] Monitoring LHM

In order to monitor the Lakehouse Monitor we will be using existing Azure monitoring solutions (Insights and Alerts).

On the VM hosting your LHM you will be preparing a python environment and cron a python script that checks all the Subscriptions and Workspaces that are available to the Monitor in order to find any issues in retrieving information for them.

Script can be downloaded here <TODO create link>

wget link to download

Once the script is in place, it’s time to create the cronjob that runs it.

crontab -e

we recommend running the script every 30/60 minutes to be on top of the situation.

0,30 * * * * python3 bplm-check.py

The script and adjacent resources

Files present in the archive

bplm-check.py - main script
.env - environment file
requirements.txt - text file used to setup the python environment

Setting up the Application Insight

The script in the archive uses REST requests to query your LHM for information about: subscriptions, workspaces, etc.

Everything it does, including any issues it finds (no workspaces, no clusters, issues in connecting to them to list) it logs. The logs it sends to Azure. For this you’ll have to create a Application Insight in the Azure portal

in the portal, go to Monitor
in the left sidebar go to Insights → Applications
Create a Insight for your LHM deployment (in the same resource group as the other resources to ease locating it later)

Once you have created the Insight, you will want to set the Instrumentation Key inside the .env file or on the CLI args. This key is being used by the script to send logs to Azure.

Setting up the alert

With the information successfully being sent to Azure, you can now set up an Alert.

In the Application Insight go to the Logs section (left sidebar). All the logs go into the traces which is where you’ll want to query. Any error the script finds it will appear with a severityLevel=3 in this insight.

You can create use the following Kusto query to identify such issues

traces
| where severityLevel == 3

The results of this query can be counted afterwards to create an alert. The alert query becomes:

traces
| where severityLevel == 3
| count

At this point you need to click on the New Alert Rule button in the top bar to start creating your rule.

Set the alert logic to trigger when the value is bigger than 0 checking every 5 minutes.

Go to the next section “Actions”.

You will have to create an Action group here. Click on Create action group and fill in the necessary information. Go to the Notifications view and here you’ll want to create definitions for any notifications you want to be executed when an alert fires. E.g:

The actions view gives you the ability to execute any particular actions.

When the Action group is created, continue with the details page for your rule. Set the severity of the alert, by default it’s set to 3 - Informational. Give your rule a name and any additional details Azure might require. On the Advanced section you might consider enabling the option for “Automatically resolve alerts”. This way, when the alert condition is no longer met the alert will resolve on it’s own, this will weed out transient issues.

Once the rule is created and enabled, you’re all set up.

[WIP] Monitoring LHM

The script and adjacent resources

Setting up the Application Insight

Setting up the alert

0 Comments