Lakehouse Optimizer stores the collected metrics directly into cloud storage from where you can freely access the data and analyze it.

The Lakehouse Optimizer interface is used as both an Admin Console and a Canned-report view. We serve up a dozen metrics as well as recommendations for action in the interface. There are 100+ telemetry data points that we're persisting behind the scenes.

This is a starting place for you to understand the data, how it relates, what the metrics are and what they mean, so that you can build your own custom dashboards and reports and integrate with other systems as needed.

Applies to version 1.3.0. Future versions might collect even more telemetry data points.

Table of Contents

🧭 Where is the data

Storage path

Cloud storage path is configured upon installation and you can find the path by navigating to the settings panel in the web interface.

Azure cloud storage

Navigate to Azure Portal, then open Storage accounts panel and open the Azure Storage Account that you configured for the Lakehouse Monitor to use.

Next click on the Data Storage → Containers menu and open the Azure Blob Container that you have assigned.

Explore metrics files

Once you identified the storage account and blob container where you store your data, you can use Azure Storage Explorer to visualize the raw metrics.

You should expect this folder structure:

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> driverSparkMetrics
        |> driverOsMetrics
        |> executorSparkMetrics
        |> executorOsMetrics
        |> jobRunAnalysis
        |> taskMetrics
    |> consumptions
        |> json
            |> workspaceId=2149xxxxxxxxxxx
                |> date=20220511 (YYYY/MM/DD)
                    |> json files
        |> parquet
            |> workspaceId=2149xxxxxxxxxxx
                |> date=20220511 (YYYY/MM/DD)
                    |> parquet files
|> bplm-config/ (lakehouse monitor configurations)

💽 What data is collected

Lakehouse Monitor collects raw metrics in the following structure:

Aggregated data is saved in:

Consumption data

Driver Spark Metrics

note

Information at the Spark driver level

Information at the Spark driver level

File path

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> driverSparkMetrics
              / date=20220502 
                    / clusterId=0502-123457-fkasde121
                        / clusterId-0502-123457-fkasde121-clusterName-job-1174-run-564352-app-20220726085314-0000.csv
        |> driverOsMetrics

File sample:

File name structure

Columns

For more details regarding each column, please see the next section and follow

Spark documentation

HiveExternalCatalog: 

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/hive/HiveExternalCatalog.html  

Code generator: 

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-CodeGenerator.html  

DAGScheduler

https://books.japila.pl/apache-spark-internals/scheduler/DAGScheduler/  

LiveListenerBus: 

https://books.japila.pl/apache-spark-internals/scheduler/LiveListenerBus/  

All Spark internal components for driver:

https://spark.apache.org/docs/latest/monitoring.html#component-instance--driver  

 

Driver Os Metrics

note

Information at the operating system level

Information at the operating system level

File path

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> driverSparkMetrics
        |> driverOsMetrics
              / date=20220502 
                    / clusterId=0502-123457-fkasde121
                        / clusterId-0502-123457-fkasde121-app-20220502135427-0000.csv

File sample:

Columns

These values are reported by Java Virtual Machine running on the Spark Driver’s box.

Driver OS columns:

For CPU & memory, we collect the following metrics:

Executor OS Metrics

File path

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> executorSparkMetrics
        |> executorOsMetrics
              / date=20220502 
                    / clusterId=0726-085022-s52b56hp
                        / clusterId-0726-085022-s52b56hp-clusterName-job-1174-run-564352-app-20220726085314-0000.csv
        |> driverSparkMetrics
        |> driverOsMetrics

File sample:

Location: lakehouse-monitor / a63c1e51-40ae-1234-1234-bf80e132c05c / 5721xxxxxxxxxxxx / executorOsMetrics / date=20220726 / clusterId=0726-085022-s52b56hp

operating system (OS) metrics of the VM where the Spark executor runs

We collect CPU and memory metrics from Spark executors using Spark plugin framework. 

This is a mechanism that allows users to plugin custom code at the driver and executors. Basically, it offers a hook, through which messages can be sent from the executors to the driver. This functionality is supported starting with Spark 3.0. More details here.

Executors report to the driver every 10 seconds (configurable), using a custom sink, which reads CPU & memory metrics, and sends them to the driver via the plugin hook.

On the driver, these metrics accumulate from all executors and are reported in a CSV file every 10 seconds (configurable).

Columns

Executor OS columns:

For CPU & memory, we collect the following metrics:

Spark documentation

All Spark internal components for executor:

https://spark.apache.org/docs/latest/monitoring.html#component-instance--executor

External shuffle:

 https://www.waitingforcode.com/apache-spark/external-shuffle-service-apache-spark/read

Executor Spark Metrics

metrics collected at the level of the Spark executor provided by the Spark API

File path

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> executorSparkMetrics
        |> executorOsMetrics
              / date=20220502 
                    / clusterId=0726-085022-s52b56hp
                        / clusterId-0726-085022-s52b56hp-clusterName-job-1174-run-564352-app-20220726085314-0000.csv
        |> driverSparkMetrics
        |> driverOsMetrics

File sample:

Location: lakehouse-monitor / a63c1e51-40ae-1234-1234-bf80e132c05c / 5721xxxxxxxxxxxx / executorSparkMetrics / date=20220726 / clusterId=0726-085022-s52b56hp

Columns

Spark documentation

All Spark internal components for executor:

https://spark.apache.org/docs/latest/monitoring.html#component-instance--executor

External shuffle:

https://www.waitingforcode.com/apache-spark/external-shuffle-service-apache-spark/read

Task Metrics

File path:

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> executorSparkMetrics
        |> executorOsMetrics
        |> taskMetrics
              / date=20220502 
                    / dbJobId=1174
                        / dbJobRunId=564070
                            / dbJobId-1174-dbJobRunId-564070-app-20220726085314-0000.csv
        |> driverSparkMetrics
        |> driverOsMetrics

Location: lakehouse-monitor / a63c1e51-40ae-4a34-b230-bf80e132c05c / 511420607229897 / taskMetrics / date=20220726 / dbJobId=1174 / dbJobRunId=564070

File name:

dbJobId-1174-dbJobRunId-564070-app-20220726085314-0000.csv

File sample

Columns

Spark documentation

Consumption data

Consumption data, i.e. cost and usage information, is kept separate from the Databricks workspace folders.

File path

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> driverSparkMetrics
        |> driverOsMetrics
        |> executorSparkMetrics
        |> executorOsMetrics
        |> jobRunAnalysis
        |> taskMetrics
    |> consumptions
        |> json
            |> workspaceId=2149xxxxxxxxxxx
                |> date=20220511 (YYYY/MM/DD)
                    |> part-00000-9c30baa3-416b-4622-8e7d-9b512aa2e70a.c000.json
        |> parquet
            |> workspaceId=2149xxxxxxxxxxx
                |> date=20220511 (YYYY/MM/DD)
                    |> parquet files

Location: lakehouse-monitor / a63c1e51-40ae-4a34-b230-bf80e132c05c / consumptions / json / workspaceId=511420607229897 / date=20220721

File name: part-00000-9c30baa3-416b-4622-8e7d-9b512aa2e70a.c000.json

Sample file

File format:

Columns

💠 Enhanced data

Aggregated data is saved in:

Job Run Analysis

Job Run Analysis represents an aggregation by db-job-id, db-run-id and cluster-id of the Task Metrics joined with Cluster Os Metrics.

We employ

Columns

Notebook Analysis

Notebook analysis represents aggregated information regarding telemetry data of Databricks notebooks runs

Columns