Lakehouse Monitor stores the collected metrics directly into cloud storage from where you can freely access the data and analyze it.

The lakehouse monitor interface is used as both an Admin Console and a Canned-report view. We serve up a dozen metrics as well as recommendations for action in the interface. There are 100+ telemetry data points that we're persisting behind the scenes.

This is a starting place for you to understand the data, how it relates, what the metrics are and what they mean, so that you can build your own custom dashboards and reports and integrate with other systems as needed.

Applies to version 1.3.0. Future versions might collect even more telemetry data points.

Table of Contents

🧭 Where is the data

Storage path

Cloud storage path is configured upon installation and you can find the path by navigating to the settings panel in the web interface.

Azure cloud storage

Navigate to Azure Portal, then open Storage accounts panel and open the Azure Storage Account that you configured for the Lakehouse Monitor to use.

Next click on the Data Storage → Containers menu and open the Azure Blob Container that you have assigned.

Explore metrics files

Once you identified the storage account and blob container where you store your data, you can use Azure Storage Explorer to visualize the raw metrics.

You should expect this folder structure:

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> driverSparkMetrics
        |> driverOsMetrics
        |> executorSparkMetrics
        |> executorOsMetrics
        |> jobRunAnalysis
        |> taskMetrics
    |> consumptions
        |> json
            |> workspaceId=2149xxxxxxxxxxx
                |> date=20220511 (YYYY/MM/DD)
                    |> json files
        |> parquet
            |> workspaceId=2149xxxxxxxxxxx
                |> date=20220511 (YYYY/MM/DD)
                    |> parquet files
|> bplm-config/ (lakehouse monitor configurations)

💽 What data is collected

Lakehouse Monitor collects raw metrics in the following structure:

driverSparkMetrics - metrics of the Spark driver
driverOsMetrics - operating system (OS) metrics of the VM where the Spark driver runs
executorSparkMetrics - metrics of the Spark executor
executorOsMetrics - operating system (OS) metrics of the VM where the Spark executor runs
taskMetrics - metrics collected from Spark tasks

Aggregated data is saved in:

jobRunAnalysis - aggregated information regarding telemetry data of Databricks jobs runs
notebookAnalysis - aggregated information regarding telemetry data of Databricks notebooks runs

Consumption data

consumptions - consumption (cost and usage) metrics

Driver Spark Metrics

Information at the Spark driver level

https://spark.apache.org/docs/latest/monitoring.html#component-instance--driver
driverSparkMetrics - metrics of the Spark driver

File path

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> driverSparkMetrics
              / date=20220502 
                    / clusterId=0502-123457-fkasde121
                        / clusterId-0502-123457-fkasde121-clusterName-job-1174-run-564352-app-20220726085314-0000.csv
        |> driverOsMetrics

File sample:

File name structure

<azure_subscription_id>/<databricks_workspace_id>/driverSparkMetrics/date=<yearmonthday>/clusterName-<cluster_or_jobname>-run-<run_number>-clusterId-<cluster_id>-app-<spark-application>.csv
- 0326cb34-f0d6-41b9-b4e3-0cc6c24dd240/5721838913406123/driverSparkMetrics/date=20220504/clusterId=0514-133143-8x6wad4r/clusterName-job-26-run-88036-clusterId-0514-133143-8x6wad4r-app-20220504133348-0000.csv
5721838913406123 - workspace id
date=20220504 - day when the metrics have been generated/saved
job-26-run-88036 - ephemeral cluster name, otherwise this would have been the all-purpose cluster name
clusterId-0514-133143-8x6wad4r - clusterID - uniquely identifies the cluster

Columns

timestamp - time of data collection
app - a unique identifier for the Spark application
clusterId - ID of assigned Databricks cluster
clusterName - name of cluster
orgId - alias of workspaceId
workspaceId - ID of the Databricks workspace
workspaceName - name of the Databricks workspace
date - data collection date formatted YYYYMMDD
BlockManager.memory.diskSpaceUsed_MB - Amount of disk space used
BlockManager.memory.maxMem_MB - Max memory limit
BlockManager.memory.maxOffHeapMem_MB - Off-heap memory remaining
BlockManager.memory.maxOnHeapMem_MB - On-heap memory remaining
BlockManager.memory.memUsed_MB - Memory used
BlockManager.memory.offHeapMemUsed_MB - Off heap memory used
BlockManager.memory.onHeapMemUsed_MB - On-heap memory used
BlockManager.memory.remainingMem_MB - Remaining memory
BlockManager.memory.remainingOffHeapMem_MB - Off-heap memory remaining
BlockManager.memory.remainingOnHeapMem_MB - On-heap memory remaining
HiveExternalCatalog.fileCacheHits - count how many times the process had to go to file cache
HiveExternalCatalog.filesDiscovered - how many files were found at the location indicated
HiveExternalCatalog.hiveClientCalls - Spark internal Metric associated with the HiveExternalCatalog (for detailed documentation, see links below)
HiveExternalCatalog.parallelListingJobCount - Spark internal Metric associated with the HiveExternalCatalog (for detailed documentation, see links below)
HiveExternalCatalog.partitionsFetched - count how many partitions fetched
CodeGenerator.compilationTime.p75 - Spark internal metric associated with the CodeGenerator (for detailed documentation, see links below)
CodeGenerator.sourceCodeSize.p75 - Spark internal metric associated with the CodeGenerator (for detailed documentation, see links below)
CodeGenerator.generatedClassSize.p75 - Spark internal metric associated with the CodeGenerator (for detailed documentation, see links below)
CodeGenerator.generatedMethodSize.p75 - Spark internal metric associated with the CodeGenerator (for detailed documentation, see links below)
DAGScheduler.job.activeJobs - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below)
DAGScheduler.job.allJobs - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below)
DAGScheduler.messageProcessingTime.meanRate - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below)
DAGScheduler.stage.failedStages - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below)
DAGScheduler.stage.runningStages - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below)
DAGScheduler.stage.waitingStages - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below)
LiveListenerBus.numEventsPosted - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)
LiveListenerBus.queue.appStatus.listenerProcessingTime.meanRate - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)
LiveListenerBus.queue.appStatus.numDroppedEvents - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)
LiveListenerBus.queue.appStatus.size - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)
LiveListenerBus.queue.executorManagement.listenerProcessingTime.meanRate - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)
LiveListenerBus.queue.executorManagement.numDroppedEvents - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)
LiveListenerBus.queue.executorManagement.size - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)
LiveListenerBus.queue.streams.listenerProcessingTime.meanRate - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)
LiveListenerBus.queue.streams.numDroppedEvents - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)
LiveListenerBus.queue.streams.size - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)
LiveListenerBus.queue.shared.numDroppedEvents - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)
LiveListenerBus.queue.shared.listenerProcessingTime.meanRate - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)
LiveListenerBus.queue.shared.size - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)
ExecutorMetrics.JVMHeapMemory - Peak memory usage of the heap that is used for object allocation
ExecutorMetrics.JVMOffHeapMemory - Peak memory usage of non-heap memory that is used by the JVM
ExecutorMetrics.OnHeapExecutionMemory - Peak on heap execution memory in use, in bytes
ExecutorMetrics.OnHeapStorageMemory - Peak on heap storage memory in use, in bytes
ExecutorMetrics.OnHeapUnifiedMemory - Peak on heap memory (execution and storage).
ExecutorMetrics.OffHeapExecutionMemory - Peak off heap execution memory in use, in bytes
ExecutorMetrics.OffHeapStorageMemory - Peak off heap storage memory in use, in bytes
ExecutorMetrics.OffHeapUnifiedMemory - Peak off heap memory (execution and storage
ExecutorMetrics.DirectPoolMemory - Peak memory that the JVM is using for direct buffer pool 
ExecutorMetrics.MappedPoolMemory - Peak memory that the JVM is using for mapped buffer pool
ExecutorMetrics.MinorGCCount - Total minor GC (garbage collector) count
ExecutorMetrics.MajorGCCount - Total major GC count
ExecutorMetrics.MinorGCTime - Elapsed time for total minor GC
ExecutorMetrics.MajorGCTime - Elapsed time for total major GC
ExecutorMetrics.ProcessTreeJVMVMemory - Virtual memory size in bytes
ExecutorMetrics.ProcessTreeJVMRSSMemory - Resident Set Size: number of pages the process has in real memory
ExecutorMetrics.ProcessTreePythonVMemory - Virtual memory size for Python in bytes
ExecutorMetrics.ProcessTreePythonRSSMemory - Resident Set Size for Python
ExecutorMetrics.ProcessTreeOtherVMemory - Virtual memory size for other kind of process in bytes
ExecutorMetrics.ProcessTreeOtherRSSMemory - Resident Set Size for Python
JVMCPU.jvmCpuTime - Spark internal metric associated with the Java Virtual Machine CPU (for detailed documentation, see links below)

For more details regarding each column, please see the next section and follow

Spark documentation

HiveExternalCatalog:

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/hive/HiveExternalCatalog.html

Code generator:

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-CodeGenerator.html

DAGScheduler:

https://books.japila.pl/apache-spark-internals/scheduler/DAGScheduler/

LiveListenerBus:

https://books.japila.pl/apache-spark-internals/scheduler/LiveListenerBus/

All Spark internal components for driver:

https://spark.apache.org/docs/latest/monitoring.html#component-instance--driver

Driver Os Metrics

Information at the operating system level

driverOsMetrics - operating system (OS) metrics of the VM where the Spark driver runs

File path

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> driverSparkMetrics
        |> driverOsMetrics
              / date=20220502 
                    / clusterId=0502-123457-fkasde121
                        / clusterId-0502-123457-fkasde121-app-20220502135427-0000.csv

File sample:

Columns

These values are reported by Java Virtual Machine running on the Spark Driver’s box.

Driver OS columns:

timestamp - time of data collection
app - a unique identifier for the Spark application
clusterId - ID of assigned Databricks cluster
clusterName - Name of assigned cluster
orgId - alias of workspaceId
workspaceId - ID of the Databricks workspace
workspaceName - name of the Databricks workspace
date - calendar date formatted YYYYMMDD
nodeIp - IP address of the driver
osName - Type of operating system
osArch - OS architecture
osVersion - Version of operating system

For CPU & memory, we collect the following metrics:

availableProcessors → the number of processors available to the Java virtual machine.
systemLoadAverage → the system load average for the last minute.
cpuSystemLoad → the "recent cpu usage" for the whole system.
cpuProcessLoad → the "recent cpu usage" for the Java Virtual Machine process.
cpuProcessTime → the CPU time used by the process on which the Java virtual machine is running in nanoseconds.
committedVirtualMemorySize → amount of virtual memory that is guaranteed to be available to the running process in bytes, or -1 if this operation is not supported
totalPhysicalMemorySize → the total amount of physical memory in bytes
freePhysicalMemorySize → the amount of free physical memory in bytes
freeSwapSpaceSize → the amount of free swap space in bytes
totalSwapSpaceSize → the total amount of swap space in bytes
processAllocatedMemory → the amount of memory allocated by the app, in bytes
processTotalMemory → the maximum amount this JVM will ever get from the operating system (as set by the -Xmx parameter), in bytes
processMemoryLoad → the amount of memory available to the JVM that is currently being used (between 0 and 1)

Executor OS Metrics

File path

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> executorSparkMetrics
        |> executorOsMetrics
              / date=20220502 
                    / clusterId=0726-085022-s52b56hp
                        / clusterId-0726-085022-s52b56hp-clusterName-job-1174-run-564352-app-20220726085314-0000.csv
        |> driverSparkMetrics
        |> driverOsMetrics

File sample:

Location: lakehouse-monitor / a63c1e51-40ae-1234-1234-bf80e132c05c / 5721xxxxxxxxxxxx / executorOsMetrics / date=20220726 / clusterId=0726-085022-s52b56hp

operating system (OS) metrics of the VM where the Spark executor runs

We collect CPU and memory metrics from Spark executors using Spark plugin framework.

This is a mechanism that allows users to plugin custom code at the driver and executors. Basically, it offers a hook, through which messages can be sent from the executors to the driver. This functionality is supported starting with Spark 3.0. More details here.

Executors report to the driver every 10 seconds (configurable), using a custom sink, which reads CPU & memory metrics, and sends them to the driver via the plugin hook.

On the driver, these metrics accumulate from all executors and are reported in a CSV file every 10 seconds (configurable).

Columns

Executor OS columns:

timestamp - time of data collection
app - a unique identifier for the Spark application
clusterId - ID of assigned Databricks cluster
clusterName - name of assigned cluster
orgId - alias of workspaceId
workspaceId - ID of the Databricks workspace
workspaceName - name of the Databricks workspace
date - calendar date formatted YYYYMMDD
nodeIp - Ip address for the executor
executorId - Spark executor ID
osName - Type of operating system
osArch - OS architecture
osVersion - Version of operating system

For CPU & memory, we collect the following metrics:

availableProcessors -> the number of processors available to the Java virtual machine.
systemLoadAverage -> the system load average for the last minute.
cpuSystemLoad -> the "recent cpu usage" for the whole system.
cpuProcessLoad -> the "recent cpu usage" for the Java Virtual Machine process.
cpuProcessTime -> the CPU time used by the process on which the Java virtual machine is running in nanoseconds.
committedVirtualMemorySize -> amount of virtual memory that is guaranteed to be available to the running process in bytes, or -1 if this operation is not supported
totalPhysicalMemorySize -> the total amount of physical memory in bytes
freePhysicalMemorySize -> the amount of free physical memory in bytes
freeSwapSpaceSize -> the amount of free swap space in bytes
totalSwapSpaceSize -> the total amount of swap space in bytes
processAllocatedMemory -> the amount of memory allocated by the app, in bytes
processTotalMemory -> the maximum amount this JVM will ever get from the operating system (as set by the -Xmx parameter), in bytes
processMemoryLoad -> the amount of memory available to the JVM that is currently being used (between 0 and 1)
ExecutorMetrics.DirectPoolMemory - Peak memory that the JVM is using for direct buffer pool 
ExecutorMetrics.JVMHeapMemory - Peak memory usage of the heap that is used for object allocation
ExecutorMetrics.JVMOffHeapMemory - Peak memory usage of non-heap memory that is used by the JVM
ExecutorMetrics.MajorGCCount - Total major GC count
ExecutorMetrics.MajorGCTime - Elapsed time for total major GC
ExecutorMetrics.MappedPoolMemory - Peak memory that the JVM is using for mapped buffer pool
ExecutorMetrics.MinorGCCount - Total minor GC (garbage collector) count
ExecutorMetrics.MinorGCTime - Elapsed time for total minor GC
ExecutorMetrics.OffHeapExecutionMemory - Peak off heap execution memory in use, in bytes
ExecutorMetrics.OffHeapStorageMemory - Peak off heap storage memory in use, in bytes
ExecutorMetrics.OffHeapUnifiedMemory - Peak off heap memory (execution and storage)
ExecutorMetrics.OnHeapExecutionMemory - Peak on heap execution memory in use, in bytes
ExecutorMetrics.OnHeapStorageMemory - Peak on heap storage memory in use, in bytes
ExecutorMetrics.OnHeapUnifiedMemory - Peak on heap memory (execution and storage)
ExecutorMetrics.ProcessTreeJVMRSSMemory - Resident Set Size: number of pages the process has in real memory
ExecutorMetrics.ProcessTreeJVMVMemory - Virtual memory size in bytes
ExecutorMetrics.ProcessTreeOtherRSSMemory - Resident Set Size for Python
ExecutorMetrics.ProcessTreeOtherVMemory - Virtual memory size for other kind of process in bytes
ExecutorMetrics.ProcessTreePythonRSSMemory - Resident Set Size for Python
ExecutorMetrics.ProcessTreePythonVMemory - Virtual memory size for Python in bytes
ExternalShuffle.shuffle-client.usedDirectMemory - For information on this column, please see Additional Documentation below
ExternalShuffle.shuffle-client.usedHeapMemory - For information on this column, please see Additional Documentation below
executor.bytesRead - How many bytes read
executor.bytesWritten - How many bytes written
executor.cpuTime - Cpu time in nanoseconds
executor.deserializeCPUTime - How many nanoseconds used to deserialize objects
executor.deserializeTime - How many nanoseconds used to deserialize objects
executor.jvmGCTime - How many nanoseconds used for garbage collection
executor.memoryBytesSpilled - How many bytes spilled from memory to disk
executor.recordsRead - How many records read
executor.recordsWritten - How many records written
executor.resultSerializationTime - How many nanoseconds to serialize the result
executor.resultSize - How large is the result in bytes
executor.shuffleBytesWritten - How many bytes for shuffle operations in bytes
executor.resultSerializationTime - How many nanoseconds did the executor wait to receive shuffled data
executor.resultSize - How many blocks fetched from local storage
executor.shuffleBytesWritten - How many bytes for shuffle operations in bytes
executor.shuffleFetchWaitTime - How many nanoseconds did the executor wait to receive shuffled data
executor.shuffleLocalBlocksFetched - How many blocks fetched from local storage
executor.shuffleLocalBytesRead - How many bytes read from local storage
executor.shuffleRecordsRead - How many records read from shuffled data
executor.shuffleRecordsWritten - How many records written from shuffled data
executor.shuffleRemoteBlocksFetched - How many blocks fetched from other executors
executor.shuffleRemoteBytesRead - How many bytes read from remote executors
executor.shuffleRemoteBytesReadToDisk - How many bytes from other executors written to local disk
executor.shuffleTotalBytesRead - Total bytes read for shuffle operations (local + remote)
executor.shuffleWriteTime - How many nanoseconds spent writing shuffle data
executor.succeededTasks - How many tasks succeeded

Spark documentation

All Spark internal components for executor:

https://spark.apache.org/docs/latest/monitoring.html#component-instance--executor

External shuffle:

https://www.waitingforcode.com/apache-spark/external-shuffle-service-apache-spark/read

Executor Spark Metrics

metrics collected at the level of the Spark executor provided by the Spark API

File path

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> executorSparkMetrics
        |> executorOsMetrics
              / date=20220502 
                    / clusterId=0726-085022-s52b56hp
                        / clusterId-0726-085022-s52b56hp-clusterName-job-1174-run-564352-app-20220726085314-0000.csv
        |> driverSparkMetrics
        |> driverOsMetrics

File sample:

Location: lakehouse-monitor / a63c1e51-40ae-1234-1234-bf80e132c05c / 5721xxxxxxxxxxxx / executorSparkMetrics / date=20220726 / clusterId=0726-085022-s52b56hp

Columns

timestamp - time of data collection
app - a unique identifier for the Spark application
clusterId - ID of assigned Databricks cluster
clusterName - Name of assigned cluster
orgId - alias of workspaceId
workspaceId - ID of the Databricks workspace
workspaceName - name of the Databricks workspace
date - calendar date formatted YYYYMMDD
nodeIp - Ip address for executor
executorId - Id of executor
ExecutorMetrics.DirectPoolMemory - Peak memory that the JVM is using for direct buffer pool 
ExecutorMetrics.JVMHeapMemory - Peak memory usage of the heap that is used for object allocation
ExecutorMetrics.JVMOffHeapMemory - Peak memory usage of non-heap memory that is used by the JVM
ExecutorMetrics.MajorGCCount - Total major GC count
ExecutorMetrics.MajorGCTime - Elapsed time for total major GC
ExecutorMetrics.MappedPoolMemory - Peak memory that the JVM is using for mapped buffer pool
ExecutorMetrics.MinorGCCount - Total minor GC (garbage collector) count
ExecutorMetrics.MinorGCTime - Elapsed time for total minor GC
ExecutorMetrics.OffHeapExecutionMemory - Peak off-heap execution memory in use, in bytes
ExecutorMetrics.OffHeapStorageMemory - Peak off-heap storage memory in use, in bytes
ExecutorMetrics.OffHeapUnifiedMemory - Peak off-heap memory (execution and storage)
ExecutorMetrics.OnHeapExecutionMemory - Peak on heap execution memory in use, in bytes
ExecutorMetrics.OnHeapStorageMemory - Peak on heap storage memory in use, in bytes
ExecutorMetrics.OnHeapUnifiedMemory - Peak on heap memory (execution and storage)
ExecutorMetrics.ProcessTreeJVMRSSMemory - Resident Set Size: number of pages the process has in real memory
ExecutorMetrics.ProcessTreeJVMVMemory - Virtual memory size in bytes
ExecutorMetrics.ProcessTreeOtherRSSMemory - Resident Set Size for Python
ExecutorMetrics.ProcessTreeOtherVMemory - Virtual memory size for other kind of process in bytes
ExecutorMetrics.ProcessTreePythonRSSMemory - Resident Set Size for Python
ExecutorMetrics.ProcessTreePythonVMemory - Virtual memory size for Python in bytes
ExternalShuffle.shuffle-client.usedDirectMemory - For information on this column, please see Additional Documentation below
ExternalShuffle.shuffle-client.usedHeapMemory - For information on this column, please see Additional Documentation below
executor.bytesRead - How many bytes read
executor.bytesWritten - How many bytes written
executor.cpuTime - Cpu time in nanoseconds
executor.deserializeCPUTime - How many nanoseconds used to deserialize objects
executor.deserializeTime - How many nanoseconds used to deserialize objects
executor.jvmGCTime - How many nanoseconds used for garbage collection
executor.memoryBytesSpilled - How many bytes spilled from memory to disk
executor.recordsRead - How many records read
executor.recordsWritten - How many records written
executor.resultSerializationTime - How many nanoseconds to serialize the result
executor.resultSize - How large is the result in bytes
executor.shuffleBytesWritten - How many bytes for shuffle operations in bytes
executor.resultSerializationTime - How many nanoseconds did the executor wait to receive shuffled data
executor.resultSize - How many blocks fetched from local storage
executor.shuffleBytesWritten - How many bytes for shuffle operations in bytes
executor.shuffleFetchWaitTime - How many nanoseconds did the executor wait to receive shuffled data
executor.shuffleLocalBlocksFetched - How many blocks fetched from local storage
executor.shuffleLocalBytesRead - How many bytes read from local storage
executor.shuffleRecordsRead - How many records read from shuffled data
executor.shuffleRecordsWritten - How many records written from shuffled data
executor.shuffleRemoteBlocksFetched - How many blocks fetched from other executors
executor.shuffleRemoteBytesRead - How many bytes read from remote executors
executor.shuffleRemoteBytesReadToDisk - How many bytes from other executors written to local disk
executor.shuffleTotalBytesRead - Total bytes read for shuffle operations (local + remote)
executor.shuffleWriteTime - How many nanoseconds spent writing shuffle data

Spark documentation

All Spark internal components for executor:

https://spark.apache.org/docs/latest/monitoring.html#component-instance--executor

External shuffle:

https://www.waitingforcode.com/apache-spark/external-shuffle-service-apache-spark/read

Task Metrics

File path:

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> executorSparkMetrics
        |> executorOsMetrics
        |> taskMetrics
              / date=20220502 
                    / dbJobId=1174
                        / dbJobRunId=564070
                            / dbJobId-1174-dbJobRunId-564070-app-20220726085314-0000.csv
        |> driverSparkMetrics
        |> driverOsMetrics

Location: lakehouse-monitor / a63c1e51-40ae-4a34-b230-bf80e132c05c / 511420607229897 / taskMetrics / date=20220726 / dbJobId=1174 / dbJobRunId=564070

File name:

dbJobId-1174-dbJobRunId-564070-app-20220726085314-0000.csv

File sample

Columns

timestamp - time of data collection
app - a unique identifier for the Spark application
clusterId - ID of assigned Databricks cluster
clusterName -
orgId - alias of workspaceId
workspaceId - ID of the Databricks workspace
workspaceName - name of the Databricks workspace
dbJobId - Id of job
dbJobName - Name of job
dbJobRunId - Run Id of job
dbTaskRunId - Run Id of task
notebookId - Id of notebook that task is run on
notebook_path - Path of notebook that task is run on
sparkJobId - Id for individual job
sparkJobGroupId - An id spark assigns to a group of spark jobs
userId - User that is running the task
stageId - Stage Id for stage within the given task
stageAttemptId - Attempt number (were retries necessary?)
taskType - What type of task was perfomed (eg: resultTask, shuffleMapTask)
taskStatus - State of the task (succeeded or failed)
taskIndex - Task number starting from 0
taskId - Task identifier
attemptNumber - Attempt number of run
launchTime - Start time of task in timestamp format
finishTime - Finish time of task in timestamp format
duration - How long did the task take in nanoseconds
schedulerDelay - How long did the scheduler take to begin in nanoseconds
executorId - Identifier of the executor that kicked off the given task
host - Ip address for the executor that kicked off the task
taskLocality - process_local
speculative - Speculative execution of tasks was enabled (true) or disabled (false)
gettingResultTime - How long did the task take to get the result
successful - Was the task successful?
executorRunTime - How long did the executor run the task in nanoseconds
executorCpuTime - How long did the executor use the cpu, in nanoseconds
executorDeserializeTime - How long did the executor take to deserialize data, in nanoseconds
executorDeserializeCpuTime - How long did the executor process data in cpu to deserialize data, in nanoseconds
resultSerializationTime - How long did it take to serialize the results, in nanoseconds
jvmGCTime - How long did the java virtual machine take in garbage collection, in nanoseconds
resultSize - How large was the result, in bytes
numUpdatedBlockStatuses - Storage statuses of any blocks that were updated as a result of the indicated task
diskBytesSpilled - How many bytes were spilled from memory to disk
memoryBytesSpilled - How many bytes spilled from memory to disk
peakExecutionMemory - What was the most memory consumed in execution, in bytes
recordsRead - How many records read
bytesRead - Number of bytes read
recordsWritten - Number of records written
bytesWritten - bytes written by the Spark Job
shuffleFetchWaitTime - How long did the task wait for shuffled data in nanoseconds
shuffleTotalBytesRead - How many bytes of shuffled data was read
shuffleTotalBlocksFetched - How many blocks were fetched for shuffled data
shuffleLocalBlocksFetched - How many blocks of local data was shuffled
shuffleRemoteBlocksFetched - How many blocks of remote data was shuffled
shuffleWriteTime - How long did it take to write shuffle data in nanoseconds
shuffleBytesWritten - Number of bytes written
shuffleRecordsWritten - Number of records written
errorMessage - Message displayed if an error was detected
sparkJobStartTime - start time of the SparkJob provided through the SparkJobListener API, as measured by Spark
sparkJobEndTime - end time of the SparkJob provided through the SparkJobListener API, as measured by Spark
sparkJobDuration - duration of the SparkJob provided through the SparkJobListener API, as measured by Spark

Spark documentation

https://spark.apache.org/docs/latest/monitoring.html#executor-task-metrics

Consumption data

Consumption data, i.e. cost and usage information, is kept separate from the Databricks workspace folders.

File path

storage bucket
|> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id)
    |> 5721xxxxxxxxxxxx/ (databricks workspace id)
        |> driverSparkMetrics
        |> driverOsMetrics
        |> executorSparkMetrics
        |> executorOsMetrics
        |> jobRunAnalysis
        |> taskMetrics
    |> consumptions
        |> json
            |> workspaceId=2149xxxxxxxxxxx
                |> date=20220511 (YYYY/MM/DD)
                    |> part-00000-9c30baa3-416b-4622-8e7d-9b512aa2e70a.c000.json
        |> parquet
            |> workspaceId=2149xxxxxxxxxxx
                |> date=20220511 (YYYY/MM/DD)
                    |> parquet files

Location: lakehouse-monitor / a63c1e51-40ae-4a34-b230-bf80e132c05c / consumptions / json / workspaceId=511420607229897 / date=20220721

File name: part-00000-9c30baa3-416b-4622-8e7d-9b512aa2e70a.c000.json

Sample file

File format:

the file contains one json item per line

Columns

id - Partial row identifier, contains subscription id, billing period Id, and a usage detail guid. Can appear multiple times for the same billing period start and end date.
name - a guid
billingAccountId - Id of billing account
billingAccountName - Name of billing account
billingPeriodStartDate - Date the metered usage began (in DateTime format, and marked UTC), but only the date is populated
billingPeriodEndDate - Date the metered usage ended (in DateTime format, and marked UTC), but only the date is populated
billingProfileName - Subscription name
accountOwnerId - Account owner Id
accountName - Account name
subscriptionId - The subscription id of the referenced Azure subscription
subscriptionName - Name of Azure subscription
workspaceName - When filled, it refers to the Databricks instance associated with the billing meter
product - Product name for the consumed service or purchase (not available for Marketplace)
quantity - Quantity of metered usage (units will vary based on what is used Storage, Network, etc)
effectivePrice - What you pay (discounts applied)
cost - Quantity * effectivePrice
unitPrice - What you pay per unit of quantity
billingCurrency - Currency code (USD)
resourceLocation - Region where resource is deployed
consumedService - Parent category of resource type
resourceType - More detailed description of resource used (storageAccount, publicIPAddresses, virtualMachines ,etc)
resourceId - Full “path” name for resource consumed. Includes subscription, resource group, Consumed Service, and GUID of resource consumed
additionalInfo - Additional details of usage
costCenter - The cost center of the department if it is a department and a cost center is provided (typically null)
resourceGroup - Resource group of consumed resource
publisherName - “Microsoft”, but often appears as null
publisherType - Most often ‘Azure’
chargeType - Most often ‘Usage’
frequency - Frequency-- most often ‘usageBased’
payGPrice - Retail price for the resource
pricingModel - Identifier that indicates how the meter is priced
dbJobId - Associated job for the resource consumed
dbJobRunName - Associated job name for the resource consumed
dbJobRunId - Associated job run id for the resource consumed
dbEnv - Environment of resource consumed
dbClusterId - Associated cluster ID for the resource consumed
dbClusterName - Associated cluster name for the resource consumed
dbCreator - Individual responsible for resource consumed
dbLakehouseMonitorEnabled - “Enabled” if LakehouseMonitor was enabled for the specified resource
dbCreatedDate - A tags metric set by Databricks intended to indicate when the cluster was created (always null)
tags - Any KVP tags associated with the resource consumed
meterDetails - Collection of additional information for the resource consumed
usageInnerType - Usually ‘legacy’
workspaceResolution - Debugging field which indicates how workspace information is resolved

💠 Enhanced data

Aggregated data is saved in:

jobRunAnalysis - aggregated information regarding telemetry data of Databricks jobs runs
notebookAnalysis - aggregated information regarding telemetry data of Databricks notebooks runs

Job Run Analysis

Job Run Analysis represents an aggregation by db-job-id, db-run-id and cluster-id of the Task Metrics joined with Cluster Os Metrics.

We employ

aggregations recommended by data science team
one file → one row
- one job run which might have multiple tasks
  - some columns are aggregated with max, avg

Columns

dbTaskRunId - Run Id of task
clusterId - Id of assigned cluster
duration - Duration of job in nanoseconds
skew - Amount of skew in the job
shuffle - Max shuffle total
spill - Max spill in disk and memory
status - Status of job (success or failed)
dbDuration - JSON object containing setup, execution, cleanup and total time in nanoseconds
cpu - Average and max cpu usage
memory - Average and max memory usage
clusterName - Name of cluster
dbJobName - Name of job
userId - User who scheduled the job
cost - Cost of job
dbJobId - Job id
date - Date of job run
dbJobRunId - Run Id of job

Notebook Analysis

Notebook analysis represents aggregated information regarding telemetry data of Databricks notebooks runs

Columns

userId - User who ran the notebook
duration - Time that notebook ran
startTime - Start time of notebook
endTime - End time of notebook
skew - Amount of skew in the data
shuffle - Max shuffle total
spill - Max spill (disk and memory)
cpu - Average and max of cpu usage
memory - Average and max of memory usage
clusterName - Name of cluster
notebookPath - Path where notebook is located
clusterId - Associated cluster Id
date - Date when notebook was ran
notebookId - Notebook id

Data Dictionary

🧭 Where is the data

Storage path

Azure cloud storage

Explore metrics files

💽 What data is collected

Driver Spark Metrics

File name structure

Columns

Spark documentation

Driver Os Metrics

Columns

Executor OS Metrics

Columns

Spark documentation

Executor Spark Metrics

Columns

Spark documentation

Task Metrics

Columns

Consumption data

Columns

💠 Enhanced data

Job Run Analysis

Columns

Notebook Analysis

Columns

0 Comments