Data Dictionary (AWS)

 

Lakehouse Optimizer stores the collected metrics directly into cloud storage from where you can freely access the data and analyze it.

The Lakehouse Optimizer interface is used as both an Admin Console and a Canned-report view. We serve up a dozen metrics as well as recommendations for action in the interface. There are 100+ telemetry data points that we're persisting behind the scenes.

This is a starting place for you to understand the data, how it relates, what the metrics are and what they mean, so that you can build your own custom dashboards and reports and integrate with other systems as needed.

Applies to version 1.3.0. Future versions might collect even more telemetry data points.

 

Table of Contents

🧭 Where is the data

Storage path

Cloud storage path is configured upon installation and you can find the path by navigating to the settings panel in the web interface.

Lakehouse Monitor settings panel

Azure cloud storage

Navigate to Azure Portal, then open Storage accounts panel and open the Azure Storage Account that you configured for the Lakehouse Monitor to use.

Azure storage account

Next click on the Data Storage → Containers menu and open the Azure Blob Container that you have assigned.

Explore metrics files

Once you identified the storage account and blob container where you store your data, you can use Azure Storage Explorer to visualize the raw metrics.

You should expect this folder structure:

storage bucket |> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id) |> 5721xxxxxxxxxxxx/ (databricks workspace id) |> driverSparkMetrics |> driverOsMetrics |> executorSparkMetrics |> executorOsMetrics |> jobRunAnalysis |> taskMetrics |> consumptions |> json |> workspaceId=2149xxxxxxxxxxx |> date=20220511 (YYYY/MM/DD) |> json files |> parquet |> workspaceId=2149xxxxxxxxxxx |> date=20220511 (YYYY/MM/DD) |> parquet files |> bplm-config/ (lakehouse monitor configurations)

💽 What data is collected

Lakehouse Monitor collects raw metrics in the following structure:

  • driverSparkMetrics - metrics of the Spark driver

  • driverOsMetrics - operating system (OS) metrics of the VM where the Spark driver runs

  • executorSparkMetrics - metrics of the Spark executor

  • executorOsMetrics - operating system (OS) metrics of the VM where the Spark executor runs

  • taskMetrics - metrics collected from Spark tasks

Aggregated data is saved in:

  • jobRunAnalysis - aggregated information regarding telemetry data of Databricks jobs runs

  • notebookAnalysis - aggregated information regarding telemetry data of Databricks notebooks runs

Consumption data

  • consumptions - consumption (cost and usage) metrics

 

Driver Spark Metrics

Information at the Spark driver level

File path

storage bucket |> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id) |> 5721xxxxxxxxxxxx/ (databricks workspace id) |> driverSparkMetrics / date=20220502 / clusterId=0502-123457-fkasde121 / clusterId-0502-123457-fkasde121-clusterName-job-1174-run-564352-app-20220726085314-0000.csv |> driverOsMetrics

File sample:

File name structure

  • <azure_subscription_id>/<databricks_workspace_id>/driverSparkMetrics/date=<yearmonthday>/clusterName-<cluster_or_jobname>-run-<run_number>-clusterId-<cluster_id>-app-<spark-application>.csv

    • 0326cb34-f0d6-41b9-b4e3-0cc6c24dd240/5721838913406123/driverSparkMetrics/date=20220504/clusterId=0514-133143-8x6wad4r/clusterName-job-26-run-88036-clusterId-0514-133143-8x6wad4r-app-20220504133348-0000.csv

  • 5721838913406123 - workspace id

  • date=20220504 - day when the metrics have been generated/saved

  • job-26-run-88036 - ephemeral cluster name, otherwise this would have been the all-purpose cluster name

  • clusterId-0514-133143-8x6wad4r - clusterID - uniquely identifies the cluster

Columns

  • timestamp - time of data collection

  • app - a unique identifier for the Spark application

  • clusterId - ID of assigned Databricks cluster

  • clusterName - name of cluster

  • orgId - alias of workspaceId

  • workspaceId - ID of the Databricks workspace

  • workspaceName - name of the Databricks workspace

  • date - data collection date formatted YYYYMMDD

  • BlockManager.memory.diskSpaceUsed_MB - Amount of disk space used 

  • BlockManager.memory.maxMem_MB - Max memory limit 

  • BlockManager.memory.maxOffHeapMem_MB - Off-heap memory remaining 

  • BlockManager.memory.maxOnHeapMem_MB - On-heap memory remaining 

  • BlockManager.memory.memUsed_MB - Memory used 

  • BlockManager.memory.offHeapMemUsed_MB - Off heap memory used 

  • BlockManager.memory.onHeapMemUsed_MB - On-heap memory used 

  • BlockManager.memory.remainingMem_MB - Remaining memory 

  • BlockManager.memory.remainingOffHeapMem_MB - Off-heap memory remaining 

  • BlockManager.memory.remainingOnHeapMem_MB - On-heap memory remaining 

  • HiveExternalCatalog.fileCacheHits - count how many times the process had to go to file cache

  • HiveExternalCatalog.filesDiscovered - how many files were found at the location indicated

  • HiveExternalCatalog.hiveClientCalls - Spark internal Metric associated with the HiveExternalCatalog (for detailed documentation, see links below)  

  • HiveExternalCatalog.parallelListingJobCount - Spark internal Metric associated with the HiveExternalCatalog (for detailed documentation, see links below)  

  • HiveExternalCatalog.partitionsFetched - count how many partitions fetched

  • CodeGenerator.compilationTime.p75 - Spark internal metric associated with the CodeGenerator (for detailed documentation, see links below)  

  • CodeGenerator.sourceCodeSize.p75 - Spark internal metric associated with the CodeGenerator (for detailed documentation, see links below) 

  • CodeGenerator.generatedClassSize.p75 - Spark internal metric associated with the CodeGenerator (for detailed documentation, see links below) 

  • CodeGenerator.generatedMethodSize.p75 - Spark internal metric associated with the CodeGenerator (for detailed documentation, see links below) 

  • DAGScheduler.job.activeJobs - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below) 

  • DAGScheduler.job.allJobs - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below) 

  • DAGScheduler.messageProcessingTime.meanRate - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below) 

  • DAGScheduler.stage.failedStages - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below) 

  • DAGScheduler.stage.runningStages - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below) 

  • DAGScheduler.stage.waitingStages - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below) 

  • LiveListenerBus.numEventsPosted - Spark internal metric associated with the LiveListenerBus  (for detailed documentation, see links below) 

  • LiveListenerBus.queue.appStatus.listenerProcessingTime.meanRate - Spark internal metric associated with the LiveListenerBus  (for detailed documentation, see links below) 

  • LiveListenerBus.queue.appStatus.numDroppedEvents - Spark internal metric associated with the LiveListenerBus  (for detailed documentation, see links below) 

  • LiveListenerBus.queue.appStatus.size - Spark internal metric associated with the LiveListenerBus  (for detailed documentation, see links below) 

  • LiveListenerBus.queue.executorManagement.listenerProcessingTime.meanRate - Spark internal metric associated with the LiveListenerBus  (for detailed documentation, see links below) 

  • LiveListenerBus.queue.executorManagement.numDroppedEvents - Spark internal metric associated with the LiveListenerBus  (for detailed documentation, see links below) 

  • LiveListenerBus.queue.executorManagement.size - Spark internal metric associated with the LiveListenerBus  (for detailed documentation, see links below) 

  • LiveListenerBus.queue.streams.listenerProcessingTime.meanRate - Spark internal metric associated with the LiveListenerBus  (for detailed documentation, see links below) 

  • LiveListenerBus.queue.streams.numDroppedEvents - Spark internal metric associated with the LiveListenerBus  (for detailed documentation, see links below) 

  • LiveListenerBus.queue.streams.size - Spark internal metric associated with the LiveListenerBus  (for detailed documentation, see links below) 

  • LiveListenerBus.queue.shared.numDroppedEvents - Spark internal metric associated with the LiveListenerBus  (for detailed documentation, see links below) 

  • LiveListenerBus.queue.shared.listenerProcessingTime.meanRate - Spark internal metric associated with the LiveListenerBus  (for detailed documentation, see links below) 

  • LiveListenerBus.queue.shared.size - Spark internal metric associated with the LiveListenerBus  (for detailed documentation, see links below) 

  • ExecutorMetrics.JVMHeapMemory - Peak memory usage of the heap that is used for object allocation 

  • ExecutorMetrics.JVMOffHeapMemory - Peak memory usage of non-heap memory that is used by the JVM 

  • ExecutorMetrics.OnHeapExecutionMemory - Peak on heap execution memory in use, in bytes 

  • ExecutorMetrics.OnHeapStorageMemory - Peak on heap storage memory in use, in bytes 

  • ExecutorMetrics.OnHeapUnifiedMemory - Peak on heap memory (execution and storage). 

  • ExecutorMetrics.OffHeapExecutionMemory - Peak off heap execution memory in use, in bytes 

  • ExecutorMetrics.OffHeapStorageMemory - Peak off heap storage memory in use, in bytes 

  • ExecutorMetrics.OffHeapUnifiedMemory - Peak off heap memory (execution and storage 

  • ExecutorMetrics.DirectPoolMemory - Peak memory that the JVM is using for direct buffer pool  

  • ExecutorMetrics.MappedPoolMemory - Peak memory that the JVM is using for mapped buffer pool 

  • ExecutorMetrics.MinorGCCount - Total minor GC (garbage collector) count 

  • ExecutorMetrics.MajorGCCount - Total major GC count 

  • ExecutorMetrics.MinorGCTime - Elapsed time for total minor GC 

  • ExecutorMetrics.MajorGCTime - Elapsed time for total major GC 

  • ExecutorMetrics.ProcessTreeJVMVMemory - Virtual memory size in bytes 

  • ExecutorMetrics.ProcessTreeJVMRSSMemory - Resident Set Size: number of pages the process has in real memory 

  • ExecutorMetrics.ProcessTreePythonVMemory - Virtual memory size for Python in bytes 

  • ExecutorMetrics.ProcessTreePythonRSSMemory - Resident Set Size for Python 

  • ExecutorMetrics.ProcessTreeOtherVMemory - Virtual memory size for other kind of process in bytes 

  • ExecutorMetrics.ProcessTreeOtherRSSMemory - Resident Set Size for Python 

  • JVMCPU.jvmCpuTime - Spark internal metric associated with the Java Virtual Machine CPU (for detailed documentation, see links below) 

For more details regarding each column, please see the next section and follow

Spark documentation

HiveExternalCatalog: 

Code generator: 

DAGScheduler

https://books.japila.pl/apache-spark-internals/scheduler/DAGScheduler/  

LiveListenerBus: 

All Spark internal components for driver:

 

Driver Os Metrics

  • driverOsMetrics - operating system (OS) metrics of the VM where the Spark driver runs

File path

storage bucket |> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id) |> 5721xxxxxxxxxxxx/ (databricks workspace id) |> driverSparkMetrics |> driverOsMetrics / date=20220502 / clusterId=0502-123457-fkasde121 / clusterId-0502-123457-fkasde121-app-20220502135427-0000.csv

File sample:

Columns

These values are reported by Java Virtual Machine running on the Spark Driver’s box.

Driver OS columns:

  • timestamp - time of data collection

  • app - a unique identifier for the Spark application

  • clusterId - ID of assigned Databricks cluster

  • clusterName - Name of assigned cluster 

  • orgId - alias of workspaceId

  • workspaceId - ID of the Databricks workspace

  • workspaceName - name of the Databricks workspace

  • date - calendar date formatted YYYYMMDD

  • nodeIp - IP address of the driver 

  • osName - Type of operating system 

  • osArch - OS architecture

  • osVersion - Version of operating system

For CPU & memory, we collect the following metrics:

  • availableProcessors → the number of processors available to the Java virtual machine.

  • systemLoadAverage → the system load average for the last minute.

  • cpuSystemLoad → the "recent cpu usage" for the whole system.

  • cpuProcessLoad → the "recent cpu usage" for the Java Virtual Machine process.

  • cpuProcessTime → the CPU time used by the process on which the Java virtual machine is running in nanoseconds.

  • committedVirtualMemorySize → amount of virtual memory that is guaranteed to be available to the running process in bytes, or -1 if this operation is not supported

  • totalPhysicalMemorySize →  the total amount of physical memory in bytes

  • freePhysicalMemorySize → the amount of free physical memory in bytes

  • freeSwapSpaceSize → the amount of free swap space in bytes

  • totalSwapSpaceSize → the total amount of swap space in bytes

  • processAllocatedMemory → the amount of memory allocated by the app, in bytes

  • processTotalMemory → the maximum amount this JVM will ever get from the operating system (as set by the -Xmx parameter), in bytes

  • processMemoryLoad → the amount of memory available to the JVM that is currently being used (between 0 and 1)

Executor OS Metrics

File path

 

File sample:

Location: lakehouse-monitor / a63c1e51-40ae-1234-1234-bf80e132c05c / 5721xxxxxxxxxxxx / executorOsMetrics / date=20220726 / clusterId=0726-085022-s52b56hp

We collect CPU and memory metrics from Spark executors using Spark plugin framework. 

This is a mechanism that allows users to plugin custom code at the driver and executors. Basically, it offers a hook, through which messages can be sent from the executors to the driver. This functionality is supported starting with Spark 3.0. More details here.

Executors report to the driver every 10 seconds (configurable), using a custom sink, which reads CPU & memory metrics, and sends them to the driver via the plugin hook.

On the driver, these metrics accumulate from all executors and are reported in a CSV file every 10 seconds (configurable).

Columns

Executor OS columns:

  • timestamp - time of data collection

  • app - a unique identifier for the Spark application

  • clusterId - ID of assigned Databricks cluster

  • clusterName - name of assigned cluster

  • orgId - alias of workspaceId

  • workspaceId - ID of the Databricks workspace

  • workspaceName - name of the Databricks workspace

  • date - calendar date formatted YYYYMMDD

  • nodeIp - Ip address for the executor 

  • executorId - Spark executor ID

  • osName - Type of operating system 

  • osArch -  OS architecture 

  • osVersion - Version of operating system 

For CPU & memory, we collect the following metrics:

  • availableProcessors -> the number of processors available to the Java virtual machine.

  • systemLoadAverage -> the system load average for the last minute.

  • cpuSystemLoad -> the "recent cpu usage" for the whole system.

  • cpuProcessLoad -> the "recent cpu usage" for the Java Virtual Machine process.

  • cpuProcessTime -> the CPU time used by the process on which the Java virtual machine is running in nanoseconds.

  • committedVirtualMemorySize -> amount of virtual memory that is guaranteed to be available to the running process in bytes, or -1 if this operation is not supported

  • totalPhysicalMemorySize ->  the total amount of physical memory in bytes

  • freePhysicalMemorySize -> the amount of free physical memory in bytes

  • freeSwapSpaceSize -> the amount of free swap space in bytes

  • totalSwapSpaceSize -> the total amount of swap space in bytes

  • processAllocatedMemory -> the amount of memory allocated by the app, in bytes

  • processTotalMemory -> the maximum amount this JVM will ever get from the operating system (as set by the -Xmx parameter), in bytes

  • processMemoryLoad -> the amount of memory available to the JVM that is currently being used (between 0 and 1)

  • ExecutorMetrics.DirectPoolMemory - Peak memory that the JVM is using for direct buffer pool  

  • ExecutorMetrics.JVMHeapMemory - Peak memory usage of the heap that is used for object allocation 

  • ExecutorMetrics.JVMOffHeapMemory - Peak memory usage of non-heap memory that is used by the JVM 

  • ExecutorMetrics.MajorGCCount - Total major GC count 

  • ExecutorMetrics.MajorGCTime - Elapsed time for total major GC 

  • ExecutorMetrics.MappedPoolMemory - Peak memory that the JVM is using for mapped buffer pool 

  • ExecutorMetrics.MinorGCCount - Total minor GC (garbage collector) count 

  • ExecutorMetrics.MinorGCTime - Elapsed time for total minor GC 

  • ExecutorMetrics.OffHeapExecutionMemory - Peak off heap execution memory in use, in bytes 

  • ExecutorMetrics.OffHeapStorageMemory - Peak off heap storage memory in use, in bytes

  • ExecutorMetrics.OffHeapUnifiedMemory - Peak off heap memory (execution and storage)

  • ExecutorMetrics.OnHeapExecutionMemory - Peak on heap execution memory in use, in bytes 

  • ExecutorMetrics.OnHeapStorageMemory - Peak on heap storage memory in use, in bytes 

  • ExecutorMetrics.OnHeapUnifiedMemory - Peak on heap memory (execution and storage)

  • ExecutorMetrics.ProcessTreeJVMRSSMemory - Resident Set Size: number of pages the process has in real memory 

  • ExecutorMetrics.ProcessTreeJVMVMemory - Virtual memory size in bytes 

  • ExecutorMetrics.ProcessTreeOtherRSSMemory - Resident Set Size for Python 

  • ExecutorMetrics.ProcessTreeOtherVMemory - Virtual memory size for other kind of process in bytes 

  • ExecutorMetrics.ProcessTreePythonRSSMemory - Resident Set Size for Python

  • ExecutorMetrics.ProcessTreePythonVMemory - Virtual memory size for Python in bytes 

  • ExternalShuffle.shuffle-client.usedDirectMemory - For information on this column, please see Additional Documentation below 

  • ExternalShuffle.shuffle-client.usedHeapMemory - For information on this column, please see Additional Documentation below 

  • executor.bytesRead - How many bytes read 

  • executor.bytesWritten - How many bytes written 

  • executor.cpuTime - Cpu time in nanoseconds 

  • executor.deserializeCPUTime - How many nanoseconds used to deserialize objects  

  • executor.deserializeTime - How many nanoseconds used to deserialize objects 

  • executor.jvmGCTime - How many nanoseconds used for garbage collection 

  • executor.memoryBytesSpilled - How many bytes spilled from memory to disk 

  • executor.recordsRead - How many records read 

  • executor.recordsWritten - How many records written 

  • executor.resultSerializationTime - How many nanoseconds to serialize the result 

  • executor.resultSize - How large is the result in bytes 

  • executor.shuffleBytesWritten - How many bytes for shuffle operations in bytes 

  • executor.resultSerializationTime - How many nanoseconds did the executor wait to receive shuffled data 

  • executor.resultSize - How many blocks fetched from local storage 

  • executor.shuffleBytesWritten - How many bytes for shuffle operations in bytes 

  • executor.shuffleFetchWaitTime - How many nanoseconds did the executor wait to receive shuffled data 

  • executor.shuffleLocalBlocksFetched - How many blocks fetched from local storage 

  • executor.shuffleLocalBytesRead - How many bytes read from local storage 

  • executor.shuffleRecordsRead - How many records read from shuffled data 

  • executor.shuffleRecordsWritten - How many records written from shuffled data 

  • executor.shuffleRemoteBlocksFetched - How many blocks fetched from other executors 

  • executor.shuffleRemoteBytesRead - How many bytes read from remote executors 

  • executor.shuffleRemoteBytesReadToDisk - How many bytes from other executors written to local disk 

  • executor.shuffleTotalBytesRead - Total bytes read for shuffle operations (local + remote) 

  • executor.shuffleWriteTime - How many nanoseconds spent writing shuffle data 

  • executor.succeededTasks - How many tasks succeeded 

Spark documentation

All Spark internal components for executor:

External shuffle:

 

Executor Spark Metrics

File path

 

File sample:

Location: lakehouse-monitor / a63c1e51-40ae-1234-1234-bf80e132c05c / 5721xxxxxxxxxxxx / executorSparkMetrics / date=20220726 / clusterId=0726-085022-s52b56hp

Columns

  • timestamp - time of data collection

  • app - a unique identifier for the Spark application

  • clusterId - ID of assigned Databricks cluster

  • clusterName - Name of assigned cluster

  • orgId - alias of workspaceId

  • workspaceId - ID of the Databricks workspace

  • workspaceName - name of the Databricks workspace

  • date - calendar date formatted YYYYMMDD

  • nodeIp - Ip address for executor 

  • executorId - Id of executor 

  • ExecutorMetrics.DirectPoolMemory - Peak memory that the JVM is using for direct buffer pool  

  • ExecutorMetrics.JVMHeapMemory - Peak memory usage of the heap that is used for object allocation 

  • ExecutorMetrics.JVMOffHeapMemory - Peak memory usage of non-heap memory that is used by the JVM 

  • ExecutorMetrics.MajorGCCount - Total major GC count 

  • ExecutorMetrics.MajorGCTime - Elapsed time for total major GC 

  • ExecutorMetrics.MappedPoolMemory - Peak memory that the JVM is using for mapped buffer pool 

  • ExecutorMetrics.MinorGCCount - Total minor GC (garbage collector) count 

  • ExecutorMetrics.MinorGCTime - Elapsed time for total minor GC 

  • ExecutorMetrics.OffHeapExecutionMemory - Peak off-heap execution memory in use, in bytes 

  • ExecutorMetrics.OffHeapStorageMemory - Peak off-heap storage memory in use, in bytes

  • ExecutorMetrics.OffHeapUnifiedMemory - Peak off-heap memory (execution and storage)

  • ExecutorMetrics.OnHeapExecutionMemory - Peak on heap execution memory in use, in bytes 

  • ExecutorMetrics.OnHeapStorageMemory - Peak on heap storage memory in use, in bytes 

  • ExecutorMetrics.OnHeapUnifiedMemory - Peak on heap memory (execution and storage)

  • ExecutorMetrics.ProcessTreeJVMRSSMemory - Resident Set Size: number of pages the process has in real memory 

  • ExecutorMetrics.ProcessTreeJVMVMemory - Virtual memory size in bytes 

  • ExecutorMetrics.ProcessTreeOtherRSSMemory - Resident Set Size for Python 

  • ExecutorMetrics.ProcessTreeOtherVMemory - Virtual memory size for other kind of process in bytes 

  • ExecutorMetrics.ProcessTreePythonRSSMemory - Resident Set Size for Python

  • ExecutorMetrics.ProcessTreePythonVMemory - Virtual memory size for Python in bytes 

  • ExternalShuffle.shuffle-client.usedDirectMemory - For information on this column, please see Additional Documentation below 

  • ExternalShuffle.shuffle-client.usedHeapMemory - For information on this column, please see Additional Documentation below 

  • executor.bytesRead - How many bytes read 

  • executor.bytesWritten - How many bytes written 

  • executor.cpuTime - Cpu time in nanoseconds 

  • executor.deserializeCPUTime - How many nanoseconds used to deserialize objects  

  • executor.deserializeTime - How many nanoseconds used to deserialize objects 

  • executor.jvmGCTime - How many nanoseconds used for garbage collection 

  • executor.memoryBytesSpilled - How many bytes spilled from memory to disk 

  • executor.recordsRead - How many records read 

  • executor.recordsWritten - How many records written 

  • executor.resultSerializationTime - How many nanoseconds to serialize the result 

  • executor.resultSize - How large is the result in bytes 

  • executor.shuffleBytesWritten - How many bytes for shuffle operations in bytes 

  • executor.resultSerializationTime - How many nanoseconds did the executor wait to receive shuffled data 

  • executor.resultSize - How many blocks fetched from local storage 

  • executor.shuffleBytesWritten - How many bytes for shuffle operations in bytes 

  • executor.shuffleFetchWaitTime - How many nanoseconds did the executor wait to receive shuffled data 

  • executor.shuffleLocalBlocksFetched - How many blocks fetched from local storage 

  • executor.shuffleLocalBytesRead - How many bytes read from local storage 

  • executor.shuffleRecordsRead - How many records read from shuffled data 

  • executor.shuffleRecordsWritten - How many records written from shuffled data 

  • executor.shuffleRemoteBlocksFetched - How many blocks fetched from other executors 

  • executor.shuffleRemoteBytesRead - How many bytes read from remote executors 

  • executor.shuffleRemoteBytesReadToDisk - How many bytes from other executors written to local disk 

  • executor.shuffleTotalBytesRead - Total bytes read for shuffle operations (local + remote) 

  • executor.shuffleWriteTime - How many nanoseconds spent writing shuffle data 

Spark documentation

All Spark internal components for executor:

https://spark.apache.org/docs/latest/monitoring.html#component-instance--executor

External shuffle:

External shuffle service in Apache Spark

 

Task Metrics

File path:

Location: lakehouse-monitor / a63c1e51-40ae-4a34-b230-bf80e132c05c / 511420607229897 / taskMetrics / date=20220726 / dbJobId=1174 / dbJobRunId=564070

File name:

dbJobId-1174-dbJobRunId-564070-app-20220726085314-0000.csv

File sample

Columns

  • timestamp - time of data collection

  • app - a unique identifier for the Spark application

  • clusterId - ID of assigned Databricks cluster

  • clusterName -

  • orgId - alias of workspaceId

  • workspaceId - ID of the Databricks workspace

  • workspaceName - name of the Databricks workspace

  • dbJobId - Id of job

  • dbJobName - Name of job

  • dbJobRunId - Run Id of job

  • dbTaskRunId - Run Id of task

  • notebookId - Id of notebook that task is run on

  • notebook_path - Path of notebook that task is run on

  • sparkJobId - Id for individual job

  • sparkJobGroupId - An id spark assigns to a group of spark jobs

  • userId - User that is running the task

  • stageId - Stage Id for stage within the given task

  • stageAttemptId - Attempt number (were retries necessary?)

  • taskType - What type of task was perfomed (eg: resultTask, shuffleMapTask)

  • taskStatus - State of the task (succeeded or failed)

  • taskIndex - Task number starting from 0

  • taskId - Task identifier

  • attemptNumber - Attempt number of run

  • launchTime - Start time of task in timestamp format

  • finishTime - Finish time of task in timestamp format

  • duration - How long did the task take in nanoseconds

  • schedulerDelay - How long did the scheduler take to begin in nanoseconds

  • executorId - Identifier of the executor that kicked off the given task

  • host - Ip address for the executor that kicked off the task

  • taskLocality - process_local

  • speculative - Speculative execution of tasks was enabled (true) or disabled (false) 

  • gettingResultTime - How long did the task take to get the result 

  • successful - Was the task successful? 

  • executorRunTime - How long did the executor run the task in nanoseconds 

  • executorCpuTime - How long did the executor use the cpu, in nanoseconds 

  • executorDeserializeTime - How long did the executor take to deserialize data, in nanoseconds 

  • executorDeserializeCpuTime - How long did the executor process data in cpu to deserialize data, in nanoseconds 

  • resultSerializationTime - How long did it take to serialize the results, in nanoseconds 

  • jvmGCTime - How long did the java virtual machine take in garbage collection, in nanoseconds 

  • resultSize - How large was the result, in bytes 

  • numUpdatedBlockStatuses - Storage statuses of any blocks that were updated as a result of the indicated task 

  • diskBytesSpilled - How many bytes were spilled from memory to disk 

  • memoryBytesSpilled - How many bytes spilled from memory to disk 

  • peakExecutionMemory - What was the most memory consumed in execution, in bytes 

  • recordsRead - How many records read 

  • bytesRead - Number of bytes read 

  • recordsWritten - Number of records written 

  • bytesWritten - bytes written by the Spark Job

  • shuffleFetchWaitTime - How long did the task wait for shuffled data in nanoseconds 

  • shuffleTotalBytesRead - How many bytes of shuffled data was read 

  • shuffleTotalBlocksFetched - How many blocks were fetched for shuffled data 

  • shuffleLocalBlocksFetched - How many blocks of local data was shuffled 

  • shuffleRemoteBlocksFetched - How many blocks of remote data was shuffled 

  • shuffleWriteTime - How long did it take to write shuffle data in nanoseconds 

  • shuffleBytesWritten - Number of bytes written 

  • shuffleRecordsWritten - Number of records written 

  • errorMessage - Message displayed if an error was detected 

  • sparkJobStartTime - start time of the SparkJob provided through the SparkJobListener API, as measured by Spark

  • sparkJobEndTime - end time of the SparkJob provided through the SparkJobListener API, as measured by Spark

  • sparkJobDuration - duration of the SparkJob provided through the SparkJobListener API, as measured by Spark

Spark documentation

Consumption data

Consumption data, i.e. cost and usage information, is kept separate from the Databricks workspace folders.

File path

Location: lakehouse-monitor / a63c1e51-40ae-4a34-b230-bf80e132c05c / consumptions / json / workspaceId=511420607229897 / date=20220721

File name: part-00000-9c30baa3-416b-4622-8e7d-9b512aa2e70a.c000.json

Sample file

File format:

  • the file contains one json item per line

Columns

  • id - Partial row identifier, contains subscription id, billing period Id, and a usage detail guid. Can appear multiple times for the same billing period start and end date. 

  • name - a guid 

  • billingAccountId - Id of billing account 

  • billingAccountName - Name of billing account 

  • billingPeriodStartDate - Date the metered usage began (in DateTime format, and marked UTC), but only the date is populated 

  • billingPeriodEndDate - Date the metered usage ended (in DateTime format, and marked UTC), but only the date is populated 

  • billingProfileName - Subscription name 

  • accountOwnerId - Account owner Id 

  • accountName - Account name 

  • subscriptionId - The subscription id of the referenced Azure subscription 

  • subscriptionName - Name of Azure subscription 

  • workspaceName - When filled, it refers to the Databricks instance associated with the billing meter 

  • product - Product name for the consumed service or purchase (not available for Marketplace) 

  • quantity - Quantity of metered usage (units will vary based on what is used Storage, Network, etc) 

  • effectivePrice - What you pay (discounts applied) 

  • cost - Quantity * effectivePrice 

  • unitPrice - What you pay per unit of quantity 

  • billingCurrency - Currency code (USD) 

  • resourceLocation - Region where resource is deployed 

  • consumedService - Parent category of resource type 

  • resourceType - More detailed description of resource used (storageAccount, publicIPAddresses, virtualMachines ,etc) 

  • resourceId - Full “path” name for resource consumed. Includes subscription, resource group, Consumed Service, and GUID of resource consumed 

  • additionalInfo - Additional details of usage 

  • costCenter - The cost center of the department if it is a department and a cost center is provided (typically null) 

  • resourceGroup - Resource group of consumed resource 

  • publisherName - “Microsoft”, but often appears as null 

  • publisherType - Most often ‘Azure’ 

  • chargeType - Most often ‘Usage’ 

  • frequency - Frequency-- most often ‘usageBased’ 

  • payGPrice - Retail price for the resource 

  • pricingModel - Identifier that indicates how the meter is priced 

  • dbJobId - Associated job for the resource consumed 

  • dbJobRunName - Associated job name for the resource consumed 

  • dbJobRunId - Associated job run id for the resource consumed 

  • dbEnv - Environment of resource consumed 

  • dbClusterId - Associated cluster ID for the resource consumed 

  • dbClusterName - Associated cluster name for the resource consumed 

  • dbCreator - Individual responsible for resource consumed 

  • dbLakehouseMonitorEnabled - “Enabled” if LakehouseMonitor was enabled for the specified resource 

  • dbCreatedDate - A tags metric set by Databricks intended to indicate when the cluster was created (always null) 

  • tags - Any KVP tags associated with the resource consumed 

  • meterDetails - Collection of additional information for the resource consumed 

  • usageInnerType - Usually ‘legacy’ 

  • workspaceResolution - Debugging field which indicates how workspace information is resolved 

  •  

💠 Enhanced data

Aggregated data is saved in:

  • jobRunAnalysis - aggregated information regarding telemetry data of Databricks jobs runs

  • notebookAnalysis - aggregated information regarding telemetry data of Databricks notebooks runs

Job Run Analysis

Job Run Analysis represents an aggregation by db-job-id, db-run-id and cluster-id of the Task Metrics joined with Cluster Os Metrics.

We employ

  • aggregations recommended by data science team

  • one file → one row

    • one job run which might have multiple tasks

      • some columns are aggregated with max, avg

Columns

  • dbTaskRunId - Run Id of task

  • clusterId - Id of assigned cluster

  • duration - Duration of job in nanoseconds

  • skew - Amount of skew in the job

  • shuffle - Max shuffle total

  • spill - Max spill in disk and memory 

  • status - Status of job (success or failed) 

  • dbDuration - JSON object containing setup, execution, cleanup and total time in nanoseconds

  • cpu - Average and max cpu usage 

  • memory - Average and max memory usage 

  • clusterName - Name of cluster 

  • dbJobName - Name of job 

  • userId - User who scheduled the job  

  • cost - Cost of job 

  • dbJobId - Job id 

  • date - Date of job run

  • dbJobRunId - Run Id of job

 

Notebook Analysis

Notebook analysis represents aggregated information regarding telemetry data of Databricks notebooks runs

Columns

  • userId - User who ran the notebook

  • duration - Time that notebook ran

  • startTime - Start time of notebook

  • endTime - End time of notebook

  • skew - Amount of skew in the data

  • shuffle - Max shuffle total

  • spill - Max spill (disk and memory)

  • cpu - Average and max of cpu usage

  • memory - Average and max of memory usage

  • clusterName - Name of cluster

  • notebookPath - Path where notebook is located

  • clusterId - Associated cluster Id

  • date - Date when notebook was ran

  • notebookId - Notebook id