Lakehouse Monitor stores the collected metrics directly into cloud storage from where you can freely access the data and analyze it.
The lakehouse monitor interface is used as both an Admin Console and a Canned-report view. We serve up a dozen metrics as well as recommendations for action in the interface. There are 100+ telemetry data points that we're persisting behind the scenes.
This is a starting place for you to understand the data, how it relates, what the metrics are and what they mean, so that you can build your own custom dashboards and reports and integrate with other systems as needed.
Applies to version 1.3.0. Future versions might collect even more telemetry data points.
Table of Contents
đ§ Where is the data
Storage path
Cloud storage path is configured upon installation and you can find the path by navigating to the settings panel in the web interface.
Azure cloud storage
Navigate to Azure Portal, then open Storage accounts panel and open the Azure Storage Account that you configured for the Lakehouse Monitor to use.
Next click on the Data Storage â Containers menu and open the Azure Blob Container that you have assigned.
Explore metrics files
Once you identified the storage account and blob container where you store your data, you can use Azure Storage Explorer to visualize the raw metrics.
You should expect this folder structure:
storage bucket |> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id) |> 5721xxxxxxxxxxxx/ (databricks workspace id) |> driverSparkMetrics |> driverOsMetrics |> executorSparkMetrics |> executorOsMetrics |> jobRunAnalysis |> taskMetrics |> consumptions |> json |> workspaceId=2149xxxxxxxxxxx |> date=20220511 (YYYY/MM/DD) |> json files |> parquet |> workspaceId=2149xxxxxxxxxxx |> date=20220511 (YYYY/MM/DD) |> parquet files |> bplm-config/ (lakehouse monitor configurations)
đ˝ What data is collected
Lakehouse Monitor collects raw metrics in the following structure:
driverSparkMetrics
- metrics of the Spark driverdriverOsMetrics
- operating system (OS) metrics of the VM where the Spark driver runsexecutorSparkMetrics
- metrics of the Spark executorexecutorOsMetrics
- operating system (OS) metrics of the VM where the Spark executor runstaskMetrics
- metrics collected from Spark tasks
Aggregated data is saved in:
jobRunAnalysis
- aggregated information regarding telemetry data of Databricks jobs runsnotebookAnalysis
- aggregated information regarding telemetry data of Databricks notebooks runs
Consumption data
consumptions
- consumption (cost and usage) metrics
Driver Spark Metrics
Information at the Spark driver level
https://spark.apache.org/docs/latest/monitoring.html#component-instance--driver
driverSparkMetrics
- metrics of the Spark driver
File path
storage bucket |> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id) |> 5721xxxxxxxxxxxx/ (databricks workspace id) |> driverSparkMetrics / date=20220502 / clusterId=0502-123457-fkasde121 / clusterId-0502-123457-fkasde121-clusterName-job-1174-run-564352-app-20220726085314-0000.csv |> driverOsMetrics
File sample:
File name structure
<azure_subscription_id>/<databricks_workspace_id>/driverSparkMetrics/date=<yearmonthday>/clusterName-<cluster_or_jobname>-run-<run_number>-clusterId-<cluster_id>-app-<spark-application>.csv
0326cb34-f0d6-41b9-b4e3-0cc6c24dd240/5721838913406123/driverSparkMetrics/date=20220504/clusterId=0514-133143-8x6wad4r/clusterName-job-26-run-88036-clusterId-0514-133143-8x6wad4r-app-20220504133348-0000.csv
5721838913406123
- workspace iddate=20220504
- day when the metrics have been generated/savedjob-26-run-88036
- ephemeral cluster name, otherwise this would have been the all-purpose cluster nameclusterId-0514-133143-8x6wad4r
- clusterID - uniquely identifies the cluster
Columns
timestamp - time of data collection
app - a unique identifier for the Spark application
clusterId - ID of assigned Databricks cluster
clusterName - name of cluster
orgId - alias of workspaceId
workspaceId - ID of the Databricks workspace
workspaceName - name of the Databricks workspace
date - data collection date formatted YYYYMMDD
BlockManager.memory.diskSpaceUsed_MB - Amount of disk space usedÂ
BlockManager.memory.maxMem_MB - Max memory limitÂ
BlockManager.memory.maxOffHeapMem_MB - Off-heap memory remainingÂ
BlockManager.memory.maxOnHeapMem_MB - On-heap memory remainingÂ
BlockManager.memory.memUsed_MB - Memory usedÂ
BlockManager.memory.offHeapMemUsed_MB - Off heap memory usedÂ
BlockManager.memory.onHeapMemUsed_MB - On-heap memory usedÂ
BlockManager.memory.remainingMem_MB - Remaining memoryÂ
BlockManager.memory.remainingOffHeapMem_MB - Off-heap memory remainingÂ
BlockManager.memory.remainingOnHeapMem_MB - On-heap memory remainingÂ
HiveExternalCatalog.fileCacheHits - count how many times the process had to go to file cache
HiveExternalCatalog.filesDiscovered - how many files were found at the location indicated
HiveExternalCatalog.hiveClientCalls - Spark internal Metric associated with the HiveExternalCatalog (for detailed documentation, see links below)Â Â
HiveExternalCatalog.parallelListingJobCount - Spark internal Metric associated with the HiveExternalCatalog (for detailed documentation, see links below)Â Â
HiveExternalCatalog.partitionsFetched - count how many partitions fetched
CodeGenerator.compilationTime.p75 - Spark internal metric associated with the CodeGenerator (for detailed documentation, see links below)Â Â
CodeGenerator.sourceCodeSize.p75 - Spark internal metric associated with the CodeGenerator (for detailed documentation, see links below)Â
CodeGenerator.generatedClassSize.p75 - Spark internal metric associated with the CodeGenerator (for detailed documentation, see links below)Â
CodeGenerator.generatedMethodSize.p75 - Spark internal metric associated with the CodeGenerator (for detailed documentation, see links below)Â
DAGScheduler.job.activeJobs - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below)Â
DAGScheduler.job.allJobs - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below)Â
DAGScheduler.messageProcessingTime.meanRate - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below)Â
DAGScheduler.stage.failedStages - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below)Â
DAGScheduler.stage.runningStages - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below)Â
DAGScheduler.stage.waitingStages - Spark internal metric associated with the DAGScheduler (for detailed documentation, see links below)Â
LiveListenerBus.numEventsPosted - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)Â
LiveListenerBus.queue.appStatus.listenerProcessingTime.meanRate - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)Â
LiveListenerBus.queue.appStatus.numDroppedEvents - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)Â
LiveListenerBus.queue.appStatus.size - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)Â
LiveListenerBus.queue.executorManagement.listenerProcessingTime.meanRate - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)Â
LiveListenerBus.queue.executorManagement.numDroppedEvents - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)Â
LiveListenerBus.queue.executorManagement.size - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)Â
LiveListenerBus.queue.streams.listenerProcessingTime.meanRate - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)Â
LiveListenerBus.queue.streams.numDroppedEvents - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)Â
LiveListenerBus.queue.streams.size - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)Â
LiveListenerBus.queue.shared.numDroppedEvents - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)Â
LiveListenerBus.queue.shared.listenerProcessingTime.meanRate - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)Â
LiveListenerBus.queue.shared.size - Spark internal metric associated with the LiveListenerBus (for detailed documentation, see links below)Â
ExecutorMetrics.JVMHeapMemory - Peak memory usage of the heap that is used for object allocationÂ
ExecutorMetrics.JVMOffHeapMemory - Peak memory usage of non-heap memory that is used by the JVMÂ
ExecutorMetrics.OnHeapExecutionMemory - Peak on heap execution memory in use, in bytesÂ
ExecutorMetrics.OnHeapStorageMemory - Peak on heap storage memory in use, in bytesÂ
ExecutorMetrics.OnHeapUnifiedMemory - Peak on heap memory (execution and storage).Â
ExecutorMetrics.OffHeapExecutionMemory - Peak off heap execution memory in use, in bytesÂ
ExecutorMetrics.OffHeapStorageMemory - Peak off heap storage memory in use, in bytesÂ
ExecutorMetrics.OffHeapUnifiedMemory - Peak off heap memory (execution and storageÂ
ExecutorMetrics.DirectPoolMemory - Peak memory that the JVM is using for direct buffer poolâŻÂ
ExecutorMetrics.MappedPoolMemory - Peak memory that the JVM is using for mapped buffer poolÂ
ExecutorMetrics.MinorGCCount - Total minor GC (garbage collector) countÂ
ExecutorMetrics.MajorGCCount - Total major GC countÂ
ExecutorMetrics.MinorGCTime - Elapsed time for total minor GCÂ
ExecutorMetrics.MajorGCTime - Elapsed time for total major GCÂ
ExecutorMetrics.ProcessTreeJVMVMemory - Virtual memory size in bytesÂ
ExecutorMetrics.ProcessTreeJVMRSSMemory - Resident Set Size: number of pages the process has in real memoryÂ
ExecutorMetrics.ProcessTreePythonVMemory - Virtual memory size for Python in bytesÂ
ExecutorMetrics.ProcessTreePythonRSSMemory - Resident Set Size for PythonÂ
ExecutorMetrics.ProcessTreeOtherVMemory - Virtual memory size for other kind of process in bytesÂ
ExecutorMetrics.ProcessTreeOtherRSSMemory - Resident Set Size for PythonÂ
JVMCPU.jvmCpuTime - Spark internal metric associated with the Java Virtual Machine CPUÂ (for detailed documentation, see links below)Â
For more details regarding each column, please see the next section and follow
Spark documentation
HiveExternalCatalog:Â
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/hive/HiveExternalCatalog.html Â
Code generator:Â
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-CodeGenerator.html Â
DAGScheduler:Â
https://books.japila.pl/apache-spark-internals/scheduler/DAGScheduler/ Â
LiveListenerBus:Â
https://books.japila.pl/apache-spark-internals/scheduler/LiveListenerBus/ Â
All Spark internal components for driver:
https://spark.apache.org/docs/latest/monitoring.html#component-instance--driver Â
Â
Driver Os Metrics
Information at the operating system level
driverOsMetrics
- operating system (OS) metrics of the VM where the Spark driver runs
File path
storage bucket |> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id) |> 5721xxxxxxxxxxxx/ (databricks workspace id) |> driverSparkMetrics |> driverOsMetrics / date=20220502 / clusterId=0502-123457-fkasde121 / clusterId-0502-123457-fkasde121-app-20220502135427-0000.csv
File sample:
Columns
These values are reported by Java Virtual Machine running on the Spark Driverâs box.
Driver OS columns:
timestamp - time of data collection
app - a unique identifier for the Spark application
clusterId - ID of assigned Databricks cluster
clusterName - Name of assigned clusterÂ
orgId - alias of workspaceId
workspaceId - ID of the Databricks workspace
workspaceName - name of the Databricks workspace
date - calendar date formatted YYYYMMDD
nodeIp - IP address of the driverÂ
osName - Type of operating systemÂ
osArch - OS architecture
osVersion - Version of operating system
For CPU & memory, we collect the following metrics:
availableProcessors â the number of processors available to the Java virtual machine.
systemLoadAverage â the system load average for the last minute.
cpuSystemLoad â the "recent cpu usage" for the whole system.
cpuProcessLoad â the "recent cpu usage" for the Java Virtual Machine process.
cpuProcessTime â the CPU time used by the process on which the Java virtual machine is running in nanoseconds.
committedVirtualMemorySize â amount of virtual memory that is guaranteed to be available to the running process in bytes, or -1 if this operation is not supported
totalPhysicalMemorySize â the total amount of physical memory in bytes
freePhysicalMemorySize â the amount of free physical memory in bytes
freeSwapSpaceSize â the amount of free swap space in bytes
totalSwapSpaceSize â the total amount of swap space in bytes
processAllocatedMemory â the amount of memory allocated by the app, in bytes
processTotalMemory â the maximum amount this JVM will ever get from the operating system (as set by the
-Xmx
parameter), in bytesprocessMemoryLoad â the amount of memory available to the JVM that is currently being used (between 0 and 1)
Executor OS Metrics
File path
storage bucket |> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id) |> 5721xxxxxxxxxxxx/ (databricks workspace id) |> executorSparkMetrics |> executorOsMetrics / date=20220502 / clusterId=0726-085022-s52b56hp / clusterId-0726-085022-s52b56hp-clusterName-job-1174-run-564352-app-20220726085314-0000.csv |> driverSparkMetrics |> driverOsMetrics
File sample:
Location: lakehouse-monitor / a63c1e51-40ae-1234-1234-bf80e132c05c / 5721xxxxxxxxxxxx
/ executorOsMetrics / date=20220726 / clusterId=0726-085022-s52b56hp
operating system (OS) metrics of the VM where the Spark executor runs
We collect CPU and memory metrics from Spark executors using Spark plugin framework.Â
This is a mechanism that allows users to plugin custom code at the driver and executors. Basically, it offers a hook, through which messages can be sent from the executors to the driver. This functionality is supported starting with Spark 3.0. More details here.
Executors report to the driver every 10 seconds (configurable), using a custom sink, which reads CPU & memory metrics, and sends them to the driver via the plugin hook.
On the driver, these metrics accumulate from all executors and are reported in a CSV file every 10 seconds (configurable).
Columns
Executor OS columns:
timestamp - time of data collection
app - a unique identifier for the Spark application
clusterId - ID of assigned Databricks cluster
clusterName - name of assigned cluster
orgId - alias of workspaceId
workspaceId - ID of the Databricks workspace
workspaceName - name of the Databricks workspace
date - calendar date formatted YYYYMMDD
nodeIp - Ip address for the executorÂ
executorId - Spark executor ID
osName - Type of operating systemÂ
osArch - Â OS architectureÂ
osVersion - Version of operating systemÂ
For CPU & memory, we collect the following metrics:
availableProcessors -> the number of processors available to the Java virtual machine.
systemLoadAverage -> the system load average for the last minute.
cpuSystemLoad -> the "recent cpu usage" for the whole system.
cpuProcessLoad -> the "recent cpu usage" for the Java Virtual Machine process.
cpuProcessTime -> the CPU time used by the process on which the Java virtual machine is running in nanoseconds.
committedVirtualMemorySize -> amount of virtual memory that is guaranteed to be available to the running process in bytes, or -1 if this operation is not supported
totalPhysicalMemorySize ->Â the total amount of physical memory in bytes
freePhysicalMemorySize -> the amount of free physical memory in bytes
freeSwapSpaceSize -> the amount of free swap space in bytes
totalSwapSpaceSize -> the total amount of swap space in bytes
processAllocatedMemory -> the amount of memory allocated by the app, in bytes
processTotalMemory -> the maximum amount this JVM will ever get from the operating system (as set by the -Xmx parameter), in bytes
processMemoryLoad -> the amount of memory available to the JVM that is currently being used (between 0 and 1)
ExecutorMetrics.DirectPoolMemory - Peak memory that the JVM is using for direct buffer poolâŻÂ
ExecutorMetrics.JVMHeapMemory - Peak memory usage of the heap that is used for object allocationÂ
ExecutorMetrics.JVMOffHeapMemory - Peak memory usage of non-heap memory that is used by the JVMÂ
ExecutorMetrics.MajorGCCount - Total major GC countÂ
ExecutorMetrics.MajorGCTime - Elapsed time for total major GCÂ
ExecutorMetrics.MappedPoolMemory - Peak memory that the JVM is using for mapped buffer poolÂ
ExecutorMetrics.MinorGCCount - Total minor GC (garbage collector) countÂ
ExecutorMetrics.MinorGCTime - Elapsed time for total minor GCÂ
ExecutorMetrics.OffHeapExecutionMemory - Peak off heap execution memory in use, in bytesÂ
ExecutorMetrics.OffHeapStorageMemory - Peak off heap storage memory in use, in bytes
ExecutorMetrics.OffHeapUnifiedMemory - Peak off heap memory (execution and storage)
ExecutorMetrics.OnHeapExecutionMemory - Peak on heap execution memory in use, in bytesÂ
ExecutorMetrics.OnHeapStorageMemory - Peak on heap storage memory in use, in bytesÂ
ExecutorMetrics.OnHeapUnifiedMemory - Peak on heap memory (execution and storage)
ExecutorMetrics.ProcessTreeJVMRSSMemory - Resident Set Size: number of pages the process has in real memoryÂ
ExecutorMetrics.ProcessTreeJVMVMemory - Virtual memory size in bytesÂ
ExecutorMetrics.ProcessTreeOtherRSSMemory - Resident Set Size for PythonÂ
ExecutorMetrics.ProcessTreeOtherVMemory - Virtual memory size for other kind of process in bytesÂ
ExecutorMetrics.ProcessTreePythonRSSMemory - Resident Set Size for Python
ExecutorMetrics.ProcessTreePythonVMemory - Virtual memory size for Python in bytesÂ
ExternalShuffle.shuffle-client.usedDirectMemory - For information on this column, please see Additional Documentation belowÂ
ExternalShuffle.shuffle-client.usedHeapMemory - For information on this column, please see Additional Documentation belowÂ
executor.bytesRead - How many bytes readÂ
executor.bytesWritten - How many bytes writtenÂ
executor.cpuTime - Cpu time in nanosecondsÂ
executor.deserializeCPUTime - How many nanoseconds used to deserialize objects Â
executor.deserializeTime - How many nanoseconds used to deserialize objectsÂ
executor.jvmGCTime - How many nanoseconds used for garbage collectionÂ
executor.memoryBytesSpilled - How many bytes spilled from memory to diskÂ
executor.recordsRead - How many records readÂ
executor.recordsWritten - How many records writtenÂ
executor.resultSerializationTime - How many nanoseconds to serialize the resultÂ
executor.resultSize - How large is the result in bytesÂ
executor.shuffleBytesWritten - How many bytes for shuffle operations in bytesÂ
executor.resultSerializationTime - How many nanoseconds did the executor wait to receive shuffled dataÂ
executor.resultSize - How many blocks fetched from local storageÂ
executor.shuffleBytesWritten - How many bytes for shuffle operations in bytesÂ
executor.shuffleFetchWaitTime - How many nanoseconds did the executor wait to receive shuffled dataÂ
executor.shuffleLocalBlocksFetched - How many blocks fetched from local storageÂ
executor.shuffleLocalBytesRead - How many bytes read from local storageÂ
executor.shuffleRecordsRead - How many records read from shuffled dataÂ
executor.shuffleRecordsWritten - How many records written from shuffled dataÂ
executor.shuffleRemoteBlocksFetched - How many blocks fetched from other executorsÂ
executor.shuffleRemoteBytesRead - How many bytes read from remote executorsÂ
executor.shuffleRemoteBytesReadToDisk - How many bytes from other executors written to local diskÂ
executor.shuffleTotalBytesRead - Total bytes read for shuffle operations (local + remote)Â
executor.shuffleWriteTime - How many nanoseconds spent writing shuffle dataÂ
executor.succeededTasks - How many tasks succeededÂ
Spark documentation
All Spark internal components for executor:
https://spark.apache.org/docs/latest/monitoring.html#component-instance--executor
External shuffle:
 https://www.waitingforcode.com/apache-spark/external-shuffle-service-apache-spark/read
Executor Spark Metrics
metrics collected at the level of the Spark executor provided by the Spark API
File path
storage bucket |> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id) |> 5721xxxxxxxxxxxx/ (databricks workspace id) |> executorSparkMetrics |> executorOsMetrics / date=20220502 / clusterId=0726-085022-s52b56hp / clusterId-0726-085022-s52b56hp-clusterName-job-1174-run-564352-app-20220726085314-0000.csv |> driverSparkMetrics |> driverOsMetrics
File sample:
Location: lakehouse-monitor / a63c1e51-40ae-1234-1234-bf80e132c05c / 5721xxxxxxxxxxxx
/ executorSparkMetrics / date=20220726 / clusterId=0726-085022-s52b56hp
Columns
timestamp - time of data collection
app - a unique identifier for the Spark application
clusterId - ID of assigned Databricks cluster
clusterName - Name of assigned cluster
orgId - alias of workspaceId
workspaceId - ID of the Databricks workspace
workspaceName - name of the Databricks workspace
date - calendar date formatted YYYYMMDD
nodeIp - Ip address for executorÂ
executorId - Id of executorÂ
ExecutorMetrics.DirectPoolMemory - Peak memory that the JVM is using for direct buffer poolâŻÂ
ExecutorMetrics.JVMHeapMemory - Peak memory usage of the heap that is used for object allocationÂ
ExecutorMetrics.JVMOffHeapMemory - Peak memory usage of non-heap memory that is used by the JVMÂ
ExecutorMetrics.MajorGCCount - Total major GC countÂ
ExecutorMetrics.MajorGCTime - Elapsed time for total major GCÂ
ExecutorMetrics.MappedPoolMemory - Peak memory that the JVM is using for mapped buffer poolÂ
ExecutorMetrics.MinorGCCount - Total minor GC (garbage collector) countÂ
ExecutorMetrics.MinorGCTime - Elapsed time for total minor GCÂ
ExecutorMetrics.OffHeapExecutionMemory - Peak off-heap execution memory in use, in bytesÂ
ExecutorMetrics.OffHeapStorageMemory - Peak off-heap storage memory in use, in bytes
ExecutorMetrics.OffHeapUnifiedMemory - Peak off-heap memory (execution and storage)
ExecutorMetrics.OnHeapExecutionMemory - Peak on heap execution memory in use, in bytesÂ
ExecutorMetrics.OnHeapStorageMemory - Peak on heap storage memory in use, in bytesÂ
ExecutorMetrics.OnHeapUnifiedMemory - Peak on heap memory (execution and storage)
ExecutorMetrics.ProcessTreeJVMRSSMemory - Resident Set Size: number of pages the process has in real memoryÂ
ExecutorMetrics.ProcessTreeJVMVMemory - Virtual memory size in bytesÂ
ExecutorMetrics.ProcessTreeOtherRSSMemory - Resident Set Size for PythonÂ
ExecutorMetrics.ProcessTreeOtherVMemory - Virtual memory size for other kind of process in bytesÂ
ExecutorMetrics.ProcessTreePythonRSSMemory - Resident Set Size for Python
ExecutorMetrics.ProcessTreePythonVMemory - Virtual memory size for Python in bytesÂ
ExternalShuffle.shuffle-client.usedDirectMemory - For information on this column, please see Additional Documentation belowÂ
ExternalShuffle.shuffle-client.usedHeapMemory - For information on this column, please see Additional Documentation belowÂ
executor.bytesRead - How many bytes readÂ
executor.bytesWritten - How many bytes writtenÂ
executor.cpuTime - Cpu time in nanosecondsÂ
executor.deserializeCPUTime - How many nanoseconds used to deserialize objects Â
executor.deserializeTime - How many nanoseconds used to deserialize objectsÂ
executor.jvmGCTime - How many nanoseconds used for garbage collectionÂ
executor.memoryBytesSpilled - How many bytes spilled from memory to diskÂ
executor.recordsRead - How many records readÂ
executor.recordsWritten - How many records writtenÂ
executor.resultSerializationTime - How many nanoseconds to serialize the resultÂ
executor.resultSize - How large is the result in bytesÂ
executor.shuffleBytesWritten - How many bytes for shuffle operations in bytesÂ
executor.resultSerializationTime - How many nanoseconds did the executor wait to receive shuffled dataÂ
executor.resultSize - How many blocks fetched from local storageÂ
executor.shuffleBytesWritten - How many bytes for shuffle operations in bytesÂ
executor.shuffleFetchWaitTime - How many nanoseconds did the executor wait to receive shuffled dataÂ
executor.shuffleLocalBlocksFetched - How many blocks fetched from local storageÂ
executor.shuffleLocalBytesRead - How many bytes read from local storageÂ
executor.shuffleRecordsRead - How many records read from shuffled dataÂ
executor.shuffleRecordsWritten - How many records written from shuffled dataÂ
executor.shuffleRemoteBlocksFetched - How many blocks fetched from other executorsÂ
executor.shuffleRemoteBytesRead - How many bytes read from remote executorsÂ
executor.shuffleRemoteBytesReadToDisk - How many bytes from other executors written to local diskÂ
executor.shuffleTotalBytesRead - Total bytes read for shuffle operations (local + remote)Â
executor.shuffleWriteTime - How many nanoseconds spent writing shuffle dataÂ
Spark documentation
All Spark internal components for executor:
https://spark.apache.org/docs/latest/monitoring.html#component-instance--executor
External shuffle:
https://www.waitingforcode.com/apache-spark/external-shuffle-service-apache-spark/read
Task Metrics
File path:
storage bucket |> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id) |> 5721xxxxxxxxxxxx/ (databricks workspace id) |> executorSparkMetrics |> executorOsMetrics |> taskMetrics / date=20220502 / dbJobId=1174 / dbJobRunId=564070 / dbJobId-1174-dbJobRunId-564070-app-20220726085314-0000.csv |> driverSparkMetrics |> driverOsMetrics
Location: lakehouse-monitor / a63c1e51-40ae-4a34-b230-bf80e132c05c / 511420607229897 / taskMetrics / date=20220726 / dbJobId=1174 / dbJobRunId=564070
File name:
dbJobId-1174-dbJobRunId-564070-app-20220726085314-0000.csv
File sample
Columns
timestamp - time of data collection
app - a unique identifier for the Spark application
clusterId - ID of assigned Databricks cluster
clusterName -
orgId - alias of workspaceId
workspaceId - ID of the Databricks workspace
workspaceName - name of the Databricks workspace
dbJobId - Id of job
dbJobName - Name of job
dbJobRunId - Run Id of job
dbTaskRunId - Run Id of task
notebookId - Id of notebook that task is run on
notebook_path - Path of notebook that task is run on
sparkJobId - Id for individual job
sparkJobGroupId - An id spark assigns to a group of spark jobs
userId - User that is running the task
stageId - Stage Id for stage within the given task
stageAttemptId - Attempt number (were retries necessary?)
taskType - What type of task was perfomed (eg: resultTask, shuffleMapTask)
taskStatus - State of the task (succeeded or failed)
taskIndex - Task number starting from 0
taskId - Task identifier
attemptNumber - Attempt number of run
launchTime - Start time of task in timestamp format
finishTime - Finish time of task in timestamp format
duration - How long did the task take in nanoseconds
schedulerDelay - How long did the scheduler take to begin in nanoseconds
executorId - Identifier of the executor that kicked off the given task
host - Ip address for the executor that kicked off the task
taskLocality - process_local
speculative - Speculative execution of tasks was enabled (true) or disabled (false)Â
gettingResultTime - How long did the task take to get the resultÂ
successful - Was the task successful?Â
executorRunTime - How long did the executor run the task in nanosecondsÂ
executorCpuTime - How long did the executor use the cpu, in nanosecondsÂ
executorDeserializeTime - How long did the executor take to deserialize data, in nanosecondsÂ
executorDeserializeCpuTime - How long did the executor process data in cpu to deserialize data, in nanosecondsÂ
resultSerializationTime - How long did it take to serialize the results, in nanosecondsÂ
jvmGCTime - How long did the java virtual machine take in garbage collection, in nanosecondsÂ
resultSize - How large was the result, in bytesÂ
numUpdatedBlockStatuses - Storage statuses of any blocks that were updated as a result of the indicated taskÂ
diskBytesSpilled - How many bytes were spilled from memory to diskÂ
memoryBytesSpilled - How many bytes spilled from memory to diskÂ
peakExecutionMemory - What was the most memory consumed in execution, in bytesÂ
recordsRead - How many records readÂ
bytesRead - Number of bytes readÂ
recordsWritten - Number of records writtenÂ
bytesWritten - bytes written by the Spark Job
shuffleFetchWaitTime - How long did the task wait for shuffled data in nanosecondsÂ
shuffleTotalBytesRead - How many bytes of shuffled data was readÂ
shuffleTotalBlocksFetched - How many blocks were fetched for shuffled dataÂ
shuffleLocalBlocksFetched - How many blocks of local data was shuffledÂ
shuffleRemoteBlocksFetched - How many blocks of remote data was shuffledÂ
shuffleWriteTime - How long did it take to write shuffle data in nanosecondsÂ
shuffleBytesWritten - Number of bytes writtenÂ
shuffleRecordsWritten - Number of records writtenÂ
errorMessage - Message displayed if an error was detectedÂ
sparkJobStartTime - start time of the SparkJob provided through the SparkJobListener API, as measured by Spark
sparkJobEndTime - end time of the SparkJob provided through the SparkJobListener API, as measured by Spark
sparkJobDuration - duration of the SparkJob provided through the SparkJobListener API, as measured by Spark
Spark documentation
Consumption data
Consumption data, i.e. cost and usage information, is kept separate from the Databricks workspace folders.
File path
storage bucket |> 0306cb34-xxxx-xxxx-xxxx-xxxxxxxxxxxx/(azure subscription id) |> 5721xxxxxxxxxxxx/ (databricks workspace id) |> driverSparkMetrics |> driverOsMetrics |> executorSparkMetrics |> executorOsMetrics |> jobRunAnalysis |> taskMetrics |> consumptions |> json |> workspaceId=2149xxxxxxxxxxx |> date=20220511 (YYYY/MM/DD) |> part-00000-9c30baa3-416b-4622-8e7d-9b512aa2e70a.c000.json |> parquet |> workspaceId=2149xxxxxxxxxxx |> date=20220511 (YYYY/MM/DD) |> parquet files
Location: lakehouse-monitor / a63c1e51-40ae-4a34-b230-bf80e132c05c / consumptions / json / workspaceId=511420607229897 / date=20220721
File name: part-00000-9c30baa3-416b-4622-8e7d-9b512aa2e70a.c000.json
Sample file
File format:
the file contains one json item per line
Columns
id - Partial row identifier, contains subscription id, billing period Id, and a usage detail guid. Can appear multiple times for the same billing period start and end date.Â
name - a guidÂ
billingAccountId - Id of billing accountÂ
billingAccountName - Name of billing accountÂ
billingPeriodStartDate - Date the metered usage began (in DateTime format, and marked UTC), but only the date is populatedÂ
billingPeriodEndDate - Date the metered usage ended (in DateTime format, and marked UTC), but only the date is populatedÂ
billingProfileName - Subscription nameÂ
accountOwnerId - Account owner IdÂ
accountName - Account nameÂ
subscriptionId - The subscription id of the referenced Azure subscriptionÂ
subscriptionName - Name of Azure subscriptionÂ
workspaceName - When filled, it refers to the Databricks instance associated with the billing meterÂ
product - Product name for the consumed service or purchase (not available for Marketplace)Â
quantity - Quantity of metered usage (units will vary based on what is used Storage, Network, etc)Â
effectivePrice - What you pay (discounts applied)Â
cost - Quantity * effectivePriceÂ
unitPrice - What you pay per unit of quantityÂ
billingCurrency - Currency code (USD)Â
resourceLocation - Region where resource is deployedÂ
consumedService - Parent category of resource typeÂ
resourceType - More detailed description of resource used (storageAccount, publicIPAddresses, virtualMachines ,etc)Â
resourceId - Full âpathâ name for resource consumed. Includes subscription, resource group, Consumed Service, and GUID of resource consumedÂ
additionalInfo - Additional details of usageÂ
costCenter - The cost center of the department if it is a department and a cost center is provided (typically null)Â
resourceGroup - Resource group of consumed resourceÂ
publisherName - âMicrosoftâ, but often appears as nullÂ
publisherType - Most often âAzureâÂ
chargeType - Most often âUsageâÂ
frequency - Frequency-- most often âusageBasedâÂ
payGPrice - Retail price for the resourceÂ
pricingModel - Identifier that indicates how the meter is pricedÂ
dbJobId - Associated job for the resource consumedÂ
dbJobRunName - Associated job name for the resource consumedÂ
dbJobRunId - Associated job run id for the resource consumedÂ
dbEnv - Environment of resource consumedÂ
dbClusterId - Associated cluster ID for the resource consumedÂ
dbClusterName - Associated cluster name for the resource consumedÂ
dbCreator - Individual responsible for resource consumedÂ
dbLakehouseMonitorEnabled - âEnabledâ if LakehouseMonitor was enabled for the specified resourceÂ
dbCreatedDate - A tags metric set by Databricks intended to indicate when the cluster was created (always null)Â
tags - Any KVP tags associated with the resource consumedÂ
meterDetails - Collection of additional information for the resource consumedÂ
usageInnerType - Usually âlegacyâÂ
workspaceResolution - Debugging field which indicates how workspace information is resolvedÂ
đ Enhanced data
Aggregated data is saved in:
jobRunAnalysis
- aggregated information regarding telemetry data of Databricks jobs runsnotebookAnalysis
- aggregated information regarding telemetry data of Databricks notebooks runs
Job Run Analysis
Job Run Analysis represents an aggregation by db-job-id, db-run-id and cluster-id of the Task Metrics joined with Cluster Os Metrics.
We employ
aggregations recommended by data science team
one file â one row
one job run which might have multiple tasks
some columns are aggregated with max, avg
Columns
dbTaskRunId - Run Id of task
clusterId - Id of assigned cluster
duration - Duration of job in nanoseconds
skew - Amount of skew in the job
shuffle - Max shuffle total
spill - Max spill in disk and memoryÂ
status - Status of job (success or failed)Â
dbDuration - JSON object containing setup, execution, cleanup and total time in nanoseconds
cpu - Average and max cpu usageÂ
memory - Average and max memory usageÂ
clusterName - Name of clusterÂ
dbJobName - Name of jobÂ
userId - User who scheduled the job Â
cost - Cost of jobÂ
dbJobId - Job idÂ
date - Date of job run
dbJobRunId - Run Id of job
Notebook Analysis
Notebook analysis represents aggregated information regarding telemetry data of Databricks notebooks runs
Columns
userId - User who ran the notebook
duration - Time that notebook ran
startTime - Start time of notebook
endTime - End time of notebook
skew - Amount of skew in the data
shuffle - Max shuffle total
spill - Max spill (disk and memory)
cpu - Average and max of cpu usage
memory - Average and max of memory usage
clusterName - Name of cluster
notebookPath - Path where notebook is located
clusterId - Associated cluster Id
date - Date when notebook was ran
notebookId - Notebook id
0 Comments