Vendor Consolidation page
Feature overview
Vendor Consolidation provides visibility into third-party vendor data usage and associated costs. It helps you to identify redundant or expensive vendor usage, enabling you to reduce expenses and improve performance by consolidating vendors and migrating, for example, to Databricks. Vendor data is available starting from the release date in July 2025.
To ensure accurate reporting, third-party vendors must be using Spark. Whenever workloads are using Spark to call out to any other vendor, LHO can detect that.
The key metrics in Vendor Consolidation context are:
Transferred Data: Total amount of data read from or written to external sources using the Spark DataFrame API
Transfer Cost: Total cost of workloads that read from or write to external sources, including full runtime and associated shared infrastructure costs.
Vendor Consolidation (tenant level)
Vendor data usage and associated costs are displayed in a monthly aggregated view, along with cost roll-ups for the selected time period:
Date range is current year to date by default, configurable.
Workloads may be using data from multiple vendors. “Multiple Vendors” tooltip shows how often each vendor is used in such workloads. In the example below 100% of workloads with 2 or more vendors uses AWS S3 and MS SQL Server, and 11% of workloads with 2 or more vendors use JDBC:
Migration Opportunities table
Migration Opportunities table returns the most expensive jobs using at least one vendor for the selected date range.
Filter to a specific workspace, job, or vendor
Table columns:
Dependent Vendor: Vendors used in job runs during the selected date range.
Total Cost: cost of all job runs for the selected date range. Total Cost is estimated when there are no actual costs for at least one job run in the selected date range, or job has at least one task running on All-Purpose Compute cluster.
Vendor execution time: Total time spent by tasks using dependent vendors. Parallel execution is not taken into consideration.
Data Transferred: Total amount of data read from or written to external sources using the Spark DataFrame API. When vendors are shared in the same task and same spark group, then the values cannot be distinguished between vendors and are added for all vendors in the same group. In this case Data Transferred the total value could be larger than the total Data Processed.
Wait Time: Time spent waiting for third-party vendors to read from or return data to Databricks. For example, Wait Time for a job run is the total duration of Spark Jobs in a group that are retrieving data from dependent vendors.
Click on Job name to open this job in Workloads >> Job Runs view.
On the job run Tasks page you can review dependent vendor summary for a run, as well per task
job run Tasks pageAdditionally, on Tasks page you can review dependent vendor query references
Supported Vendors
AWS S3
Azure Blob File System
Cassandra
IBM Db2
Databricks File System (DBFS)
Delta Lake
Elasticsearch
Google Cloud Storage (GCS)
Kafka
MS SQL Server
MySQL
Netezza
Oracle Database
PostgreSQL
Redis
Snowflake
Teradata
Note: Hive and Unity Catalog references LHO filters out from displaying since these are higher level file systems on top of GCS, Azure Blob File System or AWS S3, and would dominate and hide the true vendors.