Vendor Consolidation page

Vendor Consolidation page

Feature overview

Vendor Consolidation provides visibility into third-party vendor data usage and associated costs. It helps you to identify redundant or expensive vendor usage, enabling you to reduce expenses and improve performance by consolidating vendors and migrating, for example, to Databricks. Vendor data is available starting from the release date in July 2025.

To ensure accurate reporting, third-party vendors must be using Spark. Whenever workloads are using Spark to call out to any other vendor, LHO can detect that.

The key metrics in Vendor Consolidation context are:

  • Transferred Data: Total amount of data read from or written to external sources using the Spark DataFrame API

  • Transfer Cost: Total cost of workloads that read from or write to external sources, including full runtime and associated shared infrastructure costs.

 

Vendor Consolidation (tenant level)

Vendor data usage and associated costs are displayed in a monthly aggregated view, along with cost roll-ups for the selected time period:

2025-07-14_14h59_15-20250714-215923.png

Date range is current year to date by default, configurable.

Workloads may be using data from multiple vendors. “Multiple Vendors” tooltip shows how often each vendor is used in such workloads. In the example below 100% of workloads with 2 or more vendors uses AWS S3 and MS SQL Server, and 11% of workloads with 2 or more vendors use JDBC:

2025-07-14_15h30_54-20250714-223054.png

 

Migration Opportunities table

  • Migration Opportunities table returns the most expensive jobs using at least one vendor for the selected date range.

  • Filter to a specific workspace, job, or vendor

  • Table columns:

    • Dependent Vendor: Vendors used in job runs during the selected date range.

    • Total Cost: cost of all job runs for the selected date range. Total Cost is estimated when there are no actual costs for at least one job run in the selected date range, or job has at least one task running on All-Purpose Compute cluster.

    • Vendor execution time: Total time spent by tasks using dependent vendors. Parallel execution is not taken into consideration.

    • Data Transferred: Total amount of data read from or written to external sources using the Spark DataFrame API. When vendors are shared in the same task and same spark group, then the values cannot be distinguished between vendors and are added for all vendors in the same group. In this case Data Transferred the total value could be larger than the total Data Processed.

    • Wait Time: Time spent waiting for third-party vendors to read from or return data to Databricks. For example, Wait Time for a job run is the total duration of Spark Jobs in a group that are retrieving data from dependent vendors.

2025-07-30_20h23_24-20250731-032345.png
  • Click on Job name to open this job in Workloads >> Job Runs view.

  • On the job run Tasks page you can review dependent vendor summary for a run, as well per task

    2026-01-27_16h54_01-20260128-010453.png
    job run Tasks page

     

    2026-01-27_17h06_05-20260128-010759.png

     

  • Additionally, on Tasks page you can review dependent vendor query references

    2026-01-27_17h09_33-20260128-011014.png



    2026-01-27_17h16_34-20260128-011651.png



Supported Vendors

  • AWS S3

  • Azure Blob File System

  • Cassandra

  • IBM Db2

  • Databricks File System (DBFS)

  • Delta Lake

  • Elasticsearch

  • Google Cloud Storage (GCS)

  • Kafka

  • MS SQL Server

  • MySQL

  • Netezza

  • Oracle Database

  • PostgreSQL

  • Redis

  • Snowflake

  • Teradata

Note: Hive and Unity Catalog references LHO filters out from displaying since these are higher level file systems on top of GCS, Azure Blob File System or AWS S3, and would dominate and hide the true vendors.