• done
  • Pool Contention and Idle VMs Reporting guide

    https://youtu.be/sCvjYha54ik?si=tKwUyFIHUEs3ngtZ

    Case Study: How much am I spending on Idle VMs?

    One of the most useful features of the Pools reporting page in LHO is that it lets us easily investigate the amount our company is spending on VMs that are just sitting idle in your environment due to configuration and orchestration. The data is there in Databricks, but the user would have to find the information and manually do the calculations to discover the answer. Lakehouse Optimizer allows us to quickly and easily see just how much of your monthly budget is being wasted on Idle VMs using clear and comprehensive graph visualizations.

    We have seen customer environments with idle VM monthly amounts as high as 40-50% of the entire workspace cost. Idle VMs are by far the most significant source of waste for many of LHO’s customers.

    The subscription overview page is a good place to start any investigation into pool contention because it will show us if there are any idle VMs in a given time period. In the example below, we can see not only the total cost of Idle VMs in any given month, but also what percent of the total subscription cost was made up of idle VMs, and how much that has changed since the month before.

    Pools Reporting Page

    On the Workloads page, the tab for Pools can be found all the way over to the right. This lands by default on the Activity sub-tab.

    Activity - Clusters Activity Tab

    On the left, there will be a listing of all pools found in the selected Databricks Workspace. If none are listed, that workspace does not have any pools defined.

    Case Study Continued: Breakdown wasted cost in June 2023

    In this view, the user can use the information they obtained in the overview - cost was wasted in June on Idle VMs - and select June as the date range to view that time period. The graph will then display the usage of VMs by jobs and clusters. The higher the spike, the more VMs being utilized at once.

    A good place to start would be to get a closer look at a spike. The user can zoom into any part of the graph by clicking and dragging their selection horizontally:

    From there the user can clearly see which jobs are running when and how many VMs are being utilized. The legend above the graph lets us see which job is designated by which color at a quick glance but if the user desires more specific information on that job they need only hover over that color on the graph itself to see all the details:

    The actual Idle VM time is designated by a red and black zebra pattern color. If the user hovers over that they will still get the metrics on the actual idle time. In the below example, it is shown that 9 VMs were idle for almost 7 minutes.

    This view also shows the user all of the pool's configurable settings from Databricks:

    Example Investigation

    This is illustrated in the graph examples of idle time by the fact that up to 11 VMs can run at once, and when they finish with a job they will wait for another job for up to 10 minutes before they will shut down automatically. 3 jobs requiring 3 VMs a piece run at the same time, spinning up 9 VMs, and when they finish no new jobs run until 7 minutes later.

    This could be solved in many ways:

    • If those jobs don’t have to be done that minute, the max capacity could be set to 3, and each job would queue up and run one after another, and only at the end of all those queued jobs would 3 VMs be idle once per day.

    • The AutoShutdown Timeout could be reduced to zero. This would mean the later jobs would have to wait for the VMs to boot up before they could begin, but at least there wouldn’t be VMs waiting to do something and costing money for 7 minutes.

    • The jobs could be scheduled closer together so that the VMs are not idle all that time and instead move from one job to the next until complete.

    Activity - Clusters Timeline Tab

    To help make decisions about the orchestration of jobs, there is also the Clusters Timeline view found on the next tab. This view allows the user to see concurrent timelines of when each individual job runs to get a better idea of where the jobs overlap.

    Cost Tab

    The cost tab is particularly helpful in seeing the big picture of each of the given pools.

    If the user is having trouble telling which pools are having the biggest impact on the overall Idle VMs cost listed in the subscription overview, this page gives a clear breakdown of all costs related to each pool. In this example, the user can see that of the total cost being paid for “AllPurpose Executors”, over 36% of that cost is being spent on Idle VMs. That pool would probably be a good starting point on the Activity tab.