Understanding LHO Recommendations for Jobs

Understanding LHO Recommendations for Jobs

 

LHO continuously evaluates workloads telemetry and identifies optimization opportunities.

Types of Recommendations


A. Cluster Right‑Sizing

Goal: Match CPU and memory to actual workload needs.

Common suggestions

  • Reduce memory footprint: When memory percentiles are far below thresholds, move to a lower‑memory node at similar core count.

  • Downsize the driver: If driver CPU and memory are near idle for most of the run, pick a smaller/cheaper driver class.

  • Switch to compute‑optimized instances: When executor CPU is high and memory is low, choose more cores with less RAM to accelerate parallel stages at lower cost.

  • Increase core count for parallel workloads: If executors are consistently CPU‑saturated (and memory is required), keep memory and add cores (same family or next size up).

  • Photon consideration: Photon may improve performance for some workloads, but not all. Test your job with and without Photon to determine whether it provides measurable benefits.

How to implement (safe path)

  1. Change one variable at a time (e.g., instance type or cores).

  2. Hold autoscaling steady (fixed executors) for the test if scaling churn exists.

  3. Compare Duration, Cost/hour, Cost/TB over several consistent runs.

  4. Keep the change only if KPIs improve or meet your goals/SLA.

2026-03-02_12h25_12-20260302-202512.png

B. Autoscaling Optimization

Goal: Maintain elasticity where it provides value while preventing unnecessary scaling churn, executor turnover, and performance overhead.

Autoscaling can be effective for workloads with real variability — but only when scaling decisions closely match workload behavior. When autoscaling reacts too quickly, too slowly, or within too narrow a range, it introduces extra cost and delays without improving performance.

Two distinct scenarios

  • Use a fixed‑size cluster when autoscaling oscillates: If autoscaling frequently switches between executor counts (for example, 2 → 3 → 2), the cluster repeatedly adds and removes executors without meaningful workload change, incurring provisioning latency and re‑distribution/shuffle overhead without real performance benefit. Fixing executor count (e.g., “always 2” or “always 3”) stabilizes runtime and often reduces total cost by avoiding unnecessary scaling transitions.

  • Tune autoscaling: If the workload is truly bursty, autoscaling can still be valuable. However, you may need to adjust autoscaling parameters to ensure the cluster scales smoothly (e.g., raise the min executors to avoid repeatedly scaling down too far, widen bands, increase stabilization windows where supported).

How to implement (safe path)

  1. If oscillation exists, set min = max to a stable value for the A/B test.

  2. If workload is truly variable, tune bounds and cooldowns; test again.

  3. Re‑evaluate right‑sizing after scaling is stable (to avoid confounding effects).


C. Serverless vs. Classic: When to Use Which

Goal: Select the execution mode that best fits the job’s runtime pattern and operational requirements.

When Serverless (Performance mode) is a good fit

  • Short‑lived jobs where cluster setup time dominates execution time (e.g., jobs that run for ~1 minute but require several minutes to initialize a cluster).

  • Event‑driven or on‑demand tasks that need fast starts and low ops overhead

When classic (job compute) may be better

  • Network/firewall constraints that block serverless access

  • Long‑running or infrequent jobs (e.g., daily ETL) where an extra few minutes of cluster initialization has little impact on SLA and cost control is your priority

  • Workloads requiring precise compute configuration, such as detailed control over node type, instance size, or autoscaling behavior.

How to decide (A/B testing approach)

  • Right‑size the classic baseline first; run a few cycles.

  • Move a representative subset to serverless; run the same cycles.

  • Compare Cost/TB and Cost/hour along with Duration and success rate.
    Whichever mode yields more predictable performance at better or equal Cost/TB typically becomes the recommended.
    Note that, Serverless “standard mode” adds ~5 min setup time, and you may be better off than on classic with LHO optimization.


D. Avoid Using All‑Purpose Compute (APC) for Production Jobs

All‑Purpose Compute (APC) is designed for interactive development, not for scheduled or production pipelines. Jobs running on APC are flagged as high‑priority cost‑saving opportunities.

Why APC should not be used for production jobs

APC carries a “trifecta” of inefficiencies for automated workloads:

  • Idle time: APC clusters often remain active between runs, accumulating unnecessary cost.

  • Higher DBU rates: APC costs more per hour than job clusters or serverless for equivalent compute.

  • Misaligned sizing: APC configurations are usually not optimized for recurring scheduled workloads, leading to waste and poor cost/perf alignment.

What to do

  • Use LHO to identify any remaining APC‑based jobs.

  • Migrate them to job clusters or serverless performance mode, depending on the workload pattern.

  • Validate improvements by comparing Duration, Cost/hour, and Cost/TB trend lines before and after migration


E. Job Re‑Orchestration (Split Long Jobs into Stages)

Goal: Align compute shape with the specific needs of each phase of a multi‑hour job.

When to consider re-orchestration

  • Multi‑hour jobs showing distinct utilization phases (e.g., CPU‑intense early stage followed by light post‑processing).

  • Pipelines where a single cluster shape forces compromise (too large for light phases or too small for heavy phases).

What to do

  • Split the job into well‑defined stages (or task group) and assign compute independently. For example:

    • Stage 1 (heavy/parallel): larger or more cores for faster completion.

    • Stage 2 (light): smaller/cheaper cluster to reduce cost.

  • Ensure artifacts or intermediate outputs are written to storage so downstream stages don’t repeat work.

  • Even with an additional ~5‑minute cluster startup between stages, this approach often improves both runtime and overall cost, especially for daily or long‑duration jobs.

  • Validate improvements by comparing Duration and Cost/TB before and after orchestration changes.

2026-03-02_12h19_50-20260302-201950.png

KPIs: What to Track and Why

  • Cost per TB: Normalizes spend by data volume; essential when input size or transformation complexity changes.

  • Cost per Cluster Hour: Reflects the raw rate you pay for the chosen instance(s); useful when switching instance families or driver/worker sizes.

  • Duration: End‑to‑end runtime; helps confirm performance impact and SLA alignment.

  • Data Usage: Confirms apples‑to‑apples comparisons. If data processed increases, higher total cost may still be efficient if Cost/TB holds or improves.

Tip: Always compare changes over several consistent runs and avoid overlapping experiments.


A/B Testing Playbook

Use this lightweight procedure to validate any change:

  1. Stabilize autoscaling first if you observe short segments/oscillation, so scaling churn doesn’t mask the effect of instance changes.

  2. Capture a baseline (several runs):
    Record Duration, Cost/hour, Cost/TB, and Data Usage for a consistent set of executions.

  3. Implement one change:
    Apply a single modification such as adjusting the instance family, core count, Photon setting, or autoscaling mode.

  4. Collect new data (several runs):
    Run the job under the same schedule and conditions as the baseline and capture the same KPIs.

  5. Analyze job in Trend view and Optimization Review:

    1. If Duration decreases and Cost/TB stays the same or decreases, keep the change.

    2. If Duration stays the same or increases, or Cost/TB rises, revert the change or test another option.

2026-03-02_12h40_11-20260302-204130.png

Troubleshooting Guide

Symptoms & fixes

  • Job got slower after change

    • Recheck Autoscaling Timeline: any oscillation? Test fixed size.

    • Check for spill/shuffle increases: maybe memory was cut too far.

    • Compare data volume: did input grow? Use Cost/TB to normalize.

  • Costs increased but runtime improved

    • If Cost/TB improved or held steady, higher total may be justified by more data or a tighter SLA. Otherwise, try a cheaper compute‑optimized size.

  • Serverless run blocked

    • Confirm network/firewall allowlists and workspace settings.

    • Stay on classic for that job and optimize there.


Best Practices Checklist

Start with highest‑cost and most frequent jobs.
Fix autoscaling churn before measuring instance changes.
Change one variable at a time and run enough cycles.
Downsize drivers that are mostly idle.
Use compute‑optimized nodes when CPU is high and memory is low.
Consider re‑orchestration for multi‑phase, multi‑hour jobs.
Review monthly; re‑validate after major code or data shifts.

Frequently Asked Questions (FAQ)

  1. Q: Photon: should we always enable it?
    A: Not always. Useful in some cases, but effectiveness depends on workload code; test with and without Photon; keep it when KPIs improve.

  2. Q. Our driver looks idle — can we minimize it aggressively?
    A: Yes, if driver CPU/memory stay low across the run. Validate with a few runs after downsizing.

  3. Q. We have strict SLAs. Should we still chase lowest cost?
    A: Balance both. Prefer configs that meet SLA with headroom and good Cost/TB. Don’t trade reliability for marginal savings. LHO provides evidence and options; final decisions should consider SLAs, dependencies, and context that telemetry alone cannot capture.

  4. Q: Are there any performance implications if there is too much memory?
    A: No. The only downside is unnecessary additional cost.