Optimizing Your Databricks Clusters for Cost Efficiency

Databricks is a popular platform for running Apache Spark workloads in the cloud. While Databricks makes it easy to spin up Spark clusters on demand, this flexibility can also lead to higher costs if not managed properly. In this article, we’ll explore some techniques for optimizing your Databricks clusters to reduce costs without sacrificing performance.

 

Use Spot Instances

The cloud providers Databricks runs on, like AWS and Azure, offer spot instances that provide unused compute capacity at steep discounts compared to on-demand instances. The trade-off is spot instances can be terminated with little notice when demand increases.

For batch jobs and other fault-tolerant workloads, opting for spot instances can dramatically cut your Databricks bills. Just be sure your jobs are idempotent and can handle unexpected failures. You may also want to periodically checkpoint results to durable storage like S3.

 

Right-size Your Clusters

It’s tempting to overprovision your clusters to ensure fast performance. But oversized clusters result in waste, since you pay for the full capacity even when it’s not needed. 

Instead, take time to performance test and find the lowest cluster sizes that deliver acceptable performance for your workloads. As a rule of thumb, optimize first for the number and types of nodes before optimizing their sizes. 

Also, scale down or terminate clusters when not in use. Don’t leave large clusters running 24/7 if you only need them a few hours a day.

 

Use Auto-scaling and Auto-termination

Use auto-scaling to automatically grow or shrink your cluster to meet current demands. Set up policies to add nodes when utilization is high or jobs start queueing up.

Likewise, enable auto-termination to automatically shut down idle clusters after a certain period. This ensures you aren’t paying for resources that aren’t being used.

 

Choose Optimal Instance Types

The cloud providers offer a dizzying array of instance types optimized for different use cases. Picking the right instance type for your workloads can provide better performance per dollar spent. 

For Spark workloads, focus on choosing instances with high CPU, memory, and network capacity. SSD-backed instance types typically work better than HDD-backed instances.

 

Use Reservations and Discounts

Cloud providers offer reserved instance discounts and volume discounts that can yield substantial savings. Consider purchasing 1 or 3-year reservations for predictable workloads with steady resource needs. 

Also, look into discounts for things like prepaid Databricks commits and AWS Savings Plans to reduce your overall bill.

 

Continuously Monitor and Optimize

Treat cost optimization as an ongoing process. Continuously monitor your usage and spending to uncover savings opportunities. Analyze logs and performance metrics to right-size clusters. Keep up with new instances and features from cloud providers. With some diligence, you can achieve big savings on your Databricks bills.

Written by David Drai

David is CEO and co-founder of Anodot, where he is committed to helping data-driven companies illuminate business blind spots with AI analytics. He previously was CTO at Gett, an app-based transportation service used in hundreds of cities worldwide. Prior to Gett, he co-founded Cotendo, a content delivery network and site acceleration services provider that was acquired by Akamai Technologies, where he also served as CTO. He graduated from Technion - Israel Institute of Technology with a BSc in computer science.

You'll believe it when you see it