Databricks has emerged as one of the most powerful solutions for organizations to sort and normalize the data they ingest. But like all cloud providers — from the big three to more specialized infrastructure vendors — it adds complexity to customers’ IT spend.
This week, we announced support for Databricks on the CloudZero Platform. With this new functionality, customers can gain visibility into their Databricks spend, combine it with any other spend that contributes to their bottom line, and get a complete view of business dimensions, such as products and customers.
In this blog, we give an overview of all things Databricks: Why it’s innovative, its billing challenges, and how we evolved our platform to work seamlessly with it.
Introducing: Data Lake Houses
Databricks pioneered the concept of a “data lake house.” Aside from sounding like a reality show for data engineers, a data lake house combines two of its technological predecessors: data warehouses and data lakes — which both have key strengths and limitations.
Data warehouses are great for storing and sorting large amounts of structured data, which is fine if you collect only one type of data. The problem is, most companies collect all kinds of different types of data — images, emails, videos, keystrokes (looking at you, TikTok). Some is structured, some is semi-structured, some is unstructured — which breaks the data warehouse model. This led to…
Data lakes — or, as one Medium article called them, “data swamps.” Data lakes collect and store all types of data. The reason they get called “swamps” is because data lakes are often a complete mess because there are so many data types and formats.
Databricks sits on top of your data repository (lake), normalizes the format of different types and structures of data, and lets you use it for all kinds of higher-order tasks. Business Intelligence (BI), dynamic analytics, data science, and ML all become possible — and with them, more informed strategic decisions. (For example, an oil company used past data about the status of their rigs to predict when currently healthy rigs might need maintenance.)
Plus, Databricks has a very intuitive UI that doesn’t require a ton of technical expertise to use. Customers love that.
Databricks Billing Challenges
At a basic level, using Databricks means getting another invoice from another cloud provider. Unless you have a sophisticated way to ingest this billing data and integrate it with the rest of your cloud spend, it will take manual effort to combine this invoice with all your others.
Databricks presents a few other more specific challenges:
- No spending guardrails. Minimal cost alerting functions make it easy to overspend on data exploration exercises. It’s not unheard of for customers to spend tens (or even hundreds, according to one of our customers) of thousands of dollars before realizing it.
- Databricks/EC2. Databricks usage incurs two main charges: the cost to license the platform, and associated EC2 costs to run the platform. This can make it hard to assess the overall cost of Databricks — and, by extension, its ROI.
- COGS vs. R&D. People use Databricks for different reasons — either to explore data and extract insights, or to automate certain production-stage data queries. These costs fall into different categories, but Databricks doesn’t have a great solution for sorting them.
- Unit costs. As with other cloud providers, there’s no easy way to allocate the right portions of Databricks spend to the business units driving them.
How CloudZero Makes Databricks Billing Simple
CloudZero recently developed an adaptor that lets customers unify Databricks spend with the rest of their cloud spend. In addition to automating the process of ingesting and analyzing their Databricks bill, this gives customers:
- Total cost. We show the complete cost to run Databricks, including costs incurred within Databricks and within the EC2 resources.
- Guardrails. We provide up-to-the-minute cost data, giving engineers cost guardrails to prevent $150k+ mistakes.
- Business Dimensions: We allocate Databricks spend according to where the money’s actually going. We can distinguish R&D spend from production spend and automatically trace it to the relevant customers, products, teams, and more.
- Accurate Unit Economics and Cost Per Customer: CloudZero is the only cost solution that enables customers to ingest unit cost telemetry, accurately apportioning shared spend, such as multi-tenant costs. This can be applied to Databricks in addition to every other cost source customers bring into the platform.
- Commit with confidence. Like other cloud providers, Databricks offers commitment-based discounts. With precise cost data from CloudZero, Databricks customers can make informed decisions about how much to commit to upfront.
We love lake houses as much as the next cloud cost intelligence company. But we especially love them when they run efficiently — when their expenses are trackable, and when their costs are morphed into actionable insights.