We’ll be diving into the data juggernaut Databricks today.
1. Overview and Why Databricks Matters Now
Databricks has emerged as one of the most important companies in enterprise data and AI infrastructure. Founded in 2013 by the creators of Apache Spark, it began as a way for data scientists and engineers to process very large datasets quickly.
Over the past decade, it has grown into a unified data lakehouse platform. It combines the capabilities of a data lake (flexible, inexpensive storage) and a data warehouse (fast, structured queries) in one place. This means customers can store all their raw data in one system, process it at scale, run advanced analytics, and build AI models without moving between different tools.
The company is still private but has scaled to an estimated $3 billion in annual revenue with more than 10,000 customers globally, including over half of the Fortune 500 (company figures, October 2024). It reached free cash flow breakeven in 2024, signaling that it’s no longer in the “burn cash to grow” phase that defines many startups. Its most recent funding round in September 2024 valued it at $62 billion.
Databricks is not just a big player in its own right. It’s a bellwether for a broad category of infra startups. Many emerging companies build tools that plug into, extend, or compete with Databricks. As Databricks grows, it can pull whole segments of the infra market along with it. Or crush weaker competitors in overlapping areas.
2. What Databricks Actually Offers
To understand the company’s market power, you need to understand what its product does and why it’s attractive to both engineers and business leaders.
Unified data environment: Historically, companies used different tools for different steps of the data workflow. A raw data store like Hadoop or Amazon S3 for holding files, a separate data warehouse like Snowflake or Teradata for analytics, and perhaps another platform for machine learning. This meant data had to be copied between systems, slowing work and creating risk of inconsistencies. Databricks’ “lakehouse” model lets teams do all those steps (store, clean, query, analyze, and run AI) in one integrated place.
AI integration: Through tools like MLflow (open-sourced by Databricks) and its 2023 acquisition of MosaicML, Databricks lets customers train, fine-tune, and deploy AI models directly on their own data. With the MosaicML technology, customers can run LLMs with fine-tuning that respects their privacy and regulatory needs. This integration is timely: many enterprises want to harness AI without sending sensitive data to a public API like OpenAI’s.
Open-source foundation: The platform is built on open standards like Apache Parquet (a storage format) and Delta Lake (for transactional reliability in data lakes). These open-source roots make Databricks easier to integrate with other tools and reduce fear of vendor lock-in. Engineers are more willing to commit to it because they can still work with their data outside Databricks if needed.
Multi-cloud compatibility: Unlike cloud provider-native tools that run only on one platform (e.g. AWS Redshift), Databricks runs on all three major public clouds (AWS, Azure, Google Cloud). This is important for companies that have multi-cloud strategies or want to avoid being tied too tightly to one provider.
3. The Market Context
Databricks sits at the center of several huge and fast-growing markets. The global big data and analytics market was estimated at $348 billion in 2023 and is projected to grow at over 13% annually through 2030 (Grand View Research, April 2024). Within that, the market for data platforms that unify analytics and AI (the “lakehouse” niche) is newer but expanding faster. This is fueled by enterprises shifting to cloud data storage and adding AI workloads.
Adoption of AI across industries is a major tailwind. Every AI application (from predictive maintenance in manufacturing to personalized recommendations in retail) depends on robust data infrastructure. Databricks directly benefits from that wave: before you can train a good AI model, you need a clean, accessible, and well-structured dataset. And that’s what Databricks enables.
Another growth driver is the move away from legacy, on-premise data warehouses toward cloud-based and hybrid solutions. Companies still running on older systems like Oracle or Teradata are potential customers as they modernize. Their multi-cloud flexibility makes it appealing for these migration projects.
4. Competition and Positioning
Their closest high-profile competitor is Snowflake. Both companies want to be the central hub for enterprise data, but they come from different starting points. Snowflake began as a cloud-native data warehouse optimized for structured data and SQL analytics. Databricks began in big data processing and machine learning. Over the past few years, they have been converging:
Snowflake has added machine learning and unstructured data capabilities.
Databricks has improved its SQL support and ease-of-use for business analysts.
Cloud giants like AWS, Azure, and Google Cloud are also competitors since each offers their own analytics and AI services. But those tend to be more siloed and less open. Databricks wins when a customer wants one environment for both engineering and analytics and doesn’t want to be locked to a single cloud provider.
The competitive dynamic matters for infra startups. If Databricks wins more accounts against Snowflake, it shapes the ecosystem for add-on tools. For example, a startup building a monitoring tool for Databricks pipelines will see a bigger market. But one tightly integrated with Snowflake might have a smaller addressable market if Databricks’s share grows.
5. Financial and Operational Performance
Databricks’s revenue reached roughly $3 billion in the year to October 2024, up from around $1.9 billion in 2023. A growth rate of about 58% (company figures, October 2024). That’s faster than Snowflake, which grew 36% year-over-year in its latest fiscal year.
Gross margins (the percentage of revenue left after covering the cost of delivering the service) are in the mid-80% range for the core software business. That’s in line with best-in-class SaaS companies and higher than many cloud infra companies whose margins are eroded by heavy compute costs.
Customer retention is extremely strong. Net revenue retention is estimated around 140%. It means that on average, existing customers increase their spending by 40% year-over-year. This is a sign of both product stickiness and expansion potential within accounts.
The company also crossed into free cash flow positive territory in 2024, meaning it’s no longer dependent on outside funding to sustain operations. For potential employees, this is important: it signals stability and reduces the risk of deep cost-cutting in a downturn.
6. Risks and Dependencies
Their trajectory isn’t guaranteed. There are several risk areas to watch:
Competitive intensity: Snowflake is not standing still. And the cloud hyperscalers are constantly improving their native offerings. If a cloud provider bundles a full-fledged lakehouse-style product at a lower price, Databricks could face pricing pressure.
Macro environment: The company’s usage-based pricing means that if customers cut workloads in a downturn, revenue could slow quickly. Smaller infra startups have seen this effect sharply in past slowdowns. Databricks is more resilient, but not immune.
Complexity barrier: Despite recent improvements, Databricks can still feel daunting for less technical teams. If adoption stalls in the “business analyst” user segment, Snowflake’s simpler interface could win deals.
Security and compliance: Handling sensitive data for large enterprises means any breach or compliance failure could be a major reputational and financial hit.
Dependencies also exist in its growth model: Databricks’ expansion drives (and depends on) the health of complementary infrastructure. It needs a robust partner ecosystem to meet customer demands in areas it doesn’t fully cover itself such as specialized data ingestion, industry-specific AI models, or compliance automation.
7. Impact on the Infra Startup Ecosystem
Their growth has a direct ripple effect across the broader infra startup market:
Complementary startups get a bigger pie: Companies building data connectors, orchestration tools, observability platforms, governance layers, or AI deployment systems can integrate with Databricks and ride its expansion. For example, ETL providers like Fivetran benefit as customers feed more data into Databricks.
Adjacent categories may get absorbed: Startups that build features Databricks can easily add risk being outcompeted. For instance, a standalone notebook-based machine learning workflow tool might find customers prefer the built-in MLflow inside Databricks.
Cloud optimization tools gain indirectly: Databricks workloads consume significant compute and storage on AWS, Azure, and GCP. As Databricks’s usage grows, so does demand for startups offering cloud cost optimization, performance tuning, and monitoring.
Higher technical standards in the market: As Databricks sets the bar for scalability, reliability, and openness. Startups in adjacent infrastructure categories may need to meet similar standards to be considered enterprise-grade partners.
In short, the winners will be those that complement and extend Databricks’s capabilities. Not those that try to replicate them. The losers will be those whose entire product overlaps with a Databricks roadmap item.
8. The 24-Month Outlook
The next two years are likely to include:
Continued strong growth in large enterprise accounts, especially in regulated industries like financial services, healthcare, and government, where Databricks’s security and governance features are a differentiator.
IPO readiness. The company’s scale, growth, and profitability profile make a public listing highly plausible in 2025–2026. Public market valuation will depend on maintaining high growth while demonstrating operating leverage.
Deeper AI integration. Expect Databricks to push hard on enterprise AI features, from fine-tuning LLMs with MosaicML to building tools for deploying AI agents. This will both strengthen its value to existing customers and open doors to new ones.
Ecosystem expansion. More integrations, partner-built apps, and vertical solutions will likely emerge, further entrenching Databricks in enterprise workflows.
For infra startups, this means the clock is ticking to align with the Databricks ecosystem if you want to ride the wave. Being “Databricks-native” could become as valuable in data infra as being “AWS-native” became in cloud infrastructure a decade ago.
9. Bottom Line
Databricks combines strong technology, excellent financial performance, and a favorable market environment. Its unified approach to data and AI positions it well against both pure-play rivals like Snowflake and the native services of cloud giants. The risks (especially competitive pressure and macroeconomic headwinds) are real. But the company’s scale, retention, and cash flow give it resilience.
For investors, the company’s growth is a signal that the unified data + AI platform model is winning. For potential employees, it offers the stability of a profitable, late-stage company with the upside of pre-IPO equity. And for infra startups, it’s a gravitational force in the market: align with it and you may find your market expanding. Compete head-on in its core territory and you’ll face an uphill battle.
If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend: