Infra Startups

Agents Need Worlds: Building Verifiable Environments From Scratch

Prateek Joshi — Wed, 29 Apr 2026 18:05:28 GMT

If you want to train an AI agent, a prompt is not enough. A prompt gives the agent an instruction. But real work doesn’t happen inside an instruction.

Real work happens inside an environment. There is state. There are tools. There are constraints. There are consequences. You try something, observe what changed, recover from mistakes, and eventually reach a useful outcome.

I wanted to see what it takes to build an environment from scratch. And that’s the premise behind pworlds, a tool I built to explore this simple thesis.

So what is pworlds? It’s a collection of verifiable task environments for agents. It is not a chatbot wrapper, a benchmark suite, or a generic RL framework. It is closer to environment engineering for agent training. The goal is to create small executable worlds where agents can act, receive objective feedback, generate traces, and export training signal.

The Problem With Final Answers

Most AI evaluation still over-focuses on final answers.

Did the model solve the problem?
Did it pass the test?
Did it produce the right output?

Those questions matter, but they miss the messy middle.

In real work, the process is often more informative than the answer. What did the agent try first? What failed? What changed? What feedback did it receive? How did it recover?

That is the kind of signal I wanted pworlds to capture.

A world is not just a question with an answer. A world has state. A world accepts actions. A world changes when those actions are taken. A world can grade whether progress was made. And if designed correctly, a world can record the full path from initial state to successful outcome.

That path is where the valuable data lives.

The Constraint: Keep It Local And Small

From the beginning, I forced the project to stay small.

The first version had to be local-only. No GPU requirement. No model training. No OpenAI integration. No Anthropic integration. No cloud services. No database. No Docker. No plugin-heavy architecture.

The abstractions had to stay clean, but small. The priority was CLI usability, tests, determinism, and extensibility.

Those constraints were not arbitrary. They prevented the project from prematurely becoming a platform.

Before building distributed infra, I wanted to prove the runtime pattern:

Can we make a task executable?
Can we expose state?
Can we accept actions?
Can we compute rewards?
Can we record traces?
Can we replay a trajectory?
Can we export the result as training signal?

That was the core question.

psignal: The Runtime Substrate

The first package I built was psignal.

psignal is the shared substrate underneath pworlds. Its job is to turn executable tasks into training signal.

The loop is straightforward:

task → observe → action → transition → reward → trace → replay → export

Instead of hiding state in a database or service, psignal makes every task a visible local artifact on disk. When a task is created, the system generates files like psignal.yaml, state.json, trace.jsonl, and a README.

The metadata lives in the YAML file. The current state lives in JSON. The trajectory is appended line by line into JSONL.

That file-backed design ended up being one of the most important decisions in the project. It made everything inspectable. You can open the folder and see what happened. It made the system debuggable, portable, git-friendly, and easy to explain.

The Smallest Possible World

The first built-in environment was intentionally tiny: a counter.

The task starts at zero. The goal is to reach five. The valid actions are +1, -1, and reset.

If the task reaches the goal, it gets a positive reward. If it takes a valid but incomplete step, it receives a small penalty. If it takes an invalid action, it receives a larger penalty.

That may sound too simple, but that was exactly why it was useful.

The counter is the smallest possible environment that still demonstrates the full runtime pattern. There is persistent state. There are explicit actions. There are valid and invalid moves. There is an objective success condition. There is a trace. There is replay. There is export.

Before proving the system could handle complex work, I wanted to prove that the loop itself was clean.

plemma: A Theorem-Proving World

Once psignal worked, the next question was whether this substrate could support something more meaningful than a toy counter.

That led to plemma, the first real world built on top of psignal.

plemma is a theorem-proving world. I chose theorem proving because it is a high-signal domain for objective agent interaction. A proof state is explicit. Actions are discrete. Some moves are valid. Some are invalid. Success is not a matter of taste. Either the proof completes or it does not.

The long-term direction is Lean-style theorem proving, but the first version deliberately avoided full Lean integration. Instead, plemma used a simulated tactic environment and failed gracefully if a real Lean toolchain was missing.

The initial theorem was simple: prove identity. The valid path was intro h, then exact h.

That was enough to make the abstraction legible.

In the counter world, actions manipulate integer state. In the theorem world, actions manipulate proof state. Same runtime pattern, different domain.

This was the first important proof point. Now pworlds went from just being a counter demo to representing symbolic work.

pspec: A Software Engineering World

But theorem proving is still niche, so I wanted a second world that would be more broadly understandable.

That became pspec, a software engineering world.

pspec is a coding-task environment. The first version used a small buggy FizzBuzz function. The task had a source file, tests, metadata, state, trace, and reward function.

The workflow was simple:

Inspect the code.
Edit the file.
Run the tests.
Receive feedback.
Record the step.
Try again.

This is much closer to how real software work happens. You do not jump from broken code to final patch. You inspect, edit, run tests, fail, edit again, and eventually pass.

The intermediate attempts are not noise. They are the process. And the process is the training signal.

The Important Distinction In pspec

Building pspec revealed an important distinction.

In the counter world, the action is typed directly into the CLI. You say +1, and the environment updates the count.

In the theorem world, the action is also typed directly into the CLI. You say intro h, and the environment updates the proof state.

But in the coding world, the meaningful action is not the CLI command. The meaningful action is the file edit.

The CLI action is only run-tests.

That distinction matters because pspec is not a patch-application DSL. It is a local repair environment. A human or agent edits files in the workspace. Then pspec evaluates the current code state by running tests and recording the result.

That model feels much closer to actual agent work.

The Trajectory Is The Data

pspec records more than just whether the tests pass.

It can record the current source snapshot, source hash, test output, reward, completion status, and the diff between evaluated steps.

That diff is especially important. It turns the trajectory from a sequence of states into a sequence of state transitions.

So the output is not merely: “Here is the final fixed file”

The output is: “Here is the buggy starting point. Here is the first attempted change. Here is what failed. Here is the next diff. Here is the test feedback. Here is the moment the task became correct.”

That is a much richer object for agent training. A final answer tells you what worked. A trajectory tells you how the work got done.

From Local Debugging To Training Artifacts

After the initial pspec version, I added support for custom tasks.

You can create a task from a real Python source file and a test file or test directory. pspec copies the source and tests into a task folder, stores a hidden reset template, tracks evaluated edits, captures diffs, and can package the whole thing into a single training_artifact.json.

That packaging step matters operationally.

Without it, handing data to a training team means explaining which files matter, where the trace lives, how the tests relate to the source, and what the final state represents.

With packaging, the output becomes one structured artifact containing metadata, state, source files, test files, traces, and exports.

This is the bridge from local debugging to process-data generation.

What pworlds Is Not

pworlds is not a training platform.

It is not a model-serving platform. It is not a cloud orchestration system. It is not a benchmark leaderboard. It is not a full Lean integration. It is not an autonomous coding agent.

It is an early substrate for verifiable task environments.

But even in this early form, the pattern is visible. The same runtime can support integer control, theorem proving, and code repair.

Those are three very different domains, but they all fit the same loop:

observe → act → transition → reward → trace → replay → export

That is the core abstraction.

The Bigger Point

A lot of the AI world still thinks in prompts.

Better prompts. Longer prompts. More structured prompts. Prompt libraries. Prompt workflows.

Those things are useful, but they are not enough for agents that need to do real work. Real agents need environments.

A coding agent needs broken repos, tests, diffs, intermediate failures, and repair trajectories.

A theorem-proving agent needs proof states, tactic attempts, invalid moves, and verified completions.

A data-center operations agent needs simulated incidents, control actions, safety constraints, and recovery paths.

A chip-design agent needs toolchains, timing reports, constraints, compiler feedback, and objective pass/fail signals.

The domain changes, but the pattern stays the same.

Build the world. Expose the state. Let the agent act. Grade the outcome. Record the trace. Replay the trajectory. Export the signal.

Why This Matters

The frontier labs already have a lot of text.

What they increasingly need is high-quality process data from environments where outcomes are verifiable.

Not just answers, but attempts.
Not just scores, but trajectories.
Not just final patches, but the path from broken to working.

If we can turn high-value work into executable environments, then every attempt becomes data. Every failure becomes signal. Every recovery becomes part of the training distribution.

If you’ve been building or investing in this direction, I’d love to chat with you.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Four Open Source Tools I Built to Dissect AI Infra

Prateek Joshi — Thu, 23 Apr 2026 19:05:46 GMT

A lot of AI infra is still discussed at the level of abstractions.

People talk about agents, reasoning, open models, serving, environments, orchestration, and post-training as if these are clean categories. In practice, they are messy systems problems.

Memory breaks. Runtime assumptions leak. Models behave differently once they are actually served. Fine-tuning is easy to talk about and much harder to operationalize on commodity hardware. Agent environments sound simple until you have to make them persistent, inspectable, and usable by real workflows.

Over the last stretch, I built four open source tools to get closer to the mechanics: pforge, phabitat, pscope, and psplice.

This was not meant to be a tool-building exercise. It was research through construction. My goal was simple:

AI infra is moving too rapidly to reason about it from the outside. So to get an actual pulse, I need start touching the actual surfaces where things break.

Together, these tools form a kind of personal lab for open models and agents.

1. pforge: shaping and serving open models on your own GPU

pforge began with a basic question: what do we actually learn when we work with open models directly instead of only consuming them through an API?

Most people interact with models as black boxes. You send a prompt and get back text. But that hides the most important questions. How does latency change with model size and serving setup? What happens when you compare base and tuned variants side by side? How sensitive is behavior to reasoning budget, decoding settings, and fine-tuning? What can you inspect while the answer is forming?

pforge is my attempt to make those mechanics more visible.

It is a CLI for shaping and serving open models on a user’s own GPU machine. The emphasis is not just “run a model locally”. The emphasis is to compare, inspect, and experiment. You can chat with a model, compare outputs across variants, adjust reasoning budgets, and examine how behavior changes under different conditions.

What pforge taught me is that open models are not just cheaper substitutes for frontier APIs. They are research objects. Once you control the serving surface, you stop asking only “is this model good?” and start asking “under what constraints does this model stay useful?”

That is a much more infra-native question.

2. phabitat: giving every agent its own computer

If pforge is about the model, phabitat is about the runtime around the model.

The core idea behind phabitat is: every agent should have its own persistent workspace-scoped computer. It shouldn’t just be a stateless API call or a disposable demo session. It should be a real environment with storage, logs, artifacts, and task continuity.

This matters because a lot of agent discourse still assumes that the model is the product. I think that view is incomplete. The useful unit is often the combination of model, runtime, permissions, workspace, memory, and inspection layer.

phabitat is a CLI for spinning up these isolated environments. A user can create a habitat, assign it a plain-English task, watch it work, inspect its outputs, and return later. The agent’s workspace persists. Its artifacts persist. Its event history persists.

Building this pushed me toward a stronger view: agent infra is really environment infra.

The difficulty is not in merely calling a model repeatedly. So what’s the actual difficult part? It’s giving the system durable state, bounded permissions, legible artifacts, and enough structure that a human can trust what happened.

Once you see that clearly, the market around “agents” starts to look less like a pure model story and more like a systems story.

3. pscope: understanding what a machine can realistically run

One underrated problem in open model adoption is basic fit.

People want to run open models locally, but they often don’t know what their machine can actually support. They guess. They overestimate. Or they spend hours installing things only to hit resource ceilings later.

pscope is a small tool, but it sits on an important question: what model will run best on this machine?

It scans a system and helps map hardware reality to model feasibility. That sounds operational, but it is also research-relevant. Hardware constraints shape what developers can build, test, and learn. They determine whether open models feel accessible or frustrating. They shape which parts of the ecosystem become broadly usable.

Working on pscope reinforced a simple belief. Infra adoption is often constrained less by raw model quality and more by setup friction plus hardware ambiguity.

In other words, discoverability of fit matters. A lot.

4. psplice: model surgery, steering, and live intervention

psplice is probably the most research-heavy of the four.

The goal is to make model intervention more practical. Load a model once, hold it in memory through a daemon, and then let the user perform operations like chatting, steering, and modifying behavior without reloading everything each time.

This tool sits closer to the layer of “how can I alter model behavior directly?” rather than “how can I wrap a product around it?”

That includes things like activation steering and head-level interventions. Even implementing the ergonomics of this forces you to confront real system details: attention implementations, VRAM persistence, daemon architecture, tensor assumptions, and the gap between a neat conceptual technique and a usable tool.

psplice taught me that model control is still early. Many ideas sound elegant in papers and rough in practice. But this is exactly why building matters. It reveals which interventions are robust, which are brittle, and which might eventually matter for real workflows.

What these tools are really for

On the surface, these are four separate open source projects.

Underneath, they are all attempts to answer the same research question: where is the real control surface in AI infra?

Is it the model weights? The serving stack? The environment around the agent? The hardware fit layer? The intervention interface?

My current view is that the answer is not a single layer. The leverage comes from understanding the handoffs between layers.

That is why I built these. I want to build enough of the stack myself that my research can be grounded in contact with the machinery.

That’s the standard I want for my Infra Startups research column. Research should not just summarize what others built. It should leave evidence that you have wrestled with the systems yourself.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Programmable Reasoning: My Experiments With Qwen

Prateek Joshi — Fri, 27 Mar 2026 16:15:22 GMT

The gap between a model’s theoretical capabilities and what you can actually deploy on constrained hardware is big. And this is where the real engineering happens.

We’re familiar with reasoning, but what does it take to make it programmable? And how do we make it accessible to anyone?

To find out, I recently tinkered with the Qwen model family on a single RTX 4090 (24GB VRAM). My goal was to do everything from scratch and build two specific primitives:

Inspecting the reasoning chain: Can we expose and parse the model’s internal “chain of thought”?
Rewiring the personality: Can we use rapid fine-tuning (flash-tuning) to fundamentally alter the model’s stylistic gravity on the fly?

I wanted to make it easily accessible via command line. Here is a breakdown of the architecture, the model quirks, and the engineering realities of building something like this.

Selecting the model

Qwen 3.5 models are highly capable, but they are multimodal under the hood. We need vLLM to serve these models. And currently, vLLM’s LoRA implementation is broken for multimodal models. This makes dynamic tuning impossible.

The Solution: I landed on Qwen3-1.7B. It’s a pure language model featuring a fascinating hybrid architecture (alternating dense and linear attention blocks), it’s fully LoRA-compatible, and crucially, it supports native “thinking mode”. And it’s small enough that you can focus on tinkering vs worrying about system/memory issues.

Objective 1: Exposing the Chain of Thought

The first goal was to peek inside the model’s reasoning process. To do this, you have to pick the right weights.

To extract the internal logic, I configured vLLM with the --reasoning-parser qwen3 flag. This cleanly intercepts the chain of thought wrapped in the ... tokens and exposes them as a distinct reasoning field in the streaming API delta. Instead of a black box, you get a real-time window into the model’s cognitive process before it outputs the final answer.

Objective 2: Rewiring Personality via Flash-Tuning

With the reasoning engine exposed, the next goal was to see if I could rapidly bend the model’s persona to my will. I set up an end-to-end QLoRA pipeline to run a Quentin Tarantino style alignment experiment (and yes, I’m a huge fan of Quentin Tarantino).

The Setup: 5 training examples. 50 steps. Rank 8 adapter.

The Result: The loss plummeted from 4.6 to 0.12. And the model perfectly memorized the prompt formats, delivering highly stylized, rhythmic, and visceral responses for the training concepts.

However, when hit with a zero-shot, unseen prompt (”Describe London”), the LoRA broke down. The base pre-training dominated, and it reverted to a generic encyclopedia response. Five examples simply aren’t enough to generalize a stylistic syntax across the entire latent space.

The Fix: I injected a strong system prompt at inference time alongside the loaded adapter. The response to “Describe London” instantly locked in and shifted into a gritty, sensory scene: “London is a city that doesn’t lie. You walk down a street and someone walks up to you with a...”

The Lesson: A LoRA adapter successfully shifts the model’s default behavior, but the system prompt acts as the anchor. And thus locking in the style for edge cases the adapter has never explicitly seen. You need both to reliably rewire personality.

Output

I ran it from the terminal. For the chain of thought experiment, I asked about Paris but didn’t a good answer. And then I did what any good Tarantino fan would do. I summoned my inner Jules Winnfield (Pulp Fiction):

The slightly dimmed text at the top shows how the model “thinks” before answering the question. And then you can look at the bottom for the actual answer it outputted. Very interesting to see this live!

And then I wanted to rewire the model’s personality on the fly. This is what came out:

It’s fun to see this live in action.

The Infra Reality

Running both an inference server and a training pipeline on a single 24GB GPU requires strict, defensive orchestration.

A vLLM instance serving in bf16 eats ~18GB, and QLoRA needs another 12-14GB. Concurrent execution is not possible. To manage this, I built a FastAPI orchestrator that acts as a traffic controller:

VRAM Juggling: When a POST /train request hits, the orchestrator gracefully kills the vLLM subprocess, freeing the VRAM.
OOM Blast Shields: The training script (trainer.py) is never imported. It runs strictly as a subprocess. If a memory spike triggers the Linux OOM killer, it only takes down the trainer, leaving the API server alive to report the failure and restart inference.
Hot-Swapping: Once training completes, the orchestrator attempts a dynamic POST /v1/load_lora_adapter to vLLM. If that fails, it falls back to a hard restart with the new modules loaded.

Building for the Edge

Working with open-source models right now means navigating rapid library deprecations (like TRL silently renaming parameters between versions) and structural cloud limits (routing all virtual environments and pip caches to a persistent volume to avoid disk quota crashes).

But when the pipeline finally hums, it’s amazing to watch. You can watch a model “think” through a problem and then answer you in the exact voice you just injected into its weights a minute prior. Having programmable reasoning running purely on your own stack is incredibly powerful.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Sector Deep Dive #9: VERIFIABLE REASONING

Prateek Joshi — Sat, 10 Jan 2026 16:37:26 GMT

We’ve spent the last decade scaling probabilistic systems optimizing for plausibility. That’s fine for content, but what about correctness?

The moment an LLM transitions from suggesting code to executing mission critical code (infra code, editing identity management policies, authorizing a payment rail), stochasticity becomes a liability. “Usually correct” is not a shippable spec for autonomous action.

The next wave of infra will be defined by Verifiable Reasoning: the ability to translate intent into a formal specification and receive either of the following two items:

a mechanically checkable certificate (proof / witness) that a property holds
a counterexample that falsifies the property.

Isn’t this just “better evals”? Not exactly. It’s the reintroduction of invariants and proof artifacts as first-class objects in the software lifecycle.

1. Technical substrate: solvers as commoditized compute, specs as the scarce resource

The primitives are industrial-grade and battle-tested. They come from two lineages:

EDA and hardware verification (Synopsys / Cadence): where verification is not optional because failure is catastrophic and expensive.
Safety-critical software (AdaCore / SPARK): where specifications and proofs are part of how systems ship.

The engines are mature:

SAT/SMT solvers (e.g. Microsoft Z3, cvc5): constraint engines for decidable fragments of logic.
Model checkers (e.g. SPIN): exhaustive state exploration against temporal properties.
Proof assistants (Lean 4, Coq, Isabelle/HOL): interactive theorem proving with machine-checked proof terms.
Symbolic execution / static verification (e.g. KLEE): produce concrete counterexamples and coverage guarantees on program paths.

None of this is new. What’s new is the integration pattern.

Historically, formal verification was blocked by the translation tax. Humans cannot write TLA+, SystemVerilog assertions, or Lean propositions at the speed of modern engineering. And even when they can, proofs are brittle. They break under refactors, they require experts to maintain, and they don’t fit into CI.

The unlock is that LLMs are becoming the interface layer for formal methods. It’s becoming the compiler.

They can do auto-formalization: converting natural-language intent into specs, invariants, and proof sketches. They can also accelerate the loop that matters most in practice:

spec → attempt proof → fail → counterexample → refine spec/code → retry

That is exactly how formal methods scale. Not by “proving everything” but by iterating quickly toward the properties you actually care about.

This is the shift from “formal methods as a PhD thesis” to “formal methods as a CI/CD artifact”.

2. Market architecture: three layers collapsing into a single correctness control plane

I view the stack as three investable layers. Each is real, but the compounding value happens when they interlock.

a) The legibility layer (governance + observability)

This is the audit trail: drift detection, attribution, documentation, explainability. Incumbents here include Fiddler, Arthur, IBM Watson OpenScale, and open-source toolkits.

This layer is necessary but insufficient. It tells you what happened. Mission-critical systems need guarantees about what cannot happen. Legibility becomes powerful only when it feeds the guarantee layer with structured constraints (and later consumes the proof artifacts).

b) The guarantee layer (verification)

This is where mathematics touches model behavior. You see two camps:

classical formal methods teams adapting into ML and autonomous systems (e.g. Galois, AbsInt-style approaches),
verification-first ML research stacks (e.g. VerifAI, solver-aided neural verification like Marabou, abstract-interpretation style bounds like DeepPoly).

The alpha is in neuro-symbolic bridges: systems that constrain a neural network’s behavior using bounded, checkable logic.

The canonical shape looks like:

preconditions on inputs (ranges, schemas, safety envelopes),
postconditions on outputs/actions (policy compliance, forbidden states),
and the proof obligation that says for given inputs and conditions, it will lead to a safe model.

In practice you don’t get full universality. You get bounded domains, approximations, and certificates with explicit assumptions. That’s fine. The point is that it leads to mechanical accountability.

c) The workflow layer (enterprise assurance / control plane)

This is where verification becomes a product. Integrated into how software is shipped. Credo AI is an example of packaging governance and assurance into enterprise workflow. The wedge is whether the system can become a gate:

does it run inside GitHub Actions and block a merge?
does it gate deployment in managed ML stacks like Amazon Bedrock?
does it generate artifacts a risk committee or auditor can consume?

The winning products operationalize proofs the way devops operationalized deployments.

3. Competitive Landscape: The “Proof” Stack

a) AI Mathematicians + Automated Reasoning (The “Solver” Layer)

Symbolica — Building “structured intelligence” using category theory (categorical deep learning) rather than just transformers. Their flagship Agentica framework focuses on creating agents with provable correctness guarantees, effectively replacing “approximate” neural reasoning with algebraic structure.
Normal Computing — Building thermodynamic computers and software to solve probabilistic reasoning problems. Their stack focuses on “energy-based models” that can reason about uncertainty and correctness more natively than standard GPUs, targeting high-stakes auditable workflows.
Axiom Math — Building an AI mathematician / superintelligent reasoner as a wedge into formal reasoning + proof.
Harmonic — Building mathematical reasoning systems (Aristotle) to solve Mathematics Olympiad-level problems, serving as a proxy for “reasoning reliability”.
Imandra — “Reasoning-as-a-Service”. One of the few offering a cloud-accessible automated logic engine used by finance (Goldman Sachs) and defense to verify algorithms before they trade or shoot.

b) Formal Methods “Builders” (Verification Tooling)

Galois — The OG deep-tech services firm. Productizing cryptographic verification (SAW) and high-assurance tooling. They essentially operate as the R&D lab for the US government’s hardest verification problems.
Atlas Computing — A newer entrant explicitly focused on “AI-assisted formal verification” for critical infrastructure. Their thesis is using LLMs to write the specs that traditional formal methods tools (like Z3) verify.
Runtime Verification — Commercializes the K Framework. They are unique because they define semantics for languages (C, Java, EVM) to prove code adheres to spec.
TrustInSoft — “Mathematically guaranteed” bug-free code for C/C++. They use formal methods to prove the absence of undefined behaviors, selling to automotive/IoT.
BedRock Systems — Building a formally verified trusted computing base (Hypervisor/OS) to ensure critical systems cannot be bypassed.

c) Formal Verification (The “High Stakes” Sandbox)

Certora — “Proving code works with mathematical certainty”. They provide the Certora Prover, which allows smart contract developers to write invariants (CVL) and auto-check them.
Veridise — Spun out of UT Austin research. They use automated analysis (fuzzing + static analysis + formal verification) specifically for zero-knowledge circuits and smart contracts.
Invariant Labs — Now partnering with Snyk. Focused on agentic security via formal guarantees. They use a “security analyzer” that imposes hard constraints on agent actions, preventing state violations regardless of prompt injections.

d) ProofOps and Agent Assurance (The “Control Plane”)

Lakera — They act as a firewall for LLMs, preventing prompt injections and jailbreaks. Their database (Gandalf) is the industry standard for “how to break an LLM”, giving them a defensive moat.
Protect AI — Building “MLSecOps”. They acquire and aggregate tools (like Laiyer AI) to scan models for vulnerabilities, verify supply chain integrity (signing models), and firewall LLM inputs/outputs.
Gomboc.ai — “Deterministic Infra”. They use deterministic AI to remediate cloud infrastructure violations. Instead of just alerting, they generate the precise code fix that is mathematically guaranteed to solve the policy violation.
Credo AI — The governance layer. Less about mathematics proofs and more about “audit proofs” i.e. generating the artifacts regulators need to sign off on a model.
HiddenLayer — Security for the model itself. They detect if someone is trying to steal the model weights or tamper with the inference process at runtime.
Robust Intelligence (Acquired by Cisco) — The “Antivirus for AI”. Automated red-teaming to find failure modes before deployment.

e) The “Bridge” Layer (Static Analysis + Neuro-symbolic)

Semgrep — “Policy as Code”. While not purely “formal”, their engine is the standard for deterministic code checks. Their new AI features allow developers to write natural language rules that compile into deterministic, greppable constraints.
Qodo (formerly Codium) — Focuses on “code integrity” for AI generation. They use a mix of static analysis and test generation to verify that the code an LLM writes actually runs and passes assertions.
Cleanlab — “Data Curation as Code”. They use confident learning (a statistical theory) to mathematically prove which labels in a dataset are incorrect, purifying the input layer for AI.

3. Distribution signal: incumbents are buying “permission to deploy”

The most important market signal is M&A by platforms that already own deployment surfaces.

Snowflake → TruEra: data platforms need model reasoning and validation to keep high-value workloads on-platform.
Cisco → Robust Intelligence and F5 → CalypsoAI: infra incumbents are acquiring assurance capability as a layer that ships everywhere their stack ships.
The velocity of HiddenLayer and the emergence of vendors like Lakera show this moving from research to procurement: a “trust/correctness” budget line forming inside enterprise AI rollouts.

This is not “AI safety” per se. This is the same playbook as observability and security. Once the control becomes mandatory, it gets bundled.

4. The investable opportunity: Infra for mathematical correctness

My variant perception is that value will not accrue to the solver. Solvers commoditize. The defensibility is in infra that can operationalize this work of proving correctness. Something that can operationalize specifications and proof artifacts inside software delivery.

The core product primitives look like:

specification authoring + versioning (spec diffs are as important as code diffs)
incremental verification (proofs that survive refactors via modularity / compositional reasoning)
counterexample triage (turn solver outputs into developer-actionable bug reports)
“proof coverage” metrics (what properties are guaranteed, under what assumptions)
deployment gates + regression proofs (the proof becomes part of the release artifact).

If you can make proofs cheap enough to live in the git push loop, entire classes of autonomous software become viable.

Where this unlocks mission-critical software:

Infrastructure / agents: the primitive is state-machine verification. If agents are deployed through Kubernetes and toolchains, you need invariants over tool usage (“never delete database X unless condition Y is met”). TLA+ style thinking, but productized.
Finance: the primitive is constraint satisfaction (monotonicity, fairness constraints, bounded risk policies). The killer product auto-generates compliance-ready proof artifacts tied to model and data lineage.

Conclusion

Mathematics is becoming infrastructure. The opportunity here is building infra that lets enterprises ship autonomous systems with explicit guarantees, auditable assumptions, and mechanically checkable artifacts.

When correctness becomes a deployable artifact (when a proof is something you can diff, version, and regression-test), we can expand the frontier of what software can be trusted to do. That’s where the alpha is.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Sector Deep Dive #8: POST-TRAINING INFRA

Prateek Joshi — Sun, 28 Dec 2025 11:42:44 GMT

1. What exactly is “post-training infra”?

Post-training infra is the tooling layer that helps teams improve model/agent behavior after a foundation model exists. And it’s done using a mix of: supervised fine-tuning (SFT), preference tuning / RLHF-style methods, prompt/tool changes, guardrails, eval suites, and continuous monitoring in production.

As LLMs move into business-critical workflows, the bottleneck is no longer “can we run a model?” but “can we keep it correct, safe, and cost-bounded as the world changes?”. This requires an iterative loop as opposed to a one-off training job.

A useful mental model: pretraining gives you general capabilities whereas post-training infra turns those capabilities into reliable, auditable, domain-specific behavior.

2. The post-training loop: the actual workflow enterprises are building

Across most teams, the loop looks like:

Instrument: capture prompts, tool calls, retrieval context, outputs, latency/cost, and user feedback.
Evaluate: run offline test suites + online canaries, measure task success (not just BLEU-like metrics).
Diagnose: identify failure modes e.g. hallucinations, refusal errors, prompt injection, tool misuse, drift.
Patch quickly: prompt changes, tool routing, guardrails/validators, retrieval fixes.
Escalate selectively: when patches aren’t enough, do targeted fine-tuning / preference tuning on high-value tasks.
Deploy + monitor: watch regressions, cost blowups, safety issues, repeat.

Why infra matters: each stage creates data and decision points that need to be versioned, reproducible, and connected. There is a growing trend toward “unified stacks” rather than disconnected tools e.g. TensorZero pitching gateway + observability + eval + optimization in one.

3. Demand drivers over the next 24 months

Three forces look durable through 2026:

i) Enterprise adoption is rising faster than “enterprise hardening”
A large percentage of orgs are regularly using LLMs in at least one business function. But reliability poses a big challenge. This gap (“we deployed something” vs “it’s reliable and governed”) is exactly where post-training infra sells.

ii) The world is moving to task-specific models, which increases tuning + evaluation needs
Gartner predicts that by 2027, more than 50% of AI models enterprises use will be industry/function-specific. Domain specificity means you need to make it happen through data pipelines, eval harnesses, and fine-tuning.

iii) Governance is becoming non-optional
There’s an increasing demand for monitoring, eval evidence, audit trails, and policy enforcement. Classic infra value props.

4. Subsector map: where startups cluster and who’s pulling ahead

There are six clusters with heavy convergence between them:

i) Agent/app orchestration frameworks (the “runtime” layer)

LangChain is the canonical open-source entry point. They recently raised $125M at a $1.25B valuation.
LlamaIndex positions around “knowledge agents” and enterprise data interfaces.

These frameworks become post-training companies when they add: tracing, eval harnesses, prompt/versioning, and feedback loops.

ii) Evals + testing (the “unit tests” for AI behavior)

Braintrust explicitly focuses on evals and “AI software engineering”. It announced a $36M Series A in Oct 2024.

iii) Observability + monitoring (production truth, regressions, drift)

Arize is a leading independent vendor. They announced a $70M Series C on 2025-02-20 focused on evaluation and observability for LLMs/agents.
Datadog launching LLM Observability is important because it signals bundling pressure from “classic observability” into AI stacks.

iv) Guardrails + policy enforcement (safety + reliability controls)

Guardrails AI raised a $7.5M seed and built a hub/wrapper approach.

v) Fine-tuning + preference optimization tooling (make models yours)

Lamini raised $25M for an enterprise AI platform

vi) Closed-loop optimization stacks (unifying gateway + eval + optimization)

TensorZero announced a $7.3M seed to build an open-source stack unifying gateway/observability/optimization/evals.

This is a strong signal of where the market is going: fewer dashboards, more continuous improvement pipelines.

5. Early-stage venture opportunity: where the market is still “unsolved”

The best pockets are areas where the stack is still missing a reliable primitive. Here are 4 areas where it might work:

Outcome-based evaluation (beyond LLM-as-judge)
Enterprises care about “did the agent complete the workflow correctly?” as opposed to “did it look fluent?”. But the big challenge is instrumenting ground truth from business systems (CRM, ticketing, payments) and then turning it into automated evals. Startups that own this interface can become system-of-record for AI quality.

Continuous learning for agents (safe retraining loops)
A lot of teams want self-improving agents, but they don’t trust the loop. The winning wedge is: gated data collection + audit trails + rollbacks + sandboxed deployments. This could be the next evolution after basic orchestration.

Governance + compliance automation as product
There are rules in place to push companies to document risk controls, testing, and monitoring. The infra opportunity is software that continuously produces compliance evidence (test coverage, incident trails, red-team results) as a byproduct of normal operation.

Data flywheels for post-training (high-quality feedback at scale)
Post-training quality is gated by data. Partnerships like Anthropic’s use of Surge AI’s RLHF platform illustrate the demand for scalable human feedback + QC systems. Startups that productize “feedback ops” (tools, QC, workforce routing, privacy) can be critical picks-and-shovels.

6. Business models and why pricing power is tricky

There is real revenue traction in agent building platforms. A simple derived check on how big a single customer can be under seat pricing:

If a company has 1,600 employees and pays $40–$50 per user per month, that’s:
- Monthly: 1,600 × $40 = $64,000 on the low end and 1,600 × $50 = $80,000 on the high end
- Annual: $768,000 to $960,000
This is attractive ARPA if adoption is broad and renewals hold

But pricing power faces two structural headwinds:

Bundling by incumbents (Datadog, cloud providers, model providers) squeezes standalone point tools.
Open source defaults (LangChain, Guardrails) force vendors to monetize via enterprise controls: SSO, RBAC, audit logs, data residency, eval governance, and support.

The likely “winning” monetization pattern is: open-core adoption → paid control plane + collaboration + compliance → usage-based expansion on monitoring/optimization.

7. How does it affect other infra subsectors?

Post-training infra doesn’t live in a vacuum. It reshapes the broader AI infra stack. Here are the most important dependencies and second-order effects:

a) Serving + inference infra becomes more valuable when evaluation loops are tight.
If teams are constantly iterating (new prompts, new adapters, new routing), they need fast, cheap experimentation environments. That pulls demand toward inference/serving startups that support canaries, model routing, and cost observability. Correlation: more eval + experimentation → more switching between models → more value in routing + caching + cost controls.

b) Data infra and security vendors get pulled into the loop
Post-training requires logging prompts and outputs, which often contain sensitive data. That creates direct dependencies on:

data loss prevention / redaction
secure storage + retention policies
access controls and audit trails
synthetic data or privacy-preserving feedback.

Regulatory pressure amplifies this because governance becomes an operational requirement.

c) Observability incumbents will “tax” the ecosystem
Datadog’s LLM Observability is a bundling signal: classic observability vendors can package AI monitoring into existing procurement, reducing budget for startups unless they are clearly better on model-specific workflows. Risk for startups is that feature parity arrives quickly (basic tracing, prompt logs, cost dashboards). Differentiation must move up the stack with actionable evals, automated fixes, and governance automation.

d) Model providers shape the ceiling
As frontier models improve, some failure modes disappear. But enterprises still need proofs, cost controls, and domain specificity. The base model progress shifts spend from “make it work at all” to “make it work reliably and cheaply”.

e) Consolidation is real (platform gravity)
CoreWeave’s acquisition of Weights & Biases shows infra providers moving upstack to own the developer workflow end-to-end (train/tune/evaluate/deploy). This creates a dependency risk: early-stage tools that don’t become a platform primitive may be acquired, copied, or squeezed.

A practical estimate for “how much of AI infra gets affected”:

if you define AI infra startups as serving one of six layers (compute, model serving, data, orchestration/devtools, observability/safety, and security/governance), then post-training infra directly overlaps orchestration + observability/safety + governance, and partially overlaps serving + data.

That’s 3–5 of 6 layers touched. The exact “portion” depends on how you bucket companies, but the direction is clear: post-training loops become a central integration point that many infra startups either plug into or compete with.

8. What to watch through 2026

Catalysts (positive for the sector)

More agent deployments → more need for continuous improvement. A large chunk of of agentic AI projects may be canceled by end of 2027 due to costs/value/risk controls. Ironically a tailwind for post-training infra that reduces those risks.
Regulatory timelines hit operational reality. Procurement starts demanding audit evidence.
Platform consolidation continues. More “W&B-style” moves by clouds, devtool incumbents, and observability platforms.

Failure modes (what breaks the bull case)

Bundling crushes standalone tools before they reach scale (especially basic eval/monitoring features).
“Good enough” models reduce willingness to fine-tune, pushing spend to prompting + retrieval. Fine-tuning platforms must show clear ROI.
Data/legal incidents (leaks, IP disputes, privacy failures) slow deployments and raise compliance friction. This can either stall budgets or redirect them to governance-heavy vendors.

9. What’s the opportunity?

The investable center of gravity is shifting from “training pipelines” to behavior pipelines. These are systems that continuously measure, correct, and harden model/agent behavior in production. The arc is: start with developer adoption, then climb into enterprise workflows by owning the feedback loop.

For early-stage venture, the best opportunities are the primitives that remain hard even as models improve:

outcome-grounded evaluation
safe continuous learning loops
governance evidence automation
feedback/data ops at scale
cost + reliability control planes across many models

The good thing is that the question “does post-training matter?” has been answered. It matters a lot! But who captures the value? Independent startups or bundled incumbents?

The next 24 months will likely reward teams that (1) become deeply embedded in production workflows and (2) generate proprietary signals (eval outcomes, failure taxonomies, policy decisions) that compound into a defensible moat.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Sector Deep Dive #7: AI MATHEMATICIAN

Prateek Joshi — Tue, 09 Dec 2025 16:09:13 GMT

1. Snapshot: What Are “AI Mathematician Products” in 2025?

AI mathematician products sit at the intersection of advanced LLMs, formal methods, and specialized tooling for mathematics, verification, and research. In the last 18–24 months, multiple systems have hit International Mathematical Olympiad (IMO) gold-medal performance, Putnam-level scores, and near-saturation of benchmarks like MATH and MiniF2F. The core thesis:

If a model can reliably do mathematics and formal proofs, it’s a proxy for general trustworthy reasoning.

This ecosystem now spans:

Big Tech “System-2” engines (OpenAI, Google DeepMind, Anthropic, xAI, Microsoft, Meta, Alibaba’s Qwen team).
Formal-verification startups (Harmonic, Axiom Math, Logical Intelligence, Symbolica AI).
Open-source reasoning models (DeepSeek-R1, DeepSeek-Math-V2, DeepSeek-Math, DeepSeek V3, Qwen-2.5-Math in 1.5B/7B/72B sizes, QwQ-32B, NuminaMath, NuminaMath dataset, Llama 3.1).
Infra, tools, and proof languages (Lean, mathlib, Coq, Isabelle, HOL Light, MiniF2F, IMO-ProofBench / ProofBench, HOList, GPT-f).
Consumer and education solvers (Photomath, WolframAlpha, Mathway).
New agentic stacks (Math Inc with its Gauss agent and the strongpnt benchmark repo).

The current phase is R&D + early pilots, not scaled revenue. But the tech is clearly real and improving fast.

2. Two Axes: Consumer Solvers vs Formal Verifiers, LLM vs Tools

2.1 Product categories

Consumer / Homework Solvers
- General reasoning models like o1 / o1-pro (OpenAI), Gemini 1.5 / Gemini 2.0 / Gemini Deep Think (Google DeepMind), Claude 3.7 and Claude 3.7 Sonnet (Anthropic), Grok 3 (xAI), and Llama 3.1 (Meta Llama line) now ship “Think” or extended-reasoning modes.
- Camera-first apps: Photomath (Google) and Mathway (Chegg) handle K-12 to early college, often backed by engines like Gemini or other LLMs.
- WolframAlpha remains the OG symbolic engine, increasingly paired with LLM chat front-ends (e.g. ChatGPT + Wolfram).
Research / Formal Verifier Systems
- Harmonic (with its Aristotle model and Mathematical Superintelligence (MSI) vision) outputs Lean 4 proofs.
- Axiom Math targets AI that generates new conjectures and proves them in Lean or Coq.
- Logical Intelligence builds language-free Energy-Based Models (EBMs) and agents like Aleph and Noa to convert code into formal statements/proofs.
- DeepSeek-Math-V2 and DeepSeek-Math (plus DeepSeek-R1 and DeepSeek V3) occupy the open-source reasoning tier with very strong math performance.
- Symbolica AI takes a neuro-symbolic, non-Transformer approach to structured logic.
Open-source reasoning specialists
- DeepSeek-R1 and DeepSeek-Math-V2: RL-trained reasoning models with self-verification and very long test-time “internal monologue”.
- Qwen-2.5-Math family (1.5B, 7B, 72B) and QwQ-32B: Alibaba’s math-specialized suite. QwQ-32B is a 32B “System-2” reasoner. Qwen-2.5-Math-72B is a powerhouse solver. 1.5B and 7B variants are laptop-friendly.
- NuminaMath (models and NuminaMath dataset) emphasize data quality. Competition-grade problems with Chain-of-Thought plus Tool-Integrated Reasoning (TIR) via Python + sympy.
- Benchmarks and training infra rely heavily on MiniF2F, MATH, AIME, AI Math Olympiad (AIMO), IMO-ProofBench / ProofBench, Putnam datasets, and formal repos like mathlib in Lean.
Math Inc and agentic stacks
- Math Inc (math.inc) runs the Gauss agent: a multi-tool reasoning loop over Lean, Python, and external math tools.
- Their open-source strongpnt repo benchmarks strong-point geometry problems and acts as a formal geometry testbed for AI mathematicians.

2.2 Architectural patterns

System-2 / inference-time compute
Models like DeepSeek-R1, QwQ-32B, o1 / o1-pro, Grok 3, Gemini Deep Think, Claude 3.7 Sonnet (extended thinking) and Llama 3.1 405B run long “hidden chain-of-thought” trajectories, sampling multiple reasoning paths before emitting an answer.
Tool-Integrated Reasoning (TIR)
Systems such as NuminaMath, Qwen-2.5-Math-72B, Qwen-2.5-Math-7B, Qwen-2.5-Math-1.5B, Claude 3.7, Gemini, GPT-4, GPT-4o, GPT-5, Grok 3, ChatGPT and hybrid stacks (ChatGPT+WolframAlpha) explicitly generate Python (sympy, numpy) or call external tools to compute integrals, solve equations, or run simulations.
Formal proof generation
Harmonic’s Aristotle, AlphaProof (Google), Logical Intelligence’s Aleph / Noa, Math Inc’s Gauss, and future offerings from Axiom Math produce Lean 4, Lean, or Coq scripts that are checked by proof assistants like Lean, Coq, Isabelle, HOL Light, built on libraries such as mathlib.
Alternative architectures
Logical Intelligence pushes energy-based models (EBMs) rather than token LLMs. Symbolica AI explores neuro-symbolic non-backprop architectures. Meta’s HOList and DeepMind’s AlphaProof mix search with learned guidance.

3. Key Companies and Products

3.1 Big Tech engines

Google DeepMind
- Research models: Gemini suite of models offer strong capabilities.
- Formal backend: AlphaProof (neuro-symbolic Lean prover).
- Owns Photomath and integrates math into Google Workspace, Bard, Google Cloud AI and potentially GCP.
- Historically repurposed AlphaGo, AlphaZero, and AlphaFold techniques. AlphaProof follows that lineage.
OpenAI
- Models: GPT-5 series of models are capable at reasoning
- Approach: massive inference-time compute (many parallel trajectories) plus tools and formal integrations.
Anthropic
- Claude 4.5 series, featuring Opus 4.5 (most capable for complex tasks), Sonnet 4.5 (strong reasoning, efficient for agents)
xAI
- Grok 3 on X/Twitter: reasoning model with “Think” mode plus live access to X data.
Alibaba / Qwen team
- Qwen-2.5-Math (1.5B, 7B, 72B) and QwQ-32B: arguably the most versatile open math model family, spanning laptop-friendly to 72B-scale.

3.2 Formal-verification startups

Harmonic
- Product: Aristotle, marketed as Mathematical Superintelligence (MSI).
- Architecture: trains entirely on synthetic Lean proofs. Outputs Lean 4 code, checked mechanically, targeting zero hallucinations.
- Benchmarks: gold-level IMO, ~90% on MiniF2F, strong scores on IMO-ProofBench / ProofBench.
- API: free Aristotle Lean API to seed adoption. Roadmap towards safety-critical software (aerospace, automotive, trading, crypto).
Axiom Math
- Mission: AI that not only solves problems but proposes new conjectures and proves them (Lean/Coq).
- Backed by a strong team including Carina Hong and Ken Ono, with ex-Meta FAIR folks like François Charton.
- Targets: cryptography, algorithms, physics, finance. Wants a self-improving AI mathematician at AGI scale.
Logical Intelligence
- Products/agents: Aleph (formal proof), Noa (bug finding).
- Architecture: language-free EBMs, reasoning in continuous state space rather than tokens.
- Benchmarks: ~76% on a Putnam benchmark. Pilot work in crypto, national infrastructure, high-assurance systems.
Symbolica AI
- Pitch: neuro-symbolic reasoning without standard backprop. Structured algebraic representations rather than pure token streams.
- Still early/stealth, but positioned as a deep-tech alternative to Transformers.

3.3 Open-source and “people’s champion” models

DeepSeek
- Models: DeepSeek-R1 (RL “System-2”), DeepSeek-Math, DeepSeek-Math-V2, DeepSeek V3, and DeepSeekMath-V2 Heavy.
- Training: ~500B tokens of math/code/science plus RL methods like Math-Shepherd.
- Benchmarks: gold IMO, near-perfect Putnam (~118/120), top scores on AIME, MATH, and ProofBench.
- Strategy: fully open weights on Hugging Face and GitHub; extremely large (up to 685B params), aiming to be the “open GPT-5 for math”.
NuminaMath
- Assets: NuminaMath dataset (~1M competition-style problems with CoT and TIR annotations) and NuminaMath models (often on DeepSeek/Qwen backbones).
- Strength: Tool-Integrated Reasoning via Python + sympy, explicitly solving symbolic math rather than hallucinating.
Qwen-2.5-Math and QwQ
- Qwen-2.5-Math-72B is a top classical solver. Qwen-2.5-Math-7B and Qwen-2.5-Math-1.5B bring high math quality to commodity hardware.
- QwQ-32B is a medium-sized but very strong reasoning engine for logic puzzles and proofs.
Math Inc
- Agent: Gauss, orchestrating Lean + Python + external toolcalls in multi-step loops. Fits squarely in the agentic TIR camp.
- Repo: strongpnt (GitHub), a benchmark suite for geometry/strong-point problems in a formal setting. Acts as a shared testbed for AI mathematicians.

3.4 Consumer and education tools

Photomath (Google): camera-based solver, now backed by Gemini for better OCR and reasoning.
Mathway (Chegg): algebra/calculus homework assistant.
WolframAlpha: symbolic compute engine. Modern twist is tight integration with LLMs like ChatGPT and Claude, where Wolfram does the mathematics and the LLM handles chat/UX.

These products are where most students and non-experts first see “AI mathematics” in the wild.

4. Product Stack: From Models to Platforms

4.1 Foundation models and proof assistants

The core stack looks like:

LLM / Reasoning engine: e.g. Gemini Deep Think, GPT-5, Claude 4.5, Grok 3, DeepSeek-Math-V2, Qwen-2.5-Math-72B, NuminaMath models, Llama 3.1, Axiom Math’s internal models, Logical Intelligence’s EBMs, Symbolica AI’s models.
Proof assistant: Lean / Lean 4, Coq, Isabelle, HOL Light, plus libraries like mathlib.
Benchmarks: MiniF2F, MATH, AIME, AI Math Olympiad (AIMO), Putnam, IMO-ProofBench / ProofBench, geometry benchmarks like strongpnt.
Training infrastructure: research environments like HOList, older GPT-f, RL frameworks, and synthetic-data pipelines like Math-Shepherd.

4.2 Platform and integration surface

Most companies aim to evolve from “model API” to a platform:

Harmonic is turning Aristotle into an API + IDE plugin for Lean, with future integration into CI/CD and tooling (GitHub / GitLab, devops pipelines) to automatically verify critical properties.
Logical Intelligence is on track to ship a general model by 2026, targeting vertical deployments in crypto, power grids, defense, and other high-assurance systems.
Axiom Math envisions a research co-pilot that reads textbooks/PDFs, autoformalizes them into Lean or Coq, then explores “what if” conjectures.
Math Inc’s Gauss plus strongpnt is an early example of an agent + benchmark loop targeted at a specific branch (geometry).
DeepSeek-Math-V2, Qwen-2.5-Math and NuminaMath are being wrapped by the open-source community into local assistants, IDE extensions, and research tools.

On the infrastructure side, cloud players like AWS, Azure, GCP, and possibly IBM can easily provide specialized “reasoning clouds”. Analogies can be drawn to Synopsys / Cadence (hardware verification) and Adobe for vertical, high-value software.

5. Market, GTM, Monetization, and Unit Economics

5.1 Market framing

Near-term wedges:
- AI code tools (today’s GitHub Copilot, DeepCode, etc.) plus formal verification: ~$26B AI code tools TAM by 2030, with formal verification currently a $400M niche but attached to a $55B software testing/QA market.
- Crypto and DeFi: billions lost in contract bugs → strong ROI for tools like Aleph, Noa, Aristotle, Gauss, DeepSeek-Math-V2, Qwen-2.5-Math-72B, NuminaMath.
- Safety-critical software in aerospace, automotive, defense, power grids, and national infrastructure.
Mid-term:
- AI mathematicians as R&D amplifiers for quant firms (e.g. Renaissance, Two Sigma) and research labs (e.g. analogies to Isomorphic Labs and Insilico Medicine in drug discovery).
Long-term:
- “AI reasoning cloud” as standard infra, akin to OpenCV, TensorFlow, or Stable Diffusion in their domains. Companies like Harmonic, Axiom Math, Logical Intelligence, DeepSeek, Math Inc, Symbolica AI, plus big labs (OpenAI, Google DeepMind, Anthropic, xAI, Meta, Alibaba/Qwen) compete for that role.

5.2 GTM patterns

Harmonic: free Aristotle API for community + top-down pilots in aerospace, automotive, finance, crypto, national security.
Axiom Math: academia-heavy GTM (talks, conferences, publications) plus early design-partner engagements in trading, chip design, cryptography.
Logical Intelligence: crypto audits and government/national-infrastructure pilots; narrative heavily about moving “beyond LLMs” with EBMs.
DeepSeek: open-source adoption via Hugging Face/GitHub. Focus on mindshare rather than immediate revenue.
Math Inc: dev-first reach with Gauss and strongpnt as open assets that others can build on.

5.3 Monetization and unit economics

Likely models:
- Enterprise licenses (on-prem or VPC deployments of Aristotle, Aleph, Axiom models, Gauss-style agents).
- Cloud APIs with consumption-based pricing (per proof, per reasoning hour).
- Consulting / verification-as-a-service (e.g. Logical Intelligence auditing smart contracts with Aleph/Noa, Harmonic verifying autopilot code).
- Government contracts / grants in defense, aerospace, and infrastructure.
Today, economics are compute-heavy and negative margin: 685B-param models (DeepSeek-Math-V2 Heavy) and long test-time reasoning (GPT-5.1, Gemini 3) can cost hundreds or thousands of GPU-hours per hard problem.
Over time, distillation (e.g. from DeepSeek-Math-V2 Heavy to smaller derived models), better search (as in Math-Shepherd, AlphaProof), EBMs (Logical Intelligence), and caching will reduce inference cost and improve unit economics.

6. Moats, Risks, and Scenarios

6.1 Moats

Technical / IP:
- Synthetic proof pipelines (Harmonic’s Aristotle), EBM architectures (Logical Intelligence), neuro-symbolic designs (Symbolica AI), RL methods like Math-Shepherd (DeepSeek), and agent stacks like Gauss are non-trivial to replicate.
Data and corpora:
- Massive internal datasets (DeepSeek’s 500B-token corpus, NuminaMath dataset, Lean mathlib, geometry sets like strongpnt).
Talent:
- Fields-Medalist-caliber mathematicians (Logical Intelligence), high-profile researchers (Axiom Math), and operators like Vlad Tenev, Tudor Achim, Carina Hong.
Community and ecosystem:
- Lean + mathlib, open-source usage of DeepSeek-Math-V2, or pooled geometry benchmarks like strongpnt can create community lock-in.
Integration:
- Embedding Aristotle, Aleph, Gauss, Qwen-2.5-Math, or DeepSeek-Math into CI pipelines, IDEs, and cloud stacks (GitHub, GitLab, Azure, GCP, AWS, etc.) raises switching costs.
Trust:
- Formal guarantees (Lean/Coq/Isabelle proofs with Aristotle, AlphaProof, Aleph, Gauss) provide a trust moat vs generic LLMs that cannot certify correctness.

6.2 Risks

Technical ceiling: scaling from contest mathematics to full research-level problems or million-line code verification may prove much harder than IMO/Putnam benchmarks suggest.
Adoption friction: conservative regulators and engineers may delay trusting AI proofs. Mathematicians may resist or limit AI to “assistant” roles.
Big-tech encroachment: if companies like OpenAI, Google DeepMind, Anthropic, xAI, Alibaba/Qwen bundle strong mathematics capabilities into their mainstream offerings, standalone startups must differentiate sharply.
Open-source erosion: DeepSeek-Math-V2, Qwen-2.5-Math, NuminaMath, show how open source can cap proprietary pricing power.
Compute constraints and funding cycles: limited access to GPUs, shifting macro, or valuation resets could stress capital-intensive players like Harmonic, Axiom Math, Logical Intelligence, DeepSeek, Math Inc, Symbolica AI.

6.3 Scenario sketch 24-month horizon

Bull case:
- Systems like Aristotle, DeepSeek-Math-V2, Axiom’s models, Aleph/Noa, Gauss solve at least one high-profile new result or prevent a major real-world failure (e.g. crypto hack, aerospace bug).
- Harmonic passes $20–30M ARR. Axiom Math and Logical Intelligence reach unicorn valuations. DeepSeek-Math-V2, Qwen-2.5-Math, NuminaMath, Gauss/strongpnt become standard infra for research workflows.
Base case:
- Benchmarks continue to improve (100% on IMO, stronger Putnam performance), tools embed in niche workflows (crypto audits, select aerospace projects, math research labs).
- Harmonic, Axiom Math, Logical Intelligence, Math Inc, Symbolica AI each have a handful of pilots; revenues are in low-single-digit millions, valuations grow modestly.
Bear case:
- Progress plateaus at “Olympiad-level only”. Integration friction plus big-tech bundling compresses room for standalone AI mathematicians.
- One or more startups pivot, merge, or exit cheaply. Open-source models like DeepSeek-Math-V2, Qwen-2.5-Math, NuminaMath, and agent stacks like Gauss dominate the practical usage while formal verification remains niche.

7. Takeaways

The field has clearly crossed a feasibility threshold: Frontier AI labs are shipping AI models that are already competitive with IMO gold medallists and Putnam stars.
The open-source reasoning wave (DeepSeek-R1, DeepSeek-Math-V2, Qwen-2.5-Math, QwQ-32B, NuminaMath, Llama 3.1, strongpnt) ensures this capability won’t be limited to closed labs.
Formal verification stacks (Harmonic, Axiom Math, Logical Intelligence, AlphaProof, Math Inc) are the main bet on trustworthy AI. Zero hallucinations via Lean/Coq/Isabelle/HOL Light proofs.
The next 24 months are about converting these technical wins into repeatable workflows and revenue, especially in software verification, crypto, safety-critical systems, and advanced research.

For an investor or builder, this is a classic high-risk, high-optionality subsector: expensive, technically gnarly, but with real shot at becoming the reasoning layer beneath serious AI systems.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Sector Deep Dive #6: AGENT RUNTIME

Prateek Joshi — Mon, 10 Nov 2025 21:42:20 GMT

1. What this space is and why it suddenly matters

Agent Runtime is an environment that let AI agents actually do work. It’s the control room that plans steps, calls the right tools, remembers context, and keeps logs so humans can see what happened. Agent Sandbox is a related concept that refers to a safe box the agent acts inside to run code, browse, or touch APIs without breaking things. Put together, they’re the missing layer between raw models and real enterprise workflows.

Three forces make this takeoff feel real:

Models leveled up. The newest LLMs plan multi-step tasks, call tools, and follow structured instructions.
Enterprises want cognitive automation. After a decade of RPA and scripts, companies now want automation that can read, reason, and decide.
Enablers arrived. Secure micro-VMs / containers, long-context memory, and open protocols (Anthropic’s MCP, Google’s Agent-to-Agent/A2A) give teams a safer and more interoperable way to wire agents to data and systems.

Think of this as the shift from “power tools” (classic SaaS) to “coworkers” (agents) that execute tasks end-to-end with guardrails. That’s why the sector’s drawing capital and attention: it upgrades software from “assist” to “act”.

2. Market trajectory

The agents market is growing from roughly $5B in 2024 to about $47B by 2030 if current forecasts hold. Funding has kept pace: about $8B+ went into agent startups in the year through late-2024, and seed funding alone in 1H-2025 was on the order of $700M. Analysts expect a third of enterprise software to include agentic capabilities by 2028 (up from almost zero in 2024).

What that means in practical terms:

This isn’t a “single killer app”. It’s a horizontal capability (like cloud or mobile) that seeps into IT, ops, support, finance, and engineering.
Adoption is gradual and risk-managed. Most teams start with human-in-the-loop, then graduate to autonomy for bounded tasks once accuracy and auditability are proven.
It’s not winner-take-all yet. Standards like MCP and A2A reduce lock-in and keep the door open for neutral platforms and open-source tools, not just cloud megasuites.

3. The product stack: six bricks you actually need

When you peel back the marketing, mature agent platforms all converge on the same 6 components:

Secure execution
Agents need isolated places to run code, browse, and call services. That usually means Linux containers or micro-VMs with strict network and filesystem policies. Startups to know:
1. E2B: open-source, Firecracker-style micro-VM isolation; fast cold-starts.
2. Novita AI (Agent Sandbox): per-second billed serverless workers for bursty agent compute.
3. Browserbase: managed, clean browsers for reliable web automation.
Orchestration (the agent loop)
Plan → act (use a tool) → observe → re-plan. You need a consistent way to define this loop, branch on errors, and compose sub-agents. LangChain / LangGraph, CrewAI, CUA (Computer-Use Agent), Dust are common choices depending on how code-centric or visual you want to be.
Connectors and permissions
Agents get work done by touching APIs, SaaS apps, and internal services with clear scopes and approval rules. Composio has become the “agent connector fabric” many teams reach for. Cloud platforms ship their own registries too.

Memory and state
Short-term scratchpads and long-term project memory, usually backed by vector DBs or filesystems. Plus auto-summaries so context doesn’t blow up costs.
Observability, evaluation, and guardrails
You need step-level traces, cost and latency metrics, red-team tests, and “circuit breakers”. AgentOps (production runs, replay, costs), Langfuse (open-source traces/evals) are becoming table-stakes.

Human interface
Most business agents surface in chat UIs, IT portals, or IDEs. Good products make it easy to toggle autonomy, insert approvals, and explain what just happened.

4. Who’s competing and how to think about them

Cloud incumbents are shipping full stacks:

OpenAI/Microsoft (AgentKit + Copilots) lean into deep model integration and a vast distribution surface.
AWS (Bedrock AgentCore) emphasizes isolation, identity, observability, and marketplace distribution.
Google pushes open A2A to make multi-vendor agent workflows normal inside Workspace and GCP.
Anthropic focuses on model safety and MCP so tools and models interoperate cleanly.

Independent startups fill critical gaps and keep the space dynamic:

Secure execution: E2B, Novita AI (Agent Sandbox), Browserbase
Agent OS / orchestration: LangChain/LangGraph, CrewAI, CUA, Fixie, Dust
Connectors: Composio
Observability/evals: AgentOps, Langfuse
Dev-env as runtime: Daytona lets agents spin up real developer workspaces with full toolchains.
Marketplaces and hubs: Gumloop and MuleRun explore app store for agents.
Capability showcases: Prime Intellect helped popularize computer-use agents that click and type like a human.

Expect consolidation: some of these become features inside cloud platforms. Others win as neutral layers precisely because big customers want multi-model, multi-cloud flexibility.

5. What buyers actually use this for

Developers and startups use sandboxes/runtimes to ship agentic apps faster. Prototyping with OSS, then hardening with better isolation, connectors, and monitoring.

Large enterprises pick a few high-ROI use cases and expand from there. Typical first wins:

IT automation: ordering equipment, provisioning access, resetting accounts, closing tickets.
Customer support: reading tickets, checking entitlements, proposing actions, and (once trusted) executing refunds or returns.
Operations and finance: reconciling invoices, chasing documents, scheduling freight.
Engineering productivity: write → run → test → fix loops inside an isolated code sandbox (pair this with Daytona or E2B).

Adoption pattern is consistent: start with copilot (human approves), track success and cost, then move select workflows to autopilot with timeouts and escalation rules. The runtime matters because it encodes that discipline, not just “let the LLM run”.

6. The economic logic: why this can be cheaper (and when it isn’t)

A single agent task usually triggers many model calls plus tool invocations. Early Auto-GPT experiments were expensive and brittle. Three things flipped that story:

Smarter planning and caching cut token waste.
Isolated code execution moves heavy mathematics or parsing to cheap CPU time instead of expensive tokens.
Model mix-and-match runs 3.5-class models for easy steps and saves 4/5-class models for hard ones.

When you price it the way buyers do, the question is: cost per completed task vs a human baseline. If an agent can process a support email for $0.10–$0.30 all-in where a human minute costs a few dollars, the cost model works immediately.

Where it doesn’t work yet: ambiguous tasks with high back-and-forth, long tool chains, or high error penalties. That’s why most teams still insert approvals, limits, and budgets. This is as much an economic guardrail as a safety one.

The trend line is favorable: better models, cheaper inference, and tighter runtimes steadily push cost-per-task down and success rates up. That’s the flywheel to watch.

7. Impact on the broader infra startup landscape

Short answer: this wave will touch most of infra. Over the next 24 months, expect 60–70% of infra startups to be directly or indirectly affected. Either as beneficiaries, suppliers, or competitors. Here’s how it maps:

Direct beneficiaries (20–25%)
Startups whose core product is agent runtime capability: secure sandboxes (E2B, Novita), orchestration (LangChain, CrewAI, CUA, Dust), observability/evals (AgentOps, Langfuse), connectors (Composio), and marketplaces (Gumloop, MuleRun). Their traction rises with each successful enterprise deployment.
Adjacent pull-through (20–25%)
Data infra (vector DBs, feature stores), identity and policy (fine-grained scopes for agents), secrets/key management, audit logging, and cost monitors. Agents create persistent demand for retrieval, permissioning, and explainability. Great for neutral infra vendors. If you’re building vector search, lineage, or IAM, agents are a net tailwind.
Devtool and platform reshaping (15–20%)
Dev environments and CI/CD adapt so agents can participate as “non-human contributors”. Daytona is a clear bridge. Agents spin up real workspaces with compilers, DBs, and test harnesses. Expect git hosts, test frameworks, and build systems to expose agent-friendly APIs and policies. Winners will make “agent + human” pair programming and reviews safe and auditable.
Integration/iPaaS and RPA convergence (10–15%)
Workflows move from rigid scripts to agent-driven flows. RPA and iPaaS vendors will add LLM brains. New neutral runtimes will nibble at classic automation budgets. If you’re building modern integration layers, aligning with MCP/A2A and shipping strong observability can put you on the right side of this shift.
Compute and GPU infra (5–10%)
Agent adoption raises steady inference workloads and bursty sandbox compute. That benefits GPU scheduling, serverless containers, model gateways, and browser automation at scale (hello Browserbase). Efficiency startups (quantization, caching, routing) also see a lift.
Potentially crowded or pressured (10–15%)
Products that are “just an LLM wrapper” around a single workflow will feel pressure as AgentKit/AgentCore and marketplaces ship that workflow as a prefab. The defense is depth: data access, accuracy guarantees, distribution, or owning a compliance-sensitive niche.

Correlation and dependencies.
Think of a dependency chain: models → runtimes → connectors → policy/identity → observability → data. Improvements at any layer (cheaper inference, better planning, richer connectors) ripple to the others. Infra startups that “lock” into one model vendor will carry vendor risk. Those that speak MCP/A2A and multiple models reduce it. Conversely, security incidents or prompt-injection failures at the app layer will generate demand for policy, isolation, and monitoring deeper in the stack. Another pull-through for infra.

8. Key risks and the practical mitigations that matter

Reliability and safety. Agents still make bad calls. Mature teams use retrieval grounding, step limits, timeouts, and human approvals on high-impact actions. Observability and evals move from “nice-to-have” to mandatory.
Security and data privacy. Agents handle credentials and sensitive data. Sandboxes must strictly confine code and network. IAM scopes, secrets rotation, tamper-proof audit logs, and signed tool calls should be part of the design, not a later add-on.
Prompt injection and supply-chain risk. Agents read untrusted content and may be tricked. Defensive patterns (content sanitization, tool call whitelists, trusted data paths) and “kill-switch” policies reduce blast radius.
Regulation and governance. Expect requests for audit trails, decision explanations, and model/agent change control. Vendors with strong explainability and logging will win security and compliance reviews.
Cloud squeeze. Big providers will absorb generic runtime features. Neutral players must compete on openness (multi-model/multi-cloud), UX, cost, or depth in a vertical. Aligning with standards and meeting enterprises in their VPCs are proven ways to keep a seat at the table.
Unit economics drift. A long, meandering agent can burn tokens and money. Teams that enforce budgets, cache aggressively, route models by difficulty, and offload compute to sandboxes will keep cost-per-task in the green.

9. What to watch next

Capability jumps. If the next model wave materially improves tool-use and long-horizon planning, watch success rates rise and human approvals shrink. That opens more workflows to autonomy.

Reference deployments. One marquee case study in banking, logistics, or healthcare (measured in millions saved or hours cut) will unlock follow-on budgets elsewhere.

Standard adoption. Broad support for MCP and A2A would normalize multi-vendor agent meshes inside large companies. That’s a tailwind for neutral infra (connectors, policy, observability) and a constraint on lock-in strategies.

Cost curves. Cheaper inference and faster cold-starts lower the “minimum viable agent”. Keep an eye on platform announcements about long-running sessions, serverless micro-VMs, and per-second billing. These directly change which tasks pencil out.

Distribution channels. Agent marketplaces (e.g. Gumloop, MuleRun) and cloud app stores will matter more as companies move beyond pilots. Templated agents with real connectors and auditable logs will travel fastest through those channels.

Consolidation. Expect acqui-hires and product fold-ins. If you’re building infra, assume your best exit path might be a cloud or enterprise platform that wants your isolation, connectors, or observability baked in.

10. Investment stance and practical takeaways

It’s an infra story as much as a model story. Sandboxes, runtime control planes, connectors, identity, and observability will decide whether agents stay demos or become dependable “digital workers”. That creates room for neutral infra winners, not just model vendors.
Barbell strategy. One bet aligned with a major platform (for distribution and trust) and one bet that’s open, multi-model, and multi-cloud captures both worlds. In parallel, there will be category enablers: isolation (E2B, Novita), connectors (Composio), eval/ops (AgentOps, Langfuse), dev-env runtimes (Daytona).
Bias to measurable workflows. IT ops, support ops, finance back-office, and code-adjacent tasks produce clean before/after metrics (success rate, handle time, cost-per-task). Those are the proving grounds that compound into wider adoption.
Design for approvals, not just autonomy. The businesses that grow fastest will support a spectrum—suggest → approve → auto-execute—with rock-solid audit trails and budget controls.
Plan for standards. Treat MCP/A2A as inevitabilities and build in that direction. You’ll be easier to buy and harder to rip out.

11. Bottom line for infra founders and investors

Agent sandboxes and runtimes are graduating from experiments to infrastructure. The core idea of software that can read, decide, and act with constraints is now implementable with acceptable risk in many day-to-day workflows. The stack is clarifying, the standards are emerging, and the economics are trending in the right direction.

The effect on the broader infra universe will be wide. Roughly two-thirds of infra startups will feel it. Some directly as agent-native platforms, some as upstream suppliers (data, identity, observability), and some via pressure as clouds bundle the basics. The safest places to build and back are the boring necessities of a production agent world: isolation that never breaks, connectors that always work, policies that auditors love, and telemetry that catches issues before the CFO does.

The next 24 months are a prove-out. Watch the success rates, costs per task, and the first wave of big reference customers. If those turn the corner, this sector looks less like a trend and more like a new layer of enterprise software. Quiet, reliable, and everywhere.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Sector Deep Dive #5: SEARCH API PRODUCTS

Prateek Joshi — Fri, 10 Oct 2025 15:33:33 GMT

1. Snapshot

The core bet is that developers will increasingly buy real-time web search as a managed API instead of building it. Why? Because modern apps and AI agents need fresh, machine-readable information and citations on demand.

Prices, performance, and legal access to content are shifting quickly. A handful of independent search indexes (Brave, You.com) and SERP/API specialists (SerpApi, DataForSEO, Serper.dev) are emerging as the infrastructure layer that feeds LLMs, agents, and enterprise apps with web context.

There’s another layer in between. Exa is effectively the developer-first infrastructure layer that sits between the independent-index crowd (Brave, You.com) and the SERP scrapers (SerpApi, DataForSEO). It builds its own continuously refreshed web index (not just scraping Google or Bing).

Microsoft’s price hikes on the official Bing Web Search API in 2023 (e.g. S1 tier from $7 to $25 per 1,000 queries starting May 2023) drove many builders to look for alternatives. And Google still limits its own JSON API to $5 per 1,000 queries with constrained use, creating an opening for developer-first vendors.

Near-term catalysts include: (a) independent indexes scaling distribution via cloud marketplaces (e.g. Brave Search API on AWS Marketplace) and launching AI-grounding features (b) OpenAI’s SearchGPT prototype validating developers’ demand for search-plus-answers (c) high-profile AI answer engines (Perplexity, You.com) opening or expanding APIs, pushing volume into search infra instead of consumer portals.

The biggest risks are: (a) content access and litigation (publishers vs. AI search providers), which may raise COGS and restrict data (b) platform dependence (Bing or browser defaults) (c) consolidation if hyperscalers bundle “good-enough” search into agent platforms. These are active issues today (e.g. lawsuits and cease-and-desists targeting Perplexity, publisher revenue-share programs emerging in response).

This is an investable infra subsector with asymmetric upside over the next 24 months, especially in independent index APIs and legal-first SERP APIs with enterprise posture. The winners will pair developer ergonomics (clean JSON, fast SLAs), distribution (marketplaces, model/tooling integrations), and credible content access (publisher deals, compliance).

2. Thesis framing: what must be true

Investment question in one line: Can independent search APIs become the default way developers and AI agents ground responses in fresh, verifiable web data (at attractive unit economics) before big platforms make the category a bundled feature?

Thesis pillars (what must be true):

Real-time grounding becomes mandatory for LLMs and agents. OpenAI’s SearchGPT signals that “search + answers + sources” is moving into core AI UX. Third-party APIs that are fast, citable, and cheap will see rising demand from AI builders.
Independent indexes achieve escape velocity. Brave’s index (>30B pages, 100M+ daily updates) and growing distribution (AWS Marketplace, AI-grounding features) show a credible non-Google/Bing path for developer-grade web data. Exa is growing too.
Economics/choice favor specialists. Microsoft’s 3-10x Bing API price hikes (now $25 per 1,000) and Google’s limited, capped JSON API (still $5 per 1,000, up to 10k/day) push developers to alternative providers with predictable pricing and richer outputs.
Legal access matures. Publishers and search APIs converge on revenue-share and licensing. Perplexity’s $42.5M publisher pool is an early sign of viable content economics for AI search.

Disconfirming evidence to track: If platform leaders (OpenAI, Google, Microsoft) give away a full-featured, low-cost search API or effectively embed it into model runtimes, specialist API demand could compress. Also if publisher litigation materially walls off content without workable licenses, data access costs could swamp API margins.

3. Market structure, size, and geography

Structure. Three layers matter to developers:

Index owners with developer APIs: Microsoft (Bing), Brave, You.com (increasingly an AI answer engine with enterprise tilt), plus regional engines (Baidu in China). Google’s Programmable Search remains limited and quota-capped.
SERP/API specialists that fetch/parse results from many engines and verticals, exposing a clean JSON schema and compliance posture (e.g. SerpApi with legal shield, lower-cost peers like Serper.dev, DataForSEO). Exa API exposes structured, relevance-scored results that are tuned for AI agents, RAG systems, and retrieval pipelines. More semantic and programmatic than a traditional search API. Exa also markets itself as “search infrastructure for AI”, letting devs query the web in real time with filters for freshness, domain, and semantic similarity.
Enterprise/site search APIs (Algolia, Elastic, Amazon Kendra) that index a customer’s own content and power product or knowledge-base search. Adjacent but often bought by the same teams and now blending with web grounding. Algolia alone powers 1.5T+ queries/year across 10k+ customers.

Size and trajectory. There is no canonical “Search-API TAM”, but demand proxies are strong. Brave reports >1.5B searches/month in recent updates. Perplexity’s MAU has reached ~22M and processed ~780M queries in May 2025. Algolia is already at trillion-scale enterprise queries. Each datapoint indicates rising programmatic search volume and shifting developer spend from DIY crawl/scrape stacks to APIs.

Penetration and runway. Google still commands ~90% of global search, but dipped below 90% in late 2024 and has hovered in the high-80s in 2025. A small crack that corresponds with AI-native search usage and Bing’s modest desktop gains. For developers, the takeaway is not consumer share per se but willingness to try non-Google data sources when APIs are reliable and priced fairly.

Geography. In China, Baidu leads with ~56–60% share across platforms, with Bing surprisingly strong on desktop. Google is negligible. Practically, China-focused devs rely on Baidu (and 360/Haosou) data and localized APIs, while Western API startups rarely operate behind the Great Firewall. For global products that serve China, provider mix (Baidu + Bing/Brave) and compliance become material.

4. Customers, jobs to be done, and switching costs

Who buys and why. Three clusters:

AI/agent builders who need live facts + citations. Instead of running their own crawler, they call search APIs within tool-use chains to ground model answers (news, pricing, docs). OpenAI validating “search-inside-chat” accelerates this pattern across the stack. Growing Exa usage is another datapoint.
Product teams at e-commerce, SaaS, and content apps who need fast, typo-tolerant, tuned search for their own catalogs and docs (Algolia et al.). At scale, better search converts directly to revenue and support deflection.
SEO/data analytics and research ops teams who need reliable, structured SERPs at volume (rank tracking, market analysis, due diligence). SerpApi’s customer mix is now ~40% AI, ~40% SEO, ~20% other, highlighting the shift from pure SEO into AI infra.

Mission-criticality. If the search step fails, agent answers degrade or hallucinate. If site search fails, revenue drops. That creates a budget line for SLA-backed APIs and motivates redundancy (e.g. Brave primary, Bing or SERP API as fallback). The Bing price shock in 2023 nudged teams to multi-source or switch, a real-world proof of this redundancy mindset.

Switching costs. Swapping an endpoint is easy. Replicating quality tuning, synonyms, ranking rules, or JSON schemas embedded in pipelines is not. Enterprise search configs (Algolia) and AI toolchains (prompt+parser contracts) generate meaningful friction. Legal/compliance features (e.g. SerpApi’s legal shield) further raise switching costs in regulated environments.

5. Product and roadmap signals

Core modules developers expect:

Query endpoints that return structured results (JSON) for web/news/images/local, with location & language controls, snippet payloads, and schema-enriched data.
Latency and uptime SLAs and “speed tiers” for interactive UX and agent loops.
Compliance and indemnity (publisher respect, legal shield, SOC 2).
AI grounding features (citations, multi-snippet context, MCP/tool adapters), and integrations (LangChain, cloud marketplaces).

Independent index momentum. Brave exposes a web index of 30B+ pages, claims 100M+ daily updates, and recently shipped AI Grounding to anchor LLM outputs in verifiable sources. This positions the API as a turnkey “search-to-source” layer for agents. Availability on AWS Marketplace shortens procurement and signals enterprise focus. Exa has their own index as well.

Answer-engine APIs. Perplexity and You.com aim to synthesize answers with sources. Exa aims to make web search directly machine-consumable for LLMs. Their consumer metrics (Perplexity’s MAU/queries) indicate product-market fit. The open question is exposing that capability as a developer API at sustainable margins. The legal/publisher front is moving. Perplexity is pairing growth with a $42.5M publisher pool to defuse access risk.

SERP/API specialists. SerpApi abstracts Google/Bing/vertical SERPs into consistent JSON and offers enterprise-friendly pricing at high volume ($2.75 per 1k reserved searches) plus legal safeguards. This is useful when you need Google-quality outputs with engineering and legal friction removed.

Enterprise/site search keeps evolving. Algolia blends keyword + vector (“neural”) approaches and remains the easiest “drop-in” for app/internal search at massive scale (1.5T+ queries), making it a common complement to web grounding: your data via Algolia + the open web via a search API.

6. Competitive dynamics and pricing

Platform APIs vs. independents.

Microsoft Bing: Official, compliant, but expensive post-2023 (e.g. S1 web search $25/1k). Good reliability. Quality lags Google in some niches.
Google Programmable Search: Cheap ($5/1k) and reliable for custom/site collections, but not a full web API and capped at 10k/day. Many teams therefore layer SERP APIs or independent indexes to get web-wide coverage.
Exa/Brave/You.com: Independence is the differentiator (no dependency on Big Tech indices), plus developer-ready features (index transparency, grounding). Brave’s marketplace and AI-grounding moves specifically target agent stacks.
SERP APIs: SerpApi (premium, legal shield), Serper.dev/DataForSEO (aggressive price points). This tier competes on breadth of engines, JSON quality, anti-bot resilience, and price.

AI search as an encroaching competitor. OpenAI’s SearchGPT is a strategic signal: if the experience ships as a developer API or becomes bundled into model runtimes, it could absorb demand. For now it is limited, but investors should assume bundling risk in the next 24 months.

Consumer share vs. developer demand. Google still holds ~89–90% of global search. Bing ~4%. Yandex, Yahoo, DDG trail. The gap doesn’t prevent developer migration if pricing, procurement, or legal are better elsewhere. The 2024–2025 dip below 90% is symbolically important: teams are now comfortable experimenting with non-Google sources.

7. Go-to-market, adoption, and metrics to watch

PLG with enterprise overlays. Search APIs skew self-serve: devs test free tiers, wire in JSON, and grow usage. Enterprise deals add SLAs, DPAs, and volume commits. Distribution is improved by cloud marketplaces (easier procurement; draw-down on committed cloud spend) and framework integrations (LangChain/tools). Brave’s AWS listing is a concrete example of marketplace-led enterprise GTM.

Adoption proxies.

Exa: Still young but growing rapidly. Thousands of devs are using it. Recently raised $85M Series B led by Benchmark.
Perplexity: ~22M MAU, ~120M monthly visits (as of July 2025), ~780M queries in May 2025. All indicate rising appetite for AI-answer search that could translate into API usage.
Brave: claims >1.5B searches/month recently and index scale/cadence (30B+ pages; 100M+ daily updates) consistent with commercial-grade coverage.
Algolia: 1.5T+ queries/year across 10k+ customers remains the clearest signal that “search-as-an-API” is mainstream within product teams.
SerpApi: enterprise pricing pages and research show scale economics ($2.75/1k overage), and customer mix 40% AI underscores the category’s pivot from SEO to AI infra.

Reliability and compliance. Expect 99.9%-style SLAs from serious vendors. Enterprise wins will hinge on SOC 2, data protection addenda, and publisher-aware crawling. Watch for visible status histories and legal shields or revenue-share programs. Both mitigate buyer risk and will become standard.

Hiring and focus. Companies like Parallel (founded by former Twitter CEO Parag Agrawal) emphasize agent-grade research APIs. Headcount remains lean and engineering-heavy. Public comms point to millions of “research tasks/day” and benchmark-first positioning, but the bigger signal is product velocity in agent tooling.

8. Monetization and unit economics

Pricing models.

Per-query (CPM-like) is standard for web search and SERP APIs: Bing (≈$25/1k on popular tiers), Brave (public materials emphasize independence & marketplace procurement, list prices vary by tier), SerpApi (enterprise reserved $2.75/1k and speed add-ons), Google Programmable Search ($5/1k, 10k/day).
SaaS/usage for site/enterprise search (Algolia, Elastic) based on operations and records.
Hybrid for answer engines (subs + ads + licensing/publisher share). Perplexity’s $42.5M publisher pool is an early, explicit content-cost line item meant to stabilize supply.

COGS and margins. Running a crawler + index has bandwidth/compute costs but can sustain software-like gross margins at scale. SERP APIs incur proxy/captcha costs but offset via engineering leverage and high utilization. AI answer engines face inference COGS until they lean on cheaper custom models. Hence Perplexity/You.com investments in their own models and summarization stacks. (Evidence: rapid model/version launches and product cadence across 2024–2025; vendors explicitly pitch “grounding” to reduce model-token burn).

ARPU and expansion. Usage grows with app traffic and agent loops: as an e-commerce site, a support bot, or an agent platform scales, queries/customer scale too. That creates natural net-revenue expansion without more sales cycles. Enterprise contracts add overage revenue and encourage annual commits for lower unit rates (e.g. SerpApi’s reserved pricing).

Seasonality. Consumer search APIs see event-driven spikes. Enterprise/site search peaks in retail Q4. But usage-based billing smooths revenue. Overages provide upside in peak months. Vendor comms on Brave Search Ads and query growth show seasonal surges.

9. Moat, data advantage, and legal reality

Independent index ≠ nice-to-have. Owning the index (Brave, You.com) is the defensibility wedge against platform policy changes and SERP scraping fragility. It also enables product differentiation like multi-snippet grounding, “goggles” (re-ranking), and fast freshness. For developers, this means fewer brittle dependencies and more consistent JSON across query types.

Workflow lock-in. Embedded ranking rules, synonym maps, analytics, and pipelines (Algolia/Elastic) create real stickiness. On the web side, teams code to specific schemas and rate/latency expectations. Swapping vendors requires regression testing across critical UX. Legal coverage (SerpApi’s U.S. Legal Shield) and enterprise SLAs become part of the moat for high-risk users.

Publisher alignment will define winners. Lawsuits and Cease-and-Desists against AI search providers (Dow Jones/News Corp., BBC, Britannica/Merriam-Webster) demonstrate that content access is not a free good. Startups that turn adversaries into suppliers via revenue-share or licenses will be able to scale volume without existential risk, even if near-term margins are thinner.

Platform bundling risk. If OpenAI/Google ship low-cost, high-quality search endpoints inside the model runtime (or as a standard tool), third-party demand could compress. That said, developers value choice, cost control, transparency, and policy independence. All of which still argue for multi-sourcing web data (primary + fallback).

10. What this means for infra startups

Who gets pulled in. Over the next 24 months, I expect ~25–40% of infrastructure startups to be directly or indirectly affected by the rise of Search APIs. The exposure comes in three ways:

Agent and orchestration stacks (tool-use frameworks, evaluators, guardrails) will standardize on search tools for grounding. When SearchGPT-style UX becomes common, every agent platform needs a search provider and a policy for citations and often a backup. That’s a direct dependency. (Signal: OpenAI’s move with SearchGPT, Exa/Brave/others shipping MCP-style adapters.)
Data infra and retrieval layers (vector DBs, RAG pipelines, ETL) will blend internal corpus with web augmentation. As teams move from static corpora to live answers with verifiable sources, they will route external results through their retrieval/ranking layer. Expect tighter connectors from Pinecone/Weaviate-like stacks into search APIs and more budget reallocation from “more tokens” to “better grounding”.
Compliance, observability, and FinOps startups will see new budgets around content licensing, model+search cost controls, and provenance/attribution telemetry. If you must prove where an answer came from and pay the source, observability products and policy engines become critical.

Positive correlations.

Inference cost declines strengthen search APIs because grounding becomes the obvious way to reduce hallucinations and trim token use (shorter prompts when you pass high-signal snippets). Brave’s “AI Grounding” is literally a productized version of this correlation.
Marketplace distribution (AWS, Azure) lowers friction for enterprises to test and standardize on a search API. This historically accelerates infra adoption curves (database, logging, ML APIs). Brave’s AWS launch is a direct example.
Publisher deals unlock premium sources (finance, health, news), which improves answer quality, driving higher conversion to paid tiers. Perplexity’s pool is the first at scale. Expect others to follow.

Risks and dependencies for infra startups.

Legal and robots.txt compliance: startups embedding search must respect robots.txt and site policies, or risk collateral reputational/legal exposure if their provider is accused of scraping blocked sites. Recent BBC and News Corp actions show this is no longer theoretical. Vet your provider’s crawler compliance and indemnities.
Provider concentration: relying on a single provider (e.g. just Bing) exposes you to pricing shocks (as in 2023) and availability changes. Multi-sourcing (Brave + SERP API + Bing/Google Programmable where allowed) adds resiliency.
Geo constraints: if your users are in China, plan for Baidu/360 integration and localized infrastructure. This may mean separate routing, filtering, and compliance processes from your global stack.

How much budget shifts here? For AI-agent startups, search can easily become 10–30% of monthly variable COGS when agents do multi-hop research (because each answer can trigger tens of queries). For SaaS product teams, external web search spend is smaller. Internal search (Algolia/Elastic) remains the primary cost center, with web grounding added for specific features (e.g. a “Research” tab in a support bot).

Who benefits in venture terms.

Independent index APIs (Exa, Brave, You.com) with marketplace distribution, strong engineering cadence, and publisher alignment.
Legal-first SERP APIs (SerpApi) where enterprises want Google-quality JSON without running a proxy farm or fighting captchas and where legal shield matters.
Hybrid answer-engine APIs (Perplexity) if they can show measurable accuracy lift and lower blended COGS via licensing and in-house models, not just good UX.

Who could compress returns.

OpenAI/Google bundling: if search becomes “free” inside a model runtime, specialists will compete on quality, compliance, and independence (e.g. sources that big models won’t touch without licenses). Developers still like choice. Being the fallback engine is a real, durable niche.

11. Competitive landscape: notable companies to watch

Exa (US) — independent index + agentic search + answer engine

What’s special: Independent index, search features tailored for LLMs, and gaining rapid mindshare among devs.

Brave (US) — independent index + AI grounding + AWS distribution.

What’s special: Independent index (30B+ pages, 100M+ daily updates), AI Grounding features tailored for LLMs, and AWS Marketplace listing. Signals enterprise intent and procurement ease.

SerpApi (US) — legal-aware SERP API at scale.

What’s special: Wide engine coverage, enterprise legal shield, and reserve pricing down to $2.75/1k searches at scale; customer mix now ~40% AI. Often the fastest path to Google-quality JSON for devs.

You.com (US) — AI research/answer engine with enterprise tilt.

What’s special: Fresh $100M Series C at $1.5B valuation (Sep 2025), ongoing shift from “consumer search challenger” to AI research agent for regulated industries. Credible team pedigree (Richard Socher).

Perplexity (US) — answer engine with publisher economics.

What’s special: High user/query growth (22M MAU, ~780M queries in one month), bold GTM moves, and a $42.5M publisher pool amid lawsuits. The key watch-item is API exposure and sustainable COGS.

Parallel (US) — agent-grade deep research API (early).

What’s special: Founder brand (Parag Agrawal), benchmark-driven positioning vs. browsing tools. Still early but tuned for agent workflows.

Baidu (China) — dominant local index.

What’s special: Leads China search (~56–60% share). Essential for China-market apps. Developer-facing access exists within Baidu’s cloud/AI platforms. Global devs must consider geo separation and compliance.

12. Risks, catalysts, and what would change the call

Key risks in the next 24 months.

Content and legal: Publisher suits (BBC, Dow Jones, Britannica/Merriam-Webster) escalate, forcing expensive licensing at scale or curtailing content coverage. Vendors without publisher strategy lose reliability, and their customers inherit risk.
Platform moves: OpenAI or Google ships a cheap, first-party search tool in model runtimes, compressing third-party demand. Or browser defaults remain tightly controlled, limiting distribution for independent engines.
Price shocks: Another Bing-style pricing change pushes up customer COGS, causing churn or multi-sourcing complexity.

Catalysts.

Marketplace and cloud partnerships (AWS/Azure/Bedrock agent ecosystems) that pre-wire search tools for agents. Brave’s AWS launch is a template.
Publisher alignment at scale (Perplexity-like funds replicated), reducing legal friction and unlocking premium data verticals.
Visible accuracy/latency wins on benchmarked agent workloads, showing that independent indexes or SERP APIs deliver better answers per token than bundled tools. Brave’s AI-grounding launch is a signpost.

What would change the call.

Bear case: If Search becomes “free” in LLMs and publishers successfully wall off valuable content without broad licensing, the independent API market could shrink to a niche.
Bull case: If independent indexes become the de-facto grounding layer for agents and publisher economics settle, this sector compounds like payments or auth did a decade ago (quiet infra with huge downstream leverage).

Bottom line for investors

Where to lean in: Independent index APIs with credible distribution (marketplaces, agent frameworks), strong developer experience, and visible publisher strategy. Legal-first SERP APIs that are the enterprise default for Google-quality JSON. Enterprise/site search where vector+keyword convergence is demonstrably improving outcomes.
Portfolio construction: Expect multi-sourcing behavior. Your winners should play nicely as primary or fallback. Emphasize vendors with clear SLAs, observability hooks, and cost controls to survive price shocks and model bundling waves.
Exposure for infra startups: Plan as if a 25-40% of infra startups will touch search APIs directly (agents, retrieval, developer tools) or indirectly (compliance, cost, analytics). Build connectors and procurement paths accordingly, and diligence provider legality as carefully as you diligence uptime.

If these companies can convert developer trust + content access into durable distribution before platforms fully bundle search, the next two years favor specialists. If not, expect consolidation and a smaller, compliance-heavy niche. The signals above such as AWS listings, grounding features, publisher funds, MAU/queries growth, and pricing dispersion are the leading indicators to watch.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

RLenvironment.com - Tracking live signals from RL Repos

Prateek Joshi — Mon, 06 Oct 2025 20:54:59 GMT

I built an agent to analyze 49 open source RL repos. And built RLenvironment.com to show the live signals. Check it out and let me know what you think. Here is the latest snapshot:

Portfolio-level signals

Scale and freshness: 240,832 total stars across 49 repos. 24 of them were active in the last 30 days (1392 commits), 10 recent releases (≤ 30d). 2 repos archived.
Power centers: Google/DeepMind, Farama, Nvidia Isaac, OpenDILab, Meta, HF anchor a big share of stars and recent commits.
Where the work is: Mix is Library/Algos > Environments/Simulators > Platforms/Runtimes. Multi-agent + robotics + offline-RL are well represented (some people would debate that offline-RL is not real RL, but that’s for another post).

Leaderboard: Scale and mindshare

By stars (top 10): ray-project/ray, Unity ML-Agents, HF/trl, CARLA, Stable-Baselines3, OpenAI Spinning Up, Google Dopamine, DeepMind MuJoCo, Farama Gymnasium, Tianshou. These are the “safe defaults” for integrations and community reach.

Momentum: Shipping velocity now

Hottest by activity score (commits_30d, PRs_merged_30d, push/release recency): strong showings from huggingface/trl, ray-project/ray, google-deepmind/mujoco, google-deepmind/open_spiel, isaac-sim/IsaacLab, pytorch/rl, PrimeIntellect-ai/verifiers. These are the best near-term partnership/signal taps.

Emerging movers (low stars, high momentum)

“Under-the-radar but shipping” (stars < 1k, momentum ≥ 70): e.g. hud-evals/hud-python (rapid cadence in LLM-RL eval/envs), instadeepai/Mava (JAX MARL), plus a few niche env/runtime repos. These are prime candidates for early collabs, grants, or feature pilots.

Areas heating up

LLM-RL environments and verifiers: PrimeIntellect-ai/verifiers, hud-evals/hud-python show fast iteration → good proxy for RLHF/RLVR ecosystem traction.
Physics and robotics: MuJoCo, IsaacLab, ManiSkill have active pipelines → strong for sim-to-real stories and embodied agents.
Core libraries: TRL, PyTorch RL, Tianshou, SB3 remain the practical workhorses for researchers/teams.

Risk and maintenance flags

Issue backlog hotspots: very large open-issue queues in ray-project/ray, carla-simulator/carla, facebookresearch/habitat-lab → watch for maintainer bandwidth + triage pace.
Staleness: a handful of well-known but stale >90d repos (some with big star counts) → fine for legacy baselines, not great for new dependencies.

Topic coverage: Where the field is leaning

Most frequent tags: reinforcement-learning, gym/gymnasium, robotics, multi-agent, MuJoCo, PPO/SAC/TD3, imitation/offline RL. Clear skew to embodied control + MARL + practical baselines.

If you like this newsletter, consider sharing it with 1 infra-curious friend:

Subscribe now

Sector Deep Dive #4: OPEN SOURCE INFRA

Prateek Joshi — Wed, 24 Sep 2025 17:55:40 GMT

I built an agent to analyze 236 open source infra repos. I wanted to extract useful signals that indicate where the world is heading. These projects collectively concentrate around a few durable themes: data infrastructure (38%), AI tooling (30%), databases+search (21%), and smaller but high-signal pockets in observability and streaming.

The set is large and mature. Median project age is around 9 years, but still very active. 75% pushed code in the last 30 days and 81% in the last 90 days. Popularity signals are strong: the median number of stars is 12.5k. As seen from the graph below, breaking past the 20k-star barrier is difficult. And beyond 40k stars is rarefied air.

Forks and stars are highly correlated (0.81), which is consistent with communities that not only watch but also remix.

From an investor/operator lens, three patterns stand out:

Foundation-led gravity wells: Apache is the dominant owner in the dataset (22% of repos) with 85% 30-day activity. This suggests healthy stewarding and long-horizon roadmaps. Non-foundation projects skew higher on stars but slightly lower on recency of activity. Classic “breakout vs durability” trade-off.
Permissive licensing is the default: Apache 2.0 accounts for 58% of identified licenses. MIT is a distant second. For commercial builds and cloud packaging, this materially reduces legal friction and widens the potential surface for enterprise adoption and revenue capture.
Databases and data infra are “workhorse hot”: Database/search projects show 88% 30-day activity and robust median stars (13.8k). Streaming is smaller in count but similarly active (83% in 30 days). Observability is tiny by count yet shows near-universal recent activity, signaling fast iteration and a race to product-market-fit in telemetry-heavy AI / infra stacks.

Language and owner landscape

Top languages by count: Python (65), Java (60), C++ (34), Go (18), TypeScript (15). The long tail includes Rust, Scala, C, Ruby, JS, and notebooks. Median stars by language (with sample-size caution) show C++, Rust, and TypeScript/Python clustering in the mid-teens to low-20k range, while Go sits a bit lower by median in this set. Though Go projects like Kubernetes and Ollama are outliers on the high end.

Owner concentration: Apache (53 repos) is the gravitational center. Next are small clusters around facebookresearch, elastic, google, Netflix, h2oai, tidyverse, and a handful of fast-moving, company-backed AI infra owners.

Apache cohort: median stars 5.9k, median open issues 218, 85% pushed in the last 30 days. High maintenance cadence and roadmap continuity.
Non-Apache cohort: higher median stars (13.9k) and higher open issues (364), but lower 30-day activity (73%). More “breakout” but slightly more sporadic recency.

Interpretation: foundation projects optimize for stability and continuity, while non-foundation/company-backed projects skew toward velocity and viral adoption (and carry more product risk but also upside).

Licensing signals and commercialization readiness

Licenses heavily favor permissive terms:

Apache-2.0 ≈ 58% of identified licenses
MIT ≈ 8–9%, BSD variants ≈ 4–5%
Copyleft licenses are a small minority (AGPL/GPL/MPL combined ≈ low-teens count)

Why it matters: for infra investors and founders, permissive licenses simplify cloud packaging, commercial add-ons, and enterprise adoption. A deep Apache-2.0 bench implies fewer legal frictions for hosting and managed services. Copyleft projects can still commercialize, but with more nuanced business models (e.g. dual-license, hosted-only value capture).

Popularity and engagement dynamics

Median stars ≈ 12.5k (mean ≈ 21–22k, long-tail heavy).
Forks–stars correlation is strong (0.81) i.e. projects that attract attention also tend to accumulate derivative work/extensions, which is useful for platform bets.
Open issues correlate moderately with stars/forks (0.34): bigger communities generate more surface area for maintenance and governance, which strengthens moat if maintainers keep pace.

Age profile: median age 9 years. Cohorts are broad from 2008–2014 “classic” projects through 2019–2023 modern entrants.

Despite age, recency is strong: 75% pushed within the last 30 days, 81% within 90 days. This is notable: a materially active base across mature infra implies ongoing fit with today’s workloads (not just legacy shelfware).

Thematic clusters

Heuristic tags show the following splits and signals:

Data infra (38% of repos): ETL/ELT, lakes/warehouses, analytics engines.
- Activity: 80% pushed in 30 days.
- Median stars: 12.7k.
- Takeaway: Large, steady engine of infra demand. Many opportunities for connective tissue (metadata, cost governance, data contracts, lineage, privacy).
AI and LLM tooling (30%): inference servers, evaluation, fine-tuning, agent frameworks.
- Activity: 73% in 30 days.
- Median stars: 14.9k.
- Takeaway: Strong attention, slightly more volatility. Integration layers around model routing, evals, caching, safety, observability look investable if coupled with usage-linked pricing.
Databases + search (21%): transactional, analytical, vector, indexing.
- Activity: 88% in 30 days (highest of the major categories).
- Median stars: 13.8k.
- Takeaway: Sustained build velocity and adoption. Practical moats accrue via operational excellence (HA/backup/recovery), performance on real workloads, and cloud-native operability (autoscaling, storage tiering, predictable cost).
Streaming (5%): Kafka/Pulsar-like patterns, queues, event backbones.
- Activity: 83% in 30 days.
- Median stars: 11.5k.
- Takeaway: Smaller number but sticky demand. New growth likely at the edges (exactly-once, stateful stream processing with low-latency joins, and data contracts bridging OLTP→OLAP).
Observability/telemetry (6%): metrics, tracing, logging.
- Activity: 100% in 30 days (small sample).
- Median stars: 20.4k (skewed by a few big names).
- Takeaway: In the AI era, infra adds non-determinism and cost volatility. And this makes telemetry essential. Expect consolidation around OTel-native pipelines + LLM-aware SLOs, test harnesses, and cost-to-quality guardrails.
Security/auth (1–2%): tiny count but fully active.
- Takeaway: Greenfield for policy-as-code, Secrets/Key mgmt for multi-tenant AI, evaluation/test data governance, and RAG pipeline hardening.

What this implies for company building

Pick durability. The highest sustained activity is in databases and data infra, with observability also showing intense iteration. For a newco, tighter loops on reliability, operability, and cost predictability will outrun feature-led me-toos.
License choice is strategic. Given the dominance of Apache-2.0, deviating to copyleft will narrow the distribution surface. If your moat is cloud-operational (SLAs, SRE excellence, compliance), Apache/MIT keeps optionality high and makes enterprise POCs smoother.
Own the boring edges. The correlation between stars and forks tells you where ecosystems are fertile. But investors should underwrite the edges that cause pain in production: schema evolution, data drift, stateful upgrades, backfills, cross-region consistency, on-disk format stability, efficient vector indexes under churn, and observability that ties cost → quality for AI pipelines.
Design for platform effects. Projects that make it easier to extend (plugins, connectors, storage engines, UDFs, operators) accumulate forks and integrations. Compounding moats that are hard to unwind once embedded in workflows.

What this implies for investing

Foundation-anchored assets (Apache et al) are excellent ecosystem barometers and acquisition surfaces: look for teams that commercialize operational excellence around these standards with predictable TCO.
Company-backed AI infra with permissive licenses can scale fast but must prove defensibility beyond model access e.g. data adjacency, compliance, private fine-tunes, eval/safety/observability built in, and cost governance.
Databases/search remain investable where teams demonstrate technical advantage on real customer workloads (tail latency, compaction, tiered storage, multi-AZ replication, crash-safe durability, workload isolation) and a clean path to managed-service margins.
Telemetry/control planes for RAG/agentic systems are under-supplied. The high activity in observability hints at a forming standard: OTel first, LLM-aware signals (prompt/response lineage, caching hits, hallucination/eval scores), and unit-economics dashboards that tie GPU/egress cost to business outcomes.

Concrete opportunities to explore from the patterns

AI Data Reliability and Cost Controls: Tools that watch embedding churn, index compaction, cache efficacy, and prompt/eval drift. And automatically tune for cost-to-quality trade-offs.
Unified Schema and Lineage for Hybrid Workloads: Bridges between OLTP→stream→OLAP with contract enforcement and automated backfills.
Database-as-a-Product Ops Kits: Opinionated ops for top open source databases (bootstrap, HA, backup/restore drills, live-migrations, chaos tests, perf baselines), delivered as operator + runbooks + SRE service.
Security Hardening for RAG/Agents: Policy-as-code across retrievers, tools, model endpoints, with PII/PHI/PCI controls and reproducible evals.
Plugin fabrics where forks flourish: If a repo shows strong forks/stars, there’s room for a market of connectors/operators and “glue” that becomes the default choice.

Risks and watch-outs

Star-driven bias: Stars overweight top-of-funnel excitement. Insist on usage telemetry, self-hosted installs, and enterprise references before over-indexing.
License flips / relicensing: While rare in Apache/MIT, some projects have pivoted to source-available or dual-license. Diligence the governance and contributor agreements.
Maintainer bandwidth: Open issues scale with popularity. Ensure there’s a bus-factor plan e.g. multiple core maintainers, foundation backing, or a commercial steward.

If you like this newsletter, consider sharing it with 1 infra-curious friend:

Subscribe now

Sector Deep Dive #3: INFERENCE CLOUD PLATFORMS

Prateek Joshi — Wed, 17 Sep 2025 23:01:43 GMT

1. What inference cloud platforms are and why they matter now

“Inference” is the phase where trained AI models answer requests in the wild: they classify, summarize, code, chat, recommend, or generate images and video. Inference cloud platforms are the providers that run this at scale and expose it as APIs or managed endpoints. The group spans:

Hyperscalers: AWS Bedrock, Azure OpenAI, Google Vertex AI
Model companies with hosted APIs: OpenAI, Anthropic, Cohere, Mistral, xAI
Specialized AI clouds: CoreWeave, Lambda, Together AI, Fireworks AI, Modal, Replicate, Anyscale
Open-source inference engines: NVIDIA TensorRT-LLM, vLLM, Triton, Ollama) that these platforms increasingly use under the hood

Two things are happening at once. First, end-user demand is rising as enterprises shift from pilots to production. Companies are redesigning workflows and are seeing cost reductions in most functions where they actually deploy AI. Second, the supply side is scaling dramatically. Microsoft guided to a record ~$30 billion of capex in the current quarter (Jul 30, 2025). And Alphabet lifted 2025 capex plans to ~$85 billion, largely for AI data centers.

Taken together, that means inference capacity will drive value creation (not model training alone). Especially over the next two years as real usage composes into steady, billable traffic. AMD’s leadership has been explicit. Inference demand is set to outgrow training, with CEO Lisa Su calling out rapid acceleration.

2. Demand outlook: pricing is falling while usage rises

Developers can now buy high-end model outputs for a fraction of last year’s cost. OpenAI’s GPT-4o launched on May 13, 2024 at roughly half the price of GPT-4-Turbo and with higher rate limits. Google followed with large price cuts to Gemini 1.5 Pro (announced Sep 24, 2024; effective Oct 1, 2024). And later rolled newer 2.x models into its lineup with low-cost “Flash” tiers.

Cohere and Mistral publish similarly aggressive pricing for their command-and-reasoning families. On the ultra-low-cost end, DeepSeek’s R1 reasoning API lists input at roughly $0.55 per million tokens and $2.19 for output.

Lower unit prices haven’t slowed demand. If anything, they invite more usage and new types of applications (voice agents, video generation, background automation). Inference requests are becoming embedded in daily processes rather than sporadic “pilot” bursts.

3. Market structure: who’s selling what

Hyperscalers:
AWS, Microsoft, and Google anchor the managed platform tier: wrapping multiple models, safety filters, observability, and enterprise controls under one bill and SLA. AWS Bedrock and Azure OpenAI have achieved FedRAMP High in their government clouds. And Google secured FedRAMP High for selected components like Vertex AI Vector Search. This matters for regulated demand where compliance can be the gating factor.

Model API companies:
OpenAI, Anthropic, Cohere, Mistral, and xAI expose models directly. And are often available through the hyperscalers too. Current, public pricing pages let builders compare per-million-token rates and pick “fast-cheap” or “smart-expensive” options.

Specialized AI clouds and serverless GPU providers:
CoreWeave, Together AI, Fireworks AI, Modal, Replicate, and others focus on cost-efficient throughput, burst capacity, and developer experience. CoreWeave’s S-1 (filed Mar 3, 2025) revealed $1.92 billion of 2024 revenue, but also heavy customer concentration (Microsoft at ~62% in 2024 per S-1 analysis and press). It financed expansion with a $7.5 billion debt facility led by Blackstone and Magnetar on May 17, 2024.

Together AI raised a $305 million Series B on Feb 20, 2025 and crossed $100 million annualized revenue around that time, per Bloomberg/Crunchbase reporting. Fireworks AI reported rapid ARR growth in 2025 and is reportedly exploring a raise at ~$4 billion valuation. Modal and Replicate illustrate the “serverless GPU” model that charges per-second or per-GPU-hour for inference runs, with public pricing examples for L-series GPUs.

Open-source inference engines:
Under the hood, many platforms are converging on a few high-performance engines. vLLM (57k+ GitHub stars as of Sep 7, 2025) and NVIDIA’s TensorRT-LLM (actively releasing throughout 2024–2025) are two of the most visible, while NVIDIA Triton remains common as a serving runtime. On the “local” side, Ollama’s explosive adoption (152k+ stars as of Sep 6–7, 2025) signals a strong DIY and edge-inference movement. This is often a precursor to enterprise demand for managed versions.

China and the rest of world:
In China, Baidu, Alibaba (Qwen), ByteDance (Doubao), and startups like DeepSeek are pushing aggressive capability-to-cost curves. Baidu announced model upgrades and price cuts on Apr 24–25, 2025. Alibaba publishes granular Qwen API pricing. ByteDance promotes Doubao access via Volcano Engine. DeepSeek lists low per-million-token pricing for its R1 reasoning model.

4. Economics in plain terms: what drives costs and margins

Inference costs scale with three levers:

compute per request (model size, precision, and decoding strategy)
utilization (keeping GPUs busy with batching and scheduling)
data movement (egress and inter-AZ (availability-zone) / region traffic)

Providers increasingly use FP8/FP4 kernels, paged attention, speculative decoding, and “in-flight batching” to boost throughput. Exactly what TensorRT-LLM and vLLM are optimized for.

At the cloud-network level, egress fees and inter-AZ traffic still matter for TCO (total cost of ownership). In 2024, Google removed exit fees for customers migrating off its cloud (Jan 11–12, 2024), and AWS followed in March 2024. But normal egress still applies for day-to-day operations. Typical AWS data transfer out runs roughly $0.09/GB for the first 10 TB/month in many regions, with inter-AZ charges around $0.01/GB.

Put practically: bandwidth adds up if an application streams images or video from inference outputs or moves embeddings between services.

That’s why many inference platforms co-locate vector search, caches, and storage with serving to cut cross-service data charges. It’s also why some startups prefer specialized AI clouds where pricing bundles compute, storage, and networking tightly.

5. Reliability and compliance: what enterprises actually ask for

Large buyers care about uptime SLAs, security attestations, and government cloud options. That said, outages do happen. OpenAI reported notable incidents in 2023–2024 (including a DDoS-related disruption on Nov 8, 2023 and a service impairment on Jun 4, 2024), which buyers often cite when asking about multi-vendor failover and local fallback.

For IT leaders, this translates into a simple rule:

pick at least two model providers (direct or via a hyperscaler) and deploy a fallback path on an open-source engine (vLLM/TensorRT-LLM) where feasible.

This reduces outage risk and allows cost routing as prices change.

6. Competitive dynamics: price wars meet platform bundling

Price cuts are not just marketing. They pressure everyone (especially startups) to improve GPU utilization and lower serving costs. Google’s 64%/52% cuts on Gemini 1.5 Pro in late 2024 set a precedent. OpenAI’s GPT-4o (May 13, 2024) cut price and boosted rate limits. xAI’s Grok 4 and DeepSeek R1 added low-cost reasoning options in 2025.

Hyperscalers also bundle: the model call is one API, but buyers are really purchasing governance, observability, private networking, enterprise auth, and in some cases FedRAMP posture. That bundle is hard for startups to match, which is why specialized AI clouds differentiate on raw performance, GPU availability, or easy developer workflows.

CoreWeave is the bellwether for specialized AI clouds. Its S-1 showed explosive revenue growth to $1.92 billion in 2024. These numbers show how capital intensive inference clouds are and how much their fate can hinge on a few anchor customers.

7. How this connects to infrastructure startups and how many are affected

Where the dependency lies: Even if a startup doesn’t sell an “inference API”, much of the AI infra startup stack is downstream of inference demand:

Model-serving and orchestration (BentoML/Ray Serve/Anyscale, vLLM, Triton): revenue aligns with request volumes and concurrency. As inference scales, these grow. Anyscale’s 2025 partnerships signal continued push to managed Ray-based serving.
Data layer (vector databases, feature stores): more inference means more embeddings, more caching, and more retrieval.
Observability/security (guardrails, evals, tracing): production inference requires red-teaming, safety checks, and run-time monitoring.
Networking and acceleration (NICs, smart switches, CUDA/ROCm kernels): token throughput and tail-latency are network-sensitive.
Edge/enterprise local (Ollama-style local serving, on-prem Triton/TensorRT-LLM): regulated workloads and cost control push some inference to customer hardware.

A rough share of affected infra startups:
Using public venture reports that show AI absorbing an outsized share of VC dollars in 1H-2025 (EY: ~$49.2 billion into gen-AI in H1 2025, already above full-year 2024) and the visible skew toward application-layer deals, at least half (and plausibly 60–70%) of AI infrastructure startups have their fortunes tied to inference growth (direct revenue or adjacent data/observability spend).

This is an estimate, not a hard count. But it matches what’s seen in fund flows and market maps: most infra projects today pitch either “cheaper, faster inference” or “better pipelines and retrieval for inference.”

Correlation and dependency:
When hyperscalers raise capex and roll out enterprise controls, application teams are more likely to ship production features. And that directly lifts inference calls, which lifts demand for the whole downstream stack. Conversely, if a big buyer slows deployments or centralizes on one provider for cost reasons, adjacent infra (observability, vector DBs, orchestration) may see delayed projects.

8. Risks over the next 24 months and what to watch

(a) Supply chain and power constraints
GPU allocations are still tight and data-center interconnect queues are long in power-constrained regions. BigTech’s capex is surging (Microsoft ~$30 B this quarter, Alphabet ~$85 B for 2025), but a lot of that converts into capacity only when land, power, and cooling are ready. Watch for delays tied to grid interconnects and specialized high-density builds.

(b) Price compression
The fall in per-token pricing is likely to continue, with new “reasoning” models (DeepSeek R1, xAI Grok 4) pushing price-performance down further. Platforms must keep utilization high (via batching, caching, or model distillation) to avoid margin squeeze.

(c) Outages and concentration
Incidents at a single model provider can knock out large fractions of traffic for hours. The OpenAI incidents in Nov 2023 and Jun 2024 are a reminder. SRE teams will pressure for multi-provider routing and local fallbacks.

(d) Compliance and data locality
The good news is that FedRAMP High and similar authorizations are arriving for major services. The challenge is that sensitive workloads still need private networking, key management, and clear data-use policies. Delays in rolling out “enterprise safeguards” can stall big deals.

(e) Macro and investor sentiment
Inference clouds are capital intensive. Public market reception to specialized AI clouds has been mixed. If public comps wobble, late-stage private rounds could slow, impacting the partner ecosystem.

9. How to think about winners

Platforms that control utilization
Winners will squeeze more requests per GPU hour. The technology stack is clear: engines like TensorRT-LLM and vLLM, plus tricks like FP8/FP4 quantization and speculative decoding, drive throughput without hurting quality. Those gains are hard to reverse and compound every quarter.

Platforms that own regulated channels
FedRAMP High and sector certifications unlock budgets that smaller vendors can’t access quickly. AWS, Microsoft, and Google’s moves in 2024–2025 are strategic moats in US public sector and highly regulated industries.

Platforms with balanced customer bases
Revenue concentration is a risk in this subsector. The more diversified the top customers and geographies, the sturdier the cash flows through cycles.

Global low-cost challengers
The China ecosystem (Baidu, Alibaba, DeepSeek, ByteDance) is pushing cost down dramatically. While export controls and data residency limit cross-border usage, the pricing pressure they exert globally will influence buyer expectations.

10. Practical guidance for investors and operators

For venture investors
Ask any infra startup how they (a) keep GPUs hot (utilization) (b) cut token compute per request (quantization, distillation, caching) (c) minimize bandwidth charges (co-location, compression, RAG locality). You’re looking for companies able to maintain gross margins as price per token keeps sliding.

Validate multi-provider integrations. If a startup depends on one model vendor or one cloud region, treat that as concentration risk. Much like you would a single large customer.

Finally, watch the “serverless GPU” abstraction layer. Modal and Replicate show that per-second billing and instant scale can beat reserved instances for bursty workloads. Adoption there could shift where “platform” margins accrue (to the servers-on-demand layer).

For corporate buyers and product teams
Lock in at least two model paths (via your cloud of choice plus a direct API) and a local fallback using vLLM or TensorRT-LLM for mission-critical flows. Budget for bandwidth where outputs are large (images, video) and keep RAG stores co-located with serving to avoid inter-AZ/region fees. Remember that migrating off a cloud may waive exit fees now, but normal egress still applies to daily operations.

For founders at the application layer
Lean into cheaper “flash” tiers for non-critical tasks and reserve expensive reasoning models for high-value steps. Many teams are carving workload graphs so that 70–90% of tokens go to low-cost models and only the “hard” paths hit premium models. That keeps unit economics sane as your user base grows.

11. What could change the call (24-month horizon)

Positive surprises

A step-change in throughput (e.g. widespread FP4 adoption or a new serving breakthrough) that halves cost per request again. Suddenly many more use cases become profitable. Keep an eye on TensorRT-LLM and vLLM releases.
Faster regulatory certifications (FedRAMP High/DoD IL-5 for more services) unlocking pent-up demand in government and healthcare.
Public-market validation of specialized AI clouds (smoother IPO outcomes and rising multiples) that lowers the sector’s cost of capital and speeds build-outs.

Negative surprises

Power or interconnect delays slow data-center rollouts, creating capacity gaps during peak demand. The capex is committed, but lead-times can slip.
Major, prolonged outages trigger widespread buyer mandates for on-prem inference, temporarily shifting spend away from managed cloud endpoints.
An extended price war that compresses gross margins faster than utilization improvements can compensate. Particularly painful for smaller providers without proprietary hardware access.

12. How many infrastructure startups will this affect and how?

Estimated share affected:
Given H1-2025 venture flows (~$49.2 billion into generative AI in H1 alone) and the clear skew of infra pitches toward serving, orchestration, vector/RAG, and observability, a reasonable estimate is that 60–70% of AI infrastructure startups are directly tied to inference adoption curves either as primary revenue (serving) or adjacent spend (data/observability). That range reflects uncertainty: public sources break out “AI” as a whole, not “inference infra” specifically.

Correlation pathways:

Capex → capacity → price: Hyperscaler capex raises capacity. More capacity tends to push per-token prices down. Lower prices increase app usage. More usage drives infra demand.
Compliance unlock → big-ticket buyers: FedRAMP High or equivalent certifications unlock multi-year contracts. Once signed, these generate steady inference flows that support data and observability partners.
Model competition → routing: As DeepSeek/xAI/Mistral cut costs, app teams start routing workloads by cost/quality. That forces startups to integrate multiple providers and invest in evaluation and guardrails, benefiting orchestration and tooling companies.

Key dependencies:

GPU supply and scheduling tech (TensorRT-LLM, vLLM, advanced schedulers) are foundational. Without them, margins erode.
Network costs determine whether RAG-heavy apps scale profitably. “Exit” fee waivers don’t change daily egress economics.
Customer concentration (CoreWeave-Microsoft) shows platform-level fragility that can cascade to smaller partners.

13. Regional notes: US, Europe, Asia

United States: The US remains the center of gravity for both demand and supply. FedRAMP authorizations and record hyperscaler capex point to continued expansion.

Europe: Regulatory focus on switching costs pushed clouds to waive exit fees, which may encourage multi-cloud inference strategies (EU Data Act and broader scrutiny played a role in the 2024 fee changes).

China: A separate but fast-moving market with intense price competition (Baidu, Alibaba Qwen, DeepSeek, ByteDance). Even if cross-border use is limited, the global effect shows up in buyer expectations about what a “fair” price per million tokens should be.

14. What to monitor (practical checklist for the next 6–24 months)

Capex guidance: Microsoft, Alphabet, Amazon quarterly updates. If spending flattens earlier than expected, expect tighter capacity growth and slower price cuts.
Price changes: OpenAI, Google, Anthropic, Mistral, Cohere, xAI, DeepSeek pricing pages. Track cuts or new “flash/reasoning” tiers.
Engine releases: vLLM and TensorRT-LLM release notes. Watch for features like better batching, quantization, and scheduler upgrades that change GPU economics.
Compliance milestones: New FedRAMP/IL-4/5 authorizations across Bedrock, Vertex AI, Azure OpenAI. These correlate with large RFPs in public sector and highly regulated industries.
Incident history: Model/API outages on vendor status pages and developer forums. See if customers adopt multi-provider routing as standard.
Public comps: Watch CoreWeave and any follow-ons. Stock performance and disclosures can influence late-stage private rounds and M&A appetite across inference tools.

15. Bottom line

Inference cloud platforms are moving from novelty to utility. Prices are trending down and capacity is trending up. And compliance gates are opening. In the next two years, the most value will accrue to platforms that do three things well: (1) keep GPUs highly utilized with advanced serving stacks (vLLM, TensorRT-LLM, Triton-style backends), (2) meet enterprise governance and compliance needs at scale, and (3) diversify customers and geographies to reduce concentration risk.

For venture investors, that creates two attractive pockets: (a) “picks and shovels” that make inference cheaper (serving engines, schedulers, compression, agentic runtimes) and (b) “adjacent infrastructure” that becomes non-optional as inference scales (vector/RAG stores, eval/observability/guardrails, privacy/security layers). For founders and buyers, the operating playbook is: multi-model routing, local fallbacks, and ruthless attention to utilization and bandwidth.

The punchline: the growth of inference will pull a majority of AI infrastructure startups along with it. The exact share is uncertain. But current capex, pricing, and adoption signals make it hard to see a different center of gravity for AI infrastructure in the near term. Keep tracking capex guidance, price sheets, engine releases, and compliance wins. That’s where the next two years of winners will be decided.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Sector Deep Dive #2: AI BROWSERS

Prateek Joshi — Thu, 04 Sep 2025 16:03:00 GMT

1. Snapshot and thesis

Browser wars are happening again! This time with AI fueling the frenzy. AI browsers blend a familiar web browser with built-in AI that can summarize pages, answer questions, auto-navigate, draft content, and increasingly act for the user.

Over the next 24 months, this category will test whether AI-first workflows can carve out share in an entrenched browser market. And whether they can monetize beyond traditional search ads and create new attachments across the cloud and edge stack.

Core thesis: The investable opportunity is attractive if (a) AI browsers can win distribution on mobile and desktop despite defaults (b) they can lower inference costs using in-browser compute and edge inference (c) they can convert intent (and time saved) into measurable revenue (ads, subscriptions, affiliate, or payments).

The upside expands because this category sits on top of and helps drive demand for a number of infra layers: GPUs, WebGPU / ONNX Runtime Web / TensorFlow.js, model APIs (OpenAI, Anthropic, Google), edge inference platforms (Cloudflare, Akamai), and search indexes (Brave Search, Bing, Perplexity’s own).

Why now: Two enabling changes have landed: (1) web-native acceleration (WebGPU, maturing WASM toolchains, early WebNN), so models can run in the browser (2) policy/UX shocks that open distribution (EU DMA forcing Apple to allow alternative engines in the EU, Google’s AI Overviews changing search result pages and traffic patterns).

2. What counts as an “AI browser” and who are the players

In this report, “AI browsers” will include any web browser or browser-like app where AI is a native control surface (not just an extension). That spans:

Established browsers adding AI:
Google Chrome: AI Overviews in search; “Help me write,” “Tab Organizer”.
Microsoft Edge: Copilot built-in
Opera: Aria assistant via Composer, on desktop, Android, and Opera Mini. Neon, an AI-native browser slated for public rollout. MiniPay crypto wallet integrated.
Brave: Leo assistant, independent Brave Search, opt-in Brave Ads with 70% revenue share to users
Mozilla Firefox: Exploring on-device and private AI assistants. Firefox remains a distribution gate even as AI integrations evolve. Context on iOS engine policy below.
AI-native challengers:
Perplexity: Comet AI browser. Talks with OEMs for preinstalls. Consumer MAU and revenue run-rate scaled markedly in 2025.
BrowserBase: A web browser for AI agents. They recently teamed up with Cloudflare to build an identity layer for AI agents.
Anthropic: They recently launched Claude for Chrome.
The Browser Company (Arc): AI-heavy “Browse for me” style features and agentic flows.
SigmaOS (workflow browser with AI co-pilot). Felo/Fellou and Strawberry (smaller AI browser entrants).
You.com: AI search + browser-style app. Enterprise agents and a web search API.
DuckDuckGo: DuckAssist summaries and private AI chat embedded in its app.
Regional ecosystems:
Baidu is folding ERNIE models into search and companion apps. Yandex upgraded its Alice assistant with YandexGPT 3 across devices. These ecosystems point to AI browsers that are tightly integrated with local search, messaging, and mini-apps.

3. Market size and trajectory (next 24 months)

Installed base: Browsers are the biggest consumer runtime. Statcounter shows Chrome at ~68%, Safari ~16%, Edge ~5%, with others (Brave, Opera, Firefox, Samsung Internet) making up the balance (12-month window ending July 2025). That reach is applied to a global internet user base of ~5.5–5.65 billion in 2024–2025.

Penetration starting point: AI answers already reach users through default surfaces (Google AI Overviews, Bing/Copilot milestones such as 100 million DAU in 2023), while dedicated AI browsers are in early innings with meaningful momentum: Perplexity reported ~30 million MAU and a ~$150 million ARR-like run-rate by mid-2025. Brave discloses ~93.8 million MAU. Opera reported accelerating ad/search growth with AI-powered monetization and an AI-native Neon rollout in 2025.

Sizing the near-term “AI browser” revenue pool. This is not a classic TAM. It’s layers:

Ad/search monetization inside the browser: Sponsored answers, native ad formats, and affiliate/commerce. Google’s AI Overviews shift click-through patterns. Publishers report measurable volatility, which implies spend reallocations toward answer surfaces.
Subscriptions: Perplexity’s premium, Brave Leo Premium, VPN bundles, and potential “agent” fees.
Payments/commerce baked into browsers: Opera MiniPay crossing 8–9 million activated wallets and >200–250 million transactions in 2025.

Back-of-envelope scenario (illustrative): If only 2% of the world’s internet users (~110 million) adopt an AI browser subscription at a blended $4/month in 24 months, that’s ~$5.3 billion annualized. If another 300 million users generate $1/month in incremental ad/affiliate yield via AI answers and shopping flows, that’s ~$3.6 billion annualized. These are assumptions and the purpose is to show order of magnitude. Actual outcomes hinge on distribution and inference cost curves.

Growth drivers:
a. Distribution policy changes (EU DMA, choice screens on iOS 17.4 in the EU, and potential antitrust remedies around default search contracts in the US).
b. Lower latency and cost thanks to WebGPU + ONNX Runtime Web/TensorFlow.js and edge inference (Cloudflare, Akamai).
c. New OEM channels for AI browsers on mobile (Perplexity’s Comet preinstall talks).

4. Who uses these and why they care

For consumers, the job to be done is “get answers, not links” plus “summarize this page, plan this trip, buy the thing, and do it fast”. AI browsers like Perplexity and Opera Aria collapse the search-read-copy-paste loop. And in Opera Mini, they do it even on low-end Android devices via lightweight UIs.

For prosumers and teams, the pitch is “do more with fewer tabs”. Arc and SigmaOS focus on workspace-style browsing with AI co-pilots to organize and draft. And You.com pushes enterprise agents and a web search API to wire AI into regulated workflows.

For privacy-sensitive users and publishers, Brave’s independent index and Leo assistant try to keep data local and links credited. DuckDuckGo keeps AI optional and private.

5. Product and technology: how the stack fits together

Client-side acceleration. WebGPU (default in Chrome/Edge since 2023–2024) plus WASM allows meaningful on-device inference for vision, speech, and small-to-mid LLMs. Microsoft’s ONNX Runtime Web added a WebGPU execution provider in Feb 2024. TensorFlow.js continues to expand WebGPU support. Early WebNN bridges to native ML APIs. This reduces cloud spend, improves latency, and eases privacy concerns.

Model options. Opera’s Aria routes to OpenAI and Google through its Composer layer (and can switch models), while Brave’s Leo supports leading closed and open models. Perplexity uses a blend (its own retrieval stack plus partner models) and its Comet browser aims to bring that into the navigation layer.

Open-source tooling (browser-ready).

WebLLM / MLC-LLM (LLMs compiled for WebGPU, in-browser quantization).
Transformers.js (browser-side transformer inference with JS).
llama.cpp (CPU/GPU-friendly inference, with ports to web via WASM/WebGPU).
ONNX Runtime Web and TensorFlow.js (core runtime layers, increasingly WebGPU-accelerated).

Edge inference. Cloudflare’s Workers AI and AI Gateway, and Akamai’s Cloud Inference, push model serving closer to users to cut tail latency and cost. For AI browsers, that shortens round-trips for page-aware actions (summaries, “shopping compare”, “book this”) and creates an infra partnership surface (shared caches, embeddings, and guardrails at the edge).

OS hardware tailwinds (indirect but relevant). Copilot+ PCs ship with NPUs. Apple Intelligence targets A17 Pro and M-series devices. While browsers mainly use the GPU via WebGPU, the hardware trend signals more capable local inference and rising user expectations for on-device AI.

6. Distribution and competition

Defaults still matter. Safari and Chrome together exceed 80% market share worldwide. Moving users off defaults is hard. That’s why Perplexity is negotiating preinstalls and why EU choice screens (iOS 17.4) matter. For private companies, OEM deals and regional bundling can be the difference between niche and mainstream.

Search stacks split the field.

Google: AI Overviews change SERP layouts and upstream supply/demand. The company controls both Chrome and Search.
Microsoft: Edge + Copilot + Bing give an integrated alternative. Bing crossed 100 million DAU when Chat launched in 2023, a psychological threshold that keeps the flywheel turning.
Independents: Brave has its own index (plus Brave Search Ads), a differentiator vs meta-search. Perplexity is building index and retrieval infra and layering an agentic browser on top. BrowserBase is building web browsers to agents+applications.
Regional: Baidu (ERNIE in search), Yandex (Alice/YandexGPT) bundle AI across services.

Notable financial traction and signals.

Perplexity: ~30 million MAU and ~$150 million run-rate as of mid-2025 (press reporting).
BrowserBase: 50 million+ browser sessions in 2025, serves 1,000+ companies, and has 20,000+ developers signed up.
Brave: ~93.8 million MAU (company transparency page, 2025).
Opera: Q2 2025 revenue +30% YoY; ad revenue +44% YoY; MiniPay >9 million activated wallets and >250 million transactions; AI-native Neon moving toward rollout.

7. Monetization, unit economics, and paths to profit

Ad and commerce yield. AI answers can capture high-intent queries (e.g. comparisons, local services, products) and monetize via native ads, affiliate, or merchant lead gen. Google’s AI Overviews already alter click flows. Some publishers report big swings, a signal that spend may reallocate toward answer units where the decision happens. For challengers, the question is whether AI answers can command CPC/CPA pricing comparable to today’s SERP ads at scale.

Subscriptions. Perplexity Plus (and Comet tiers), Brave Leo Premium, and bundles (VPN, talk, search premium) provide diversified ARPU. Subscriptions smooth out the inherently cyclical ad budgets and create a budget for inference.

Revenue sharing and tokens. Brave shares 70% of ad revenue from opt-in Brave Ads with users via BAT. It’s one of the few browsers with a transparent user revenue share. For investors, this is a lever for adoption and loyalty, but it shifts margin from company to user.

Inference cost curve. Cloud-only answer generation is expensive. But two trends are bending the curve:
a. In-browser inference offloads smaller and latency-sensitive tasks (summaries, RAG, speech) using WebGPU and WASM. ONNX Runtime Web’s WebGPU provider launched in 02/2024.
b. Edge inference (Cloudflare, Akamai) trims tail latency and egress while enabling semantic caching. Fastly’s AI Accelerator (semantic caching) illustrates the caching/gateway layer that can sit in front of expensive LLM calls.

Unit economics (directional). If an AI browser session involves 1–2 short model calls (RAG + summary) that can be handled locally or at the edge for pennies and premium tasks go to 4o/Claude/Gemini only when needed, then gross margins can look similar to ad-supported browsers with improved attach on premium. The mix of local/edge/cloud will be the dominant driver of gross margin over the next 24 months.

8. Dependencies, hidden connections, and infra correlations

To GPUs (and GPU-like APIs). WebGPU exposes the device GPU to the web. Everything from WebLLM to ONNX Runtime Web depends on it. That ties AI browser performance to Chrome/Edge release cadence (and to Metal/Vulkan/DirectX under the hood), raising the value of teams that can squeeze performance from shader kernels and quantization.

To model vendors. Opera Aria’s Composer taps OpenAI and Google. Edge integrates Copilot (OpenAI-family models). Perplexity blends its retrieval with partner models. Contract pricing, rate limits, and safety policies at OpenAI, Google, and Anthropic directly affect the UX and gross margin of many AI browsers.

To search indexes. Independence matters. Brave’s own index reduces reliance on Bing. Perplexity is investing in its own web data and partnerships (Comet + OEM). Contract changes at Bing or Google could whipsaw smaller players.

To edge networks. Cloudflare’s Workers AI/Gateway and Akamai’s Cloud Inference make agentic browsing feel instantaneous and cheaper. Expect deeper commercial tie-ups (shared semantic caches, RAG stores at the edge, abuse prevention) between AI browsers and these networks.

To mobile OEMs and app stores. Perplexity’s preinstall talks hint at a classic “default wars” playbook. DMA-driven choice screens on iOS in the EU open a wedge for challengers who design great first-run flows and import tools. These are leverage points for venture-backed challengers.

Correlations to watch.

Ad market health <> AI answer RPMs. If SERP budgets migrate into AI answer units, browsers that own the answer surface will capture outsized upside.
WebGPU maturity <> local inference share. As WebGPU and WebNN mature, more workload moves on-device, improving margins and privacy.
Policy changes <> install base churn. DMA-style changes and U.S. remedies on default contracts could shift share faster than organic marketing can.

9. Risks and how to read them early

Distribution lock-in. Chrome and Safari dominate. Even great products struggle to overcome defaults and habit. Early warning: OEM deals failing to convert, short-lived spikes post-PR, and low stickiness after first-run.

Traffic and publisher backlash. AI answers that hoover demand without sending clicks will risk regulatory and ecosystem pushback. There might be lawsuits, policy proposals, and an uptick in paywalled content blocking crawlers. Publishers have already reported volatility post-AI Overviews.

Safety and compliance. Browsers delivering AI answers at scale must handle hallucinations, defamation, and local content controls. Edge caches can repeat bad answers faster. Human-in-the-loop and retrieval quality become key controls.

Inference cost blowouts. If workloads stay cloud-heavy, COGS can swamp subscription ARPU. Watch ratio of local/edge/cloud calls. Follow releases from ONNX Runtime Web/TensorFlow.js and edge providers that measurably cut $$.

Mobile platform friction. Apple’s EU engine carve-out helps, but outside the EU the WebKit requirement remains. Android OEM deals can be fragile. Track Apple/WebKit changes, iOS adoption of alternative engines (EU-only), and OEM preinstall terms.

Regional complexity. In China and parts of the CIS, local giants (Baidu, Tencent, Yandex) integrate AI into super-apps and search, making it hard for foreign AI browsers to grow. Follow ERNIE, Hunyuan, DeepSeek integrations across search, messaging, and app stores.

10. Company snapshots (what’s differentiated)

Google (Chrome + AI Overviews). Control over the browser and the ad engine with AI Overviews shifting “where decisions get made”. The risk is publisher backlash and regulatory scrutiny if traffic declines persist.

Microsoft (Edge + Copilot). “Assistant in the browser” is clear. Crossing 100 million DAU on Bing in 2023 showed a step-change in engagement once chat arrived. The new Copilot+ PC push raises user expectations for local AI, which can complement WebGPU workloads in Edge.

Opera (Aria, Neon, MiniPay). Clear AI narrative, strong emerging-market exposure via Opera Mini, and a fintech angle through MiniPay (8–9 million wallets, >200–250 million transactions). Q2 2025 had 30% revenue growth, 44% ad growth, and Neon on the horizon. Execution on Neon and MiniPay monetization are the key tells.

Brave (Leo, Search, BAT). A vertically integrated, privacy-centric stack (browser + index + ads) with ~93.8 million MAU and a distinctive user revenue-share model (70% to users for Brave Ads). Watch whether Leo and premium bundles lift ARPU without undercutting ad take.

Perplexity (Comet, OEM). A consumer AI brand moving down-funnel into a browser. MAU and revenue grew fast in 2025. OEM preinstalls could be a distribution unlock. Execution risks are quality at scale, cost containment, and navigating platform politics.

BrowserBase: In June 2025, it closed a $40 million Series B led by Notable Capital. Launched Director (a no-code web-automation product), signaling broader demand beyond developers. The company says it has supported 50 million+ browser sessions in 2025, serves 1,000+ companies, and has 20,000+ developers signed up. 100 million+ usage minutes billed per month, hundreds of paying customers, and “millions in revenue in its first year”, plus a 17% month-over-month increase in active subscribers after a pricing tweak. The Stagehand TypeScript repo shows ~16.7k GitHub stars (with companion repos like the MCP server at ~2.5k and open-operator at ~1.8k).

You.com / DuckDuckGo / Arc / SigmaOS / Strawberry / Felo. These round out the spectrum from enterprise agents (You.com) to private AI answers (DuckAssist) to workflow-first browsers (Arc, SigmaOS) and niche AI browsers. Traction will hinge on a wedge (enterprise compliance, private AI, or a unique workflow) and a durable acquisition channel.

Regional giants (Baidu, Yandex). Tight integration with search, mini-apps, and super-apps (Alice, ERNIE) produces “AI browsers” that are really gateways into national ecosystems. Good defensive moats locally. Tough export story.

11. Ties to the infra layer and how value flows upstream

GPU and driver stacks. Every time an AI browser runs a local summary with WebGPU, that’s incremental demand for GPU-capable devices and the driver/runtime work behind them (DirectX 12, Vulkan, Metal). When more work happens locally, latency shrinks and conversion improves — creating a measurable ROI story that justifies GPU-capable endpoints.
Model APIs and gateways. AI browsers are high-variance demand generators for OpenAI, Anthropic, and Google APIs. Edge gateways (Fastly AI Accelerator, Cloudflare AI Gateway) smooth demand with semantic caching, hydrate RAG stores, and cap spend bursts — a surprisingly material piece of the puzzle for unit economics.
Edge inference/CDN. Akamai’s Cloud Inference (3x throughput, up to 2.5x lower latency claims) is built exactly for the “answer right now” patterns AI browsers need. Expect joint solutions e.g. shared embeddings, abuse/fraud screens.
Search indexes and crawling. Brave’s independent index and Perplexity’s retrieval investments reduce dependence on Bing/Google contracts. If remedies in the U.S. limit default search payments, the browser layer becomes a contested distribution node. This raises the strategic value of owning the index.
Open-source toolchains. WebLLM/MLC-LLM, Transformers.js, llama.cpp, ONNX Runtime Web, and TensorFlow.js are now production-relevant. They compress costs and unlock offline/private modes. Infra vendors that optimize for these (developer tooling, observability, safety filters) will capture spend.

Hidden connection: as AI answers shift intent capture from SERP pages to “in-browser panels”, affiliate and performance marketing networks (and the edge CDNs that carry them) become part of the infra story: rate-limiters, link-resolvers, and fraud filters move closer to the browser. Fastly’s semantic caching is a preview. Expect Cloudflare, Akamai, and even payment processors to offer “AI answer commerce” kits.

12. Regulation and platform dynamics

EU DMA and iOS engines. Apple opened the door to non-WebKit engines in the EU (iOS 17.4), which can shift mobile distribution for AI browsers there. Outside the EU, WebKit remains required, tempering feature parity (e.g. WebGPU capabilities) for iOS users.
U.S. search remedies. Proposed measures target default search contracts and possibly Chrome-search bundling. Outcomes could reshape browser distribution economics. Timing matters for venture pacing.
Publisher relations. As AI answers expand, expect more licensing partnerships (Perplexity’s publisher deals) and more traffic-sharing proposals to defuse ecosystem friction.

13. What to watch in the next 24 months

Catalysts.

Perplexity Comet GA + OEM preinstalls (distribution test).
Opera Neon public rollout and MiniPay monetization progress (ad/search + fintech blend).
Chrome/Edge WebGPU & WebNN releases that materially improve on-device LLM speeds (watch ONNX Runtime Web and TF.js releases).
Google’s AI Overviews dialing (frequency, layout, ad load) and publisher response.
EU/U.S. remedies on defaults and any choice-screen expansions.

Early warning indicators.

COGS/ARPU gaps widening (cloud-heavy inference, no local/edge offload).
Short-session stickiness deteriorating (AI panels opened but not used >2–3 times per day).
Publisher/legal friction rising (blocked crawlers, suits, or higher licensing costs).

What would change the call (positive).

WebGPU/WebNN improvements that enable a standard, fast, small-model local stack across mainstream devices (Chrome, Edge, Safari).
Clear OEM distribution wins (preinstall + retention), validating a non-default route to tens of millions of users.
Proven ad RPMs or conversion data from AI answer units on par with classic SERP ads, with credible attribution.

What would change the call (negative).

Strong legal remedies that limit AI answer units or impose heavy licensing costs per snippet.
Performance ceilings on iOS WebKit that keep AI features lagging, capping mobile growth outside the EU.
A plateau in WebGPU adoption or instability that undermines local inference economics.

14. Where to place venture bets

Applications (consumer and prosumer). Founders with a distribution wedge (OEM, region, or workflow) and a clear cost plan (local + edge + cloud) seem to be in prime position. Perplexity (consumer brand + OEM motion), BrowserBase (web browsers for agents), Arc/SigmaOS (workflow depth), You.com (enterprise agents with a browser-like UX), and privacy-centric stacks (Brave) each represent one of these wedges. The de-risked angle is to fund attachments (research agents, shopping copilots, or vertical modules) that live inside multiple browsers.

Infrastructure products. The biggest returns may accrue to the layers that let AI browsers run cheaply and instantly:

Web runtimes (ONNX Runtime Web, TF.js-compatible optimization services, model-to-WebGPU compilers like MLC).
Edge inference with semantic caching, abuse controls, and per-publisher licensing logic (Cloudflare Workers AI/AIGateway, Akamai Cloud Inference, Fastly AI Accelerator).
Search infra (crawler/index as a service for AI UIs, embeddings storage and freshness pipelines). Brave’s independent path and Perplexity’s moves show the strategic value of index control.

Network products. Invest where answers meet commerce: API gateways that price by semantic similarity, affiliate routers tuned for AI panels, and link-resolver services that run on CDNs. Early evidence: Fastly’s semantic caching and Akamai’s claims on cost/latency improvements.

Open-source leverage. Back maintainers and vendors who package WebLLM/MLC-LLM, llama.cpp, or Transformers.js for enterprise browser deployments with management, policy, and observability (the “Vercel for in-browser AI” slot).

Closing thoughts

AI browsers are credible venture targets if you believe (1) answer-first experiences will capture high-intent moments inside the browser (2) local/edge inference will make those experiences fast and cheap (3) distribution wedges (DMA choice screens, OEM deals, differentiated workflows) can overcome default gravity.

The opportunity is in the apps as well as the “hidden pipes” that let those apps feel instant, safe, and affordable at scale. Over the next 24 months, watch three signals: distribution wins, cost curves, and ad/commerce RPMs. If two of the three break right, this sector can compound. And drag a lot of infra value up with it.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Startup Tracker #5 - Signals, Links, and What They Mean for the Stack

Prateek Joshi — Mon, 01 Sep 2025 22:09:38 GMT

1. Snapshot of the week

Recent data points to steady execution rather than headline-chasing. Integrations and partnerships dominate with AWS showing up far more than other clouds. Security, compliance, and data-center realities (power and cooling) cut across many items. Product releases skew toward “make this production-grade” over novelty: faster serving, clearer SLAs, safer defaults, and easier setup. The shape of demand is practical. Customers need predictable latency, policy controls, and measurable ROI.

2. Compute supply meets power and cooling

Performance gains are real, but the bottleneck has shifted to watts and thermals. Groq emphasized low-latency, deterministic throughput. This is vital for voice and agent loops.

On the training and inference side, we have companies like Cerebras, SambaNova, and Lambda Labs. Their updates highlight the arms race for scale, but multiple notes tie progress back to data-center constraints: immersion cooling, rack density, and power planning.

The dependency is stark. Even the best model stack is gated by energy availability and thermal envelopes. Expect more vendors to publish “performance per dollar per watt” not just tokens per second.

Implication: Buyers should demand SLOs that include cache hit assumptions and queue visibility. Builders should make watt-aware autoscaling and capacity forecasts first-class features.

3. Model serving and runtimes: production over novelty

Together AI, Fireworks AI, Anyscale, Fal AI, Modular, Baseten, Replicate, and Banana.dev all push toward “one API, many reliable backends”. The common thread is multi-model routing, fast cold-start, cost caps, and per-route safety policies. Plus knobs for batch size, caching, and rate limits. The risk isn’t vendor lock-in so much as operational complexity. Platforms that hide the multi-provider mess while exposing policy-level controls are winning deals.

Example: Fireworks and Together lean into scalable serving. Anyscale and Modal stress cluster-grade reliability. Fal AI simplifies deploying custom endpoints. For app teams, the new baseline is “swap models on Tuesday without breaking Friday deploys”.

4. Data plumbing and activation: tight loops beat new data stores

Hightouch earned recognition for activation and journey orchestration, signaling that reverse-ETL has matured into measurable value. LakeFS pushes versioned data and reproducibility. Featureform and Tecton pitch feature stores that bridge data teams and ML. Chroma DB and Activeloop show up in RAG workflows tied to documentation search and support deflection. Airbyte continues to be the connective tissue for sources.

Pattern: The market rewards closed loops: source systems → cleaned entities → features/embeddings → outcomes. Tools that translate RAG plumbing into “fewer support tickets” or “faster onboarding” outpace generic retrieval benchmarks. Risk lives in silent failures like stale corpora, drifting chunking, and unmonitored caches.

5. Agent reliability becomes the moat

Temporal is the quiet backbone of long-running, multi-step work. And this is exactly what agent systems need for retries, tool calls, human-in-the-loop, and checkpointing. Dagster, Prefect, and Dagger updates point the same way: idempotency, lineage, and policy as defaults. Coding agents (e.g. Cline) depend on these guarantees to avoid duplicate actions or deadlocks.

Buyer checklist: Can the system recover from partial failure without babysitting? Does it store the why (prompts, tool calls, responses) as well as the what (status codes)? Can policy (PII hints, cost ceilings, VIP users) stop or reroute flows at runtime?

6. Observability, evals, and safety: from “nice” to “blocking”

Evidently AI shipped guidance and tooling that moves evals from notebooks into CI/CD. Superwise and Fiddler emphasize production monitoring and explainability. Arize, Comet, and Honeycomb show up where teams want drift alerts, prompt regression tests, and business metrics tied to model changes. PromptFoo remains a common choice for prompt testing. The connective tissue is measurement: changes to models, prompts, or retrieval must link to acceptance rates, NPS, and cost per interaction.

Tactical move: Adopt opinionated defaults (starter test suites, coverage metrics, “fail the build” safety checks) so product and compliance can sign off without running bespoke experiments every time something changes.

7. Security, identity, and governance now baked into design

Security pops up in nearly every layer. Teleport for secure access and context, Aserto and Oso for authorization, Permit.io and Stytch for identity flows, Credo AI for governance. The dependency risk is upstream IAM and secrets. Many teams lean on cloud KMS and provider SDKs. If there’s a change in scopes / tokenization / quotas, then downstream systems can break in surprising ways. Treat every tool a model calls as an untrusted boundary. Standardize redaction and approval. Assume prompts and tool calls are records subject to retention.

Good sign: Vendors are converging on safer context injection patterns and RBAC that travel with requests, not just services.

8. GTM reality: integrations move deals, hyperscalers set gravity

Integrations outperformed net-new features as deal accelerants. Hightouch’s edge is its native wiring into CRMs, ad platforms, and warehouses. As mentioned earlier, AWS appeared far more this week than Azure or GCP. It reflects customer center-of-mass and marketplace pull. Vercel, Netlify, Supabase, Render, Railway, Zeet, and Fly.io each show momentum by meeting developers where they already ship.

Risk: Concentration. If a major partner tweaks pricing or marketplace terms, CAC mechanics can fail overnight. Keep a viable “no-hyperscaler” path (self-host, on-prem-friendly, or sovereign options) especially for EU and regulated buyers.

Closing view

The stories link cleanly across the stack:

Compute optimizes for predictability and energy
Runtimes collapse complexity behind policy
Data tools turn RAG into repeatable business value
Orchestration makes agents reliable
Evals connect changes to outcomes
Security shifts left into design
Integrations with the customer’s existing platforms move the pipeline.

For founders, the opportunity is to collapse handoffs. Ship opinionated paths from data to decision with reliability, policy, and measurement built-in. For buyers, favor vendors that publish evals, integrate natively, and can explain not just “how fast” but “how predictably, at what cost, under what controls”.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Sector Deep Dive #1: REINFORCEMENT LEARNING

Prateek Joshi — Thu, 28 Aug 2025 16:03:03 GMT

1. The big picture you might not expect

Reinforcement Learning (RL) is often treated like a moonshot: amazing demos but not very dependable in production. But the past few years tell a different story. There have been a handful of very practical deployments and a maturing toolchain around simulators, data, and post-training. And they’re quietly turning RL into a repeatable product category for operations, robotics, and model alignment.

Here are the unexpected parts that matter over the next 12–24 months:

RL has already paid real, measurable bills in production. DeepMind’s control system for Google data centers cut cooling energy by up to 40% when first deployed on human-in-the-loop settings in July 2016. And from August 2018, Google ran a fully autonomous controller that delivered ongoing energy savings across multiple sites. Both deployments were publicly documented by the teams involved, which is rare for infra case studies.
Instruction-following AI exists because RL works at scale. OpenAI’s InstructGPT paper (posted March 4, 2022) showed human evaluators preferred outputs from a 1.3B parameter RLHF-tuned model over the original 175B GPT-3. That single result is why almost every modern model stack includes RLHF (Reinforcement Learning with Human Feedback) or a cousin such as DPO (Direct Preference Optimization) / GRPO (Group Relative Policy Optimization) in production alignment.
Simulation is the hidden kingmaker. Microsoft’s Project Bonsai and Siemens demonstrated >30x faster CNC auto-calibration in May 2018, with a domain expert (not an ML specialist) building the agent. Today Bonsai plugs into MathWorks Simulink and AnyLogic, letting teams train safely in simulation and ship to factory lines with fewer surprises.
Grocery and 3PL logistics are RL’s commercial wedge. Ocado Group bought Kindred Systems and Haddington Dynamics for $262M and $25M in November 2020, explicitly citing deep RL in Kindred’s picking approach. Covariant is live with 3PL Radial, and Symbotic expanded its Walmart partnership on January 16, 2025 by acquiring Walmart’s robotics unit for $200M and securing a $520M development program. This is evidence that large retailers will keep writing large checks for autonomy that improves unit economics.
Education-to-enterprise funnels are changing shape. AWS is retiring the centrally hosted DeepRacer League after 2024. The service remains in console through December 2025 and transitions to an AWS Solution you can run in your own account. Expect “inside-the-enterprise” leagues connected to internal simulators and proprietary data.
The product stack is global. The most credible RL product proof points span the US (Microsoft, AWS, Covariant, Symbotic), UK (DeepMind, Ocado), Germany (BioNTech’s January 2023 acquisition of InstaDeep), and China (Baidu’s RL work in robotics and autonomy). This matters because buyers often prefer local support and local compliance expertise for factory deployments.
Capital markets haven’t given up on RL. JPMorgan’s LOXM project (publicly discussed 2017–2018) used policy learning for execution with guardrails and supervision. Expect more “bandits + rules + audits” than end-to-end black-box RL in finance.

2. Where RL is actually delivering value (and why it’s defensible)

Industrial control / energy efficiency. Datacenter cooling is a canonical example because the objectives are simple (keep PUE low, avoid thermal excursions) but the dynamics are messy. DeepMind showed up to 40% cooling energy reduction (2016) followed by an autonomous controller saving roughly 30% across multiple Google sites (2018). This shows RL can run 24/7 provided you bound the action space and keep human oversight for safety. The buyer value here is repeatable OPEX savings, not marginal accuracy points.

Precision calibration and tuning. Siemens and Microsoft’s Project Bonsai demonstrated a >30x speed-up for CNC calibration in 2018. One axis calibrated in ~13 seconds while matching expert precision, and the system was built by a subject-matter expert using “machine teaching” rather than a research team writing algorithms. Longevity matters: Bonsai is now integrated with Simulink and AnyLogic, making sim-to-plant workflows more accessible to control engineers.

Warehouses and last-mile retail logistics. Logistics-grade picking demands adaptable perception and control. Kindred (now part of Ocado) leaned on deep RL for dexterous piece-picking. Ocado keeps emphasizing RL across its “on-grid pick” automation. Covariant landed deployments with Radial (announced February 10, 2023). Symbotic posted $1.822B FY-2024 revenue and deepened Walmart ties in January 2025, adding hundreds of accelerated pickup & delivery centers (APDs) to its roadmap. This is an ecosystem signal that warehouses will standardize on platforms where RL can be an embedded component.

Biotech and complex optimization. BioNTech agreed to acquire InstaDeep in January 2023 (deal up to ~€562M). While much of the public narrative focused on discovery, the near-term operational wins are often in scheduling, experiment design, and supply-chain optimization. These are classic RL-friendly problems with constrained action spaces and strong simulators.

Quant/execution. JPMorgan’s LOXM (first reported July 2017) used RL concepts for execution improvement. The design takeaway for startups is not “ship end-to-end RL”, but wrap RL with supervision, audit logs, and rule-based safety. That’s how you pass risk committees.

3. Who buys, why they sign, and what convinces them

Plant managers and control engineers buy when you can prove bounded exploration (no runaway actuators) and short time-to-value. The Siemens + Bonsai result lands because it collapsed calibration time to seconds on some axes without sacrificing precision. And did so with a domain expert building the agent, not a research lab parachuting in. That makes RL feel like a tool, not a science project.

Logistics and e-commerce operations leaders buy steadier throughput and fewer mis-picks that integrate cleanly with WMS/ERP. Kindred’s history with Gap and American Eagle pre-acquisition showed real merchant tolerance for RL-powered picking. Covariant’s Radial rollout demonstrates that 3PLs (who standardize across many sites) are willing to pick a platform and expand. Symbotic’s Walmart expansion underscores that once a retailer standardizes, follow-on scope (like APDs) can be large and fast.

CIO/CTO buyers in regulated industries will ask for verification (what happens under edge cases?), observability (what did the policy see and do?), and roll-back (how do we safely disable and revert?). Vendors need to bundle formal verification or “safe RL” claims with simulator-backed testing. Something like NVIDIA Isaac Sim for robotics or AnyLogic for discrete-event systems. These vendors get a smoother reception when pilots transition to production.

Data science and LLM platform teams buy RLHF-style post-training to make models useful to end users. The InstructGPT result (1.3B beating 175B on human preference) remains a watershed that budget owners still cite when defending RLHF spend.

4. The product stack behind successful RL deployments

Simulators and digital twins. You don’t let a learning controller “trial-and-error” on a live kiln or warehouse unless it’s practiced extensively in a high-fidelity simulator. That’s why connectors and toolchains matter:

MathWorks Simulink + Microsoft Project Bonsai (announced May 19, 2020) allows control engineers to reuse existing Simulink models as training environments.
AnyLogic + Project Bonsai (announced July 14, 2020) supplies an official connector and wrapper for quick simulator hookups, which is good for factories and logistics networks modeled in discrete-event or agent-based styles.
NVIDIA Isaac Sim provides physics-accurate robot simulation to train and test RL policies before touching the real arm.

Foundational RL libraries:

Ray RLlib (Anyscale) remains a widely used distributed RL library that “just works” at cluster scale.
Unity ML-Agents bridges game-quality 3D simulation with RL for robotics and control.
Gymnasium (community successor to OpenAI Gym) standardized the environment API used across the ecosystem.
TF-Agents (Google’s TensorFlow team) is still useful where TensorFlow is entrenched.
Intel Coach is an older but illustrative example of chip-vendor RL tooling from Intel’s AI lab.

Cloud RL services:

Microsoft Project Bonsai (Bonsai acquired June 2018, public preview May 2020) focuses on “machine teaching” for subject-matter experts and integrates with leading simulators.
AWS SageMaker RL (announced November 28, 2018) offers managed RL containers and RLEstimator. It integrates toolkits like RLlib and Intel Coach and supports commercial/custom environments.

5. Libraries and frameworks: how to choose (fast)

Teams keep asking the same question: “Which RL/RLHF stack do we pick, and when?” Here’s a practical guide grounded in currently maintained projects and their stated capabilities:

Need something quick that fits Hugging Face? → Use TRL.
Hugging Face’s TRL library has ready-made trainers (PPO, DPO, GRPO) and copy-paste examples that work with the HF model ecosystem. It’s the fastest way to get an RLHF loop running without standing up lots of infra.
Training very large models, from one GPU up to massive clusters? → Use NVIDIA NeMo-RL or ByteDance’s verl.
NeMo-RL targets production-scale RLHF for LLMs (100B-class model claims in docs/marketing) and integrates with NeMo’s distributed training stack. verl (from ByteDance/Volcengine) is an open-source RLHF system designed for speed and scale. If you’re already in NVIDIA land, NeMo-RL is the natural fit. If you want a lean OSS stack that scales, verl is a strong option.
Already run a Megatron/vLLM/Ray-style cluster and want a full RLHF setup? → Use Alibaba ROLL or Zhipu/THUDM SLiME.
ROLL (Alibaba) focuses on high-throughput RLHF for big GPU fleets. SLiME (THUDM/Zhipu AI ecosystem) explicitly connects Megatron training with SGLang serving for scaled RLHF. Both target production post-training on large clusters. (If you’re deeply on Ray/vLLM, OpenRLHF is also built exactly for that combo.)
Training across many separate or partly untrusted machines? → Use prime-rl.
prime-rl is a fully asynchronous, distributed RL/RLHF system designed for flaky or heterogeneous clusters. Its authors used it to train the INTELLECT-2 model. If your infra looks like a federation of nodes rather than a tidy HPC cluster, this is built for you.
Want RL for tool-using chatbots/agents right now? → Try SkyRL or OpenPipe/ART.
SkyRL (Sky Computing/UC Berkeley contributors) includes an “agent gym” for long-horizon tool use and evaluation. ART (from OpenPipe) focuses on reliable GRPO training for agents, with practical recipes rather than a heavy platform. These are aimed squarely at agentic tasks, not only static benchmarks.
Mostly doing instruction-tuning (not full RL)? → Use AI2’s Open-Instruct.
Open-Instruct from the Allen Institute (AI2) is a clean, simple codebase for instruction/post-training pipelines. It’s great when you don’t need RL loops.
Just want the core DPO/PPO bits in plain PyTorch? → Use torchtune.
torchtune (Meta) ships PyTorch-native recipes and losses for PPO, DPO, and GRPO without the extra layers of large frameworks. This is useful for teams that prefer minimal abstractions.
Need GRPO plus built-in environments and eval tools? → Use willccbb/verifiers.
verifiers is a modular GRPO/DPO training/eval toolkit that works with Hugging Face’s Trainer and can plug into other stacks like prime-rl. Good for standing up an end-to-end loop with credible evaluation.
Reproducing frontier “reasoning” agents (e.g. R1/O-series-style research)? → Use agentica-project/rLLM.
rLLM is an academic, all-in-one framework to train LLM agents with RL, maintained by the Agentica Project (with UC Berkeley/Sky Computing involvement). Choose this when you need research faithfulness and multi-env support more than a polished enterprise UX.

Two closing notes on the framework landscape: (a) you can mix and match e.g. simulate in Isaac Sim or AnyLogic, train with NeMo-RL or ROLL, align with TRL or torchtune, and serve with vLLM/SGLang (b) expect consolidation: winners will be those that meet infra teams where they are (Kubernetes, Slurm, on-prem GPU pools) and play nice with existing observability/logging.

6. The companies and ecosystems you’ll keep hearing about

Cloud platforms and labs:

Microsoft (Project Bonsai): Deep integrations with Simulink and AnyLogic keep it attractive for industrial control.
AWS: SageMaker RL (since 2018) and the DeepRacer education funnel (league ending 2025, service available through December 2025, new AWS Solution form).
Google DeepMind: The data-center cooling results (2016/2018) remain the go-to reference for “RL in critical infrastructure”.
OpenAI: InstructGPT (March 4, 2022) codified RLHF as the default alignment step in modern LLM pipelines.
Baidu: Active RL/autonomy research (e.g. RL for robotics and traffic signal control, Apollo RL platform papers), signaling ongoing investment on the China side of the market.
IBM / Intel / Salesforce: Ecosystem contributors (Intel’s Coach RL library, Salesforce research has released performance-minded RL tools historically).

Simulation products / RL Environment / RL-as-a-Service:

MathWorks (Simulink) and AnyLogic: Official connectors with Project Bonsai with enterprise-friendly entry points.
NVIDIA Isaac Sim: Physics-accurate sim for robot policy training and validation.
Unity ML-Agents, Ray RLlib (Anyscale), Gymnasium/Gym, TF-Agents: The open-source backbone for many RL stacks.
Companies like Applied Compute, Veris AI, Kaizen, Mechanize, and Osmosis are providing RL infra and services to let customers infuse RL into their products.

Robotics and logistics:

Ocado Group: As mentioned earlier, they acquired Kindred Systems ($262M) and Haddington Dynamics ($25M) in Nov 2020. Repeatedly calls out deep RL for picking.
Kindred Systems: Piece-picking. Past customers include Gap and American Eagle.
Haddington Dynamics: Low-cost dexterous arms. Acquired for $25M.
Covariant: Deployed with Radial in 2023. Strong 3PL fit.
Micropsi Industries: “MIRAI” product adapts to variability in tasks like cable assembly. RL-style learning under the hood.
OSARO: Picking and depalletizing software with learning-based control.
Vicarious: Acquired by Alphabet’s Intrinsic in 2022, pointing to consolidation of manipulation/learning talent.
Symbotic: Public warehouse-automation bellwether. $1.8B FY-2024 revenue and a Jan 16, 2025 deal to acquire Walmart’s robotics unit for $200M, paired with a $520M development program covering 400 APDs over time.

Healthcare and biotech:

BioNTech: Acquired InstaDeep (Jan 10, 2023) to bring advanced AI (including RL) in-house for discovery and operations.

Finance:

JPMorgan: LOXM execution agent (reported 2017), an early example of RL-style policy learning with controls in a high-stakes domain. Financial Times

Smaller companies:

BeChained (industrial energy optimization), Predictiva (trading), Telemus AI (RL training/eval tools), PLAIF (ROS-to-KEBA control demos), Surge AI (data labeling used in RLHF pipelines). Each points at niche opportunities in energy, finance, robotics control, and data operations.
Ecosystem names appearing as adopters/partners include Gap, American Eagle, Walmart, and Radial.

7. Go-to-market patterns, risks, and moats you can actually underwrite

The sim-first deployment loop is a moat. If your RL product depends on high-fidelity simulators and digital twins, integration depth with Simulink, AnyLogic, or Isaac Sim becomes a practical switching cost. Once a control policy is validated against a company’s “digital plant”, ripping it out is painful. And this is especially true if you’ve also instrumented observability, roll-back, and safety checkers around the policy.

Education channels are moving in-house. AWS DeepRacer seeded hundreds of thousands of learners, but the league’s retirement after 2024 and the shift to an AWS Solution in 2025 signals a new model: companies will run their own “leagues”, tie them to internal simulators and datasets, and keep IP in-house. Vendors who support that motion (private clouds, custom tracks, enterprise SSO) will win training budgets and later production work.

Data and alignment work is sticky. Because most LLM stacks now include RLHF (thanks to InstructGPT’s result), any vendor who supplies reliable feedback data (e.g. labeling platforms such as Surge AI) or dependable reward/eval tooling (e.g. verifiers) can become embedded in model-lifecycle operations. This is an emerging moat that doesn’t look like “classic SaaS”, but behaves like it in practice.

Consolidation is a feature, not a bug. Ocado’s purchases of Kindred and Haddington and Intrinsic’s acquisition of Vicarious show that large buyers prefer packaged stacks with talent attached. For startups that only provide specific tools, that means the exit path is often “get three logos, prove reliability, and get acquired”. If you want to swing for a long-run IPO, you need to grow out of pure-play offering and build a full stack offering.

Geography matters. Local compliance and support remains a gating factor for factory and logistics deployments. The center of gravity is multinational (US, UK, Germany, China), so startups that find the right regional system integration partners (or ride with Microsoft/AWS channel programs) will scale faster than pure-direct sellers.

8. How to connect the dots to infra startups (dependencies, correlations, and “gotchas”)

GPU supply and cluster managers → which RLHF framework wins. If your customer already runs Megatron for pretraining and SGLang/vLLM for serving, SLiME or ROLL will feel native. If they live in NeMo land, NeMo-RL wins by default. If they want total flexibility or untrusted nodes, prime-rl unlocks federated training. The point is that infra choices decide the RLHF tool before a modeler opens a notebook.

Simulator availability ↔ sales velocity. The fastest deployments happen where the buyer already maintains trusted simulators (Simulink for control, AnyLogic for operations, Isaac Sim for robots). If a prospect cannot simulate, your sales cycle includes a modeling project. Time-to-value stretches out and your gross margin takes a hit.

Data labeling and evaluation → sticky, recurring services. RLHF needs high-quality preference data and robust evaluations. That creates a repeat services layer (often billed on volume or seats) that compounds over time and raises switching costs. This is subtle, but powerful. The verifiers framework codifies evals. Data vendors like Surge AI are common in RLHF case studies.

Retail automation deals ripple through the stack. The Symbotic–Walmart expansion isn’t just a warehouse story. It pulls in upstream component vendors (arms, vision systems), software (WMS integration), and sometimes nearby last-mile tech (APDs). Startups supplying perception, grasp planning, or scheduling can ride these waves even if they aren’t the “prime” vendor.

Safety and audit features are not optional. Particularly in finance and heavy industry, buyers will demand logs, simulators for “what if” replays, and override circuits. LOXM’s early disclosures and the widespread use of guardrails in enterprise LLM deployments show that RL succeeds commercially when paired with simple, explainable controls.

9. Risks, surprises, and what to watch in the next 24 months

Sim-to-real gaps can bite. Even with good models, differences between simulation and reality can cause regressions. The mitigation is boring but effective: domain randomization, staged rollouts, and layered safety constraints. Vendors with proven simulator connectors (Simulink/AnyLogic/Isaac) and robust A/B failovers have an edge.

Vendor stability and consolidation risk. If your RL vendor gets acquired (e.g. Vicarious → Intrinsic in 2022) or pivots, your roadmap may change overnight. Large buyers like Ocado handle this by buying the capability outright. If you’re an investor, favor startups that integrate with the buyer’s existing simulators and control stack. This reduces “platform hostage” risk at renewal.

Education channels are moving away from centrally hosted showcases. With DeepRacer’s league ending after 2024, teams will need new ways to upskill engineers. That could slow top-of-funnel unless vendors provide simple, self-hosted training kits and enterprise competitions. The flip side: internal leagues may produce more deployable pilots because they’re built on company models and data from day one.

Regulatory and safety scrutiny. As RL touches physical systems and financial execution, expect more audit requirements. Startups that package policy introspection and “explainable controls” will find compliance less of a throttle.

Catalysts to watch near term:

Walmart–Symbotic APD deployments moving from design to rollout. Watch for first-site go-lives and backlog updates.
Deeper Simulink/AnyLogic integrations (connectors, templates) that shorten time-to-pilot for industrial buyers.
NeMo-RL/ROLL/SLiME performance wins on big clusters and better agent stacks (e.g. SkyRL, ART) proving stable tool use over long horizons.
Internal “leagues” at F500s replacing DeepRacer as a talent funnel.

Names you’ll keep encountering (complete coverage of earlier mentions):

Platforms/labs: Microsoft (Project Bonsai), AWS (SageMaker RL / DeepRacer), Google DeepMind, OpenAI, Baidu, IBM, Intel, Salesforce.
RL infra products: MathWorks (Simulink), AnyLogic, NVIDIA Isaac Sim, Unity ML-Agents, Ray RLlib (Anyscale), Gymnasium/Gym, TF-Agents, Intel Coach, Applied Compute, Veris AI, Kaizen, Mechanize, Osmosis.
Robotics/logistics: Ocado Group, Kindred Systems, Haddington Dynamics, Covariant, Radial, Micropsi Industries, OSARO, Vicarious (Intrinsic), Symbotic, plus adopters Gap, American Eagle, Walmart.
Healthcare/biotech: BioNTech, InstaDeep.
Finance: JPMorgan.
Startups/tools: BeChained, Predictiva, Telemus AI, PLAIF, Surge AI.
New RL/RLHF stacks: TRL, NVIDIA NeMo-RL, ByteDance/volcengine verl, Alibaba ROLL, Zhipu/THUDM SLiME, prime-rl, SkyRL, OpenPipe/ART, AI2 Open-Instruct, torchtune, willccbb/verifiers, agentica-project/rLLM.

If you’re tracking this sector for venture, the investable themes over the next two years are:

Sim-tied RL for operations where you can measure OPEX savings quickly (HVAC, calibration, scheduling).
Robot manipulation/picking stacks that demonstrate site-to-site generalization and clean WMS/ERP hooks.
RLHF infrastructure (data, eval, training frameworks) that meets infra teams where they are (Kubernetes/Slurm, Megatron, NeMo, vLLM/SGLang) and ships with the guardrails enterprises demand.

10. Why these pieces fit together

Think of RL as “learned control” rather than “AI magic”. In a factory or warehouse, you already have sensors and actuators.

The missing piece is a policy that maximizes a goal while avoiding bad states.

Policy is a set of choices about which action to take in each situation
Goal is something like energy savings, picks per hour, or calibration accuracy.
Bad state is something like overheating, collisions, or mis-picks.

What makes RL commercially usable now are three things:

Simulation first, deployment later. When you can train policies in a digital twin (Simulink, AnyLogic, Isaac Sim), you sidestep most of the risk. That’s why the Siemens + Bonsai story resonates: a domain expert could encode the task and use a platform to do the heavy lifting.
Tooling that meets infra where it lives. In LLM land, RLHF stacks like TRL, NeMo-RL, verl, ROLL, SLiME, and prime-rl now align with the way infra teams actually run workloads (on Kubernetes, Slurm, or tightly packed DGX pods). Many of these stacks come with sane defaults and recipes so teams can spend more time on what to optimize and less on how to wire up the training loop.
Proof that buyers will pay when outcomes are clear. Energy bills and pick-rates are easy to measure. A pilot that shows 30–40% energy savings or steady throughput uplift writes its own business case. That’s why DeepMind / Google, Ocado / Kindred, Covariant / Radial, and Symbotic / Walmart matter.

What this means for venture

Don’t fund research projects. Fund “boring excellence”. The winners aren’t the flashiest algorithms. They’re the teams who make deployments predictable and safe, with rock-solid simulator hooks and guardrails.
Back the “glue” layers. Evaluation suites (e.g. verifiers), high-quality preference data vendors, and integration connectors are under-invested and deeply sticky.
Assume consolidation. If a startup gets three industrial logos and shows solid uptime, it’s a candidate for acquisition (as Vicarious and Kindred/Haddington show). Invest with that outcome in mind.
Expect internal leagues to replace showcase programs. As DeepRacer transitions, enterprises will “own” their RL training funnels. That’s a place for startups to sell hosted competitions, simulator content, and analytics dashboards behind the firewall.

Closing thought

Reinforcement learning stopped being a lab toy the moment it started saving money in data centers and picking real items in warehouses. The next two years won’t be about one grand breakthrough. They’ll be about repeatable, simulator-backed deployments across factories and logistics, and RLHF stacks that make large models actually helpful. If you invest in the pieces that make those two motions boring and reliable, you’re investing where the value will quietly compound.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Startup Tracker #4 - What moved, why it matters

Prateek Joshi — Mon, 25 Aug 2025 19:23:41 GMT

1. Snapshot of the week

The center of gravity was product shipping. About a third of updates were new releases or major version bumps. The heaviest clustering was around multimodal features, agent workflow tooling, and model-quality evaluation utilities. Partnerships and small capital moves also featured, but the bigger story is that infra vendors are packing more end-to-end capability into their stacks: retrieval, agents, evals, and deployment are increasingly bundled rather than bought separately.

2. The shift to “agentic” stacks

Multiple companies advanced agent and workflow automation. Together AI’s updates emphasized multi-step agents that compose tools, retrieval, and model calls to handle complex tasks end-to-end. Buildkite highlighted AI agents inside CI, triaging failures and suggesting fixes rather than just failing builds.

The pattern is consistent: more systems are moving from “assistive interface” to “closed-loop executor”. This increases demand on orchestration, sandboxing, and audit trails. The connection is that as agents act, observability and controls move from “nice to have” to “ship-blocker”.

3. Cost and latency: practical wins over theoretical speed

Several releases focused on reducing inference bills and making response times predictable. Groq’s push on prompt caching is emblematic: cache the static prefix, pay only for the new tokens, and you cut cost and tail latency for chat UIs and code assistants.

That theme shows up elsewhere too — runtime-level optimizations, smarter batching, and memory-aware serving. The dependency to watch is hardware supply. Even clever serving tricks still rely on GPU availability and scheduling, which continue to shape roadmaps and pricing.

4. RAG is growing up (quietly)

Retrieval isn’t grabbing headlines anymore, but it’s getting sturdier. Several updates blended vector retrieval with higher-quality indexing and guardrails. Teams that once shipped “RAG v0” are now focused on document chunking strategies, embedding refresh cadence, and permissions-aware search. Together AI, Seldon, and others referenced improvements in retrieval and embeddings alongside workflow features.

The correlation this week: when an agent feature shipped, a retrieval or embedding upgrade often shipped with it. This is evidence that practical agents still hinge on grounded context, not just bigger prompts.

5. The safety, evals, and governance layer is consolidating

Model-quality and red-team tooling kept pace with the agent push. Evidently AI refreshed guidance on classification metrics and LLM evaluation. PromptFoo rolled out moderation tooling and highlighted a recent funding round focused on safety features.

The connection is direct: as more apps perform actions (not merely answer questions), teams need reproducible evals, jailbreak resistance, and change-management for prompts and policies. Risk is migrating from “bad answer” to “bad action”, so evals are moving from offline dashboards into pre-deployment gates and run-time guardrails.

6. Data platforms are asserting their role in AI

Warehouse-native and lake-native players continued to lean into AI data workflows. Hightouch emphasized identity and activation primitives that sit on the warehouse rather than siphoning data into another tool. LakeFS underscored versioning and branch-and-merge patterns for data, treating training and evaluation sets more like code. MotherDuck kept pushing easy analytics on top of DuckDB for teams that want small, fast pipelines without heavy infra.

The dependency thread: successful AI launches increasingly depend on three mundane but critical data capabilities —lineage, time-travel/versioning, and permissioning mapped to business entities.

7. Multimodal moves go from demos to workflows

Several launches centered on image/video generation and editing. Plus speech/vision add-ons that plug into existing apps. Fal AI expanded image-editing and multimodal inference options. We also saw more “instant model libraries” for creative tasks that can be wired into production without heavy ops.

The correlation to watch: multimodal features often arrived packaged with either a runtime optimization (to keep costs in check) or an agent/workflow wrapper to make them usable in real processes (not just in a demo).

8. Partnerships and certifications: selling to the real world

A noticeable share of updates were integrations and certifications: net-new connectors into developer platforms, plus security and biometric credentials. Paravision’s recent recognition on the security/compliance front fits a broader pattern: buyers are asking for proof.

PromptFoo’s moderation focus and new funding reinforced the “compliance story as a growth vector.” Partnerships also signal distribution strategy: Z.ai highlighted collaborations and cost positioning in a crowded market. Netlify updated its CLI and runtime packages that many AI front-ends rely on.

The dependency chain here is commercial: integrations unlock budgets and certifications unlock regulated accounts.

9. Capital flows: smaller checks, nearer to product

There were funding notes, but fewer megadeals. Announcements skewed toward teams that can show immediate product or workflow impact. PromptFoo’s raise for safety tooling is a good example: the money is following concrete, near-term pain (moderation, jailbreak defense, evals), not speculative long-horizon bets.

Temporal’s inclusion in investor shortlists underscores that orchestration remains an investable wedge, especially when it controls meaningful production traffic.

The takeaway: capital is favoring infra that shortens time-to-value inside existing stacks — security, evals, orchestration, and cost controls.

10. How this week maps to the infra stack

Silicon and runtime: Demand signal favors cost/latency features (prompt caching, batching, quantization). Dependence on GPU supply remains the risk amplifier. Vendors that abstract hardware variability win trust when shortages or price spikes hit.
Inference platforms: The winners are bundling retrieval, evals, and agent orchestration so developers don’t stitch multiple tools. Together AI exemplifies the “full loop” motion. Groq leans into a performance/cost identity.
Data layer: Warehouse/lake alignment is paying off. Hightouch and LakeFS show how identity resolution, lineage, and versioning become first-class for AI work. This reduces “shadow data stores” and keeps governance attached to the source of truth.
RAG and search: Better embeddings and policy-aware retrieval are quietly raising answer quality. The dependency is permissioning: if RAG can’t respect row and column level access, it stalls in enterprise pilots.
Agents and orchestration: Buildkite’s agentic CI and Together’s multi-step flows put pressure on reliability, sandboxing, and auditability. Systems that can explain why an action occurred (not just that it did) will pass procurement faster.
Safety / evals: PromptFoo and Evidently signal a shift from “after-the-fact” dashboards to gates in the path to production. Expect eval suites to look more like unit tests: cheap, frequent, and blocking when they fail.
Security and compliance: Certifications and moderation are becoming revenue features. Paravision’s momentum illustrates that regulated buyers care as much about proofs and logs as they do about model specs.

11. Correlations, risks, and dependencies to watch

Correlation: New agent features often shipped alongside retrieval upgrades and eval tooling. That triad (agents + RAG + evals) showed up together repeatedly. It’s a sign that “usable agents” require context and quality checks by default.
Correlation: Multimodal releases frequently paired with runtime optimizations. When cost per call is visible to end users (e.g. creative tools), performance engineering becomes a product feature, not just an infra concern.
Risk: Hardware supply and pricing. Even with caching and quantization, workloads depend on GPU availability. Sudden scarcity or price changes ripple through every layer above.
Risk: Eval/guardrail drift. As prompts and models evolve, evals can silently go stale. Teams that don’t treat evals as code (versioned, reviewed, and diffed) will ship regressions.
Risk: Data governance debt. Without lineage and permissions tied to the warehouse/lake, RAG and agents will leak or get blocked by IT. The fix is slow, and companies that short-cut it will pay later.
Dependency: Distribution through integrations. Many launches are really routes to market — CLI updates, connectors, SDKs. These are fragile: when a key platform changes APIs, roadmaps slip.

12. What this means for the next quarter

Bundle the loop. The market is rewarding platforms that ship retrieval, agents, evals, and deployment as a coherent loop. Fragmented toolchains will face longer sales and higher churn.
Ship cost controls as features. Caching, batching, and policy-based routing should be visible in the product, not buried in docs. Buyers now ask for “how do you keep my bill predictable?” in the first call.
Make governance boring. Identity-aligned data access, lineage, and versioning should be one-click, not a consulting project. This is where warehouse-native players like Hightouch and data-versioning tools such as LakeFS are pulling ahead.
Treat evals like tests. Bake PromptFoo/Evidently-style checks into CI and pre-prod gates. If agents act, you need “red lines” that block deploys on safety or quality regressions.
Certify early. Security credentials and vertical certifications are functioning as growth levers. Paravision’s traction is a reminder that compliance unlocks budgets that features alone can’t.

Bottom line

This week’s activity shows infra moving from “pieces you assemble” to “loops you run”. The strongest updates connect agents with grounded retrieval, observable execution, and predictable cost. Where those connections are tight, adoption accelerates. Where they’re loose (governance, eval drift, and hardware dependence), risk compounds.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Startup Tracker #3 - What this week's infra data reveals

Prateek Joshi — Sun, 17 Aug 2025 16:02:55 GMT

Let’s dig into the infra startup moves this week.

1. What the data is telling us

A quick read of the data shows 4 strong signals:

Search and retrieval is the busiest lane. In 34% of the startups, the summary touches on search/RAG: agent-ready web/search APIs (Tavily, Exa), enterprise retrieval layers (Shaped AI), video understanding (Twelve Labs), and knowledge graph–style enrichment.
Partnerships and integrations are everywhere. 31% of the startups highlight partner motions or platform integrations. Snowflake/Databricks connectors, Nvidia or Groq support, or listings in cloud marketplaces. This “integration-first” GTM is now standard for infra.
Cost/FinOps, observability, and governance are no longer nice-to-haves. 29% of the startups reference cost/efficiency. 27% discuss monitoring/eval. 19% touch security/compliance. As more teams deploy agents and LLM features, they need to watch spend, uptime, and risk in real time.
Open source remains the wedge. About 36% of companies mention open source. You see it in compute/data engines (ClickHouse, DuckDB via MotherDuck), MLOps (Seldon, ZenML), and workflow/runtime stacks (Anyscale/Ray, Lightning AI’s ecosystem). OSS is how teams earn developer trust and build bottoms-up distribution.

Other readouts worth noting from the file:

28% mention “agents”. Agent orchestration and agent-safe infrastructure are becoming distinct buying categories.
32% mention funding. Just 5% explicitly mention hiring. An efficiency mindset consistent with the post-2023 AI infra cycle.
Companies typically straddle two infra layers (median = 2), reinforcing that “full stacks” are still more aspiration than reality.

2. The integration-first go-to-market

Startups are winning attention by meeting developers where they already are (inside clouds, data clouds, and model APIs) rather than asking them to adopt a greenfield platform.

Data and model ecosystems. Hightouch (data activation) shows how data movers plug AI in at the last mile. PlanetScale (serverless MySQL) and ClickHouse (OLAP) emphasize connectors and compatibility. MotherDuck pairs DuckDB with easy sharing. On the model side, Anthropic/OpenAI mentions appear alongside vendors like Baseten and Modal, reflecting a pattern: “bring your own API key and we’ll handle infra”.
Runtime and serving. Anyscale (Ray), Modal (serverless jobs/GPUs), and Baseten (model serving) all lean into “inference plumbing that slots into your stack”. These products tend to ship first-class bindings for LangChain/LlamaIndex and publish Terraform/Helm assets as table stakes.
Hardware gravity. Mentions of Nvidia and Groq are frequent. Lambda Labs’ presence underscores how GPU supply still drives architecture choices and vendor selection. Even when a company isn’t a “GPU company”, they market compatibility and performance on a specific chip or accelerator.

Why it matters: Integrations reduce friction in evaluation and procurement, shorten time-to-value, and unlock co-marketing. The catch is dependence: when your roadmap is gated by upstream APIs / cloud quotas / marketplace rules, your velocity and gross margins are partially out of your control.

Example connections:

Baseten and Modal both make it trivial to stand up stateful model endpoints and background workers with minimal devops. They win because they drop directly into Python projects and let teams switch between OpenAI/Anthropic and open source models without a platform migration.
Hightouch meets analytics teams inside Snowflake/Databricks, then layers AI enrichment. This is “AI where your warehouse lives” and not “ship data to our AI platform”.

3. Agentic workflows are hardening into an infra layer

Nearly 33% of the startups reference agents. The pattern is consistent: teams first experiment with a copilot, then hit reliability/latency/cost walls. And then go shopping for infra that makes multi-step, tool-using agents predictable.

The emerging stack looks like this:

Orchestration/frameworks: LangChain, LlamaIndex, and purpose-built orchestrators (you’ll see agent wording around Anyscale/Ray, Baseten workflows, and Modal flows).
Search/RAG backplanes: Tavily and Exa for web search. Kumo and Shaped AI for domain-specific retrieval. Twelve Labs for video search.
Guardrails and eval: PromptFoo for prompt/eval regression. Fiddler AI and Arize for production monitoring of quality/drift/safety.
Policy and identity: Aserto (authorization), plus increasing mentions of SOC2/GDPR/HIPAA alignment for enterprise use.

Example connections:

A product pipeline built on Modal (tasks), Tavily (search), PromptFoo (eval), and Arize (live monitoring) gives a team a realistic “agent SLO”. That’s a de-risked agent loop without building heavy platform code.

Risks/dependencies: Orchestrators depend on reliable search APIs and model latency guarantees. If a model vendor changes API behavior or rate limits, your agent reliability degrades. Startups that cache aggressively or can swap models at runtime without losing behavior will have an edge.

4. Search and retrieval is becoming core infra

This week’s busiest theme is retrieval, not just “add a vector DB”. It’s domain-aware search, freshness, entity linking, and multimodal cues.

Web and real-time search APIs: Tavily and Exa AI are leaning into agent-ready web search with rate limits, citations, and source controls. This reduces prompt glue work and makes agent actions explainable.
Specialized retrieval: Twelve Labs does video understanding/search. We also see companies positioning around enterprise semantic search with connectors to internal repos, Slack, Confluence, and data lakes.
Vector isn’t the whole story: Traditional stores (Postgres/ClickHouse) recur in the file alongside vector capabilities. Teams prefer hybrid retrieval (keyword + semantic) and reranking over “vector-only” systems.

Why it matters: Most useful agents are “retrieve-reason-act” loops. If retrieval is flaky, your agent is flaky. Retrieval vendors that publish quality/latency SLOs and offer clear cost controls will capture budget that used to belong to internal search teams.

Example connections:

Pair Exa with ClickHouse (fast storage/analytics) or MotherDuck (analytics + sharing) to build an internal news/search console that’s enterprise-grade without a big search team.
Kumo can sit above existing data to expose predictions to agents without forcing a new warehouse migration.

5. Open source as the default wedge

Roughly 33% of the companies frame an open source angle: an MIT/Apache reference build, a community operator, or an SDK. This is strongest in:

Serving/runtime: Anyscale (Ray), Lightning AI (PyTorch Lightning and friends), Seldon (Seldon Core) put production knobs around open tooling.
Data engines: ClickHouse and DuckDB (via MotherDuck) are classic examples of open source engines with commercial clouds. LakeFS (data versioning) and Supabase (Postgres BaaS) follow the same pattern.
MLOps pipelines: ZenML open-sources orchestration recipes for training/eval. Seldon publishes model deployment primitives that are enterprise-hardened in the paid edition.

Why it matters: Open source is still the fastest route to developer love and bottom-up adoption. The model that’s working is “batteries-included open source software for single-team use, plus a hosted product with policy/SAML/observability baked in”.

Risks: License drift and cloud competition. If an adjacent cloud vendor can host your open source with a thinner margin structure or better discounting, your paid tier needs real enterprise-only features (governance, isolation, SLAs) to defend itself.

6. Compute economics are the product

Mentions of hardware vendors are common. The practical pattern: AI infra is priced by tokens and milliseconds, so runtime and hardware efficiency is product differentiation.

Alternative accelerators: Groq shows up often. Its LPU-based inference emphasizes deterministic latency and high tokens/sec for chat/coding workloads. Startups that add first-class Groq support are signaling “we care about speed and cost”.
GPU cloud pragmatism: Lambda Labs is top of mind for teams that need capacity without resorting to hyperscaler lock-in. This pressure also explains the popularity of serverless runtimes (Modal, Baseten) that can bin-pack GPU work and eliminate idle time.
Data throughput/storage: Weka appears in storage-intensive contexts. Infra that feeds GPUs fast (and cheaply) becomes a competitive moat for training and high-throughput inference.

Why it matters: Inference platforms win or lose on cost curves and tail latency, not just features. Expect more vendors to publish cost/latency dashboards and to auto-route workloads across Nvidia, Groq, and CPU/AVX backends to keep SLAs and margins.

Risk: Supply concentration. If your SLOs are tuned to a single silicon vendor, procurement shocks become product incidents.

7. Monitoring, evaluation, and governance get first-class budgets

As teams move from demos to production, they buy controls:

Observability and eval: Fiddler AI and Arize are the canonical examples in this week’s data. Model quality tracking, drift detection, feature attribution, and experiment comparison. Tools like PromptFoo push evals earlier in the lifecycle with test suites you can run in CI.
Security and policy: Aserto (authorization) and privacy-forward data players (Tonic AI, Mostly AI, Parallel Domain for simulated or synthetic data) help enterprises ship AI without moving PII into unmanaged systems.
Ops for k8s/on-prem: Rafay and Spectro Cloud show how regulated sectors keep agents in private clusters or at the edge. When CIOs say “air-gapped AI”, this is the pattern they buy.

Why it matters: These purchases reduce organizational risk: vendor lock-in risk (by enabling model swapping), regulatory risk (by making flows explainable), and finance risk (by tying spend to outcomes).

Risk: Tool sprawl. If eval, monitoring, and policy each live in separate tools, buyers push back and consolidate. Vendors that integrate well into Buildkite/Harness and observability backbones will fare better.

8. Layer-by-layer implications: who benefits, who’s exposed

Compute and hardware
- Winners: serverless compute that can auto-select the cheapest/fastest backend for each model class (Modal, Baseten), GPU clouds that publish predictable queues (Lambda Labs), and alt-silicon with credible software stacks (Groq).
- Exposed: single-vendor-only runtimes and any platform that can’t show unit-economics improvements quarter over quarter.
Model providers
- Winners: vendors with strong policy controls and tooling hooks (Anthropic’s Claude Code-style tools, OpenAI function-calling) that make agent loops simpler.
- Exposed: API-only vendors without enterprise-grade governance or regional hosting options.
Inference platforms
- Winners: platforms that treat cost controls, caching, and canarying as first-class features and that publish “speed under load” as a product.
- Exposed: “model hosting” without workflows, evals, or rollbacks.
Data infrastructure
- Winners: hybrid retrieval patterns that combine Postgres/ClickHouse features, embeddings, and rerankers. Data-versioning (LakeFS) that gives auditors a clean chain of custody. Friendly SQL-forward products (Supabase, Turso) that let agents read/write safely.
- Exposed: vector-only systems that don’t support hybrid search or enterprise security.
Orchestration and agents
- Winners: opinionated, production-grade workflows that treat tools as versioned APIs, with timeouts, retries, and SLOs.
- Exposed: notebooks as production.
Observability and evaluation
- Winners: tools that bridge pre-prod eval suites (PromptFoo style) and prod monitoring (Fiddler/Arize) so teams share metrics, not screenshots.
- Exposed: black-box dashboards without hooks into CI/CD or ticketing.
Security and governance
- Winners: policy engines that sit in the hot path without adding material latency (Aserto), and synthetic data that demonstrably reduces compliance scope (Tonic/Mostly).
- Exposed: vendors promising “secure by design” without proofs, logs, or SOC2-ready controls.
Edge/on-prem
- Winners: Kubernetes management (Rafay, Spectro Cloud) with curated model catalogs for air-gapped installs.
- Exposed: cloud-only offerings in healthcare, defense, and critical infrastructure.

9. Correlations and what they imply

A few notable co-occurrences jump out in the dataset:

Cost + integrations travel together. Companies that talk about FinOps also talk about partnerships. This makes sense: the fastest path to lower costs is often swapping models/runtimes based on price/perf, which requires deep integrations.
Agents + observability are tightly linked. Teams that deploy agents quickly discover they need eval/monitoring to keep incident tickets down. This supports the “agent SLO” thesis: a budget holder will pay to guarantee reliability, not just to add reasoning.
Search + observability co-occur. Retrieval quality is fragile under domain drift. Buyers are starting to ask for quality dashboards, not just precision/recall claims in a PDF.

Practical takeaway for founders: If you sell an agentic or retrieval-heavy product, ship integrations, SLOs, and cost controls in the first 90 days. If you sell an open source wedge, publish a clean enterprise demarcation (policy, SSO, RBAC, audit) and resist feature leakage into the free tier.

10. Funding vs hiring: what it says about the next 6 months

Mentions of funding are materially more common than hiring. Within funding mentions, the rate is highest for model providers and compute (roughly half of companies touching those layers also talk about new capital), and lowest in core data infrastructure. That’s consistent with market behavior: tokens and milliseconds get funded quickly, but durable data platforms raise on slower cycles and enterprise proofs.

Implications:

Expect more serverless inference and search APIs to raise in the near term. They have obvious growth levers via integrations.
Data infrastructure founders should assume more diligence on procurement, security, and total cost. Lean into hybrid retrieval and SQL-forward UX to expand the buyer set.

11. What could derail these trends

Vendor concentration. Nvidia, a few hyperscalers, and two model API leaders hold a lot of power. Price or policy changes can damage downstream gross margins. Mitigate with true multi-backend routing and pre-negotiated capacity.
Benchmarks and evals drift. Retrieval benchmarks are easy to game. Eval suites may not reflect messy production. Bake in per-customer testsets and quality telemetry, and close the loop to prompt/model changes.
Regulatory pull-forward. As agentic systems move into healthcare/financial workflows, privacy and autonomy rules tighten. Vendors that can’t provide auditability, data residency, or deterministic behaviors will get boxed out.
Tool sprawl fatigue. Buyers will resist stitching together five tools for one use case. Lean into opinionated defaults and publish a canonical “reference architecture” with two or three vendors, not eight.

12. How this affects each layer of infra in practice

For compute vendors (Lambda Labs, Groq): publish transparent queues/prices and ecosystem guides (“How to run vLLM/TensorRT-LLM here”). Partner tightly with serving platforms to become the default backend they auto-select.
For inference/serving platforms (Baseten, Modal, Anyscale, Lightning AI): treat routing + caching + eval + rollbacks as a single story. Make it trivial to go from a PromptFoo test passing in CI to a guarded rollout behind feature flags.
For data platforms (MotherDuck, ClickHouse, PlanetScale, Supabase, Turso, ParadeDB): own hybrid retrieval patterns and permissioning. Agents increasingly need safe writes (not just reads). Make row-level security and reversible migrations “agent-proof”.
For search/RAG APIs (Tavily, Exa, Twelve Labs): ship SLAs, source traceability, and per-customer cost caps. If an agent loops forever, your API bill shouldn’t spiral.
For observability/eval (Fiddler AI, Arize, PromptFoo): integrate with Buildkite/Harness so evals gate deploys. Offer budgets and drift alerts in plain English, not just charts.
For governance and synthetic data (Aserto, Tonic, Mostly, Parallel Domain): market the compliance delta: what audits get easier if the customer adopts you. For agents in enterprises, “pass the audit” is the buying trigger.
For edge/on-prem (Rafay, Spectro Cloud): package a curated model catalog with signed images, air-gapped updates, and policy-by-default. The value is “your agents run inside your cluster”, not “k8s but with AI”.

Bringing it all together

This week’s data points to an infra market that is consolidating around agent-ready retrieval, serverless inference with real cost controls, and production-grade guardrails.

Founders who win will make integration friction effectively zero, turn latency and cost into product features, and expose observable, testable quality across the entire agent loop. Investors should expect capital to chase those patterns first. And core data platforms win by enabling hybrid retrieval, safe writes, and clean governance rather than by selling “a vector DB” alone.

Companies like Tavily, Exa, Twelve Labs (retrieval), Baseten, Modal, Anyscale, Lightning AI (serving/runtime), MotherDuck, ClickHouse, PlanetScale, Supabase, Turso (data), Fiddler AI, Arize, PromptFoo (eval/observability), Tonic, Mostly, Parallel Domain, Aserto (privacy/policy) illustrate how these pieces combine into reliable, economical agent systems. The common thread is practical: ship fast by integrating deeply, then earn enterprise trust with controls.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Company Deep Dive #6: SNOWFLAKE

Prateek Joshi — Fri, 15 Aug 2025 19:57:35 GMT

We’ll be diving into Snowflake today.

1. What is Snowflake and why does it matter?

Snowflake is a cloud-native data platform used by more than 11,000 organizations to store, combine, and analyze data. Think of it as a managed backbone where companies dump everything from sales transactions to product logs, then run analytics and AI on top without buying hardware or stitching together a dozen tools. It runs on AWS, Azure, and Google Cloud. And it separates storage from compute so customers pay only for what they use when they query or process data.

Why does this matter for people who build or invest in infra startups? Snowflake has become a default “home base” for analytics and AI data pipelines. When Snowflake grows, adjacent categories see more demand e.g. ETL/ELT, orchestration, observability, governance, semantic layers, BI, ML platforms, vector search, and data marketplaces. When Snowflake slows, enterprise budgets tighten around the whole modern data stack. In short, Snowflake is a bellwether for data-first infrastructure.

What the recent numbers tell you at a glance: product revenue is just under $1 billion a quarter and growing in the mid-20% range. Net dollar retention sits in the mid-120s, which is a sign customers expand once they’re in. Gross margins are mid-70s on a non-GAAP basis. Free cash flow is healthy for a high-growth platform. The headline deceleration from hypergrowth down to “merely” strong growth is real. But the customer base (Global 2000 penetration and >$1M-spend accounts) keeps expanding. That mix points to durability even as growth normalizes.

2. What Snowflake actually does (in user terms)

Snowflake hosts your core analytical data in one place. Data engineers pipe raw data from SaaS apps, databases, and event streams. Snowflake stores it. Analysts query it with SQL. Data scientists train or score models against it. Business teams consume dashboards on top. The platform handles security, scaling, and performance so teams don’t have to babysit clusters. Two design choices matter most:

Separation of storage and compute. You can park a lot of data cheaply and spin up compute for specific jobs only when needed. That’s why finance teams like Snowflake: the bill tracks usage.
Multi-cloud neutrality. Snowflake runs on the big clouds. If you’re an AWS shop (and most Snowflake workloads are), you can still avoid lock-in to Redshift. If you prefer Azure or GCP, you get the same Snowflake experience there.

Over the last 2 years, Snowflake has broadened beyond “warehouse” into an “AI data cloud”. This includes built-in features for data sharing, governance, Python, ML, and a marketplace where vendors offer data products and apps. It has also leaned into builders: programs like “Powered by Snowflake” for ISVs, startup credits and an AI accelerator, plus investments via Snowflake Ventures. That ecosystem pull is why a lot of infra startups list a Snowflake integration on day one.

3. Market tailwinds and the size of the prize

There are two forces you can rely on:

Data keeps compounding. Every product emits logs and events. Every business team wants metrics and insights. Compliance needs lineage and audit trails. Even companies that don’t consider themselves “data companies” now ingest terabytes monthly.
AI turns “nice-to-have analytics” into “must-have infrastructure”. If you want AI in production, you need curated, governed, queryable data at scale. That’s not optional.

Depending on definitions, the “data warehouse / data platform” market sits in the tens of billions and is growing quickly. What matters more than any top-down TAM slide is penetration and runway. Snowflake already serves hundreds of Global 2000 companies, yet most large enterprises still have pockets of legacy on-prem warehouses, departmental silos, and bespoke pipelines that haven’t been modernized. Mid-market penetration is even earlier. If you’re underwriting the space, the shape looks like a long adoption curve with multiple expansion vectors — more data sources, more use cases (real-time, AI, governance), and more regions and business units.

For startups, those tailwinds translate into predictable demand patterns. Once a customer standardizes on Snowflake, they go looking for better ingestion, faster transformations, cheaper storage tiers, smarter cost controls, higher-quality governance, and applications that sit directly on the warehouse. That creates room for specialized infra products so long as they reduce time-to-value or total cost of ownership versus do-it-yourself.

4. How Snowflake makes money and what its unit signals mean

Snowflake’s revenue model is consumption-based. Customers buy credits and these credits are spent when they use compute (running queries, loading data, ML jobs) and on storage. The practical result: land with a small workload, then expand as usage grows. That’s why net dollar retention sits around the mid-120s. Once a team loads critical data and connects dashboards, the organization tends to add more data and more users over time.

The company’s non-GAAP gross margins are in the mid-70s, reflecting that it resells cloud compute/storage with value-add software on top. Operating margins are positive on a non-GAAP basis but still modest. GAAP profitability remains weighed down by stock-based comp, typical for growth platforms. Free cash flow is strong, aided by upfront deals and efficient support relative to the size of the dataset under management.

The two things founders should read from these signals:

Expansion is the core flywheel. If you sell into Snowflake customers, design your product and pricing to expand in step with data and seat growth i.e. usage-based SKUs, straightforward attach, and visible ROI in the admin’s cost dashboard.
Cost visibility matters. Snowflake’s consumption can surprise finance teams if workloads aren’t governed. Tools that give spend predictability, workload scheduling, and performance tuning have tangible value. If you provide them, quantify savings in hours and dollars.

5. The ecosystem: partners, marketplace, and why co-selling works

The center of gravity is AWS. A large majority of Snowflake’s workloads run there. That deep alignment (plus similar tie-ups with Azure and GCP) leads to co-sell motion where Snowflake and the cloud provider bring each other into enterprise deals. For startups, this means two things:

If you’re building a product that complements Snowflake, become a first-class integration and explore the partner programs. “Powered by Snowflake”, the Marketplace, and Snowflake’s startup accelerator can shorten your path to customers. AWS credits targeted at Snowflake-building startups sweeten early runway.
If you’re building an alternative data platform, you’ll be competing not only with Snowflake’s direct sales team but also with its partner field resources and an installed base invested in Snowflake skills, tooling, and data models. Your wedge needs to be sharp: a 10x on a real pain point, not a 20% improvement.

The Marketplace angle deserves emphasis. It allows data vendors and app builders to sell to Snowflake customers without complex deployment. That lowers distribution friction for startups, especially in vertical data products (healthcare, fintech, climate), synthetic data, enrichment feeds, and analytics apps. For founder planning pipeline, that channel can be the difference between a long enterprise slog and a repeatable sales motion.

6. Competition and pricing dynamics

The big three cloud providers all offer their own analytic databases: Redshift (AWS), BigQuery (Google), and Synapse/Fabric (Microsoft). Databricks is the cross-cloud heavyweight with a focus on the “lakehouse” and AI tooling. Each has rational appeal:

Cloud-native warehouses integrate tightly with their own clouds, sometimes with bundle economics that are hard to ignore if a customer is “all-in” on a given vendor.
Databricks resonates where data science and ML are the main events, and where notebook-centric developer workflows are the norm.
Snowflake wins on simplicity, predictable performance, governance, and true multi-cloud neutrality, especially in organizations that need to share data across partners or acquisitions.

Pricing is rarely apples-to-apples and often comes down to total cost of ownership, not list price. A “cheaper” engine that demands more ops or yields longer runtimes can cost more overall. Where Snowflake loses, two reasons show up often: (1) a customer standardizes on the native cloud stack for consolidation simplicity (2) heavy ML shops prefer Databricks’ developer ergonomics. Where Snowflake wins, customers cite low-friction scaling, governance, data sharing, and fewer knobs to tune.

For startups, the takeaway is to assume a multi-platform world and meet customers where they are. If your product only works with one warehouse, you’ve narrowed your market more than you think. If it works better with one (e.g. deep Snowflake integration) but still functions elsewhere, you keep optionality while riding Snowflake’s distribution.

7. Financial health as a signal for investors and your sales pipeline

Snowflake’s latest prints show mid-20s growth off a multi-billion revenue base, high net retention, rising large-customer counts, mid-70s non-GAAP gross margins, and strong free cash flow. At the same time, the growth curve has flattened from earlier years. And GAAP profitability is still out of reach. Here’s how to interpret that for your own planning:

Budgets are expanding, but with scrutiny. Data teams still spend, but CFOs want predictability. Products that help customers hit value targets or control spend continue to land. The “nice-to-have” tools struggle.
Pipeline quality > pipeline size. With Snowflake growing steadily but not explosively, downstream categories don’t get automatic lift. Founders should prioritize ICP discipline and ROI-first messaging over broad, top-of-funnel experiments.
Stock and cash position fuel the ecosystem. As long as Snowflake’s free cash flow is solid and the market rewards durable growth, programs like credits, marketplace incentives, and startup acceleration remain funded. If that changes, discount rates across data infra tighten quickly.

A practical heuristic: if Snowflake’s net retention ticks down meaningfully or RPO growth slows, expect longer sales cycles across the stack two to three quarters later. Conversely, when Snowflake’s large-customer growth re-accelerates, expect higher attach rates for complementary tooling shortly after.

8. How many infra startups are actually exposed

Not every infra startup lives and dies with Snowflake, but many are intertwined. Here’s a directional breakdown based on how customers actually build:

Directly dependent (high exposure): ETL/ELT into warehouses. Reverse ETL. Data quality/observability for SQL/warehouse pipelines. Governance and cataloging tuned to warehouse metadata. Semantic layers and BI that query Snowflake. Usage optimization/cost control for Snowflake. If your product inherently assumes a warehouse and Snowflake is the most common one, you’re in this bucket. Roughly 40-45% of venture-backed data-infra startups in the market fall here by function.
Indirectly dependent (moderate exposure): ML platforms that read/write from the warehouse. Vector/search layers that sync embeddings from warehouse tables. Privacy/security overlays. Data applications “powered by” the warehouse. Vertical analytics where Snowflake is the default store. A large slice of the remainder sits here, especially in AI application infrastructure.
Loosely coupled (low exposure): Core developer tools, CI/CD, general-purpose observability for app services, container/Kubernetes tooling, and non-warehouse databases (OLTP, time-series, graph) that often live alongside but not inside the warehouse footprint. These still feel macro data headwinds/tailwinds but don’t map 1:1 to Snowflake cycles.

If you’re fundraising, assume your investor will ask: “What % of your current ARR is in Snowflake-centric accounts? And what % of your pipeline references Snowflake explicitly?”. You should know both numbers. If >60% of your ARR depends on Snowflake usage expanding, your growth will correlate to their cycle. That’s not inherently bad, but you should explain your diversification plan by warehouse, by cloud, and by use case.

Two more practical notes:

ISVs building on Snowflake’s platform (e.g. Marketplace apps) can scale faster with lower deployment friction. The trade-off is platform dependency. Price that risk into your roadmap and contracts (e.g. plan for a second platform by a certain ARR milestone).
Data vendors selling feeds directly in the Marketplace can win on distribution. The edge comes from curation and refresh SLAs. If your data is “commodity”, the Marketplace’s transparency can compress margins unless you add proprietary enrichment or quality guarantees.

9. Risks, correlations, dependencies: How to monitor them without handwaving

Macro and budget risk. When CFOs tighten spend, consumption platforms are the first to feel it. Usage throttles, postponed migrations, and smaller commitments ripple through to everyone who sells into the same teams. Watch for early signs in smaller cohorts (SMB/mid-market churn rises first) and in Snowflake’s remaining performance obligations growth. If bookings slow, downstream products will feel it.

Competitive pressure. If Microsoft’s Fabric bundle or Google’s BigQuery pricing makes “all-in” economics too attractive, some customers will consolidate. Likewise, Databricks can pull workloads where ML is central. Founders should take this as a design constraint: make your product work well across at least two data backbones and keep your total value proposition independent of any single vendor’s roadmap.

Cloud dependence. Snowflake rides on the hyperscalers’ rails. If underlying cloud costs change or relationships shift, Snowflake’s own unit economics and pricing could move. In practice, that would show up as either pass-through price changes or new SKUs that nudge customers to particular usage patterns. Products that help customers optimize spend become more valuable in those moments.

Execution and reliability. Incidents, security issues, or delayed roadmap items can shake confidence. Keep a lightweight risk register: note uptime status, incident post-mortems, and major feature delivery slippage. If you sell governance or reliability tooling, these become selling moments. But tread carefully and sell value, not FUD (fear, uncertainty, and doubt).

Regulatory and data residency. New privacy rules or localization requirements can complicate centralized data strategies. If you build security, lineage, synthetic data, or anonymization tooling, these shifts are opportunity. If you’re a Marketplace ISV, stay ahead on certifications and document the controls your customers’ compliance officers will ask about.

Correlation mechanics for planning. A simple way to quantify exposure: calculate the share of your revenue and pipeline tied to customers who list Snowflake as a strategic platform, then apply a haircut equal to Snowflake’s last-reported net retention delta if it deteriorates (e.g. if NRR went from 124% to 118%, model a 5–10% drag on your expansion-led growth). It’s not precise, but it forces explicit assumptions and helps set board expectations.

10. Near-term catalysts and practical playbooks for founders + investors

Product catalysts. Snowflake’s continued push into AI (native inference, feature stores, Python ergonomics) and its Postgres move expand the reachable developer base. If these land well, expect more mid-market adoption and more “apps-on-the-warehouse” startups. For ISVs, that’s a bigger total market and more reasons to integrate deeply.

Go-to-market catalysts. Marketplace improvements and stronger co-sell with the hyperscalers can shorten enterprise sales cycles. If you’re an ISV, align your enablement with Snowflake’s field teams and speak their value language: governance, time-to-insight, and cost visibility. If you’re a challenger platform, concentrate on verticals or workloads where you’re dramatically better. You won’t out-distribution Snowflake, but you can out-outcome it in a niche.

M&A and partnerships. Snowflake will keep buying capabilities that help them win developer mindshare and AI workloads. Each acquisition is a signal about adjacencies they consider strategic. If you’re in one of those adjacencies, decide whether to lean in as a partner, differentiate sharply, or pivot up/down-stack.

Fundraising timing. When Snowflake prints clean quarters (steady growth, strong RPO, upbeat guidance), investors re-risk in data infra. If prints wobble, valuations compress first in the “picks and shovels” around the warehouse. If you’re raising, aim to announce on the heels of strong industry prints. If you must raise in a soft patch, show more concrete ROI and clear multi-platform support.

Sales playbooks that work today. Three approaches can land here:

Prove a hard dollar cost reduction on Snowflake spend (e.g. 20–30% savings) with a two-week pilot.
Show a time-to-insight improvement that unblocks a specific recurring workflow (e.g. log analysis or revenue ops dashboards) and quantify the saved hours per month.
Deliver a governance/compliance control that a CISO or risk team needs for an audit this quarter.

If your pitch depends on “eventual” benefits or indefinite “AI readiness”, reframe it into one of the above.

11. What would change our view

If you’re using Snowflake as a market barometer, keep a simple dashboard:

Growth cadence: Is product revenue growth holding in the mid-20s or re-accelerating? A steady 25–30% off a multi-billion base is bullish enough for the ecosystem.
Net dollar retention: Above 120% signals healthy expansion. Sustained drift below that hints at spend discipline or competitive encroachment.
Large customer counts: >$1M-spend customers and Global 2000 penetration are the “quality of demand” metrics.
Bookings: This is forward demand. If this slows, plan a conservative second half.
Free cash flow: Healthy free cash flow keeps ecosystem programs funded. A squeeze here means fewer credits and tighter co-marketing.
Incident/uptime trend and roadmap delivery: Confidence indicators for buyers and for ISVs riding along.

Three positive triggers that would make the case for being more aggressive in backing Snowflake-adjacent companies:

Re-acceleration of large-customer adds and NRR trending back toward mid-120s or higher.
Material Marketplace GMV growth with improved ISV monetization terms.
Clear technical wins in AI features that demonstrably shift workloads from external tools into Snowflake.

Three negative triggers that would lead to more caution:

Two or more quarters of NRR drifting toward ~115% with flat RPO growth.
Hyperscaler bundle pressure that pulls major accounts back to native warehouses.
Reliability or security incidents that cause high-profile churn.

Putting it all together: What this means for infra startups and investors

Snowflake is no longer a scrappy disruptor. It’s part of the core enterprise fabric. That status brings both stability and gravity. Stability, because the platform has proven staying power, deep compliance, and entrenched workflows. Gravity, because it bends the rest of the data stack around it: partners, pricing, and buyer expectations.

If you’re a founder:

Pick your angle consciously. Be the accelerant (cost control, performance, governance), the specialist (vertical apps, domain data, compliance), or the alternative (superior engine for a specific workload). Don’t be a fuzzy middle.
Design for co-sell. Align your messaging with Snowflake’s outcomes. Show how you help customers extract more value from their existing Snowflake investment without surprise costs.
Diversify with intent. Even if Snowflake is your best channel, prioritize a second platform by a revenue milestone so you’re resilient to vendor cycles.

If you’re a VC or corp dev lead:

Expect correlation. A portion of your data-infra portfolio’s growth will correlate with Snowflake’s cycle. Model it. Use Snowflake’s quarterly signals to adjust reserves and pace.
Underwrite distribution. Products that can ride the Marketplace or co-sell motions deserve higher go-to-market credit in your model. Challengers need clearly superior unit outcomes to offset distribution disadvantage.
Watch for adjacencies Snowflake treats as strategic. Those areas will either be great M&A outcomes for portfolio companies or regions where Snowflake’s native products compress startup room. Time your bets accordingly.

The short version: the “data cloud” has matured into a dependable layer of enterprise infrastructure. Snowflake’s growth may not be the rocket ride it once was. But its combination of broad adoption, usage expansion, and cash generation makes it a reliable anchor for the ecosystem. For infra startups, that reliability can be an advantage. If you plug into it smartly, measure your exposure, and keep a second engine humming in case the winds shift.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Company Deep Dive #5 - DATABRICKS

Prateek Joshi — Sun, 10 Aug 2025 16:02:35 GMT

We’ll be diving into the data juggernaut Databricks today.

1. Overview and Why Databricks Matters Now

Databricks has emerged as one of the most important companies in enterprise data and AI infrastructure. Founded in 2013 by the creators of Apache Spark, it began as a way for data scientists and engineers to process very large datasets quickly.

Over the past decade, it has grown into a unified data lakehouse platform. It combines the capabilities of a data lake (flexible, inexpensive storage) and a data warehouse (fast, structured queries) in one place. This means customers can store all their raw data in one system, process it at scale, run advanced analytics, and build AI models without moving between different tools.

The company is still private but has scaled to an estimated $3 billion in annual revenue with more than 10,000 customers globally, including over half of the Fortune 500 (company figures, October 2024). It reached free cash flow breakeven in 2024, signaling that it’s no longer in the “burn cash to grow” phase that defines many startups. Its most recent funding round in September 2024 valued it at $62 billion.

Databricks is not just a big player in its own right. It’s a bellwether for a broad category of infra startups. Many emerging companies build tools that plug into, extend, or compete with Databricks. As Databricks grows, it can pull whole segments of the infra market along with it. Or crush weaker competitors in overlapping areas.

2. What Databricks Actually Offers

To understand the company’s market power, you need to understand what its product does and why it’s attractive to both engineers and business leaders.

Unified data environment: Historically, companies used different tools for different steps of the data workflow. A raw data store like Hadoop or Amazon S3 for holding files, a separate data warehouse like Snowflake or Teradata for analytics, and perhaps another platform for machine learning. This meant data had to be copied between systems, slowing work and creating risk of inconsistencies. Databricks’ “lakehouse” model lets teams do all those steps (store, clean, query, analyze, and run AI) in one integrated place.

AI integration: Through tools like MLflow (open-sourced by Databricks) and its 2023 acquisition of MosaicML, Databricks lets customers train, fine-tune, and deploy AI models directly on their own data. With the MosaicML technology, customers can run LLMs with fine-tuning that respects their privacy and regulatory needs. This integration is timely: many enterprises want to harness AI without sending sensitive data to a public API like OpenAI’s.

Open-source foundation: The platform is built on open standards like Apache Parquet (a storage format) and Delta Lake (for transactional reliability in data lakes). These open-source roots make Databricks easier to integrate with other tools and reduce fear of vendor lock-in. Engineers are more willing to commit to it because they can still work with their data outside Databricks if needed.

Multi-cloud compatibility: Unlike cloud provider-native tools that run only on one platform (e.g. AWS Redshift), Databricks runs on all three major public clouds (AWS, Azure, Google Cloud). This is important for companies that have multi-cloud strategies or want to avoid being tied too tightly to one provider.

3. The Market Context

Databricks sits at the center of several huge and fast-growing markets. The global big data and analytics market was estimated at $348 billion in 2023 and is projected to grow at over 13% annually through 2030 (Grand View Research, April 2024). Within that, the market for data platforms that unify analytics and AI (the “lakehouse” niche) is newer but expanding faster. This is fueled by enterprises shifting to cloud data storage and adding AI workloads.

Adoption of AI across industries is a major tailwind. Every AI application (from predictive maintenance in manufacturing to personalized recommendations in retail) depends on robust data infrastructure. Databricks directly benefits from that wave: before you can train a good AI model, you need a clean, accessible, and well-structured dataset. And that’s what Databricks enables.

Another growth driver is the move away from legacy, on-premise data warehouses toward cloud-based and hybrid solutions. Companies still running on older systems like Oracle or Teradata are potential customers as they modernize. Their multi-cloud flexibility makes it appealing for these migration projects.

4. Competition and Positioning

Their closest high-profile competitor is Snowflake. Both companies want to be the central hub for enterprise data, but they come from different starting points. Snowflake began as a cloud-native data warehouse optimized for structured data and SQL analytics. Databricks began in big data processing and machine learning. Over the past few years, they have been converging:

Snowflake has added machine learning and unstructured data capabilities.
Databricks has improved its SQL support and ease-of-use for business analysts.

Cloud giants like AWS, Azure, and Google Cloud are also competitors since each offers their own analytics and AI services. But those tend to be more siloed and less open. Databricks wins when a customer wants one environment for both engineering and analytics and doesn’t want to be locked to a single cloud provider.

The competitive dynamic matters for infra startups. If Databricks wins more accounts against Snowflake, it shapes the ecosystem for add-on tools. For example, a startup building a monitoring tool for Databricks pipelines will see a bigger market. But one tightly integrated with Snowflake might have a smaller addressable market if Databricks’s share grows.

5. Financial and Operational Performance

Databricks’s revenue reached roughly $3 billion in the year to October 2024, up from around $1.9 billion in 2023. A growth rate of about 58% (company figures, October 2024). That’s faster than Snowflake, which grew 36% year-over-year in its latest fiscal year.

Gross margins (the percentage of revenue left after covering the cost of delivering the service) are in the mid-80% range for the core software business. That’s in line with best-in-class SaaS companies and higher than many cloud infra companies whose margins are eroded by heavy compute costs.

Customer retention is extremely strong. Net revenue retention is estimated around 140%. It means that on average, existing customers increase their spending by 40% year-over-year. This is a sign of both product stickiness and expansion potential within accounts.

The company also crossed into free cash flow positive territory in 2024, meaning it’s no longer dependent on outside funding to sustain operations. For potential employees, this is important: it signals stability and reduces the risk of deep cost-cutting in a downturn.

6. Risks and Dependencies

Their trajectory isn’t guaranteed. There are several risk areas to watch:

Competitive intensity: Snowflake is not standing still. And the cloud hyperscalers are constantly improving their native offerings. If a cloud provider bundles a full-fledged lakehouse-style product at a lower price, Databricks could face pricing pressure.

Macro environment: The company’s usage-based pricing means that if customers cut workloads in a downturn, revenue could slow quickly. Smaller infra startups have seen this effect sharply in past slowdowns. Databricks is more resilient, but not immune.

Complexity barrier: Despite recent improvements, Databricks can still feel daunting for less technical teams. If adoption stalls in the “business analyst” user segment, Snowflake’s simpler interface could win deals.

Security and compliance: Handling sensitive data for large enterprises means any breach or compliance failure could be a major reputational and financial hit.

Dependencies also exist in its growth model: Databricks’ expansion drives (and depends on) the health of complementary infrastructure. It needs a robust partner ecosystem to meet customer demands in areas it doesn’t fully cover itself such as specialized data ingestion, industry-specific AI models, or compliance automation.

7. Impact on the Infra Startup Ecosystem

Their growth has a direct ripple effect across the broader infra startup market:

Complementary startups get a bigger pie: Companies building data connectors, orchestration tools, observability platforms, governance layers, or AI deployment systems can integrate with Databricks and ride its expansion. For example, ETL providers like Fivetran benefit as customers feed more data into Databricks.

Adjacent categories may get absorbed: Startups that build features Databricks can easily add risk being outcompeted. For instance, a standalone notebook-based machine learning workflow tool might find customers prefer the built-in MLflow inside Databricks.

Cloud optimization tools gain indirectly: Databricks workloads consume significant compute and storage on AWS, Azure, and GCP. As Databricks’s usage grows, so does demand for startups offering cloud cost optimization, performance tuning, and monitoring.

Higher technical standards in the market: As Databricks sets the bar for scalability, reliability, and openness. Startups in adjacent infrastructure categories may need to meet similar standards to be considered enterprise-grade partners.

In short, the winners will be those that complement and extend Databricks’s capabilities. Not those that try to replicate them. The losers will be those whose entire product overlaps with a Databricks roadmap item.

8. The 24-Month Outlook

The next two years are likely to include:

Continued strong growth in large enterprise accounts, especially in regulated industries like financial services, healthcare, and government, where Databricks’s security and governance features are a differentiator.
IPO readiness. The company’s scale, growth, and profitability profile make a public listing highly plausible in 2025–2026. Public market valuation will depend on maintaining high growth while demonstrating operating leverage.
Deeper AI integration. Expect Databricks to push hard on enterprise AI features, from fine-tuning LLMs with MosaicML to building tools for deploying AI agents. This will both strengthen its value to existing customers and open doors to new ones.
Ecosystem expansion. More integrations, partner-built apps, and vertical solutions will likely emerge, further entrenching Databricks in enterprise workflows.

For infra startups, this means the clock is ticking to align with the Databricks ecosystem if you want to ride the wave. Being “Databricks-native” could become as valuable in data infra as being “AWS-native” became in cloud infrastructure a decade ago.

9. Bottom Line

Databricks combines strong technology, excellent financial performance, and a favorable market environment. Its unified approach to data and AI positions it well against both pure-play rivals like Snowflake and the native services of cloud giants. The risks (especially competitive pressure and macroeconomic headwinds) are real. But the company’s scale, retention, and cash flow give it resilience.

For investors, the company’s growth is a signal that the unified data + AI platform model is winning. For potential employees, it offers the stability of a profitable, late-stage company with the upside of pre-IPO equity. And for infra startups, it’s a gravitational force in the market: align with it and you may find your market expanding. Compete head-on in its core territory and you’ll face an uphill battle.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Startup Tracker #2 - What this week's infra moves reveal

Prateek Joshi — Sat, 09 Aug 2025 16:02:14 GMT

Lot of movement in the world of infra startups this week. Let’s look at what the signals are.

1. Infra is consolidating around three control points

Three control points dominate how infra startups are creating or losing leverage:

a. Data gravity (stores, pipelines, feature platforms).
The updates from Hightouch, Featureform, PlanetScale, Chroma, ClickHouse, and several data quality/monitoring vendors reinforce a simple pattern: whoever sits closest to production data (and can move it into AI‑usable shapes reliably) gets pulled into the most customer decisions. Reverse ETL vendors are driving “operational AI” connectivity, feature stores are making training/inference consistent across teams, and cloud databases are racing to make vector and JSON-native workloads feel first‑class without bolting on new systems.

(b) Inference control planes (compute + model routing).
News touching Modular, Replicate, Fireworks AI, Together AI, Baseten, Modal, Render, and Lambda Labs points to a busy middle layer where customers want a single API or console to route across models, runtimes, and GPUs. A lot of the product updates revolve around latency wins, autoscaling improvements, or “one-click” integrations with popular model endpoints. The competitive axis is simplicity under bursty/spiky loads plus price predictability.

(c) Developer workflow anchors (agents, orchestration, CI/CD for LLMs).
Temporal, Seldon, Fiddler, Cleanlab, Evidently, Scale AI, Cursor, Replit, and Lightning AI updates lean toward “make it safe and repeatable”. Startups are blending agents, evaluators, and governance checks into familiar deployment workflows. The winners are making AI look like software engineering again: unit‑style evals, lineage, rollout policies, drift monitoring, test sets, and traceability that security and compliance teams can live with.

If you’re mapping risk and dependencies across the companies you track, these updates read like a reminder that everything connects to those three control points. Data platforms decide what’s possible, inference control planes decide what’s fast/affordable, and workflow anchors decide what’s shippable in the enterprise.

2. Funding and market sentiment: capital flows to “AI next to revenue”

A nontrivial chunk of updates are around fresh rounds or rumors for companies doing one of two things:

Pushing AI to the revenue edge. Glean (enterprise search and knowledge work), Hebbia (document‑heavy analysis), Tavus (personalized media at scale), and a handful of API‑first infra companies show investor appetite for AI that “touches revenue” quickly. Funding concentrates around startups that shorten the cycle from data to outcome. This is especially true where the infra is visible to GTM teams, not just data teams.

Lowering the cost of shipping. Baseten, Predibase, Fireworks AI, Together AI, Fal AI, and Replicate cluster around making model serving/switching cheap, fast, and boring. The news emphasizes inference routing, GPU ops, model compatibility, and integrations (e.g. with vector databases or orchestration frameworks). Funding flows to platforms that absorb the infra pain of scale and let product teams swap models without rewiring backends.

What this implies: If your goal is to forecast who can raise again on good terms, look for startups that (1) show up in sales workflows (2) take meaningful infra work off of engineering’s plate. Those two threads recur across the fundraise-related blurbs this week.

Correlations to notice

The movements show a strong co‑mention pattern between model‑serving platforms and vector/database systems (e.g. Together AI paired with Chroma/ClickHouse, Baseten/Modal paired with embeddings stores). And between feature stores/activation tools (Featureform/Hightouch) and enterprise orchestration (Temporal/Seldon).

If you’re diligencing any one of these, you want to check the adjacent layer for partner references and co‑selling motions. The companies that show up together in customer stacks are disproportionately likely to keep showing up together.

3. Product direction: speed, safety, and switchability

Most of the product updates across dozens of companies fall into three buckets:

Speed

Updates from Modular, Groq, Lambda Labs, and inference platforms emphasize throughput and tail‑latency improvements. Startups are pushing specialized runtimes, kernel‑level optimizations, and better autoscalers.

In practice, customers buy these when they’re paying real inference bills and hitting p95/p99 pain. For people who may not know, p95/p99 are percentiles used to measure latency in performance metrics (p95 represents the latency value below which 95% of requests fall, while p99 represents the latency value below which 99% of requests fall).

The news tone suggests customers now benchmark “end‑to‑end latency with guardrails” rather than just raw tokens-per-second. So systems that keep middleware simple gain share.

Safety and evaluation

The releases from Evidently, Cleanlab, Fiddler, and Scale AI (along with evaluation notes from dev tool vendors like Cursor/Lightning/Replit) orbit the same idea: ship evaluation alongside deployment.

You see terms like “drift detection”, “bias checks”, and “guardrail policies”. But the underlying direction is standardizing an eval taxonomy teams can actually maintain. In short: policy as code for LLMs.

Switchability

Companies like Together AI, Fireworks, Replicate, Baseten, Predibase, and Modal press on portability: One API, many models/runtimes. One console, many clouds/GPUs. Minimal glue code.

Chroma and PlanetScale nod to the same theme on the data side with “bring your own embeddings” and SQL‑native vector patterns. The strategy is obvious: if you make switching cheap, customers will come to you because they’re nervous about lock‑in elsewhere.

A note on risk: Switchability is a double‑edged sword. If you make it trivial to swap models/providers, you must compete on reliability (not just price). The news highlights up‑time promises, observability built‑ins, and rollback features. Those become the new brand.

4. Data infrastructure: the stack is coalescing around “feature-ready + vector-native”

Hightouch, Featureform, PlanetScale, ClickHouse, Chroma, along with quality and lineage tools, collectively outline the modern AI data path:

Ingest, standardize, and activate.
Reverse ETL and activation vendors (Hightouch) are positioning as the “last mile” into operational systems. The news focuses on new connectors and better sync reliability. Feature stores/platforms (Featureform) emphasize consistent features across training and inference, reducing skew and giving model teams common primitives. This is the connective tissue between data teams and product teams.

Store and query in two modes: transactional and vector.
PlanetScale and ClickHouse items highlight SQL ergonomics even as vector embeddings slide into the picture. The pattern is “don’t make your engineers learn a new database to add semantic search or RAG.” Chroma remains a common pairing with inference platforms, indicating ongoing traction as the vector‑first option in stacks chasing fast time‑to‑market.

Observability and quality are no longer optional.
Cleanlab, Evidently, Fiddler, and WhyLabs‑like tooling show up as safety rails for data and predictions. Drift, label noise, and test‑set curation are treated as product features, not ad hoc scripts. That pays off when customers need to pass audits.

Dependencies and risks:

Feature platforms depend on reliable upstream pipelines. Failures there ripple into “silent regressions” at inference time.
Vector systems depend on embedding model stability. Changes in models can invalidate neighborhoods or require re‑indexing. Ops teams must plan for that.
Reverse ETL tools depend on third‑party SaaS APIs. Breaking changes can cause outages that customers will blame on the messenger.

The market is rewarding vendors that remove system sprawl. If a startup can keep engineers inside SQL while getting RAG or keep a single feature definition across batch and real‑time, it wins more easily in conservative enterprise accounts.

5. Compute and runtime: GPUs are the bottleneck, but the control plane is the prize

Updates from Lambda Labs, Render, Modal, Baseten, Fireworks AI, Together AI, and Groq add up to a clear message: the compute market is moving from “get a GPU” to “get a predictable SLO (service level objective)”.

Provisioning is commoditizing, orchestration is not.
Bare‑metal providers and GPU clouds compete on availability and price per hour. But the product updates this week highlight autoscaling, queue management, capacity reservations, spillover to alternative hardware, and quotas per tenant. Those are orchestration problems. Platforms that do this well are becoming the front door for enterprise AI teams that don’t want to negotiate with five GPU vendors.

Model‑runtime specialization continues.
Modular’s work around high‑performance runtimes and Groq’s insistence on deterministic low‑latency inference show the “custom engine” thesis isn’t dead. These systems win where latencies must be predictable e.g. agents that chain many calls, ad‑pricing, realtime personalization. The risk is compatibility churn: every new model or tokenizer tweak is a new round of engineering.

Economics still matter.
A recurring subtext across the news: TCO (total cost of ownership) comparisons are back in vogue. Not just “dollars per million tokens”, but total cost after guardrails, vector queries, and orchestration overhead. Platforms that publish clear, comparable pricing (and let customers turn costly features off when they don’t need them) showed up more often this week.

Dependencies:

Inference platforms depend on upstream model providers’ rate limits and terms of service. When those change, platforms must re‑route traffic or provide self‑hosted fallbacks.
GPU clouds depend on hardware vendor roadmaps and supply chains. Any hiccup (driver bugs, new memory configurations) can surface as customer incidents at the platform layer.

6. Agents, orchestration, and dev tools: making AI boring enough for enterprise

News on Seldon AI (deployment/governance), Temporal (reliable long‑running workflows), Fiddler (monitoring), Scale AI (data/evals), Cursor/Replit (coding assistants), Lightning AI (training/inference tooling), and Netlify (platform-level updates) converge on the same goal: reduce “AI glue” and make it feel like standard software delivery again.

Agents are graduating into typed, testable workflows.
Instead of free‑form agents, the updates emphasize tools/skills with clear schemas, retry policies, timeouts, and human‑in‑the‑loop checkpoints. That’s why Temporal shows up alongside AI deployment notes: you need a workflow runtime to survive non‑determinism and vendor flakiness. Seldon and similar platforms then anchor model packaging, promotion, and guardrail enforcement.

Evaluations and governance are “shifting left”.
Scale AI, Evidently, and Cleanlab emphasize building evaluation sets pre‑deployment and keeping them fresh after release. The trend is toward “unit tests for prompts, integration tests for agents”. Promotion is now gated by evals the same way merges are gated by code coverage.

Developer adoption is the tip of the spear.
Cursor, Replit, and Lightning keep focusing on the day‑to‑day feel: better model switching, better repos/workspaces, faster inner loops. The tone in the news implies that IDE‑anchored usage still pulls infra decisions along with it: the tools developers love influence which inference APIs, vector stores, and feature platforms get adopted.

A note on risk: Shadow AI. If teams spin up agents and evals outside centralized infra, you get drift and duplicated costs (and governance gaps). The platforms that centralized this (with APIs developers don’t hate) will lock down enterprise share.

7. Open-source, community, and the ecosystem.

Several items in the news talk about open‑source releases, GitHub activity, or community‑scale launches (Chroma, ClickHouse, Lightning AI, Seldon, and a few model serving projects). The pattern looks familiar:

Why open-source is still strategic for infra startups:

Distribution: Open source gets you into proof-of-concept stages without procurement drama.
Telemetry: With opt‑in reporting or cloud-hosted “teams” editions, you learn what features matter.
Conversion: The paid path tends to be enterprise SSO, governance, and reliability features. These are things that legal/compliance teams need rather than what developers want for weekend projects.

Where it bites:

If your cloud product looks like a thin wrapper over the core, then people just fork the open source project and do it on their own.
Support burden can spiral if your open source community becomes your Tier‑1 helpdesk.

The open‑source mentions often accompany “cloud” or “teams” editions. That’s the trade: keep the core attractive to devs while gating enterprise‑only policy/scale features.

8. What connects whom: co‑selling, composability, and buyer patterns

Even without explicit partnership press releases in every row, the summaries show clear co‑occurrence patterns you can exploit for diligence and GTM:

Composability clusters

Serving + vector + evals: Together AI, Fireworks, and Baseten frequently show up alongside Chroma and ClickHouse. And also alongside eval / monitoring vendors like Evidently, Cleanlab, and Fiddler. Customers want a reference path from “index data → answer questions → measure quality”.
Feature stores + orchestration: Featureform and Hightouch news often rhymes with Temporal/Seldon notes. Enterprises are knitting “data definitions” to “safe rollouts” because it prevents training/inference skew.
Developer tools + platforms: Cursor/Replit appear in the same contexts as inference platforms and vector stores. Developer adoption pulls infra behind it.

Buyer patterns:

Data teams pick activation/feature/observability.
Platform/infra teams pick inference control planes and GPUs.
Security & compliance sign off on eval/governance.

Getting traction requires mapping to all three. If you miss one, deals slow down. The news hints at this via repeated mentions of governance features in what used to be pure dev tools.

Risks across clusters

If a vendor in the cluster stumbles (e.g. a vector index bug or a model provider rate-limit change), the whole bundle looks shaky. That’s why multi‑vendor “switchability” is prized by customers and marketed heavily by platforms.

9. What could break: the risk ledger

Based on the themes and dependencies that recur in this week’s movements, here’s the compact risk ledger to look into:

Vendor lock‑in → countered by “bring-your-own X”.
Customers fear lock‑in to one GPU cloud, one model, or one vector database. Startups win when they accept customer‑owned keys, support multiple clouds, and support “bring your own embeddings”. The frequent mentions this week of portability are not fluff. They’re how deals get unstuck.

Latency SLOs in multi-hop agents.
Agents are chaining calls to models, tools, and search/vector backends. Any platform that can show predictable p95/p99 under load will beat faster single‑hop competitors in production. This obsession with end‑to‑end latency shows up again and again in the inference product blurbs.

Data drift and eval rot.
Evals built in a rush go stale. The monitoring/eval vendors in your file are pushing scheduled tests, data drift alerts, and explanation tooling because customers keep getting burned by silent regressions post‑launch. If a vendor cannot prove a path to fresh evals, risk rises quickly in regulated accounts.

Supply chain fragility.
GPU roadmaps, driver updates, and capacity crunches trickle down into every platform update. Several news items allude to capacity planning and burst handling. Assume supply remains choppy and architect for “burst elsewhere” rather than “burst nowhere”.

10. Investment implications and what to watch next

Where the puck is going (near term):

Inference middleware consolidation. Expect a few of Together/Fireworks/Baseten/Modal/Replicate‑like platforms to separate from the pack by owning the messy stuff: quotas, failover, per‑tenant fairness, eval‑gated deploys, and clean pricing surfaces. Watch who publishes migration guides and “n‑model routing” case studies. That’s a tell for real multi‑tenant usage.
Feature platforms as the data contract. Featureform‑style definitions that live across training and inference will become the “API contract” between data, ML, and app teams. The more those definitions plug directly into reverse ETL (Hightouch) and workflow engines (Temporal/Seldon), the stickier they get.
SQL‑first vector becomes the default in the enterprise. Chroma keeps winning in fast‑moving teams, but the news suggests ClickHouse / PlanetScale‑style “stay in SQL” is comforting to buyers. Expect more “native vector” stories from cloud databases you already know.

How this affects the broader set of infra startups:

If you’re up‑stack (apps with AI in the product): your cost and reliability now depend on which inference control plane you pick and whether it plays nicely with your vector store and eval pipeline. Choose for SLOs, not just price.
If you’re mid‑stack (platforms and orchestration): your growth depends on how well you make devs productive and how easy you make audits. Tie evals and governance directly into deploys.
If you’re down‑stack (data infra): your advantage is giving teams one mental model (ideally SQL) while adding vector, semantics, and feature serving without a zoo of services.

Two contrarian notes from this week’s patterns:

Open source is necessary but no longer sufficient for distribution in this segment. The companies making it work are pairing open source software with ruthless focus on enterprise‑only governance/scale.
The best “AI infra” pitch is now a “less to operate” pitch. Product blurbs that landed hardest in this week’s news were ones that removed a system (or hid it) rather than added a shiny new one.

Watchlist of concrete signals derived from repeated themes:

Which serving platforms publish customer SLOs that include multi‑model routing and guardrails.
Which feature/activation vendors ship native type systems and lineage you can enforce during deploys.
Which vector/database projects demonstrate consistent p95 under mixed OLTP (online transaction processing) + vector workloads.
Which dev tools (Cursor/Replit/Lightning) become first‑class launch surfaces for evals and governance, not just assistants.
Which GPU clouds (Lambda) and inference engines (Modular/Groq) prove they can absorb model churn without breaking compatibility.

Closing take

This week’s infra market points to fewer “we built X from scratch” announcements and more “we made X reliable and swappable” updates. The connective tissue across data, inference, and workflows is thickening. And that favors startups that (a) sit on a control point (b) play well with their neighbors.

If you’re picking winners, pick the ones that make AI boring in production. These companies partner across the stack without drama and publish the kinds of SLOs + migration stories that soothe enterprise buyers’ nerves.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now

Company Deep Dive #4 - VERTIV

Prateek Joshi — Thu, 07 Aug 2025 16:02:13 GMT

Today we’ll be diving into the company that provides power and cooling systems to giant AI clusters that have been popping up.

1. Why bother with Vertiv?

Vertiv makes the gear that keeps the world’s data centers and telecom networks alive. This includes backup power units, high-capacity cooling, prefab data-center modules, and the service teams who install and maintain them. It sits at the beating heart of the cloud: if servers are the brain, Vertiv is the circulatory and respiratory system.

Over the past two years, Vertiv’s sales have leapt from roughly $8 billion to almost $10 billion. And profits have grown even faster because management squeezed costs and raised prices. At the same time its share price has more than tripled. That dramatic rise leaves investors wondering: Is there still room to run or has the market already banked all the good news?

Our central question is whether Vertiv can keep expanding sales and margins at today’s pace for another 2 years. Long enough to justify the lofty price tag while fending off supply hiccups, tariffs, and an increasingly ambitious set of rivals.

2. The Market Backdrop: Why “Plumbing” Suddenly Matters

Life on the internet is moving toward ever-heavier workloads: gigantic AI models, real-time video, immersive gaming, factory automation, and 5G-powered edge devices. All that digital traffic must be processed, stored, and cooled somewhere. Analysts now peg the market for data-center power and cooling at more than $50 billion a year, expanding close to 10% annually.

A big slice of that growth is tied to “AI mega-campus” projects. A single AI training hall can draw as much power as a small town. That means massive uninterruptible power supplies (UPS), new liquid-cooling systems that look more like radiator circuits than air conditioners, and racks custom-designed to cram more servers into less real estate. Vertiv has spent heavily on R&D in just these areas: bigger UPS cabinets, coolant-distribution units capable of handling kilowatts per rack, and turnkey prefab rooms that arrive on site ready to plug in.

Outside the headline-grabbing AI boom, there are quieter but still powerful growth engines. Telecom operators adding battery-backed power plants to 5G tower sites, hospitals and factories swapping older inefficient power equipment for newer gear to cut energy bills, and emerging markets building first-generation data centers as they digitize. Even if the AI frenzy cooled, these parallel streams provide a wider foundation for Vertiv’s long-term demand.

3. What Vertiv Actually Sells and Why Customers Pay Up

We can think of Vertiv’s catalog in 3 buckets:

Power management. Giant UPS units, precision switchgear, and DC plants deliver clean electricity when the grid flickers.
Thermal management. Traditional chilled-air systems plus newer liquid-cooling rigs siphon heat from servers that would otherwise melt.
Integrated solutions and services. Prefabricated data-center pods, monitoring software, and a 3500-engineer service army that keeps everything running.

Customers are hypersensitive to downtime. Minutes of outage can cost a bank millions in lost trades or a streaming company a wave of angry subscribers. That fear makes reliability and rapid service worth paying for. Vertiv leans on its global service footprint. 200 service depots and decades-old brand names like Liebert to promise peace of mind.

Once a facility is designed around Vertiv gear, it’s really hard to swap them out. It usually involves rewriting blueprints, climbing learning curves, and living through a risky cut-over. Those high switching costs help Vertiv hang on to customers far longer than the typical hardware vendor.

4. Financial Health

Vertiv’s recent numbers read like a turnaround thriller:

Sales: up 33% year-on-year in the latest quarter, thanks mainly to gigantic data-center orders in North America.
Profit margins: roughly 20% now drops to operating profit, up from less than 10% only three years ago.
Cash: free cash flow topped $1 billion dollars last year and management expects even more this year, despite heavier capital spending to open new lines outside tariff-hit regions.
Debt: net leverage is now well under one turn of EBITDA, giving the balance sheet fresh air.

Because demand is strong and service work is recurring, Vertiv’s backlog has ballooned to more than $8 billion. This is about 9 months of sales already spoken for. That backlog is a safety net if orders ever slow.

So what’s the key issue here? Investors have already priced in a lot of perfection. Vertiv shares now sell at more than 30x next year’s estimated earnings. This is richer than older industrial peers like Schneider Electric or Eaton. To keep the multiple aloft, Vertiv must keep growing in the high-teens and nudge margins toward 25%. This is a goal that management has sketched for 2029. Any stumbles like tariffs biting harder, a supplier shortfall, or a pause in cloud spending could compress the valuation quickly.

5. Competitive Landscape: Who Else Wants the Prize?

The power-and-cooling world is an oligopoly. Schneider Electric, Eaton, Legrand, and Huawei all sell similar racks of hardware. Schneider in particular jostles with Vertiv for the global top spot. The difference is focus. Schneider is broader with offerings in mining, smart buildings, general electrification. And Vertiv is laser-focused on digital infrastructure.

New entrants tend to be specialists. A German company building ultra-efficient chillers or a Taiwanese company that makes low-cost rack PDUs. They can win slices of the pie but rarely displace the incumbents across an entire site. Still niche innovators force Vertiv to keep investing. If a startup invents a cooling system that reduces water use by half, hyperscalers might demand it. And Vertiv would need a response.

Huawei looms as the unpredictable wild card. It can undercut prices in markets less sensitive to US trade restrictions, especially in Asia and Africa. That could erode Vertiv’s share in price-centric projects, though many Western customers shy away from Huawei for security reasons.

6. Why Infra Startups Should Care

Vertiv’s fortunes ripple across the infrastructure start-up world in several ways:

Capital Flows. When a public market giant like Vertiv soaks up attention and cash, investors often group the whole critical-infrastructure domain together. A roaring Vertiv share price can buoy sentiment for startups that sell monitoring software, predictive maintenance analytics, edge micro-datacenters, or next-gen cooling fluids. The logic is that if the incumbent is thriving, the ecosystem must be healthy. Conversely if Vertiv disappoints, VCs may turn cautious on smaller companies who ride the same wave.

Channel Dynamics. Many infra startups piggyback on Vertiv’s distribution channels. For example, a young company selling AI-driven battery analytics might standardize its API to plug into Vertiv’s service toolchain. If Vertiv accelerates adoption of such add-ons, the startup gains exposure to hundreds of data center customers overnight. But if Vertiv doubles down on proprietary in-house software, those partner slots could shrink.

Technology Standards. Vertiv’s design choices set de-facto standards. If it pushes a particular liquid-coolant chemistry or rack-level power bus, suppliers of sensors / pumps / coolant filters must follow. Startups that bet on competing standards could be frozen out.

Supply-Chain Dependencies. Vertiv commands vast volumes of batteries, power semiconductors, and copper busbars. A shortage or price spike (like the capacitor crunch of 2021) reverberates downstream. Hardware startups often rely on the same component suppliers but lack Vertiv’s buying power, so they can be squeezed on cost or lead times if Vertiv locks up supply.

Acquirer Appetite. Vertiv has a history of bolt-on buys: switchgear maker E&I, and most recently a high-density rack company. Its healthy cash flows mean more buying power. For venture-backed hardware and software companies, Vertiv could be a natural exit. The better Vertiv’s stock performs, the more coins in its acquisition purse. If the multiple contracts, M&A budgets tighten and exit windows may narrow.

Roughly speaking, half of the infra startups focused on physical data-center technology (e.g. power modules, advanced cooling, microgrids) swim in Vertiv’s lane. They either integrate with Vertiv, compete in a niche, or hope to be bought. A slowdown in Vertiv’s ordering pattern or a stumble in its capacity ramp would ripple through component suppliers and service providers that power these young companies.

7. Risks, Dependencies, and Early Warning Signs

Tariffs and Trade. U.S. tariffs on Chinese-made power gear already shave a couple of percentage points off Vertiv’s margins. If trade friction escalates (e.g. a new tariff tranche on batteries), then the costs will rise again. Startups that depend on the same Chinese contract manufacturers would feel similar pain, perhaps without the leverage to pass cost through.

Cloud CAPEX Cycles. The biggest driver of Vertiv’s recent growth is hyperscale data center spending. Cloud companies tend to feast and then digest. If they declare a digestion phase in late 2025, Vertiv’s order book could deflate. Early warnings: sharp drops in cloud CAPEX guidance, rising vacancy in wholesale colocation, or long-delivery lead times suddenly shortening.

Execution Strain. Vertiv is sprinting to open new capacity outside tariff zones, a complex global shuffle. Any hiccup like factory delays, quality slips, or labor shortages could hurt margins or delay shipments. Watch margin trends, one-off charge announcements, or slipping delivery times reported by customers.

Technology Leapfrogging. Suppose a cooling-tech startup proves its immersion system cuts total energy by 30% at scale. Hyperscalers might skip incremental upgrades and jump directly to the new technology, denting demand for Vertiv’s conventional line. Keep an eye on pilot projects e.g. if multiple cloud operators roll out a startup’s solution in production. That’s a sign of pending disruption.

Interest Rates and Financing. Data center projects are capital intensive. Higher rates raise hurdle returns and can push projects out. A sustained jump in long-term yields might freeze marginal projects, trimming Vertiv’s growth. It would likewise dampen investor appetite for capital-heavy infra startups.

8. Conclusion: A Strong Company at a Fairly Full Price

Vertiv is riding megatrends that look durable: ever-denser compute, the race for AI, and the need to keep digital lights on in every industry. Management has fixed its earlier cost woes, reshaped the supply chain, and built a huge backlog that cushions near-term risk. Cash flow is gushing and debt is tame. And the company still has multiple levers such as bigger service contracts, bolt-on acquisitions, and product refreshes. This will help grow earnings even if topline growth cools.

Yet the market is no longer overlooking Vertiv. After a 300% rally, the stock already bakes in several more years of robust growth and smooth execution. If the AI build-out keeps humming and Vertiv pushes margins to the mid-twenties early, today’s valuation can be justified and perhaps nudged higher. But if hyperscalers tap the brakes or tariffs bite harder, the shares could quickly return to earth.

For infrastructure startups, Vertiv’s trajectory is more than a stock-market curiosity. It shapes investor sentiment, sets technical standards, and dictates parts of the supply chain they rely on. A healthy Vertiv lifts the entire critical-infrastructure ecosystem. A stumble would ripple across component makers, cooling innovators, and service-analytics vendors alike.

9. Bottom line

Vertiv is a high-quality company with strong tailwinds and capable leadership, but its stock trades as if the next two years will unfold without a hitch. That may or may not come to pass.

Investors should weigh their risk tolerance: owning Vertiv now means betting that the AI infrastructure boom lasts longer and runs hotter than the skeptics think. And that the company’s operational tune-up holds under even greater strain. If those bets sound comfortable, stay in the saddle. If not, waiting for a more forgiving entry point might be the wiser move.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Subscribe now