Sector Deep Dive #3: INFERENCE CLOUD PLATFORMS

Companies to build, enable, and sell inference platforms for AI applications

Sep 17, 2025

1. What inference cloud platforms are and why they matter now

“Inference” is the phase where trained AI models answer requests in the wild: they classify, summarize, code, chat, recommend, or generate images and video. Inference cloud platforms are the providers that run this at scale and expose it as APIs or managed endpoints. The group spans:

Hyperscalers: AWS Bedrock, Azure OpenAI, Google Vertex AI
Model companies with hosted APIs: OpenAI, Anthropic, Cohere, Mistral, xAI
Specialized AI clouds: CoreWeave, Lambda, Together AI, Fireworks AI, Modal, Replicate, Anyscale
Open-source inference engines: NVIDIA TensorRT-LLM, vLLM, Triton, Ollama) that these platforms increasingly use under the hood

Two things are happening at once. First, end-user demand is rising as enterprises shift from pilots to production. Companies are redesigning workflows and are seeing cost reductions in most functions where they actually deploy AI. Second, the supply side is scaling dramatically. Microsoft guided to a record ~$30 billion of capex in the current quarter (Jul 30, 2025). And Alphabet lifted 2025 capex plans to ~$85 billion, largely for AI data centers.

Taken together, that means inference capacity will drive value creation (not model training alone). Especially over the next two years as real usage composes into steady, billable traffic. AMD’s leadership has been explicit. Inference demand is set to outgrow training, with CEO Lisa Su calling out rapid acceleration.

2. Demand outlook: pricing is falling while usage rises

Developers can now buy high-end model outputs for a fraction of last year’s cost. OpenAI’s GPT-4o launched on May 13, 2024 at roughly half the price of GPT-4-Turbo and with higher rate limits. Google followed with large price cuts to Gemini 1.5 Pro (announced Sep 24, 2024; effective Oct 1, 2024). And later rolled newer 2.x models into its lineup with low-cost “Flash” tiers.

Cohere and Mistral publish similarly aggressive pricing for their command-and-reasoning families. On the ultra-low-cost end, DeepSeek’s R1 reasoning API lists input at roughly $0.55 per million tokens and $2.19 for output.

Lower unit prices haven’t slowed demand. If anything, they invite more usage and new types of applications (voice agents, video generation, background automation). Inference requests are becoming embedded in daily processes rather than sporadic “pilot” bursts.

3. Market structure: who’s selling what

Hyperscalers:
AWS, Microsoft, and Google anchor the managed platform tier: wrapping multiple models, safety filters, observability, and enterprise controls under one bill and SLA. AWS Bedrock and Azure OpenAI have achieved FedRAMP High in their government clouds. And Google secured FedRAMP High for selected components like Vertex AI Vector Search. This matters for regulated demand where compliance can be the gating factor.

Model API companies:
OpenAI, Anthropic, Cohere, Mistral, and xAI expose models directly. And are often available through the hyperscalers too. Current, public pricing pages let builders compare per-million-token rates and pick “fast-cheap” or “smart-expensive” options.

Specialized AI clouds and serverless GPU providers:
CoreWeave, Together AI, Fireworks AI, Modal, Replicate, and others focus on cost-efficient throughput, burst capacity, and developer experience. CoreWeave’s S-1 (filed Mar 3, 2025) revealed $1.92 billion of 2024 revenue, but also heavy customer concentration (Microsoft at ~62% in 2024 per S-1 analysis and press). It financed expansion with a $7.5 billion debt facility led by Blackstone and Magnetar on May 17, 2024.

Together AI raised a $305 million Series B on Feb 20, 2025 and crossed $100 million annualized revenue around that time, per Bloomberg/Crunchbase reporting. Fireworks AI reported rapid ARR growth in 2025 and is reportedly exploring a raise at ~$4 billion valuation. Modal and Replicate illustrate the “serverless GPU” model that charges per-second or per-GPU-hour for inference runs, with public pricing examples for L-series GPUs.

Open-source inference engines:
Under the hood, many platforms are converging on a few high-performance engines. vLLM (57k+ GitHub stars as of Sep 7, 2025) and NVIDIA’s TensorRT-LLM (actively releasing throughout 2024–2025) are two of the most visible, while NVIDIA Triton remains common as a serving runtime. On the “local” side, Ollama’s explosive adoption (152k+ stars as of Sep 6–7, 2025) signals a strong DIY and edge-inference movement. This is often a precursor to enterprise demand for managed versions.

China and the rest of world:
In China, Baidu, Alibaba (Qwen), ByteDance (Doubao), and startups like DeepSeek are pushing aggressive capability-to-cost curves. Baidu announced model upgrades and price cuts on Apr 24–25, 2025. Alibaba publishes granular Qwen API pricing. ByteDance promotes Doubao access via Volcano Engine. DeepSeek lists low per-million-token pricing for its R1 reasoning model.

4. Economics in plain terms: what drives costs and margins

Inference costs scale with three levers:

compute per request (model size, precision, and decoding strategy)
utilization (keeping GPUs busy with batching and scheduling)
data movement (egress and inter-AZ (availability-zone) / region traffic)

Providers increasingly use FP8/FP4 kernels, paged attention, speculative decoding, and “in-flight batching” to boost throughput. Exactly what TensorRT-LLM and vLLM are optimized for.

At the cloud-network level, egress fees and inter-AZ traffic still matter for TCO (total cost of ownership). In 2024, Google removed exit fees for customers migrating off its cloud (Jan 11–12, 2024), and AWS followed in March 2024. But normal egress still applies for day-to-day operations. Typical AWS data transfer out runs roughly $0.09/GB for the first 10 TB/month in many regions, with inter-AZ charges around $0.01/GB.

Put practically: bandwidth adds up if an application streams images or video from inference outputs or moves embeddings between services.

That’s why many inference platforms co-locate vector search, caches, and storage with serving to cut cross-service data charges. It’s also why some startups prefer specialized AI clouds where pricing bundles compute, storage, and networking tightly.

5. Reliability and compliance: what enterprises actually ask for

Large buyers care about uptime SLAs, security attestations, and government cloud options. That said, outages do happen. OpenAI reported notable incidents in 2023–2024 (including a DDoS-related disruption on Nov 8, 2023 and a service impairment on Jun 4, 2024), which buyers often cite when asking about multi-vendor failover and local fallback.

For IT leaders, this translates into a simple rule:

pick at least two model providers (direct or via a hyperscaler) and deploy a fallback path on an open-source engine (vLLM/TensorRT-LLM) where feasible.

This reduces outage risk and allows cost routing as prices change.

6. Competitive dynamics: price wars meet platform bundling

Price cuts are not just marketing. They pressure everyone (especially startups) to improve GPU utilization and lower serving costs. Google’s 64%/52% cuts on Gemini 1.5 Pro in late 2024 set a precedent. OpenAI’s GPT-4o (May 13, 2024) cut price and boosted rate limits. xAI’s Grok 4 and DeepSeek R1 added low-cost reasoning options in 2025.

Hyperscalers also bundle: the model call is one API, but buyers are really purchasing governance, observability, private networking, enterprise auth, and in some cases FedRAMP posture. That bundle is hard for startups to match, which is why specialized AI clouds differentiate on raw performance, GPU availability, or easy developer workflows.

CoreWeave is the bellwether for specialized AI clouds. Its S-1 showed explosive revenue growth to $1.92 billion in 2024. These numbers show how capital intensive inference clouds are and how much their fate can hinge on a few anchor customers.

7. How this connects to infrastructure startups and how many are affected

Where the dependency lies: Even if a startup doesn’t sell an “inference API”, much of the AI infra startup stack is downstream of inference demand:

Model-serving and orchestration (BentoML/Ray Serve/Anyscale, vLLM, Triton): revenue aligns with request volumes and concurrency. As inference scales, these grow. Anyscale’s 2025 partnerships signal continued push to managed Ray-based serving.
Data layer (vector databases, feature stores): more inference means more embeddings, more caching, and more retrieval.
Observability/security (guardrails, evals, tracing): production inference requires red-teaming, safety checks, and run-time monitoring.
Networking and acceleration (NICs, smart switches, CUDA/ROCm kernels): token throughput and tail-latency are network-sensitive.
Edge/enterprise local (Ollama-style local serving, on-prem Triton/TensorRT-LLM): regulated workloads and cost control push some inference to customer hardware.

A rough share of affected infra startups:
Using public venture reports that show AI absorbing an outsized share of VC dollars in 1H-2025 (EY: ~$49.2 billion into gen-AI in H1 2025, already above full-year 2024) and the visible skew toward application-layer deals, at least half (and plausibly 60–70%) of AI infrastructure startups have their fortunes tied to inference growth (direct revenue or adjacent data/observability spend).

This is an estimate, not a hard count. But it matches what’s seen in fund flows and market maps: most infra projects today pitch either “cheaper, faster inference” or “better pipelines and retrieval for inference.”

Correlation and dependency:
When hyperscalers raise capex and roll out enterprise controls, application teams are more likely to ship production features. And that directly lifts inference calls, which lifts demand for the whole downstream stack. Conversely, if a big buyer slows deployments or centralizes on one provider for cost reasons, adjacent infra (observability, vector DBs, orchestration) may see delayed projects.

8. Risks over the next 24 months and what to watch

(a) Supply chain and power constraints
GPU allocations are still tight and data-center interconnect queues are long in power-constrained regions. BigTech’s capex is surging (Microsoft ~$30 B this quarter, Alphabet ~$85 B for 2025), but a lot of that converts into capacity only when land, power, and cooling are ready. Watch for delays tied to grid interconnects and specialized high-density builds.

(b) Price compression
The fall in per-token pricing is likely to continue, with new “reasoning” models (DeepSeek R1, xAI Grok 4) pushing price-performance down further. Platforms must keep utilization high (via batching, caching, or model distillation) to avoid margin squeeze.

(c) Outages and concentration
Incidents at a single model provider can knock out large fractions of traffic for hours. The OpenAI incidents in Nov 2023 and Jun 2024 are a reminder. SRE teams will pressure for multi-provider routing and local fallbacks.

(d) Compliance and data locality
The good news is that FedRAMP High and similar authorizations are arriving for major services. The challenge is that sensitive workloads still need private networking, key management, and clear data-use policies. Delays in rolling out “enterprise safeguards” can stall big deals.

(e) Macro and investor sentiment
Inference clouds are capital intensive. Public market reception to specialized AI clouds has been mixed. If public comps wobble, late-stage private rounds could slow, impacting the partner ecosystem.

9. How to think about winners

Platforms that control utilization
Winners will squeeze more requests per GPU hour. The technology stack is clear: engines like TensorRT-LLM and vLLM, plus tricks like FP8/FP4 quantization and speculative decoding, drive throughput without hurting quality. Those gains are hard to reverse and compound every quarter.

Platforms that own regulated channels
FedRAMP High and sector certifications unlock budgets that smaller vendors can’t access quickly. AWS, Microsoft, and Google’s moves in 2024–2025 are strategic moats in US public sector and highly regulated industries.

Platforms with balanced customer bases
Revenue concentration is a risk in this subsector. The more diversified the top customers and geographies, the sturdier the cash flows through cycles.

Global low-cost challengers
The China ecosystem (Baidu, Alibaba, DeepSeek, ByteDance) is pushing cost down dramatically. While export controls and data residency limit cross-border usage, the pricing pressure they exert globally will influence buyer expectations.

10. Practical guidance for investors and operators

For venture investors
Ask any infra startup how they (a) keep GPUs hot (utilization) (b) cut token compute per request (quantization, distillation, caching) (c) minimize bandwidth charges (co-location, compression, RAG locality). You’re looking for companies able to maintain gross margins as price per token keeps sliding.

Validate multi-provider integrations. If a startup depends on one model vendor or one cloud region, treat that as concentration risk. Much like you would a single large customer.

Finally, watch the “serverless GPU” abstraction layer. Modal and Replicate show that per-second billing and instant scale can beat reserved instances for bursty workloads. Adoption there could shift where “platform” margins accrue (to the servers-on-demand layer).

For corporate buyers and product teams
Lock in at least two model paths (via your cloud of choice plus a direct API) and a local fallback using vLLM or TensorRT-LLM for mission-critical flows. Budget for bandwidth where outputs are large (images, video) and keep RAG stores co-located with serving to avoid inter-AZ/region fees. Remember that migrating off a cloud may waive exit fees now, but normal egress still applies to daily operations.

For founders at the application layer
Lean into cheaper “flash” tiers for non-critical tasks and reserve expensive reasoning models for high-value steps. Many teams are carving workload graphs so that 70–90% of tokens go to low-cost models and only the “hard” paths hit premium models. That keeps unit economics sane as your user base grows.

11. What could change the call (24-month horizon)

Positive surprises

A step-change in throughput (e.g. widespread FP4 adoption or a new serving breakthrough) that halves cost per request again. Suddenly many more use cases become profitable. Keep an eye on TensorRT-LLM and vLLM releases.
Faster regulatory certifications (FedRAMP High/DoD IL-5 for more services) unlocking pent-up demand in government and healthcare.
Public-market validation of specialized AI clouds (smoother IPO outcomes and rising multiples) that lowers the sector’s cost of capital and speeds build-outs.

Negative surprises

Power or interconnect delays slow data-center rollouts, creating capacity gaps during peak demand. The capex is committed, but lead-times can slip.
Major, prolonged outages trigger widespread buyer mandates for on-prem inference, temporarily shifting spend away from managed cloud endpoints.
An extended price war that compresses gross margins faster than utilization improvements can compensate. Particularly painful for smaller providers without proprietary hardware access.

12. How many infrastructure startups will this affect and how?

Estimated share affected:
Given H1-2025 venture flows (~$49.2 billion into generative AI in H1 alone) and the clear skew of infra pitches toward serving, orchestration, vector/RAG, and observability, a reasonable estimate is that 60–70% of AI infrastructure startups are directly tied to inference adoption curves either as primary revenue (serving) or adjacent spend (data/observability). That range reflects uncertainty: public sources break out “AI” as a whole, not “inference infra” specifically.

Correlation pathways:

Capex → capacity → price: Hyperscaler capex raises capacity. More capacity tends to push per-token prices down. Lower prices increase app usage. More usage drives infra demand.
Compliance unlock → big-ticket buyers: FedRAMP High or equivalent certifications unlock multi-year contracts. Once signed, these generate steady inference flows that support data and observability partners.
Model competition → routing: As DeepSeek/xAI/Mistral cut costs, app teams start routing workloads by cost/quality. That forces startups to integrate multiple providers and invest in evaluation and guardrails, benefiting orchestration and tooling companies.

Key dependencies:

GPU supply and scheduling tech (TensorRT-LLM, vLLM, advanced schedulers) are foundational. Without them, margins erode.
Network costs determine whether RAG-heavy apps scale profitably. “Exit” fee waivers don’t change daily egress economics.
Customer concentration (CoreWeave-Microsoft) shows platform-level fragility that can cascade to smaller partners.

13. Regional notes: US, Europe, Asia

United States: The US remains the center of gravity for both demand and supply. FedRAMP authorizations and record hyperscaler capex point to continued expansion.

Europe: Regulatory focus on switching costs pushed clouds to waive exit fees, which may encourage multi-cloud inference strategies (EU Data Act and broader scrutiny played a role in the 2024 fee changes).

China: A separate but fast-moving market with intense price competition (Baidu, Alibaba Qwen, DeepSeek, ByteDance). Even if cross-border use is limited, the global effect shows up in buyer expectations about what a “fair” price per million tokens should be.

14. What to monitor (practical checklist for the next 6–24 months)

Capex guidance: Microsoft, Alphabet, Amazon quarterly updates. If spending flattens earlier than expected, expect tighter capacity growth and slower price cuts.
Price changes: OpenAI, Google, Anthropic, Mistral, Cohere, xAI, DeepSeek pricing pages. Track cuts or new “flash/reasoning” tiers.
Engine releases: vLLM and TensorRT-LLM release notes. Watch for features like better batching, quantization, and scheduler upgrades that change GPU economics.
Compliance milestones: New FedRAMP/IL-4/5 authorizations across Bedrock, Vertex AI, Azure OpenAI. These correlate with large RFPs in public sector and highly regulated industries.
Incident history: Model/API outages on vendor status pages and developer forums. See if customers adopt multi-provider routing as standard.
Public comps: Watch CoreWeave and any follow-ons. Stock performance and disclosures can influence late-stage private rounds and M&A appetite across inference tools.

15. Bottom line

Inference cloud platforms are moving from novelty to utility. Prices are trending down and capacity is trending up. And compliance gates are opening. In the next two years, the most value will accrue to platforms that do three things well: (1) keep GPUs highly utilized with advanced serving stacks (vLLM, TensorRT-LLM, Triton-style backends), (2) meet enterprise governance and compliance needs at scale, and (3) diversify customers and geographies to reduce concentration risk.

For venture investors, that creates two attractive pockets: (a) “picks and shovels” that make inference cheaper (serving engines, schedulers, compression, agentic runtimes) and (b) “adjacent infrastructure” that becomes non-optional as inference scales (vector/RAG stores, eval/observability/guardrails, privacy/security layers). For founders and buyers, the operating playbook is: multi-model routing, local fallbacks, and ruthless attention to utilization and bandwidth.

The punchline: the growth of inference will pull a majority of AI infrastructure startups along with it. The exact share is uncertain. But current capex, pricing, and adoption signals make it hard to see a different center of gravity for AI infrastructure in the near term. Keep tracking capex guidance, price sheets, engine releases, and compliance wins. That’s where the next two years of winners will be decided.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 infra-curious friend:

Infra Startups

Discussion about this post