Article Information

Category: AI & Automation

Published: May 26, 2026

Author: Chris de Gruijter

Reading Time: 12 min

Microsoft and Uber Hit the Token Wall — What It Means If You Actually Pay the Bill

Published: May 26, 2026

A post has been making the rounds: "Microsoft and Uber are stopping using Claude, but not for the reason you think." The claim is that Uber blew through its entire annual AI budget by May, and that Microsoft told employees to stop because token costs now exceed their salaries. Then it pivots into "tokens are the new oil" and "this is why it can't be a bubble." Most of it is directionally true and badly distorted at the same time. I pay Claude Code bills every month for a one-person agency (plus the occasional freelancer), so I went and checked the actual reporting. Here is what really happened, what the viral version gets wrong, and what the token economy looks like from the bottom of the pyramid rather than the top.

The viral version versus what actually happened

The short, punchy version is engineered to go viral, so it rounds everything up to the most dramatic phrasing. The real story is more specific — and more interesting, because the specifics are where the lesson lives. Two things are genuinely true: Microsoft is pulling back on internal Claude Code use, and Uber spent its 2026 AI coding budget far faster than planned. Almost everything else in the viral post needs a footnote.

Neither company is "stopping using Claude." Both are reacting to one specific product — Claude Code, the agentic terminal tool — and to one specific pricing model: per-token, usage-based billing. That distinction matters, because it is the pricing model that broke, not the model's quality. If anything, the quality is the problem.

What Microsoft actually did

Microsoft rolled out Claude Code internally in December 2025, starting in its Experiences & Devices division. Within months, token-based billing had consumed the team's annual AI budget, and reporting indicates Microsoft is now cancelling most of those direct Claude Code licenses and steering engineers toward GitHub Copilot CLI instead, with the change landing around June 30. Individual engineers were reportedly spending between $500 and $2,000 a month on tokens.

The "token cost exceeds employee salaries" line is real, but it did not come from a Microsoft memo telling staff to down tools. It came from Bryan Catanzaro, VP of applied deep learning at NVIDIA, describing his own team: the cost of the compute his engineers use is now well beyond the cost of the engineers themselves. That is a startling statement about where the industry is heading — but it is a quote from a chip company executive, not a company-wide Microsoft ban.

Crucially, this is about internal seat licenses, not a strategic divorce. Microsoft's roughly $5 billion investment in Anthropic and its Foundry deal — which gives Azure customers access to Claude models — are untouched. Microsoft pulled a tool from its own engineers' hands because the unit economics did not work, while continuing to sell Claude to its own customers. Hold that contradiction; we will come back to it.

What happened at Uber

Uber's story is the more vivid one. It deployed Claude Code (alongside Cursor) in December 2025 and burned through its entire 2026 AI budget in roughly four months — by spring, not "ending in May with seven months left," though the spirit of the claim holds. The adoption curve is the real headline: agentic usage jumped from 32% of engineers in February to 84% by March. By spring, about 95% of engineers were using AI tools monthly, and roughly 70% of committed code originated from them.

Average spend landed around $150–$250 per engineer per month, with heavy users hitting $2,000. Uber's CTO reportedly spent $1,200 in a single two-hour demo session. And here is the detail that explains everything: Uber put engineers on leaderboards ranked by how much they used the tool. When you gamify consumption of a metered resource, you should not be shocked when consumption goes vertical. Roughly 11% of backend updates ended up fully agent-generated — so the spend bought real output. It just bought far more of it, far faster, than the budget assumed.

The counterintuitive part: it breaks because it is good

The instinct is to read these stories as "AI coding tools are too expensive and not worth it." That is the wrong lesson. The tools are expensive because they are worth it. Under usage-based pricing, your bill scales with how useful the tool is. The better Claude Code gets at one-shotting a refactor, the more you reach for it, the more tokens you burn, the bigger the invoice. Quality and cost move in the same direction.

This is the opposite of every SaaS pricing model engineers grew up with. A per-seat license has a fixed cost; using it more is free, so improvement is pure upside. Token billing inverts that. There is no plateau where you have "paid for" the tool and now use it freely. Every keystroke of value has a marginal cost attached. That is the wall Microsoft and Uber hit — not a quality ceiling, a metering one.

With token-based pricing, the work gets more expensive the more you use it and the better it gets. The math breaks not because the tool is bad, but because it is good enough that engineers never want to stop.

Let us fix the numbers

The viral post says "1 million tokens from Claude's most advanced Opus 4.7 costs $25." That number is real but mislabelled. Opus 4.7 is priced at $5 per million input tokens and $25 per million output tokens. The $25 is the output rate, not a flat per-million price — and in real workloads you consume far more input (your codebase, context, tool results) than output. So a blended cost is its own calculation, not a single sticker price.

There is a second, sneakier cost lever. Opus 4.7 shipped with a new tokenizer that can produce up to 35% more tokens for the same input text. So even though the per-token rate did not change from 4.6, the same prompt can cost more simply because it now counts as more tokens. If you migrated workloads up a version and your bill crept higher for no obvious reason, that is why.

Working in the other direction are two big discounts most people ignore. Prompt caching drops cached input reads to roughly 10% of the standard rate — about $0.50 per million instead of $5 — which is enormous when you are sending the same system prompt and reference files on every turn. And the Batch API halves both input and output for anything you can run asynchronously. Opus 4.7 at batch rates is $2.50 / $12.50 per million. The companies getting wrecked are largely the ones paying full freight on every token.

As for "a top engineer uses 30 million to 1 billion tokens a day, which is $700 to $20k": the low end is plausible for someone running agents hard, but a billion tokens a day per person is hyperbole. The grounded data points — $2,000/month for heavy users, $1,200 in a two-hour demo — are dramatic enough without inflation. The honest framing is that an aggressive agentic workflow can cost tens of dollars an hour, and a fleet of autonomous agents running unattended overnight can cost a great deal more. You do not need fictional billion-token engineers to make the point.

Tokens are the new oil — and that is not just a slogan

The "tokens are the new oil" framing in the viral post is borrowed from Jensen Huang, and it is the part most worth taking seriously. NVIDIA's pitch is that data centres are becoming "token factories" — facilities whose entire output is tokens, and whose revenue is bounded by physics. At GTC 2026, Huang reduced it to a formula: Revenue = (Tokens per Watt) × (Available Gigawatts).

Read that again, because it is the whole game. If a region only has so many gigawatts of power to allocate, then the only way to grow output is to generate more tokens per watt. Power is the hard ceiling; efficiency is the only lever left. NVIDIA claims its architectures — Hopper, Blackwell, and the Vera Rubin generation — have driven a millionfold increase in inference throughput per megawatt across six generations. That is why every new GPU generation matters to the people paying token bills, even if they never touch the hardware: better performance-per-watt is the only path to cheaper tokens.

This reframes Anthropic's incentives. To lower the price it charges customers, Anthropic does not just need cheaper inference software — it needs the most efficient silicon it can get its hands on, running in the most power-dense configuration possible. Which leads directly to the deal.

The NVIDIA–Microsoft–Anthropic deal

On 18–19 November 2025, the three companies announced a tightly interlocked partnership. Anthropic committed to buy roughly $30 billion of Azure compute capacity and to bring on an additional gigawatt of capacity. Microsoft committed to invest up to $10 billion in Anthropic; NVIDIA up to $5 billion. Anthropic agreed to optimise its Claude models for NVIDIA's architecture — the Grace Blackwell GB200 systems the viral post name-checks, and the generations after them. Anthropic's valuation reportedly jumped to around $350 billion in the process. By May 2026 there were even reports of Anthropic and Microsoft discussing Microsoft's own Maia 200 custom silicon for Claude.

So yes — the viral post's core mechanism is right. Anthropic buying more NVIDIA compute is, partly, how it intends to bring per-token costs down. But notice the circularity: Microsoft invests in Anthropic, Anthropic spends that money on Azure (Microsoft) and NVIDIA chips, NVIDIA invests back into Anthropic. Money is moving in a loop between three companies who all benefit from the loop continuing to spin. That is not automatically a problem. It is also exactly the pattern people point at when they use the word "bubble."

So is it a bubble?

The viral post's rhetorical move is: "Shortages in chips, memory, energy, electricity, tokens — how can that be a bubble?" It is a clean line, and it is also a false binary. Shortages and a bubble are not mutually exclusive. They routinely coexist.

Demand can be real, constraints can be real, and valuations can still run ahead of the cash flows that eventually have to justify them. The railways were a genuine revolution and a financial mania. Fibre in the late 1990s was laid for real demand that materialised — a decade later, after most of the companies that laid it went bankrupt. "There is a real shortage of the thing" tells you the technology is useful. It tells you nothing about whether the prices being paid for it today will look sane in five years.

The Microsoft and Uber stories are actually the strongest evidence on both sides. Bull case: usage is so high that companies physically cannot stop their engineers from spending. Bear case: the unit economics are so broken that one of the most sophisticated buyers on earth yanked the tool rather than keep paying. Both are true. That is what a transition looks like before the price of the input — energy — falls or efficiency catches up. My honest read: the demand is not a bubble, but some of the financing structures and valuations built on top of it might be. Those are different claims, and the viral post collapses them into one.

The other force on price: competition is coming

The bear case is not just about financing — it is about the moat. The premium both OpenAI and Anthropic charge assumes their frontier models stay meaningfully ahead of everything cheaper. That assumption is exactly what is under pressure. In a May 2026 segment, CNBC's Deirdre Bosa lays out how Chinese labs like DeepSeek are matching American frontier capability at a fraction of the cost, with a wave of American and European challengers building toward the same price point — and enterprise traffic already starting to shift toward the cheaper models.

This sharpens the bubble question into something concrete. OpenAI and Anthropic are pitching IPO investors on a premium moat — and that moat is eroding fastest in the enterprise segments they most need to dominate, the same segments where Microsoft and Uber just felt the cost in their own budgets. If a "good enough" model lands at a tenth of the price, the calculus that justified $2,000-a-month engineers changes overnight.

But the deeper point in that segment is the one I keep coming back to: this is a software problem as much as a money problem. Right now the industry is brute-forcing capability with hardware — throwing more silicon and more power at models so enormous that running one can demand on the order of 2TB of VRAM. That is the expensive, inelegant phase. The durable wins will come from software and efficiency catching up to the infrastructure: smaller models that punch above their weight, better quantisation, smarter routing, cheaper inference per token. Whoever closes that gap fastest — not whoever spends the most — sets the price everyone else has to match. Which, conveniently, is the same reason I would rather stay portable than bet my whole workflow on one lab staying ahead.

What this means if you are not Microsoft

Here is the part the macro takes skip. I am not Uber. I am one developer running a small agency, and I pay these bills out of my own revenue, not a venture-backed "AI budget." When a metered resource can cost $2,000 a month for a heavy user, cost discipline stops being a nice-to-have and becomes a core engineering skill. The good news: almost everything that wrecked Microsoft and Uber is avoidable if you are paying attention, precisely because they were not optimising — they were on leaderboards and unlimited budgets.

The single biggest lever is model routing. Opus is the deepest reasoner and the most expensive; you do not need it to rename variables, write a commit message, or grind through a batch of boilerplate. Sonnet handles the overwhelming majority of real coding work, and Haiku is genuinely capable for high-frequency, low-stakes tasks at a fraction of the price. Treating one model as your hammer for every nail is how the bill explodes. Matching model tier to task difficulty is the cheapest optimisation there is.

Route by difficulty. Opus for architecture and gnarly debugging, Sonnet for everyday coding, Haiku for frequent worker-style tasks. Most of my agent fleet runs on the cheaper tiers by default.
Cache aggressively. A stable system prompt and reference files cached at ~10% of input cost turns your most repeated tokens into your cheapest ones. This is the discount Microsoft-scale rollouts often leave on the table.
Batch what is not interactive. Overnight audits, bulk content passes, and report generation do not need a real-time response — half price via the Batch API.
Keep context lean. Token-hungry MCP server schemas and bloated context windows cost money on every single turn and degrade quality. Load only what the current task needs.
Watch usage like a metered utility. Because it is one. Knowing your daily burn before the invoice arrives is the difference between a surprise and a decision.

Why a model-agnostic setup matters more now

There is a strategic point underneath the tactical one. When the price of your most-used tool can move with a new tokenizer, a pricing change, or a single partnership announcement, you do not want to be locked into one provider's billing model. I have written before about building a model-agnostic development setup where my coding standards, security rules, and workflows live in a provider-neutral format that every tool loads. The payoff is exactly this scenario: if Anthropic's token costs spike, or a cheaper frontier model appears, switching is a thin adapter file and thirty minutes — not a rewrite of how I work.

Microsoft's move proves the point at scale. They are not abandoning Claude's quality; they are routing engineers to a tool with billing they can predict. You can make the same kind of move on a smaller scale, deliberately, before the invoice forces your hand. The teams that get hurt are the ones who built their entire workflow around one provider's pricing and treated it as fixed. It is not fixed. It is the most volatile input in your stack.

The takeaway

Strip away the viral framing and the real story is simpler and more useful than "Big Tech is quitting Claude." Agentic coding tools got good enough that usage went exponential, usage-based pricing turned that into exponential cost, and even Microsoft and Uber discovered there is no free plateau. The fix the industry is reaching for — more efficient NVIDIA silicon, more gigawatts, more tokens per watt — is real and it is the right lever, because energy is the actual ceiling. Whether the financing built on top of all that is a bubble is a separate question the shortage argument does not answer.

For the rest of us, the lesson is mundane and immediately actionable: tokens are a metered utility, treat them like one. Route by difficulty, cache relentlessly, batch the non-interactive work, keep your context lean, and stay portable enough to switch when the economics shift. Microsoft had to learn that lesson with a budget the size of a small country. You can learn it for the price of paying attention to your own dashboard.

Sources

Frequently Asked Questions

Did Microsoft and Uber actually stop using Claude?

No. Neither company stopped using Claude models broadly. Microsoft is reportedly cancelling most of its internal Claude Code seat licenses and moving engineers to GitHub Copilot CLI, while keeping its multi-billion-dollar Anthropic investment and Foundry deal intact. Uber did not stop either — it simply exhausted its allocated 2026 AI coding budget far faster than planned. The "stopping using Claude" framing conflates one agentic tool and one budget line with abandoning the company.

How much does Claude Opus 4.7 actually cost per million tokens?

Opus 4.7 is priced at $5 per million input tokens and $25 per million output tokens. The viral "$25 per million" figure is the output rate, not a flat price — and real workloads use more input than output. Prompt caching cuts cached input reads to roughly $0.50 per million (about 10% of standard), and the Batch API halves both rates to $2.50 / $12.50 per million for asynchronous work.

Why does better AI lead to higher costs under token pricing?

Because usage-based billing ties your cost directly to how much you use the tool, and you use a tool more when it is better. There is no fixed-cost plateau like a per-seat license. Every unit of value Claude Code produces has a marginal token cost, so improvements in quality drive consumption up, which drives the bill up. That is the structural reason Microsoft and Uber overran their budgets.

What is the NVIDIA, Microsoft and Anthropic deal about?

Announced in November 2025, Anthropic committed to buy roughly $30 billion of Azure compute and add a gigawatt of capacity; Microsoft committed up to $10 billion of investment and NVIDIA up to $5 billion. Anthropic also agreed to optimise Claude for NVIDIA architectures like the Grace Blackwell GB200. The goal is more efficient, lower-cost token generation — though the arrangement is notably circular, with money cycling between the three companies.

Does the AI compute shortage prove it is not a bubble?

No — shortages and bubbles can coexist. Genuine demand and real supply constraints tell you a technology is useful; they say nothing about whether current valuations and financing structures are sustainable. The railway and fibre-optic build-outs were both real revolutions and financial manias. The honest read is that the demand for AI is not a bubble, but some valuations and circular financing arrangements built on top of it might be.

How can a small team avoid the cost problems Microsoft and Uber hit?

Treat tokens as a metered utility. Route tasks by difficulty (Opus for hard reasoning, Sonnet for everyday coding, Haiku for frequent low-stakes work), cache stable prompts and reference files to cut repeated input to ~10% cost, batch non-interactive jobs at half price, keep context windows lean, and monitor daily burn. Most of what wrecked the big rollouts came from not optimising at all — they had unlimited budgets and usage leaderboards.