The Bill Nobody Budgeted For
There's a moment that happens in engineering organizations about six months after they ship their first AI-powered feature. Usage grows, the feature gets popular, and someone looks at the API invoice and says: "Wait — we're spending how much?"
It's not a small number. And the reason it's not a small number is that nobody on the team was thinking in tokens when they built the thing. They were thinking in features, in sprints, in story points — the units of cost they've always tracked. Tokens didn't exist in that mental model, so they didn't get managed.
That's about to change. AI cost management is becoming a core engineering discipline, and the teams that treat it as an afterthought are going to learn the hard way.
What Tokens Actually Are
Every word, punctuation mark, and whitespace character you send to a language model gets broken into tokens — roughly 0.75 words per token on average, though the actual split depends on the content. The model processes those tokens and generates output tokens in response. You pay for both.
What makes this different from most infrastructure costs is that it scales in two directions you don't fully control: the length of what you send, and the length of what the model decides to say back. A vague system prompt that runs 2,000 tokens and a model that produces 1,500-word answers will cost you ten times more than a tight 200-token prompt with constrained output — for the same functional result.
That ratio is where the money goes.
The Cost Reality Across Models
Here's a concrete example. A simple AI-powered task: a code comment generator. The developer pastes a function, the model writes the JSDoc block. Simple, useful, the kind of thing teams ship in a sprint and then forget about.
Assume a 200-token system prompt, a 150-token code input, and 100 tokens of generated output. That's 350 tokens in, 100 tokens out per request.
| Model | Input | Output | Cost per Request |
|-------|-------|--------|-----------------|
| Claude Haiku 4.5 | $0.80 / MTok | $4 / MTok | ~$0.0007 |
| Claude Sonnet 4.6 | $3 / MTok | $15 / MTok | ~$0.0026 |
| Claude Opus 4.7 | $15 / MTok | $75 / MTok | ~$0.0128 |
Those numbers look small in isolation. They don't stay small.
At 1,000 requests per day — a moderate engineering team using the tool regularly:
| Model | Daily Cost | Monthly Cost |
|-------|-----------|-------------|
| Haiku 4.5 | ~$0.70 | ~$21 |
| Sonnet 4.6 | ~$2.60 | ~$78 |
| Opus 4.7 | ~$12.80 | ~$384 |
At 10,000 requests per day across a larger organization:
| Model | Daily Cost | Monthly Cost |
|-------|-----------|-------------|
| Haiku 4.5 | ~$7 | ~$210 |
| Sonnet 4.6 | ~$26 | ~$780 |
| Opus 4.7 | ~$128 | ~$3,840 |
The same feature. The same task. An 18x difference in monthly cost depending on which model you chose and whether you thought about it at all.
For a JSDoc generator, Haiku produces results that are indistinguishable from Opus in the vast majority of cases. The capability gap matters for complex reasoning tasks — it largely doesn't matter for well-defined, low-complexity operations.
Six Strategies That Cut Costs 40–80%
This isn't theoretical. These are the levers that make the difference between an AI feature that stays in budget and one that triggers an emergency spending review.
1. Right-size the model to the task.
This is the highest-leverage move and the one teams skip most often. Not every task needs Opus. Opus is exceptional at complex reasoning, nuanced analysis, and tasks where quality variance is high and consequences are real. Most internal tooling doesn't qualify.
Categorize your AI use cases. Simple classification, short generation, low-stakes summarization — these belong on Haiku. Complex reasoning, code generation for critical systems, anything where output quality directly affects a user — Sonnet or Opus. The delta is significant enough that this decision alone can cut your bill by 60–80% on tasks currently running on a model they don't need.
2. Optimize your prompts like you'd optimize a query.
System prompts are paid for on every single request. A system prompt that drifted to 1,500 tokens through iterative additions gets sent with every API call, every time, forever. Audit your system prompts. Cut everything that isn't load-bearing. Instructions that don't change the output don't belong in the prompt.
The same principle applies to context. If you're sending 3,000 tokens of context when 800 relevant tokens would produce the same answer, you're paying for 2,200 tokens of noise on every call.
3. Constrain output length.
Models don't naturally write short answers — they generate until they decide to stop, which usually means more tokens than you needed. Set max_tokens intentionally. A feature that needs a one-paragraph summary should be configured to produce a one-paragraph summary, not a model left to decide how much is enough.
This is not just a cost optimization. Shorter, constrained outputs are often better for the user interface anyway. Verbose AI responses frequently get ignored. Concise ones get read.
4. Use prompt caching for repeated context.
If your system prompt or large context blocks are stable across requests — shared instructions, a reference document, a consistent persona — use prompt caching. Anthropic's API caches prompt prefixes for up to five minutes, charging a fraction of the standard input rate for cache hits. On workloads where the same 2,000-token system prompt is being sent with every request, caching turns that cost from full price to roughly 10% of full price.
For any AI feature with high request volume and a consistent system prompt, this single change can cut the input token cost by 80–90%.
5. Build skills for repeated operations.
Rather than sending the model a broad instruction and hoping it reasons correctly each time, build specific tool definitions for repeated operations. Well-defined tools route the model's output through a predictable structure, which reduces both the complexity of the system prompt and the verbosity of the output. Less ambiguity in the prompt means less token consumption reasoning through it.
6. Implement caching at the application layer.
Some AI responses are stable enough to cache. A career summary, an FAQ response, a boilerplate code snippet — if the same input will produce effectively the same output, store the result and skip the API call entirely. This is basic engineering applied to AI infrastructure. The cheapest token is the one you never send.
What This Means for Engineering Teams
The estimation conversation needs to expand.
When a developer proposes an AI-powered feature, the team needs to ask three cost questions before the sprint starts: What model is this using and why? How many tokens does each request consume? What's the expected request volume at steady state?
Right now, most teams are asking none of these questions. The feature gets built, the model gets chosen by default, the system prompt grows organically, and nobody looks at token usage until the invoice arrives.
The engineers who will be most valuable in AI-integrated organizations aren't just the ones who can prompt models effectively. They're the ones who can design AI features that are both capable and cost-efficient — who understand the cost surface of the tools they're working with and make deliberate tradeoffs rather than defaulting to the most capable model because it was easier than thinking about it.
That's not a different skill from good engineering. It's the same skill applied to a new set of constraints.
A Line Item That Isn't Going Away
AI infrastructure is not going to become free. Model costs will decrease over time, but usage will scale faster — and the organizations that build responsible token management into their engineering culture from the beginning will have a meaningful structural advantage over the ones that treat it as someone else's problem.
Estimation has always been about accountability — knowing what something costs before you commit to it. Tokens are a cost. Treat them like one.
