How to Measure and Monitor LLM Token Usage (Before You Can Optimise It)
Most teams start optimizing LLM token usage by tweaking prompts. They compress instructions, shorten examples, and trim system messages.
Then they check their bill and find it barely moved.
The problem isn't that optimization doesn't work — it's that they optimized the wrong thing. Without measurement, you're guessing where your tokens go. And most guesses are wrong.
Teams optimize the wrong thing
The intuitive assumption is that the first generation pass — your prompt and the model's initial response — is the primary cost driver. So teams focus on making that prompt smaller and the response more concise.
But research on real-world LLM agent usage tells a different story. A 2026 analysis of token usage patterns in code review agents found that iterative review loops consumed 59.4% of total tokens on average. The initial generation was a small fraction of the total cost.
Similarly, in coding agent sessions, input context growth from conversation history typically dwarfs the original prompt. By message 20, you might be paying for 80,000+ tokens of history on every turn — while your original prompt was 50 tokens.
The lesson: measure first, then optimize the actual hotspot.
Claude Code's built-in measurement tools
If you're using Claude Code, you have three measurement tools available immediately:
/cost — Shows the running cost of your current session. Check this periodically to understand which conversations are expensive and when costs spike. A session that suddenly jumps from $0.30 to $1.20 on a single turn tells you something went wrong.
/context — Shows what's currently in your context window. This is the single most useful debugging tool for token waste. You'll often discover that your context contains full file dumps, stale conversation history, or tool definitions you forgot were loaded.
/mcp — Shows which MCP servers are connected and their tool counts. If you see 30+ tools from servers you rarely use, that's immediate overhead you can cut.
These three commands together give you a quick picture of where your tokens are going right now — which is often very different from where you think they're going.
API-level measurement
For production applications using the LLM APIs directly, you have more granular measurement options:
Token Count API (pre-flight check): Before sending a large request, count the tokens first. This lets you catch unexpected context bloat before it costs money. If your prompt is supposed to be 5,000 tokens and the count comes back at 45,000, something is wrong — and you caught it before paying for it.
Usage & Cost API (post-hoc analysis): Break your usage down by:
- Model (are you accidentally using Opus for simple tasks?)
- Cache reads vs. cache writes (is your caching actually working?)
- Context-window tier (are you hitting premium pricing above 200K tokens?)
- Input vs. output tokens (is the cost in your prompt or the response?)
These breakdowns often reveal surprises. Teams regularly discover that one endpoint is responsible for 60% of their bill, or that their cache hit rate is 20% when they assumed it was 80%.
Identifying the real hotspots
Once you have measurement data, look for these common patterns:
Input context growth: Does your per-turn input token count grow linearly through a conversation? That's conversation history accumulation. The fix is session splitting — see our post on context engineering.
Low cache hit rates: Are your cache read tokens much lower than your total input tokens? Your prompt structure probably isn't cache-friendly. See designing for prompt cache hits.
High tool overhead: Is a large constant chunk of your input identical across requests and consists of tool definitions? You're paying for tools you're not using. See cutting tool overhead.
Expensive review loops: Are you seeing multiple back-and-forth turns where the model generates code, you reject it, and it tries again? These review cycles are the single biggest cost multiplier. Each iteration pays for the full context plus new generation.
Subagent costs: Agent frameworks that spawn subagents can be surprisingly expensive. Claude Code's docs note that agent-team patterns use approximately 7x more tokens than standard sessions. Subagents are valuable for noisy exploration (keeping experimental context out of the main thread), but should be used deliberately, not by default.
A measurement-first workflow
Here's a practical workflow for token optimization:
Step 1: Baseline. Run your typical workload for a day and record total tokens, cost, and per-request breakdowns. Don't change anything yet.
Step 2: Identify the hotspot. Look at your data and find the single biggest source of waste. Is it context growth? Tool overhead? Review loops? Low cache hits? Pick the one that accounts for the most spend.
Step 3: Apply a targeted fix. Address only the top hotspot. Don't try to optimize everything at once. If context growth is the issue, implement session splitting. If tool overhead is the problem, prune your MCP servers.
Step 4: Re-measure. Run the same workload again and compare. Did the fix work? By how much? Sometimes a fix that should save 40% only saves 5% — which tells you your diagnosis was wrong.
Step 5: Repeat. Move to the next hotspot. Each iteration should target diminishing returns, so stop when the ROI of further optimization is too low to justify the effort.
Setting up ongoing monitoring
Once you've done the initial optimization, set up lightweight monitoring to prevent regression:
- Cost alerts: Set a daily or weekly budget threshold. If your bill exceeds the threshold, investigate. Regressions are common — a new feature, an updated prompt, or a new MCP server can silently increase token usage.
- Per-endpoint tracking: If you have multiple API endpoints calling LLMs, track cost per endpoint. This lets you catch a single endpoint that starts burning tokens due to a code change.
- Cache hit rate tracking: Monitor your cache hit rate over time. A sudden drop usually means something changed in your prompt structure.
- Community tools: For Claude Code, tools like
claude-code-usage-monitorprovide real-time token tracking beyond what the built-in/costcommand shows.
The goal isn't constant monitoring — it's catching regressions quickly before they compound into a large bill.
This post is part of our complete LLM token optimization strategy guide. Once you know where your tokens go, start with the highest-ROI fixes: context engineering, prompt cache design, and reducing tool overhead.