LLM Token Optimization Strategies: The Complete Guide for 2026

LLM APIs are priced by the token. Every token you send and receive has a cost. As you scale from prototype to production — from tens of requests to thousands per day — the difference between optimized and unoptimized token usage can be tens of thousands of dollars per year.

This guide covers the full landscape of LLM token optimization strategies. It's based on research across Anthropic's documentation, real-world usage data, and academic findings on retrieval-augmented generation and long-context performance.

The core thesis: token optimization is a context-engineering problem, not a prompt-shortening problem. Most teams waste time making prompts shorter when the real cost drivers are bloated context, idle tool schemas, and stale conversation history.

Why token optimization matters now

Three trends make token optimization increasingly important:

Pricing tiers: Anthropic's models move to premium long-context pricing once input exceeds 200K tokens. Staying below that threshold can significantly reduce per-token costs.
Agent architectures: Coding agents, tool-use workflows, and multi-step reasoning all multiply token usage. A single agent session can consume 10–100x more tokens than a simple API call.
Diminishing returns on long context: Research consistently shows that relevant information buried in the middle of a long context is used less reliably. More tokens doesn't just cost more — it can produce worse results.

The strategies below are organized from highest to lowest ROI for most teams.

1. Context engineering and session management

The single biggest source of token waste in LLM applications is context bloat — sending far more context than the model needs for the current step.

The key strategies:

Split work into phases: Do discovery, implementation, and verification in separate sessions. Stale context from failed attempts charges you on every subsequent turn while degrading quality.
Just-in-time retrieval: Pull in exactly the information needed, right when it's needed. Targeted file reads and LSP navigation beat repo dumps. Research on iterative repository retrieval (RepoCoder) showed >10% improvement in accuracy over in-file completion while using less context.
Repo memory: Put durable project knowledge (architecture, conventions, build commands) in structured config files like CLAUDE.md that load automatically, rather than typing it into every conversation.

This is the single most impactful optimization for most teams. Read the full deep dive: Context Engineering: Why Reducing Token Usage Isn't About Shorter Prompts

2. Provider-specific API techniques

Each LLM provider has features specifically designed to reduce costs. Most developers don't use them, or use them incorrectly.

The key strategies:

Prompt caching: Anthropic cache reads cost 0.1x base input price — a 90% discount. But hits require 100% identical prefix segments, so prompt structure matters enormously.
Structured outputs: Tool schemas and JSON mode eliminate the retry loops caused by malformed responses. Every eliminated retry is a full API call you didn't pay for.
Batch APIs: OpenAI's Batch API offers 50% savings for non-time-sensitive workloads.
Output constraints: Set realistic max_tokens, ask for diffs instead of full rewrites, and use stop sequences.

Read the full deep dive: How to Reduce OpenAI and Claude API Token Costs

3. Tool and schema overhead reduction

A source of waste most developers don't know about: tool definitions are included in every API request. Real-world setups have measured 55K–134K tokens of tool-definition overhead before any work starts.

The key strategies:

Disable unused MCP servers: Each server's tool definitions load on every request whether you use them or not.
On-demand tool loading: Use a tool-search pattern to load tools only when needed. This reduced one setup's overhead from 134K to 8.7K tokens — an 85% reduction.
Prefer CLI tools: When a direct command-line tool does the job, it avoids the schema overhead of the MCP layer.
Progressive disclosure: Use Skills or equivalent patterns where full instructions load only when triggered.

Read the full deep dive: Cut MCP and Tool Overhead to Save Thousands of Tokens Per Request

4. Prompt cache architecture

Caching isn't a toggle — it's an architecture. Most teams enable prompt caching but get low hit rates because their prompts aren't designed for it.

The key strategies:

Stable prefix pattern: Put stable content (system instructions, tool definitions) first and variable content (user input) last.
Multi-tier caching: Use breakpoints to cache sections that change at different rates independently.
Avoid cache busters: Timestamps in system prompts, shuffled few-shot examples, and dynamic tool lists all destroy cache hit rates.

Read the full deep dive: Designing for Prompt Cache Hits: How to Save 90% on Input Tokens

5. Model routing and right-sizing

Not every task needs your most expensive model. A routing layer that sends easy tasks to cheap models and hard tasks to expensive ones can cut costs by 40–60%.

The key strategies:

Task-based routing: Classification, extraction, and formatting go to Haiku/GPT-4o-mini. Complex reasoning and architecture decisions go to Opus/GPT-4.
Thinking/effort controls: Extended thinking burns output tokens (the expensive kind). Dial it down for simple tasks.
Subagent model selection: Route simple subagent work to cheaper models. Agent teams use ~7x more tokens than standard sessions, so model choice matters more.

6. Measurement and monitoring

You can't optimize what you can't measure. And most teams optimize the wrong thing because they haven't measured where their tokens actually go.

The key strategies:

Use built-in tools: Claude Code's /cost, /context, and /mcp commands reveal real-time token usage.
API-level tracking: Token Count API (pre-flight checks) and Usage & Cost API (post-hoc breakdowns by model, cache, and context tier).
Find the real hotspots: Research shows that review and rework loops consume ~59% of tokens on average — not the initial generation. Input context growth, not prompt size, is usually the main cost driver.

Read the full deep dive: How to Measure and Monitor LLM Token Usage

The 3 highest-ROI changes

If you only have time for three optimizations, research and production data suggest these deliver the most impact:

Spec in one session, implement in a fresh one. Resetting context between phases eliminates the compounding cost of stale history. This is free to implement and immediately reduces token usage on every subsequent turn.
Replace repo dumps with targeted retrieval. Use code intelligence, LSP navigation, and focused file reads instead of dumping entire files or directories into context. Less context, better results, lower cost.
Prune tools and MCP servers, then rely on caching for the stable remainder. Disable unused servers, switch to on-demand tool loading, and make sure your remaining tool definitions are cache-friendly. This attacks the constant overhead that charges you on every single request.

These three target the recurring token leaks that appear on almost every turn: stale history, irrelevant code context, and idle tool schemas.

These strategies transfer across providers

While the examples in this guide reference Claude and OpenAI specifically, the underlying problems — finite attention, long-context degradation, retrieval vs. dumping, and tool-schema overhead — are not provider-specific. The same strategies apply to Gemini, Codex, and any other LLM-based tool or API.

The fundamentals don't change: send the right context at the right time, measure where your tokens go, and optimize the actual hotspots.

For more specific topics, explore our other guides: