Designing for Prompt Cache Hits: How to Save 90% on LLM Input Tokens

Prompt caching is the most powerful cost-reduction feature available on modern LLM APIs. Anthropic's cache reads cost 0.1x the base input price — a 90% discount. OpenAI offers similar savings on automatically cached prefixes.

But many teams enable caching and then wonder why their cache hit rates are low. The problem is almost always the same: their prompts aren't designed for caching.

Caching isn't a toggle you flip. It's an architecture you build around.

How prompt caching works (the critical detail)

Prompt caching works by storing a processed version of your input prefix. On subsequent requests, if the prefix matches exactly, the cached version is reused — skipping the expensive tokenization and processing step.

The critical detail: cache hits require 100% identical prefix segments. If even one token in the cached portion changes between requests, the entire segment is a cache miss. You pay the full input price plus a cache write cost for re-caching.

This means cache design is really about maximizing the size of your stable prefix — the portion of your prompt that stays identical across requests.

The stable prefix pattern

The fundamental design pattern for cache-friendly prompts is simple:

[Stable content — cached]    → System instructions, background, tool definitions
[Semi-stable content]        → Few-shot examples, reference docs
[Variable content — not cached] → User input, conversation history

Everything that stays the same across requests goes first. Everything that changes goes last.

This seems obvious, but most developers structure their prompts the other way around — putting the user's question first and the context after. Inverting this order is often the single biggest improvement in cache hit rates.

What to put in your stable prefix

The best candidates for caching are:

System instructions: Your model's persona, rules, constraints, output format requirements. These rarely change between requests.
Tool definitions: If you're using function calling, tool schemas are typically identical across requests. At 500–2,000 tokens per tool, caching 10 tools saves 5,000–20,000 tokens per request.
Background context: Project documentation, API references, style guides — anything that provides context but doesn't change per-request.
Few-shot examples: If you use consistent examples, they're prime caching material. Just don't shuffle them between requests (see below).

Cache-busting mistakes

These are the most common patterns that accidentally destroy cache hit rates:

Timestamps in system prompts. "Today's date is March 6, 2026" in your system prompt means the cache invalidates every day. If you need the model to know the date, put it in the variable section after the cached prefix, or update it less frequently.

Shuffled few-shot examples. If you randomize the order of your examples on each request "for variety," every order is a unique prefix. Pick a fixed order and stick with it.

Dynamic tool lists. If your available tools change between requests — some tools enabled, some disabled — the tool definition section changes and the cache misses. Either load all tools consistently or use the on-demand tool loading pattern (see reducing tool overhead).

Per-user context in the prefix. Putting user-specific data (name, preferences, history) into the system prompt means each user gets a unique prefix. Move user context to the variable section.

Version strings or build hashes. Embedding deployment metadata in your prompt invalidates the cache on every deploy.

Multi-tier caching with breakpoints

Different parts of your prompt change at different rates:

System instructions: change rarely (monthly)
Tool definitions: change occasionally (weekly)
Background docs: change sometimes (as docs update)
User conversation: changes every request

Anthropic supports explicit cache breakpoints that let you cache these tiers independently. A change in your background docs doesn't invalidate the cache for your system instructions and tool definitions.

The pattern:

[System instructions]           → Breakpoint 1 (stable for months)
[Tool definitions]              → Breakpoint 2 (stable for weeks)
[Background docs / examples]    → Breakpoint 3 (stable for days)
[User input + conversation]     → Not cached

If you update your background docs, only that segment re-caches. The first two tiers still hit.

Measuring cache performance

You can't optimize what you don't measure. Key metrics for prompt caching:

Cache hit rate: What percentage of requests hit the cache vs. miss? Aim for 80%+ on steady-state traffic.
Cache read vs. write tokens: In your API usage dashboard, compare cached read tokens to cache write tokens. High write-to-read ratios indicate frequent cache misses.
Cost per request before vs. after: Track the actual cost impact. A well-designed caching setup can reduce input token costs by 70–90%.

Both Anthropic and OpenAI provide usage breakdowns that separate cached from uncached token counts. Monitor these regularly to catch cache-busting regressions.

The economics at scale

Let's make this concrete. Consider a production app making 10,000 API calls per day with 20,000 input tokens per call:

Without caching:

200M input tokens/day at full price

With well-designed caching (15,000 stable tokens, 5,000 variable, 85% hit rate):

15,000 tokens × 85% = 12,750 tokens at 0.1x price (cache reads)
15,000 tokens × 15% = 2,250 tokens at full price (cache misses)
5,000 tokens always at full price (variable portion)
Effective cost reduction: ~60% on input tokens

At Anthropic's Claude Sonnet pricing, that's the difference between spending roughly $60/day and $24/day on input tokens alone. Over a year, you're looking at over $13,000 in savings — from an architecture change, not a feature cut.

This post is part of our complete LLM token optimization strategy guide. For related topics, see reducing OpenAI and Claude API token costs and cutting MCP and tool overhead.