How to Reduce OpenAI and Claude API Token Costs: A Developer's Guide

If you're building on the OpenAI or Anthropic APIs, your token bill is probably higher than it needs to be. Both platforms have features specifically designed to reduce costs — but most developers don't use them, or use them incorrectly.

This post covers the API-level techniques that make the biggest difference for both providers.

How LLM API pricing actually works

Before optimizing, understand the cost model:

  • Input tokens (your prompt, system instructions, context) are charged at one rate
  • Output tokens (the model's response) are charged at a higher rate — typically 3–5x more than input
  • Cached input tokens are dramatically cheaper — Anthropic charges 0.1x the base input rate for cache reads
  • Thinking/reasoning tokens (extended thinking in Claude, chain-of-thought in OpenAI) are billed as output tokens — the expensive kind

Most developers focus on making prompts shorter (input tokens), but output tokens and thinking tokens are often the bigger cost driver per-token. And cached inputs are so cheap that the real optimization is often increasing cache-eligible input while reducing everything else.

Prompt caching: the single biggest cost lever

Anthropic's prompt caching is the highest-ROI feature most teams underuse. Cache reads cost 0.1x the base input price — effectively a 90% discount on input tokens that hit the cache.

But there's a catch: cache hits require 100% identical prefix segments. If even one token changes in the cached portion, the entire segment is a cache miss and you pay the full write cost.

The practical implications:

  • Put stable content first: System instructions, background context, and tool definitions should come at the beginning of your prompt. Variable content (user input, conversation history) goes at the end.
  • Don't put timestamps in your system prompt: A common mistake. If your system prompt includes "Today's date is March 6, 2026", the cache invalidates every day.
  • Use explicit cache breakpoints: When different sections change at different rates, use breakpoints so a change in one section doesn't invalidate everything.

OpenAI offers similar prefix caching on longer prompts, and their Batch API provides 50% cost savings for non-time-sensitive workloads.

For a deep dive on designing your prompt architecture for maximum cache hits, see our post on designing for prompt cache hits.

Structured outputs: stop paying for retries

One of the most expensive patterns in LLM usage is the retry loop: you ask for structured data, the model returns something malformed, you parse it, fail, and call the API again.

Both providers now support structured output modes that eliminate this:

  • Anthropic's tool use: Define a JSON schema for the expected output. The model returns valid JSON matching your schema. No parsing errors, no retries.
  • OpenAI's Structured Outputs: Similar schema-based approach with guaranteed valid JSON.

Even without strict schema mode, you can reduce retry rates by:

  • Providing a clear output format in your prompt with an example
  • Using JSON mode (available on both platforms) to at least guarantee valid JSON
  • Setting explicit tool schemas that constrain the output shape

Every eliminated retry is a full API call you didn't have to pay for.

Model routing: the right model for every task

Not every API call needs your most expensive model. This is the simplest cost optimization, yet most teams use one model for everything.

A practical routing framework:

| Task type | Recommended model | Why | |---|---|---| | Classification, extraction, formatting | Haiku / GPT-4o-mini | Fast, cheap, high accuracy on simple tasks | | Code generation, analysis, writing | Sonnet / GPT-4o | Good balance of quality and cost | | Complex reasoning, architecture, multi-step planning | Opus / GPT-4 / o1 | Only when you need the extra capability |

The key insight: a simple routing layer that sends easy tasks to cheap models and hard tasks to expensive ones can cut costs by 40–60% with no quality loss on the tasks that matter.

In practice, this can be as simple as an if-statement based on task type, or as sophisticated as a classifier that routes based on estimated complexity.

Thinking and effort controls

Extended thinking (Claude) and chain-of-thought reasoning (OpenAI o-series) are powerful but expensive. Thinking tokens are billed as output tokens — the most expensive token type.

Both providers offer controls:

  • Anthropic's budget tokens: Set a thinking budget to cap how many tokens the model spends reasoning. Lower budgets work fine for simpler tasks.
  • Effort settings: Control how aggressively the model spends tokens across text, tool calls, and thinking. Lower effort means fewer tool calls and less reasoning overhead.

The rule of thumb: use high thinking effort for architecture decisions, debugging complex issues, and multi-step planning. Use low or no thinking for extraction, formatting, classification, and simple code changes.

Constrain your outputs

Output tokens are expensive. Don't let the model generate more than you need:

  • Set realistic max_tokens: If you expect a classification label, set max_tokens to 50, not 4,096. This prevents runaway responses.
  • Ask for diffs, not rewrites: When modifying code, ask for the specific changes rather than the entire file.
  • Request structured formats: "Return a JSON object with fields X, Y, Z" produces much less output than "Explain your analysis."
  • Stop sequences: Use stop sequences to cut off generation when the useful output is done.

A response that's 200 tokens of precise, structured output is almost always more useful — and cheaper — than 2,000 tokens of prose.


This post is part of our complete LLM token optimization strategy guide. For related topics, see designing for prompt cache hits and cutting MCP and tool overhead.