How to Reduce OpenAI and Claude API Token Costs: A Developer's Guide

If you're building on the OpenAI or Anthropic APIs, your token bill is probably higher than it needs to be. Both platforms have features specifically designed to reduce costs — but most developers don't use them, or use them incorrectly.

This post covers the API-level techniques that make the biggest difference for both providers.

How LLM API pricing actually works

Before optimizing, understand the cost model:

  • Input tokens (your prompt, system instructions, context) are charged at one rate
  • Output tokens (the model's response) are charged at a higher rate — typically 3–5x more than input
  • Cached input tokens are dramatically cheaper — Anthropic charges 0.1x the base input rate for cache reads
  • Thinking/reasoning tokens (extended thinking in Claude, chain-of-thought in OpenAI) are billed as output tokens — the expensive kind

Most developers focus on making prompts shorter (input tokens), but output tokens and thinking tokens are often the bigger cost driver per-token. And cached inputs are so cheap that the real optimization is often increasing cache-eligible input while reducing everything else.

Heads-up for Opus 4.7 users: Anthropic's newest flagship model ships with a new tokenizer that uses up to 35% more tokens for the same text compared to Opus 4.6 (most pronounced for code and structured data, minimal for plain English). The per-token price is identical ($5/$25/MTok), but effective cost per request is higher. Benchmark your actual workloads before assuming Opus 4.7 costs the same as 4.6.

Heads-up for GPT-5.5 users: OpenAI's newest flagship (released April 2026) applies a long-context premium for requests exceeding 272K input tokens — the full session is charged at 2x input and 1.5x output rates. If your workloads regularly exceed this threshold, factor this into cost projections. Also note: GPT-5.5 Pro has no cached-input discount, unlike standard GPT-5.5.

Prompt caching: the single biggest cost lever

Anthropic's prompt caching is the highest-ROI feature most teams underuse. Cache reads cost 0.1x the base input price — effectively a 90% discount on input tokens that hit the cache.

But there's a catch: cache hits require 100% identical prefix segments. If even one token changes in the cached portion, the entire segment is a cache miss and you pay the full write cost.

The practical implications:

  • Put stable content first: System instructions, background context, and tool definitions should come at the beginning of your prompt. Variable content (user input, conversation history) goes at the end.
  • Don't put timestamps in your system prompt: A common mistake. If your system prompt includes "Today's date is March 6, 2026", the cache invalidates every day.
  • Use explicit cache breakpoints: When different sections change at different rates, use breakpoints so a change in one section doesn't invalidate everything.

OpenAI caches repeated prefixes automatically. GPT-5.5 (the current flagship, released April 2026) offers a 90% cached-input discount ($0.50/MTok cached input), matching Anthropic's rate — but GPT-5.5 Pro has no cached-input discount. GPT-5.4 and its mini/nano variants continue to offer 90% cached discounts. Older GPT-4.x models got a 50% discount. One caveat: OpenAI's reasoning models (o3, o4-mini) cache at a 75% discount, not 90%. Google Gemini offers both implicit caching (automatic) and explicit caching — Gemini 2.5+ models get 90% savings while Gemini 2.0 models get 75%, both with configurable TTLs.

For a deep dive on designing your prompt architecture for maximum cache hits, see our post on designing for prompt cache hits.

Structured outputs: stop paying for retries

One of the most expensive patterns in LLM usage is the retry loop: you ask for structured data, the model returns something malformed, you parse it, fail, and call the API again.

Both providers now support structured output modes that eliminate this:

  • Anthropic's tool use: Define a JSON schema for the expected output. The model returns valid JSON matching your schema. No parsing errors, no retries.
  • OpenAI's Structured Outputs: Similar schema-based approach with guaranteed valid JSON.

Even without strict schema mode, you can reduce retry rates by:

  • Providing a clear output format in your prompt with an example
  • Using JSON mode (available on both platforms) to at least guarantee valid JSON
  • Setting explicit tool schemas that constrain the output shape

Every eliminated retry is a full API call you didn't have to pay for.

Model routing: the right model for every task

Not every API call needs your most expensive model. This is the simplest cost optimization, yet most teams use one model for everything.

A practical routing framework:

| Task type | Model tier | Examples | |---|---|---| | Classification, extraction, formatting | Small (Haiku 4.5, gpt-5.4-nano at $0.20/MTok) | Fast, cheap, high accuracy on simple tasks | | Code generation, analysis, writing | Medium (Sonnet 4.6, GPT-5.5 at $5/$30/MTok, or GPT-5.4 at $2.50/$15/MTok) | GPT-5.5 is OpenAI's newest flagship (April 2026); GPT-5.4 offers a cheaper alternative | | Reasoning-heavy tasks at scale | Reasoning (o4-mini at $1.10/$4.40/MTok, o3 at $2/$8/MTok) | o3 dropped 80% in price (April 2026) — strong reasoning is now mid-tier cost | | Complex reasoning, architecture, deep research | Large (Opus 4.7 or Opus 4.6, GPT-5.5 Pro at $30/$180/MTok, o3-pro at $5/$20/MTok) | Only when you need maximum capability — if using Opus 4.7, note its new tokenizer adds up to 35% more tokens vs 4.6 for code-heavy inputs; GPT-5.5 Pro has no cached-input discount |

The key insight: a simple routing layer that sends easy tasks to cheap models and hard tasks to expensive ones can cut costs by 40–60% with no quality loss on the tasks that matter.

A notable shift in April 2026: OpenAI cut o3's price by 80% (from $10/$40 to $2/$8 per MTok). At that price, o3 is now cheaper than GPT-5.5 standard — but with stronger reasoning. For applications that previously avoided o3 due to cost, this warrants a reassessment. o4-mini ($1.10/$4.40) also enters as a capable reasoning option at a fraction of the cost.

In practice, this can be as simple as an if-statement based on task type, or as sophisticated as a classifier that routes based on estimated complexity.

Thinking and effort controls

Extended thinking (Claude) and chain-of-thought reasoning (OpenAI o-series) are powerful but expensive. Thinking tokens are billed as output tokens — the most expensive token type.

Both providers offer controls:

  • Anthropic's effort parameter: On Claude 4.6 models, use adaptive thinking with the effort parameter (low/medium/high) to let the model calibrate its reasoning per task. Claude Opus 4.7 adds an xhigh effort level (above high) for tasks requiring the deepest reasoning — but this also generates the most thinking tokens, so use it sparingly. On older Claude models, set an explicit budget_tokens cap. Lower effort or budget means fewer thinking tokens.
  • Task Budgets (Opus 4.7 public beta): Specify a task_budget to cap total tokens the model can spend across an entire long-running agentic session. The model prioritizes the most important work when it knows it's budget-constrained — useful for preventing runaway costs on open-ended tasks.
  • Suppress thinking output: Claude 4.6 models now support thinking.display: "omitted" (available March 2026). This strips the thinking content from the API response for faster streaming — the model still reasons internally, and the signature is preserved for multi-turn continuity, but you don't pay output tokens for thinking you'll discard. Useful for production pipelines where the reasoning trace isn't shown to users.
  • OpenAI's reasoning effort: Similarly, OpenAI's o-series models offer reasoning effort controls to limit thinking tokens on simpler tasks.

The rule of thumb: use high thinking effort for architecture decisions, debugging complex issues, and multi-step planning. Use low or no thinking for extraction, formatting, classification, and simple code changes.

Advisor Tool: Opus-level quality at Sonnet-level cost

A new cost pattern launched in April 2026: the advisor tool (beta) lets you pair a cheap executor model with a high-intelligence advisor model that's only consulted when the executor is stuck.

Instead of running an entire coding agent session on Opus 4.6 or 4.7, you run it on Sonnet 4.6 (or Haiku 4.5) and add the advisor tool to your tools array. The executor handles most turns independently. When it reaches a decision point too complex to resolve alone, it calls the advisor — which generates a focused 400–700 token response at Opus rates. Typical agentic sessions with 3 advisor consultations out of 25 turns run roughly 73% cheaper than Opus solo (Sonnet executor + Opus advisor), or 87% cheaper with Haiku as executor. Both Opus 4.6 and Opus 4.7 are available as the advisor model at the same $5/$25/MTok pricing; Opus 4.7 offers stronger coding capability but its new tokenizer may generate slightly more tokens per consultation.

The mechanism is the same as any other tool call: you declare the advisor tool in your tools array alongside the beta header advisor-tool-2026-03-01, and the executor decides when to invoke it. Advisor tokens are billed separately at Opus 4.6 rates and reported in the usage block for per-tier cost tracking.

This is particularly useful for long-horizon agentic tasks where most turns are routine (reading files, making simple decisions) but occasional turns require deeper reasoning. The executor handles the routine work cheaply; the advisor handles the hard parts without you paying Opus rates for everything.

Fast Mode: speed at a premium

Claude Opus 4.6 now offers Fast Mode (research preview), which delivers significantly faster output at 6x standard pricing ($30/MTok input, $150/MTok output). This is the inverse of most optimization advice here — it's intentionally expensive, but useful for latency-critical production workloads where waiting is unacceptable.

Fast Mode is not available with the Batch API, and the two should not be confused: Batch is for async workloads where latency doesn't matter (50% off), Fast Mode is for synchronous workloads where latency is the primary constraint (6x premium).

Batch APIs: 50% off for async workloads

All three major providers now offer batch processing at 50% cost savings:

  • OpenAI Batch API: Up to 50% off, 24-hour turnaround window. Supports chat completions, embeddings, and moderations.
  • Anthropic Message Batches: 50% off, most batches complete in under 1 hour. Up to 100,000 requests or 256 MB per batch.
  • Google Gemini Batch API: 50% off, 24-hour delivery window.

Batch is ideal for workloads that don't need real-time responses: classification pipelines, bulk summarization, embedding generation, data migration, and evaluation suites. If you're processing thousands of items nightly, batch pricing alone can halve that cost.

Predicted Outputs: latency, not cost

OpenAI's Predicted Outputs feature is worth understanding — but not for the reason most developers think. It reduces latency (up to 80% faster) when much of the output is known ahead of time, like editing code where most lines stay the same.

The common misconception: predicted outputs save money. They don't. Rejected prediction tokens are still billed at the full output token rate. The feature works via speculative decoding — the model verifies your prediction against what it would have generated, and you pay for both the accepted and rejected tokens.

Use it when latency matters and you can accurately predict most of the output. Don't use it expecting cost savings.

Constrain your outputs

Output tokens are expensive. Don't let the model generate more than you need:

  • Set realistic max_tokens: If you expect a classification label, set max_tokens to 50, not 4,096. This prevents runaway responses.
  • Ask for diffs, not rewrites: When modifying code, ask for the specific changes rather than the entire file.
  • Request structured formats: "Return a JSON object with fields X, Y, Z" produces much less output than "Explain your analysis."
  • Stop sequences: Use stop sequences to cut off generation when the useful output is done.

A response that's 200 tokens of precise, structured output is almost always more useful — and cheaper — than 2,000 tokens of prose.

References


This post is part of our complete LLM token optimization strategy guide. For related topics, see designing for prompt cache hits and cutting MCP and tool overhead.