Cut MCP and Tool Overhead to Save Thousands of LLM Tokens Per Request
Here's something most developers don't realize: before your LLM agent does any actual work, it may be spending 55,000 to 134,000 tokens just describing the tools it could use.
Every tool definition — its name, description, parameter schema, examples — gets included in the input context on every single request. Connect a few MCP servers, enable some integrations, and suddenly a significant chunk of your token budget is consumed by tool schemas the model might never touch.
The numbers
Anthropic published measurements from real tool-use setups in their November 2025 analysis. The findings were stark:
- 55K to 134K tokens of tool-definition overhead in production setups, before any work started
- Switching to on-demand tool search reduced one example's overhead to approximately 8,700 tokens — an 85% reduction
- Tool accuracy actually improved because the model had less irrelevant schema to parse
That overhead isn't a one-time cost. It's included on every single API call in the conversation.
Why this happens
LLM tool use works by including the full tool definitions in the system prompt. The model needs to know what tools are available, what parameters they accept, and what they return.
With MCP (Model Context Protocol) servers, this scales quickly:
- A database connector might define 10+ tools for querying, inserting, updating
- A project management integration adds tools for tickets, sprints, comments
- A code search server adds tools for grep, file reading, symbol lookup
- Each tool includes a JSON schema for its parameters
Connect 3–4 MCP servers and you can easily have 30+ tool definitions in your context — most of which are irrelevant to the current task.
Community members in Cursor forums reported that just saying "hello" to a new chat incurred 13,000–20,000 tokens of baseline overhead from internal system prompts and tool definitions. That's before you've even asked a question.
Audit your tool overhead
Start by understanding what you're actually paying for:
- In Claude Code, use
/mcpto see which MCP servers are connected and/contextto see what's in your current context window - Count how many tool definitions are active in a typical session
- Check which tools you actually use vs. which are just connected "in case"
Most developers find that they actively use 3–5 tools in a typical session but have 15–30 loaded.
On-demand tool loading
The highest-impact fix is switching from "load all tools upfront" to "load tools when needed."
The pattern:
- Start with a minimal set of always-needed tools (file read, file write, shell)
- Include a tool search meta-tool that can find and load other tools by description
- When the model needs a capability it doesn't have, it searches for and loads the relevant tool
- The tool definition enters the context only for the turns where it's actually used
This is exactly how Anthropic achieved the 85% reduction in their benchmarks. Instead of 134K tokens of tool schemas loaded upfront, the model started with a lightweight search tool and pulled in specific tools as needed.
CLI tools vs. MCP
Sometimes the simplest optimization is to skip the MCP layer entirely.
MCP servers are valuable when you need structured, cross-platform tool definitions. But if a task can be accomplished with a direct CLI command, the overhead difference is significant:
- MCP approach: Tool definition schema (500–2,000 tokens) + call overhead + response parsing
- CLI approach: A single shell command with minimal context
For tasks like running tests, checking git status, or grepping files, a direct command-line tool often does the job with a fraction of the token cost. Anthropic's Claude Code docs explicitly recommend preferring direct CLI tools when they do the job.
Progressive disclosure with Skills
Anthropic's Skills feature implements a pattern called progressive disclosure for tool-like instructions:
- A one-line description loads into every session (cheap)
- The full instructions only load when the skill is actually triggered (on demand)
- The model sees the menu of available skills but doesn't pay for the full playbook until it's needed
This is the same principle as on-demand tool loading, applied to workflow instructions. Instead of a 2,000-token deployment playbook living in your CLAUDE.md (loaded every session), it's a 20-token summary that expands only when deployment is relevant.
The baseline you can't avoid (but can make cheaper)
Some tool overhead is unavoidable. The model needs to know its basic capabilities. Internal system prompts and core tool definitions are always present.
But this is where prompt caching becomes your ally. Since these definitions are identical on every request, they're perfect candidates for caching. At 0.1x the base input price, that 15,000-token baseline becomes effectively a 1,500-token cost on cache hits.
The strategy: accept the baseline overhead, make it cache-friendly, and aggressively prune everything on top of it. See our guide to designing for prompt cache hits for the details.
Tool overhead is one of the easiest wins in token optimization because it requires zero changes to your actual prompts or workflows -- just turning off things you don't need. This post is part of our complete LLM token optimization strategy guide. For related reading, see designing for prompt cache hits and how to measure and monitor LLM token usage.