System Prompt Token Counter
The system prompt in this example is approximately 180 tokens. That sounds small, but system prompts in production applications routinely run 500 to 3000 tokens once you add tool schemas, persona instructions, guardrails, and few-shot examples. Counting tokens before deployment helps you budget the context window accurately across GPT-4o, Claude, and Gemini.
Understanding Context Window Budget
Every LLM API call has a context window limit. The total tokens across all messages in the request must fit within that limit:
context_window = system_prompt + conversation_history + tool_definitions + response
| Model | Context Window |
|---|---|
| GPT-4o | 128k tokens |
| GPT-4o mini | 128k tokens |
| Claude 3.5 Sonnet | 200k tokens |
| Claude 3 Haiku | 200k tokens |
| Gemini 1.5 Pro | 1M tokens |
| Gemini 1.5 Flash | 1M tokens |
For chat applications, conversation history grows with each turn. A 500 token system prompt on a 128k window is fine for turn 1, but a long conversation can fill the window regardless of system prompt size. The system prompt is fixed cost; conversation history is variable cost.
Budget Guidelines by Application Type
Interactive chat and agents
Keep system prompts under 10 to 15% of the context window. This preserves space for conversation history and multi-step reasoning. A 1000 token system prompt is not inherently a problem on a 128k model, but combine it with a 50-turn conversation, verbose tool outputs, and a long code generation response, and you approach the limit.
Single-turn classification and extraction
System prompt length matters less here because conversation history does not accumulate. You can use longer prompts with many examples if they improve accuracy.
Tool-heavy agents
Tool definitions consume tokens. The OpenAI and Anthropic APIs accept tool schemas as part of the request; these count against the context window. A complex agent with 10 tools and detailed parameter descriptions can have 2000+ tokens in tool schemas alone before the system prompt.
Token Reduction Techniques
Remove redundant instructions
If your system prompt says “Be helpful, concise, and professional” and also “Respond in a helpful and professional manner with clear, concise answers,” that is the same instruction written twice. One version is enough.
Trim verbose phrasing
Compare:
Before: “It is of the utmost importance that you always remember to greet the customer by their first name whenever their name is available to you in the conversation.”
After: “Greet the customer by name when available.”
The shorter version conveys identical instructions at roughly a quarter of the token count.
Move examples out of the system prompt
Few-shot examples are effective but expensive. A system prompt with five examples at 200 tokens each adds 1000 tokens of fixed cost to every request. Alternatives:
- Use a retrieval system to fetch relevant examples at runtime based on the user’s query
- Put examples in the first user/assistant message pair rather than the system prompt
- Use a fine-tuned model where the examples have been baked into the weights
Use concise tool schemas
Tool descriptions should describe what the tool does and what its parameters mean, not write a paragraph about each one. Compare:
{
"description": "This tool allows you to look up information about a specific order by its unique identifier. You should use this tool whenever a customer asks about the status of their order, shipping information, or the items they purchased.",
"name": "order_lookup"
}
versus:
{
"description": "Returns order status, shipping info, and line items for a given order_id.",
"name": "order_lookup"
}
The second version is more useful to the model and uses fewer tokens.
Prompt Caching
If your system prompt is stable across requests, prompt caching eliminates most of the repeated input token cost.
Anthropic
Mark the end of your system prompt with a cache control breakpoint. Subsequent requests with the same prefix are served from cache at 10% of the normal input price for cache reads (cache write is 25% more expensive than normal input, amortized over subsequent reads).
system_prompt = [
{
"type": "text",
"text": "You are a helpful customer support agent...",
"cache_control": {"type": "ephemeral"}
}
]
Cache entries last 5 minutes with automatic refresh on use.
OpenAI
OpenAI applies prompt caching automatically for prompts over 1024 tokens. Cached token reads are billed at 50% of normal input price. You do not need to opt in, but you do need a stable prefix. Varying the system prompt between requests defeats the cache.
When caching matters
At $3 per million input tokens (GPT-4o pricing), a 1000 token system prompt costs $0.003 per request. At 1 million requests per day, that is $3000 per day, or $90,000 per month. With 50% caching, that drops to $1500 per day. The savings scale directly with request volume and system prompt length.