How much of the context window should a system prompt use?

For interactive chat applications, keep system prompts under 10 to 15% of the total context window. A 128k token context window with a 5000 token system prompt leaves 123k tokens for conversation history and responses. For single-turn classification or extraction tasks where conversation history does not accumulate, longer system prompts are less of a concern.

Do system prompt tokens count toward the cost of an API call?

Yes. All tokens in an API call (system, user, assistant, and tool result messages) count toward both input token billing and rate limits. The exception is prompt caching: if you cache a system prompt with Anthropic or OpenAI, repeated calls with the same cached prefix are billed at a lower rate (typically 10 to 25% of the normal input token price).

How can I reduce the token count of my system prompt?

Remove redundant instructions (if two rules say the same thing, keep one). Use shorter phrasing without losing meaning. Move few-shot examples out of the system prompt and into a separate message or a retrieval system. Replace lengthy descriptions of tools with concise parameter schemas. Avoid repeating information that the model already knows from training.

What is prompt caching and how does it help with system prompts?

Prompt caching lets you mark a prefix of your prompt (typically the system prompt) as cacheable. The provider stores the KV cache for that prefix and reuses it on subsequent calls where the same prefix appears. Anthropic charges 25% of normal input price for cache reads. OpenAI charges 50%. For high volume applications with a stable system prompt, this can cut input costs by 50 to 75% on the system prompt portion.

System Prompt Token Counter: Estimate Token Usage

System Prompt Token Counter

The system prompt in this example is approximately 180 tokens. That sounds small, but system prompts in production applications routinely run 500 to 3000 tokens once you add tool schemas, persona instructions, guardrails, and few-shot examples. Counting tokens before deployment helps you budget the context window accurately across GPT-4o, Claude, and Gemini.

Understanding Context Window Budget

Every LLM API call has a context window limit. The total tokens across all messages in the request must fit within that limit:

context_window = system_prompt + conversation_history + tool_definitions + response

Model	Context Window
GPT-4o	128k tokens
GPT-4o mini	128k tokens
Claude 3.5 Sonnet	200k tokens
Claude 3 Haiku	200k tokens
Gemini 1.5 Pro	1M tokens
Gemini 1.5 Flash	1M tokens

For chat applications, conversation history grows with each turn. A 500 token system prompt on a 128k window is fine for turn 1, but a long conversation can fill the window regardless of system prompt size. The system prompt is fixed cost; conversation history is variable cost.

Budget Guidelines by Application Type

Interactive chat and agents

Keep system prompts under 10 to 15% of the context window. This preserves space for conversation history and multi-step reasoning. A 1000 token system prompt is not inherently a problem on a 128k model, but combine it with a 50-turn conversation, verbose tool outputs, and a long code generation response, and you approach the limit.

Single-turn classification and extraction

System prompt length matters less here because conversation history does not accumulate. You can use longer prompts with many examples if they improve accuracy.

Tool-heavy agents

Tool definitions consume tokens. The OpenAI and Anthropic APIs accept tool schemas as part of the request; these count against the context window. A complex agent with 10 tools and detailed parameter descriptions can have 2000+ tokens in tool schemas alone before the system prompt.

Token Reduction Techniques

Remove redundant instructions

If your system prompt says “Be helpful, concise, and professional” and also “Respond in a helpful and professional manner with clear, concise answers,” that is the same instruction written twice. One version is enough.

Trim verbose phrasing

Compare:

Before: “It is of the utmost importance that you always remember to greet the customer by their first name whenever their name is available to you in the conversation.”

After: “Greet the customer by name when available.”

The shorter version conveys identical instructions at roughly a quarter of the token count.

Move examples out of the system prompt

Few-shot examples are effective but expensive. A system prompt with five examples at 200 tokens each adds 1000 tokens of fixed cost to every request. Alternatives:

Use a retrieval system to fetch relevant examples at runtime based on the user’s query
Put examples in the first user/assistant message pair rather than the system prompt
Use a fine-tuned model where the examples have been baked into the weights

Use concise tool schemas

Tool descriptions should describe what the tool does and what its parameters mean, not write a paragraph about each one. Compare:

{
  "description": "This tool allows you to look up information about a specific order by its unique identifier. You should use this tool whenever a customer asks about the status of their order, shipping information, or the items they purchased.",
  "name": "order_lookup"
}

versus:

{
  "description": "Returns order status, shipping info, and line items for a given order_id.",
  "name": "order_lookup"
}

The second version is more useful to the model and uses fewer tokens.

Prompt Caching

If your system prompt is stable across requests, prompt caching eliminates most of the repeated input token cost.

Anthropic

Mark the end of your system prompt with a cache control breakpoint. Subsequent requests with the same prefix are served from cache at 10% of the normal input price for cache reads (cache write is 25% more expensive than normal input, amortized over subsequent reads).

system_prompt = [
  {
    "type": "text",
    "text": "You are a helpful customer support agent...",
    "cache_control": {"type": "ephemeral"}
  }
]

Cache entries last 5 minutes with automatic refresh on use.

OpenAI

OpenAI applies prompt caching automatically for prompts over 1024 tokens. Cached token reads are billed at 50% of normal input price. You do not need to opt in, but you do need a stable prefix. Varying the system prompt between requests defeats the cache.

When caching matters

At $3 per million input tokens (GPT-4o pricing), a 1000 token system prompt costs $0.003 per request. At 1 million requests per day, that is $3000 per day, or $90,000 per month. With 50% caching, that drops to $1500 per day. The savings scale directly with request volume and system prompt length.