The Sample Prompt
The prompt loaded in the tool above is a system prompt for an LLM powered code reviewer. It is 110 to 120 tokens depending on the tokenizer. That may not sound like much, but this prompt is sent with every API call. If you make 10,000 calls per day, the system prompt alone accounts for over 1 million tokens of input per day.
Understanding token count at the prompt design stage prevents cost surprises in production.
What Is a Token?
Tokenization is the process of splitting text into discrete units that the model processes. LLMs do not operate on characters or words. They operate on tokens. The mapping from text to tokens is defined by a vocabulary table built during the model’s training process.
Most modern LLMs use Byte Pair Encoding (BPE). BPE starts by treating every byte as its own token, then iteratively merges the most frequent adjacent pairs. After enough merges, common English words become single tokens, while rare words are split into subword units.
For example, with GPT-4’s cl100k_base tokenizer:
"PostgreSQL" → ["Post", "gre", "SQL"] (3 tokens)
"React" → ["React"] (1 token)
"TypeScript" → ["Type", "Script"] (2 tokens)
"the" → ["the"] (1 token)
Code tends to tokenize less efficiently than prose because it contains identifiers, symbols, and syntax that appear less frequently in training data. A 100-line code snippet might use more tokens than you expect from its character count.
Why Token Count Matters
API cost
LLM pricing is per token. Representative prices as of 2026:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3 Haiku | $0.25 | $1.25 |
| Gemini 1.5 Flash | $0.075 | $0.30 |
A 200-token system prompt costs $0.0005 per call with GPT-4o. At 10,000 calls/day, that is $5/day or $150/month, just from the system prompt. If you trim it to 100 tokens, you save $75/month with no change in functionality.
Context window limits
The context window caps the total tokens (input + output) per request. If your prompt template, conversation history, and injected documents together approach the limit, the model may truncate history or fail with an error. Measuring individual components lets you design prompts that leave headroom for output.
Response quality
Very long prompts do not always produce better results. Studies on “lost in the middle” effects show that models often attend less to information buried in the middle of a long context. Concise, well structured prompts tend to produce more reliable responses than verbose ones.
How Different Tokenizers Handle the Same Text
Run the same prompt through multiple tokenizers and you will get different counts. This matters when migrating between providers or comparing costs:
# OpenAI: tiktoken library
import tiktoken
# GPT-4, GPT-4-turbo, GPT-3.5-turbo
enc_cl100k = tiktoken.get_encoding("cl100k_base")
# GPT-4o, GPT-4o-mini
enc_o200k = tiktoken.get_encoding("o200k_base")
text = "You are a senior software engineer reviewing code..."
print(len(enc_cl100k.encode(text))) # e.g., 117
print(len(enc_o200k.encode(text))) # e.g., 108 (o200k tends to be more efficient)
# Anthropic: anthropic library (as of SDK 0.21+)
import anthropic
client = anthropic.Anthropic()
response = client.messages.count_tokens(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": text}]
)
print(response.input_tokens)
Claude’s tokenizer is not publicly documented as a standalone library. Use the count_tokens API endpoint directly.
Practical Techniques for Reducing Token Count
Trim whitespace and redundancy
Repeated instructions, verbose role descriptions, and multiple examples of the same pattern all cost tokens. Be direct.
Before:
You are an expert, experienced senior software engineer with many years of experience
reviewing production code. You have deep expertise in best practices, security, and
performance. Please carefully review the following code and provide detailed feedback.
After:
Review the following code for quality, security, and performance issues.
The second version carries the same instruction in roughly 40% of the tokens.
Use structured formats for data
If you are injecting tabular data, JSON is usually more token efficient than prose descriptions:
# Prose (more tokens):
"The user's name is Alice, they are 30 years old, and their account status is active."
# JSON (fewer tokens):
{"name":"Alice","age":30,"status":"active"}
Prefer system prompts for static instructions
Some APIs (and some caching implementations) treat the system prompt differently from the user turn. Keep static instructions in the system prompt where they may be cached.
Cache repeated context
OpenAI’s prompt caching (for GPT-4o) and Anthropic’s prompt caching (for Claude) reduce the cost of repeated prefixes. If your system prompt and any injected documents are always the same, they may be served from cache at a fraction of the full cost. Prompt caching typically requires the cached prefix to be ≥1024 tokens.
The Context Window as a Budget
Think of the context window as a fixed budget that all tokens draw from:
[system prompt] + [conversation history] + [injected documents] + [output] ≤ context window
For a 128,000-token context window:
| Component | Typical allocation |
|---|---|
| System prompt | 100 to 500 tokens |
| Conversation history (last N turns) | 2,000 to 20,000 tokens |
| Injected documents / RAG context | 5,000 to 50,000 tokens |
| Reserved for output | 1,000 to 4,000 tokens |
When context is nearly full, you must choose what to drop. Common strategies:
- Sliding window: Drop the oldest conversation turns first.
- Summarization: Periodically compress older turns into a summary.
- RAG truncation: Rank retrieved documents by relevance and drop the lowest ranked ones if over budget.
Knowing your token counts, for each component, is what makes these tradeoffs concrete instead of approximate.