Is "1 token = 4 characters" a reliable rule of thumb?

Only for plain English prose. The 4-character approximation breaks down significantly for source code (especially whitespace-heavy languages like Python), scientific notation, URLs, non-Latin scripts, and emoji. For accurate token counts use a tokenizer library (tiktoken for OpenAI, claude-tokenizer or the count_tokens API endpoint for Anthropic) rather than character-based estimation.

Do Claude and GPT-4o tokenize the same text the same way?

No. They use different tokenizer vocabularies. For typical English prose, Claude tends to use approximately 10–20% fewer tokens than GPT-4o for the same text. For source code and non-English languages the difference varies and can go in either direction. Never assume token counts are interchangeable across providers — always measure with the target model's tokenizer.

Why are output tokens more expensive than input tokens?

Generating each output token requires a full autoregressive forward pass through the model — the model runs once per output token, sequentially. Processing input tokens uses parallelized attention across the full context, which is computationally cheaper per token. The pricing asymmetry (typically 3–5× more for output) reflects this computational difference directly.

Does the count_tokens API include system prompt tokens in its count?

It depends on how you call it. If you pass only the user message to count_tokens, you get only the user message token count. The system prompt, conversation history, injected tool schemas, and any internal formatting tokens the provider adds are not included unless you pass them all explicitly in the same call. Your actual billed token count is always higher than a partial count_tokens call.

5 Token Counting Myths That Cost Engineering Teams Real Money

Last updated: 2026-05-21 · 9 min read

Part of the toksum.dev Guides series.

Inaccurate token counting is one of the most common sources of LLM cost overruns in production. Unlike compute costs that scale predictably with usage, token cost errors tend to compound: a bad estimate baked into a budget model is wrong every month, getting worse as volume grows. Most of these errors trace back to a small set of persistent myths about how tokenization works. Here are the five most expensive ones, with the evidence to debunk each.

Myth 1: "1 token = 4 characters" is universally true

The 4-characters-per-token rule of thumb comes from OpenAI's early documentation and was accurate for average English prose with the GPT-3 tokenizer. It has been repeated so often that many engineers treat it as a fundamental property of LLM tokenization. It is not. It is an average for one tokenizer applied to one type of text, and it breaks down in ways that matter for cost estimation.

For source code, the ratio is frequently 1:1 or even less than 1 character per token, because programming keywords, operators, and indentation whitespace are often split into multiple tokens. A Python function with heavy indentation and short variable names can tokenize at 2–3 characters per token, meaning your cost estimate using the 4-char rule is off by 30–50% in the wrong direction.

For non-English text, particularly languages that use non-Latin scripts — Chinese, Japanese, Korean, Arabic, Hindi, and many others — a single printed character can correspond to multiple tokens. A single Chinese character might be 1–3 tokens depending on the tokenizer vocabulary. Japanese text with kanji, hiragana, and katakana mixed together tokenizes very differently from English of equivalent visual length. For these languages, the 4-char rule can underestimate token counts by 2× or more, which directly translates to a 2× cost estimation error for international products.

For scientific and mathematical notation, including LaTeX, formulas, and numeric sequences with special characters, tokenization is highly variable and often unfavorable. A chemical formula or a complex math expression in LaTeX syntax may tokenize extremely inefficiently, with each symbol or subscript occupying a separate token.

For URLs and code identifiers, long camelCase or snake_case identifiers (like processUserAuthenticationRequest) are often split at word boundaries within the identifier, each sub-word becoming a separate token. A URL like https://api.example.com/v1/users/settings can be 15–20 tokens for 43 characters — less than 3 characters per token.

The fix: use the actual tokenizer for your target model. For OpenAI models, the tiktoken Python library is the reference implementation. For Anthropic models, use the count_tokens API endpoint or the official tokenizer library. For quick estimates without running code, use the toksum.dev token counter, which applies the correct tokenizer for each model family.

Myth 2: "Claude tokens equal GPT tokens" for the same text

Different LLM providers train different tokenizers on different corpora, producing different vocabularies. A token in GPT-4o's cl100k_base vocabulary is not the same unit as a token in Claude's tokenizer. For a given piece of text, the two models will produce different token counts, and the difference is not trivial.

For typical English prose and documentation, Claude generally tokenizes more efficiently than GPT-4o — meaning Claude uses fewer tokens to represent the same text. Empirical measurements across a variety of text types suggest Claude uses roughly 10–20% fewer tokens than GPT-4o for the same English prose. This has a counterintuitive implication: even if Claude's per-token rate is higher than GPT-4o's on paper, the effective per-word or per-character cost may be lower because fewer tokens are consumed.

The gap narrows or reverses for certain content types. For source code in verbose languages like Java or C#, the difference shrinks. For compact languages like C or Rust, Claude may be slightly less efficient. For non-English text, the comparison depends heavily on the specific language and script. The only way to know the true comparison for your specific content is to count tokens with both tokenizers on a representative sample of your actual data.

The practical consequence of this myth is systematic mispricing when teams migrate between providers. A team that builds a cost model for GPT-4o and then migrates to Claude without recounting tokens may find their actual Claude bills are lower than expected (because Claude uses fewer tokens) or they may set context window limits too conservatively (because they think the content is longer in tokens than it actually is). Either way, the assumption that tokens are interchangeable across providers is wrong and should be replaced with a measurement step at migration time. See the full migration guide for a checklist.

Myth 3: "Output tokens are cheaper than input tokens"

This myth is not just wrong — it is backwards. Output tokens are almost universally more expensive than input tokens, typically by a factor of 3 to 5 times. The origin of this misconception is likely a confusion between volume (output requests tend to have fewer tokens than input contexts) and price per token (output is more expensive regardless of volume).

At current pricing: GPT-4o charges $2.50/1M for input and $10.00/1M for output — a 4× premium for output. Claude 3.5 Sonnet charges $3.00/1M for input and $15.00/1M for output — a 5× premium. Gemini 1.5 Pro charges comparably asymmetric rates. There is no major production LLM where output tokens are cheaper than input tokens at standard rates. The computational reason is clear: generating each output token requires a full sequential forward pass through the model, while input tokens are processed in parallel, making output generation fundamentally more expensive per token.

This matters enormously for cost modeling and for product design. If your product generates long responses — detailed explanations, full code files, comprehensive reports — output tokens will dominate your bill even if your input contexts are large. The correct mental model is: input cost is driven by context length (how much you send), output cost is driven by response verbosity (how much the model generates). Both are controllable. Capping max output tokens is one of the most direct cost levers available, but it must be weighed against quality tradeoffs for your specific use case.

For workloads where output verbosity is the primary cost driver, consider model tiers: a smaller, cheaper model for initial drafting with a larger model for final polish; structured output formats that constrain response length; or streaming with early termination when the key information has been delivered. The token counter can help you measure the input/output ratio in your actual request logs to identify which side of the bill to optimize.

Myth 4: "Word count is a reliable token estimate"

Word count is a useful upper-bound heuristic for plain English text, where the relationship between words and tokens is reasonably stable (roughly 0.75 words per token, or 1.33 tokens per word for English prose). However, word count breaks down as a token estimator for any content type that is not plain English prose, and it can produce wildly inaccurate estimates for the content types most common in LLM API workloads.

Consider JSON data, which is one of the most common inputs to LLM APIs (structured data extraction, API response summarization, database record classification). JSON contains many short, repeated structural characters: braces, brackets, colons, commas, quotes. These characters do not map cleanly to "words" in any useful sense, but they absolutely count as tokens. A 1,000-character JSON object might have a nominal "word count" of 50–100 (counting string values as words) but a token count of 200–350 because every structural character and short key name is tokenized separately.

For code, word count is even less reliable. A 100-line Python file may have a "word count" of 200–400 if you split on whitespace, but a token count of 600–1,200 because operators, punctuation, string delimiters, and indentation spaces all tokenize individually. Budgeting for code processing based on word count will systematically underestimate costs by 2–4×.

For structured prompts with XML or markdown formatting, the formatting tags and punctuation add tokens that are not counted in a word-count estimate. A heavily formatted prompt with XML tags, numbered lists, headers, and code blocks has 20–40% more tokens than its "word count" suggests.

The fix is straightforward: use an actual tokenizer on a representative sample of your content type before committing to a cost model. Collect 50–100 real examples from your production data (or realistic synthetic data), run them through the appropriate tokenizer, compute the actual mean tokens per document, and use that empirical ratio for budgeting. Do this once and revisit whenever your content format or prompt structure changes significantly.

Myth 5: "My provider's count_tokens API is what I'll be billed"

The count_tokens API endpoint (available from both OpenAI and Anthropic) counts tokens in the payload you pass to it. What it does not count is everything else that gets added to your request server-side before the model sees it. This gap between what you count and what you are billed is a consistent source of budget surprises.

System prompt tokens. If you make a count_tokens call on just the user message, you are not counting the system prompt. The system prompt is included in the billed token count on every request. Pass the complete request payload — system prompt, full conversation history, and user message — to count_tokens to get an accurate pre-request estimate.

Tool/function definitions. When you use tool use or function calling, the complete JSON schema definitions of all tools in your tools array are included in the input token count. A complex tool definition with detailed parameter descriptions, enum values, and nested schemas can add 300–800 tokens to every request. These are billed as input tokens even though you never "wrote" them as part of your message — they come from your application code.

Special tokens and formatting. Both OpenAI and Anthropic add formatting tokens around message boundaries in the conversation — role markers, separator tokens, and structural delimiters. These are typically a small number (5–20 tokens per message turn) but accumulate in long conversations. In a 20-turn conversation, the formatting overhead alone can add 100–400 tokens to the billed count versus the raw text token count.

The usage field is authoritative. The only token count you should trust for billing reconciliation is the usage object in the actual API response — not a pre-request count_tokens call, not a character or word count, not a tokenizer library run on just the text. Log the usage.input_tokens and usage.output_tokens fields from every API response into your observability system. This is the ground truth for cost accounting and the source you should use when reconciling against provider invoices.

Practically, the gap between a naive count_tokens call on the user message and the actual billed token count is typically 15–40% depending on system prompt size, tool schema complexity, and conversation length. For cost planning purposes, measure this gap empirically on your production request mix and apply a conservative multiplier. An accurate token count also helps when comparing model costs — use the toksum.dev token counter to see provider-specific counts for your actual prompt text before committing to a model selection.

Frequently asked questions

Is "1 token = 4 characters" a reliable rule of thumb?: Only for plain English prose. The 4-character approximation breaks down significantly for source code (especially whitespace-heavy languages like Python), scientific notation, URLs, non-Latin scripts, and emoji. For accurate token counts use a tokenizer library (tiktoken for OpenAI, claude-tokenizer or the count_tokens API endpoint for Anthropic) rather than character-based estimation.
Do Claude and GPT-4o tokenize the same text the same way?: No. They use different tokenizer vocabularies. For typical English prose, Claude tends to use approximately 10–20% fewer tokens than GPT-4o for the same text. For source code and non-English languages the difference varies and can go in either direction. Never assume token counts are interchangeable across providers — always measure with the target model's tokenizer.
Why are output tokens more expensive than input tokens?: Generating each output token requires a full autoregressive forward pass through the model — the model runs once per output token, sequentially. Processing input tokens uses parallelized attention across the full context, which is computationally cheaper per token. The pricing asymmetry (typically 3–5× more for output) reflects this computational difference directly.
Does the count_tokens API include system prompt tokens in its count?: It depends on how you call it. If you pass only the user message to count_tokens, you get only the user message token count. The system prompt, conversation history, injected tool schemas, and any internal formatting tokens the provider adds are not included unless you pass them all explicitly in the same call. Your actual billed token count is always higher than a partial count_tokens call.

Tool

5 Token Counting Myths That Cost Engineering Teams Real Money

Myth 1: "1 token = 4 characters" is universally true

Myth 2: "Claude tokens equal GPT tokens" for the same text

Myth 3: "Output tokens are cheaper than input tokens"

Myth 4: "Word count is a reliable token estimate"

Myth 5: "My provider's count_tokens API is what I'll be billed"

Frequently asked questions

Related

AI Token Counter

How to Read LLM Pricing Pages

GPT-4o Mini vs Claude 3.5 Haiku