← Back to NLP Hub

Tokenization Visualizer

See how text becomes tokens across languages and scripts

🔤 Try It Yourself

Tokens: 0 Characters: 0
Full word
Word start (##)
Continuation
Special token
0
Chars/Token
0%
Subwords
0
Unknown

📊 Same Meaning, Different Token Counts

The same semantic content requires wildly different numbers of tokens depending on language. This affects model capacity and inference cost.

🇺🇸 English 6 tokens
"Hello, how are you?"
Hello , how are you ?
🇯🇵 Japanese 11 tokens
"こんにちは、お元気ですか?"
こん です
🇹🇭 Thai 18 tokens
"สวัสดี สบายดีไหม?"
##วัส ##ดี [space] ##บา ##ย ##ดี ##ไหม ? ...
🇸🇦 Arabic 14 tokens
"مرحبا، كيف حالك؟"
مر ##حب ##ا ، كي ##ف حال ##ك ؟ ...

⚠️ Why Tokenization Matters for Multilingual Models

Token Budget Inequality

If Thai uses 3x more tokens than English for the same content, Thai users hit context limits 3x faster. This is a fairness issue.

Cost Disparity

API pricing is per-token. The same query costs more in languages with lower tokenizer efficiency. CJK and Thai users pay more.

[UNK] Failures

Characters not in the vocabulary become [UNK] tokens, losing all semantic information. Common with rare scripts and emoji.

Subword Splits

When words split into many subwords, the model sees fragments instead of meaning. This hurts accuracy on low-resource languages.

🧪 Tokenization Stress Tests

Click to load each test case. These reveal common tokenization failure modes.

Emoji Overload High Token Count
🔥💯👏🏽 This is fire!!! 🚀🚀🚀
Emoji often become multiple tokens or [UNK]. Skin tone modifiers add complexity.
Code-Switching + Emoji Error Prone
I was like omggg 😭😭 pero también un poco feliz you know
Mixing English, Spanish, internet slang, and emoji. A perfect storm for tokenizers.
Numbers & Special Formats Format Sensitive
The meeting is at 3:30pm on 12/25/2024 @ conference room #42-B
Times, dates, and alphanumeric codes often split unpredictably.
Long English Words Subword Heavy
pneumonoultramicroscopicsilicovolcanoconiosis antidisestablishmentarianism
Even English has words that fragment into many subwords, affecting rare medical/technical terms.
Thai (No Spaces) Script Challenge
กรุงเทพมหานคร อมรรัตนโกสินทร์ มหินทรายุธยา
Thai has no word boundaries. Tokenizers must guess where words start and end.
URL Structure Lost
https://example.com/path?query=value&foo=bar#section
URLs often become many fragments, losing their semantic structure as a "link".