See how text becomes tokens across languages and scripts
The same semantic content requires wildly different numbers of tokens depending on language. This affects model capacity and inference cost.
If Thai uses 3x more tokens than English for the same content, Thai users hit context limits 3x faster. This is a fairness issue.
API pricing is per-token. The same query costs more in languages with lower tokenizer efficiency. CJK and Thai users pay more.
Characters not in the vocabulary become [UNK] tokens, losing all semantic information. Common with rare scripts and emoji.
When words split into many subwords, the model sees fragments instead of meaning. This hurts accuracy on low-resource languages.
Click to load each test case. These reveal common tokenization failure modes.