Design slice-based evaluation that exposes hidden failures
A single overall score can hide catastrophic failures in specific languages or scripts.
Test each supported language independently. Don't assume similar languages perform similarly.
Latin, Cyrillic, Arabic, CJK, Thai, etc. Different scripts stress tokenizers differently.
"I need to comprar something" - mixing languages in one sentence. Very common, often broken.
Very short text, very long text. Both expose different failure modes.
Names, dates, numbers, locale-specific formats. These often break across cultures.
Formal vs. informal, technical vs. casual. Different registers can have different error rates.
| Slice | Metric | Value | Notes |
|---|---|---|---|
| Overall | Accuracy | __% | Baseline |
Use these to categorize your 10 failure examples: