← Back to NLP Hub

Multilingual Evaluation Planner

Design slice-based evaluation that exposes hidden failures

Why Slices Matter
Build Your Plan
Knowledge Check

The Hidden Failure Problem

A single overall score can hide catastrophic failures in specific languages or scripts.

Overall Accuracy
92%
Looks great! Ship it?
But wait... per-language breakdown:
🇺🇸 English
97%
🇪🇸 Spanish
94%
🇫🇷 French
95%
🇯🇵 Japanese
78%
🇹🇭 Thai
43%
🔀 Code-Switch
31%
⚠️
The 92% average hid a 43% failure rate on Thai and 31% on code-switching.
Without slices, you'd ship a model that fails badly for millions of users.

Slice Categories You Should Test

🌍 Language Slices

Test each supported language independently. Don't assume similar languages perform similarly.

🔤 Script Slices

Latin, Cyrillic, Arabic, CJK, Thai, etc. Different scripts stress tokenizers differently.

🔀 Code-Switching

"I need to comprar something" - mixing languages in one sentence. Very common, often broken.

📏 Length Slices

Very short text, very long text. Both expose different failure modes.

🏷️ Entity Slices

Names, dates, numbers, locale-specific formats. These often break across cultures.

🎯 Domain Slices

Formal vs. informal, technical vs. casual. Different registers can have different error rates.

Select Your Evaluation Slices

🌍 Languages
🇺🇸 English
🇪🇸 Spanish
🇫🇷 French
🇩🇪 German
🇨🇳 Chinese
🇯🇵 Japanese
🇰🇷 Korean
🇸🇦 Arabic
🇮🇳 Hindi
🇧🇷 Portuguese
🇷🇺 Russian
🇹🇭 Thai
🔤 Scripts
Latin
Cyrillic
Arabic Script
CJK
Devanagari
Thai Script
🔀 Special Cases
Code-Switching
Short Text (<10 words)
Long Text (>100 words)
Noisy/Informal
Named Entities Heavy
Numbers/Dates
Selected Slices (0)
Click slices above to add them
📝 Test Cases 0/10 minimum
📊 Metrics Table Template
Slice Metric Value Notes
Overall Accuracy __% Baseline
🏷️ Failure Analysis Labels

Use these to categorize your 10 failure examples:

📏 Short text ambiguity
🏷️ Entity confusion
🔀 Code-switch error
🔤 Script/tokenization
🗣️ Slang/informal
📊 Number/date format
🎭 Sarcasm/tone
❓ Unknown/other
📋 EVAL_PLAN.md Checklist
✓ Metrics table (overall + per-slice)
✓ At least 3 slices selected
✓ At least 3 code-switch/stress test cases
✓ 10 test cases minimum
✓ Expected failure predictions (2-3)

🧠 Multilingual Evaluation Knowledge Check

1. Why is a single overall score risky in multilingual NLP?
It increases latency
It uses more RAM
It can hide failures in specific languages/scripts even if the average looks good
It makes tokenization impossible
Averages can mask poor performance on minority languages. A 92% overall could hide a 43% on Thai if most data is English.
2. What is "code-switching" in the context of multilingual NLP?
Switching between programming languages
A single input mixing multiple languages within the same text
Converting code to a different format
Changing the tokenizer at runtime
Code-switching like "I need to comprar something" is very common in multilingual communities and often breaks models trained on clean single-language data.
3. If your model fails on one language slice, what's your first move?
Ignore it if the overall score is high
Inspect errors and decide whether to adjust data, tests, or approach based on evidence
Delete that slice from evaluation
Remove that language from scope immediately
Always inspect errors first. The failure might be data quality, tokenization, or a fundamental model limitation. Evidence guides the fix.
4. A strong multilingual evaluation plan should include:
Only the overall accuracy metric
A confusion matrix without language breakdown
Per-slice metrics + error analysis + limitations statement
Just the model weights and a README
A reviewable eval plan needs: metrics table (overall + slices), labeled failure examples, and explicit limitations so reviewers know what fails.