The hidden layers behind 200 languages

Travis Lelle
Dec 5, 2025
7 min read

Meta's NLLB project achieved something remarkable: a single AI model that translates between 200 languages with state-of-the-art quality. But the headline figure obscures an extraordinary story of engineering, linguistics, and human collaboration that rarely gets told. Behind the 44% improvement in translation quality lies a web of innovations spanning data mining across the internet, novel training techniques to prevent model collapse, and evaluation systems that required professional translators working across 204 languages for nearly a year.

This article explores the hidden layers of complexity that made NLLB possible, revealing what it actually takes to build AI systems that serve billions of speakers of underrepresented languages.

The combinatorial explosion problem nobody mentions

When you hear "200 languages," the scale might not register immediately. Consider the math: translating between any pair of 200 languages creates approximately 40,000 unique translation directions. Previous state-of-the-art systems handled roughly 100 languages. NLLB didn't just double the number of languages; it quadrupled the complexity of the problem.

The traditional approach would require training thousands of separate models, each specializing in a handful of language pairs. NLLB's researchers instead built a single 54.5 billion parameter model that handles all 40,602 translation directions simultaneously. This is like building one employee who can translate between every pair of the world's major languages instantly, rather than staffing an entire department of specialists.

The secret weapon is an architecture called Sparsely Gated Mixture of Experts (MoE). Instead of using all 54.5 billion parameters for every translation, the model activates only a subset of specialized "expert" subnetworks depending on the language pair. When translating Arabic, certain experts activate; when translating Mandarin, different experts engage. Researchers visualized these activation patterns and discovered something beautiful: languages within the same family naturally cluster together, routing to similar experts. Arabic dialects share experts. Indic languages written in Devanagari script share experts. The model learned linguistic relationships without being explicitly told about them.

Mining a billion sentences from the internet wilderness

For high-resource languages like English, French, or Spanish, parallel training data is abundant: think EU Parliament transcripts, United Nations documents, and multilingual news archives. But for languages like Kamba (Kenya), Fula (West Africa), or Lao, such resources barely exist. Lingala, spoken by 45 million people across the Democratic Republic of Congo and neighboring countries, had only 3,260 Wikipedia articles. Swedish, with 10 million speakers, had 2.5 million.

NLLB's solution was to mine the internet at unprecedented scale using a system called LASER3. The core insight is that semantically equivalent sentences in different languages should map to nearby points in a mathematical space. LASER3 creates these "sentence embeddings" for 148 languages, then searches billions of web pages to find matching sentence pairs.

The technical achievement here is remarkable. LASER3 uses a teacher-student training approach where an existing multilingual encoder (the "teacher") guides language-specific encoders (the "students") for languages with minimal training data. A language like Assamese, with limited digital presence, can benefit from its relationship to Bengali (same script family) through this process.

The result: over 1.1 billion sentence pairs mined for 148 languages, creating parallel training data that simply did not exist before. This data pipeline represents years of engineering work, from language identification systems covering 218 languages to quality filters that remove noise, duplicates, and misaligned translations.

The problem of a model that learns too well

Training on rare languages introduces a counterintuitive challenge: the model learns them too well, too fast. With limited data, low-resource language pairs quickly "overfit," meaning the model essentially memorizes the small training set rather than learning generalizable translation patterns. Extended training improves high-resource languages while actively degrading performance on low-resource ones.

NLLB's researchers developed two complementary solutions. Curriculum learning stages the training process: the model first trains extensively on high-resource languages, then low-resource pairs are introduced later when training is nearly complete. Researchers calculated exactly when each language pair would start overfitting and timed its introduction accordingly.

The second innovation, Expert Output Masking (EOM), randomly disables expert outputs during training. This prevents the model from over-relying on specific experts for rare languages, forcing it to develop more robust representations. Combined, these techniques improved translation quality for very low-resource languages by over 2 chrF++ points (a standard translation quality metric).

These solutions represent a broader principle in machine learning that practitioners know but outsiders rarely see: getting a model to work is only the first step. Getting it to work well across diverse conditions, without degrading performance on edge cases, requires entirely different innovations.

612,000 human translations: The evaluation challenge

How do you evaluate translation quality across 40,000 language pairs? You cannot simply ask bilingual speakers to rate outputs, because for many language pairs, qualified evaluators barely exist.

The NLLB team created FLORES-200, a benchmark dataset of 3,001 sentences professionally translated into 204 languages. The arithmetic is staggering: over 612,000 individual human translations, each requiring native-speaking linguistic professionals. Average turnaround time per language was 119 days, with some taking nearly 10 months.

The workflow involved multiple quality gates: initial translation, automated checks for language mismatches and excessive copying, independent review by separate QA teams, and post-editing for flagged issues. Languages required a 90% quality threshold to be included. For languages without standardized spelling or where professional translators were scarce, this threshold was harder to meet.

Traditional evaluation metrics like BLEU scores are notoriously dependent on tokenization, making cross-language comparisons unreliable. Chinese doesn't use spaces between words. Agglutinative languages like Finnish express in one word what English requires a full phrase to convey. The team developed spBLEU, using a standardized 256,000-token vocabulary covering all 200 languages, enabling meaningful comparisons across fundamentally different language structures.

For human evaluation, they created a new protocol called Cross-lingual Semantic Text Similarity (XSTS). Unlike traditional approaches focusing on fluency, XSTS emphasizes meaning preservation, recognizing that a translation conveying the correct meaning imperfectly may be more useful than a fluent translation that subtly distorts intent. Evaluators rated the same calibration sentences, enabling score normalization across different language pairs and evaluators with different standards.

The toxicity problem hiding in the training data

When translation models hallucinate, they sometimes generate toxic content: profanity, hate speech, or disturbing text that wasn't present in the source. This is especially problematic for low-resource languages, where training data often comes from noisier sources. The NLLB team discovered that over 1 in 8 hallucinations in Tamil translations contained toxic text, traceable to problematic patterns in Common Crawl web data.

Addressing this required building toxicity detection for 200 languages from scratch. Professional translators created word lists of toxic terms for each language, then cultural adaptations added locally relevant terms. The resulting Toxicity-200 resource ranges from 36 entries for languages with limited vulgar vocabulary to 6,078 entries for morphologically rich languages like Czech, where a single root word generates thousands of variations.

A filtering pipeline removed training sentence pairs where one side contained significantly more toxic terms than the other. This removed approximately 30% of mined parallel sentences but resulted in a 5% improvement in translation quality and a 5% reduction in generated toxicity. The tradeoff reveals a hidden dimension of ML development: more data is not always better if that data contains systematic problems.

The human effort behind the algorithms

NLLB's 190-page paper lists 38 authors, but the project involved far more contributors. Before writing any code, the team interviewed 44 native speakers across 36 low-resource languages, with conversations averaging 1.5 hours each. These interviews shaped four guiding principles: prioritize underserved communities, open-source everything, maintain interdisciplinary collaboration, and practice reflexivity about potential harms.

The team included linguists who understood language family relationships, sociologists who considered community impact, ethicists who evaluated deployment risks, and translators who could verify outputs. This interdisciplinary approach influenced technical decisions: the choice to release models under open licenses, the development of toxicity safeguards, and the focus on languages that existing tools neglected.

Wikipedia editors began using NLLB within four months of launch, making it the third most-used translation engine on the platform. It achieved the lowest deletion rate (0.13%) among all translation services, meaning human editors found its outputs useful with minimal correction. The model now powers over 25 billion translations daily across Facebook and Instagram, supporting content moderation and information access in languages that were previously underserved.

Infrastructure at unprecedented scale

Training a 54.5 billion parameter model requires computational resources that few organizations can access. Meta's Research SuperCluster (RSC), which trained NLLB, contains 16,000 NVIDIA A100 GPUs delivering approximately 5 exaflops of AI compute. The storage system spans 1 exabyte with 16 TB per second throughput to GPUs.

For inference (actually using the model), the full 54.5 billion parameter version requires at least four 32GB GPUs working in parallel. Recognizing that this limits accessibility, the team developed distilled versions: 3.3 billion, 1.3 billion, and 600 million parameter models that run on more modest hardware while retaining much of the translation quality. Research found that up to 80% of the MoE experts could be pruned without significant quality loss, potentially enabling single-GPU inference.

The environmental implications of training at this scale are substantial. While Meta reports 97% renewable energy for its data centers, the computational cost of large model training remains a consideration that responsible AI development must address. The efficiency of MoE architectures, which activate only a fraction of parameters per input, represents one approach to this challenge.

What this teaches us about building AI

NLLB demonstrates that breakthrough AI systems are not simply larger versions of existing models. They require innovations across the entire pipeline: data collection (LASER3 mining), architecture (Mixture of Experts), training dynamics (curriculum learning, Expert Output Masking), evaluation (FLORES-200, XSTS), and safety (Toxicity-200, ETOX).

Each of these components represents years of research and engineering by specialized teams. The data mining pipeline alone required building language identification for 218 languages, sentence embedding models for 148 languages, quality filtering systems, and deduplication algorithms. The evaluation dataset required coordinating professional translators across 204 languages for nearly a year.

For practitioners, NLLB offers lessons in training stability for imbalanced datasets, evaluation methodology for cross-lingual systems, and architecture choices for conditional computation. For everyone else, it reveals that AI achievements like "translating 200 languages" rest on foundations of human expertise, careful engineering, and deliberate design choices that headlines rarely capture.

The model is open-sourced, as are the FLORES-200 benchmark, Toxicity-200 wordlists, and evaluation protocols. Meta also announced $200,000 in grants for impactful applications. These decisions reflect the project's founding commitment: no language should be left behind, and the tools to achieve that goal should be available to everyone.

Technical details from Costa-jussà, M.R., et al. "No Language Left Behind: Scaling Human-Centered Machine Translation." Nature 630, 841–846 (2024) and arXiv:2207.04672. Infrastructure specifications from Meta AI and NVIDIA documentation.