Meta AI Releases BELEBELE: The First Parallel Reading Comprehension Evaluation Benchmark for 122 Languages Variants

Screenshot of Results from BELEBELE Research Paper
Screenshot from BELEBELE Research Paper


Artificial intelligence has made huge advances in recent years, but much of the progress has centered on the English language. Researchers at Meta AI are hoping to change that with the release of BELEBELE, a new benchmark dataset for evaluating AI systems across 122 languages. This represents an important milestone towards inclusive AI that works well across diverse cultures and languages.


Why Multilingual AI Matters

Most existing AI datasets and models excel at English but struggle with other languages. This limits their usefulness for millions of non-English speakers worldwide. Multilingual AI aims to fix this by enabling systems to understand and communicate in multiple languages equally well.


Multilingual training resources allow AI models like chatbots to serve users across geographies. It also lets researchers identify the limitations of current techniques and improve algorithms. However, there has been a lack of diverse multilingual evaluation benchmarks to rigorously test AI capabilities across languages. BELEBELE helps fill this crucial gap.


Introducing the BELEBELE Benchmark

BELEBELE evaluates an AI's ability to read and comprehend text in different languages. It contains short passages and multiple-choice reading comprehension questions that test whether models truly understand the passage.


The key highlights of this exciting new benchmark are:


122 Languages - BELEBELE covers an unprecedented 122 languages, including many low- and mid-resource languages where AI research is lacking.

Typological Diversity - The languages span 27 different language families across 29 writing scripts. This tests model robustness.

Parallel Evaluation - The same 900 questions are translated into each of the 122 languages. This enables direct comparison of model performance across languages.

Careful Annotation - Human experts crafted challenging questions free of ambiguities or annotation biases that models can exploit.

Romanized Versions - 7 languages have Romanized versions, invaluable for testing AI systems.

Discriminative Task - Questions require real comprehension, not superficial clues. Humans score 97% accuracy while AI models struggle, showing room for progress.

Overall, BELEBELE provides the scale, diversity, and quality needed to rigorously analyze multilingual AI abilities.


BELEBELE in Action: Testing Popular AI Models

The researchers evaluated various state-of-the-art multilingual AI models using BELEBELE to demonstrate its capabilities.


They tested both masked language models (MLMs) like XLM-R and large language models (LLMs) like GPT-3.5 in settings ranging from zero-shot learning to full-task fine-tuning.


Here are some fascinating findings from assessing these popular models on BELEBELE:


English-centric LLMs unsurprisingly performed best on high-resource languages but fared poorly on most others.

MLMs pre-trained on balanced multilingual data understood far more languages despite being much smaller than LLMs.

Larger model vocabulary size correlated with better comprehension of low-resource languages.

Machine translating test samples to English improved LLM zero-shot performance in 68 languages.

These results highlight current model limitations in multilingual comprehension. They also demonstrate BELEBELE's ability to provide fine-grained insights into model strengths and weaknesses across diverse languages.


Moving Towards Truly Multilingual AI

The BELEBELE paper concludes that much work remains to build AI systems that reliably understand text across the world's thousands of languages.


But BELEBELE brings us one step closer by letting researchers rigorously assess model capabilities across languages. Insights from BELEBELE can guide the development of better multilingual models and training techniques.


Multilingual comprehension is just one piece of the AI inclusivity puzzle. But benchmarks like BELEBELE lay the foundation for inclusive and ethical AI that works seamlessly across geographies.


The researchers hope the community will actively use BELEBELE to make AI accessible to more people worldwide, irrespective of language or culture. This aligns with Meta's mission of giving everyone a voice through technology.


BELEBELE represents an exciting milestone in enabling AI for all. We can look forward to more discoveries as researchers worldwide use it to make AI inclusive and empower users globally.



Source: BELEBELE: A Massively Multilingual Reading Comprehension Benchmark by Meta AI Researchers

Visit the Paper and Github for more details.

All the credit for this research belongs to the researchers who worked on this project.

Also, make sure to join our AI SubReddit, Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, awesome AI projects, AI guides/tutorial, Best AI tools, and more.

Previous Post Next Post

POST ADS1

POST ADS 2