AI-powered tools, such as ChatGPT and Google Translate, offer incredible opportunities for those who speak languages supported by these technologies. However, billions of people in the Global South, including Africa, cannot benefit from these advancements due to the lack of support for their native languages. This language gap affects not only generative AI and translation services but also other tools like autocomplete, transcription services, voice assistants, and content moderation on social media. The scarcity of training data is the primary reason behind the limited functionality of AI tools in many languages, especially low-resource languages.
The challenge of low-resource languages
AI tools operate on vast amounts of training data, and the Common Crawl dataset, which consists of billions of web pages, is a crucial source for training language models. However, this dataset is heavily dominated by a few languages, with English being the most prominent. Consequently, languages like Amharic and other African, American, and Oceanian languages constitute less than 0.1% of the Common Crawl. This scarcity of data hinders the effectiveness of AI tools for speakers of low-resource languages, including major languages like Hindi, Arabic, and Bengali.
The disparity is evident in the representation of different languages in AI training datasets. For example, Dutch, spoken by roughly 20 million people, has significantly more data in the Common Crawl than Amharic, even though both languages have similar native speakers. This trend is not limited to Dutch but extends to various European languages, which are overrepresented compared to most Asian and African languages.
Overcoming data scarcity
To address the lack of data for low-resource languages, researchers and startups are taking matters into their own hands. One example is Lesan, a startup focused on creating machine translation and speech technology for Ethiopian languages like Amharic and Tigrinya. Instead of relying on vast online resources, Lesan’s team collaborates directly with the community, leveraging the enthusiasm of students and language lovers to collect data. The process involves identifying high-quality datasets, digitizing and translating them, and aligning the original and translated versions for machine learning training.
African startups embrace AI technology
Lesan’s approach reflects a growing trend among African startups developing AI-powered tools for their native languages. These projects demonstrate that useful models can be built with small, curated datasets, challenging the notion that a single gigantic model is the only way to succeed. The initiatives taken by African researchers and entrepreneurs foster ownership of the technology, ensuring that the financial benefits stay within their communities.
Global efforts for linguistic inclusion
Beyond Africa, researchers worldwide work on languages with smaller digital footprints, such as Jamaican Patois, Catalan, Sudanese, and Māori. Ethnologue’s support for vital language resources indicates that machine translation tools, spell checks, and speech processing are available for languages like Amharic. However, many languages with millions of speakers still lack sufficient digital support, leaving millions without AI-powered tools.
Efforts like the Distributed AI Research Institute (DAIR), GhanaNLP, Masakhane, and the Hugging Face AI collective demonstrate the power of collaboration and sharing insights. Researchers are working together to create solutions for their languages, making AI technology accessible to a broader range of linguistic communities. Unlike some tech giants, these initiatives promote transparency by freely sharing AI models and knowledge, empowering researchers to create language-specific solutions.
The language gap in AI tools presents a significant challenge for billions of people, especially those in the Global South. The scarcity of data in low-resource languages hampers the functionality of AI-powered tools, preventing many from benefiting from these technologies. However, through innovative approaches, collaboration, and sharing of insights, researchers and startups from Africa and worldwide are making strides in bridging the language gap and empowering linguistic communities with AI advancements. By prioritizing linguistic inclusion and supporting diverse languages, AI can become a transformative force for everyone, regardless of their language.