In the quiet village of Agara, nestled amidst rice paddies and groundnut fields in southwest Bangalore, Preethi P. is sitting at her sewing machine. Normally, she spends her days stitching clothes and mending garments, earning less than $1 a day for her work. However, today is different. Preethi is reading sentences in her native Kannada language into a smartphone app. She is one of 70 workers hired by Karya, a startup dedicated to gathering text, voice, and image data in India’s vernacular languages.
Preethi and her peers are part of an extensive global workforce operating in countries like India, Kenya, and the Philippines, responsible for collecting and labeling data essential for AI chatbots and virtual assistants. Unlike many other data contractors, Karya pays its workers generously, especially by local standards. After just three days of work with Karya, Preethi earned 4,500 rupees ($54), more than four times her monthly earnings as a tailor. This income allowed her to pay off her home repair loan installment, bringing relief to her family. Karya’s model showcases how technology can uplift rural communities.
Founded in 2021, Karya’s emergence coincided with the surge in generative AI technologies like ChatGPT, intensifying tech companies’ hunger for data. India, in particular, is expected to have nearly one million data annotation workers by 2030, according to Nasscom. Karya stands out by offering its primarily female, rural workforce wages that can be up to 20 times the prevailing minimum wage, ensuring better quality Indian-language data for tech companies.
Tech giants turn to Karya for quality data
Some of Silicon Valley’s leading players are now collaborating with Karya to tackle the challenge of sourcing high-quality data for their AI products. Microsoft, for instance, has utilized Karya to acquire local speech data. The Bill & Melinda Gates Foundation is partnering with Karya to mitigate gender biases in data that feed into large language models, critical for AI chatbots. Alphabet’s Google is also relying on Karya to gather speech data in 85 Indian districts, with plans to expand to all districts and languages.
Many AI services have been predominantly developed using English-language data, leading to a lack of representation for non-English speaking users, particularly in countries like India. Google’s Manish Gupta highlights the stark challenge, where over 70 Indian languages spoken by millions have zero digital corpus. This gap leads to AI systems struggling with grammar and vocabulary in South Asian languages. Karya’s efforts are essential in addressing these disparities.
The importance of diverse and high-quality data cannot be overstated. Large language models trained on inadequate data can produce misinformation, harmful stereotypes, and even hate speech. Ensuring representation in training data is vital, and Karya is playing a crucial role in this regard.
Karya’s social impact
Karya, a social impact startup based in Bangalore, is not only expanding the pool of represented languages but also targeting workers in rural areas who might otherwise be overlooked. The company’s app is designed to work without internet access and offers voice support for those with limited literacy. Over 32,000 crowdsourced workers in India have logged into the app, completing 40 million paid digital tasks, including image recognition, video annotation, and speech annotation.
For Karya’s founder, Manu Chopra, the mission extends beyond data. Growing up in poverty in West Delhi, he understood the impact technology could have on poverty alleviation. Karya’s model allows workers to earn a fair income, ultimately helping combat poverty in rural communities.
Karya’s success in India is inspiring the company to explore opportunities beyond its borders. The startup is in discussions to offer its platform as a service to organizations in Africa and South America, who can replicate the model to empower their own communities.
Over 30,000 educated women in rural India are actively working with Karya on projects aimed at reducing gender biases in AI datasets. This initiative, supported by the Bill & Melinda Gates Foundation, is a significant effort in Indian languages and will serve as a foundation for building datasets that promote gender equity in AI systems.
A path to a brighter future
In villages like Yelandur, women are eagerly participating in Karya’s projects, driven by the desire to earn and educate their children. As technology continues to advance, initiatives like Karya are not only reshaping the data industry but also opening doors to a brighter future for rural communities across the globe.
Karya’s innovative approach to data collection is transforming the AI landscape while addressing critical social issues like poverty and gender bias. As technology companies increasingly recognize the value of high-quality data, Karya’s model serves as a beacon of progress and inclusivity in the tech industry.