MICROSOFT PROJECT ELLORA
If a Hindi speaker needs to search for content on the Internet, they can either input a query in Devnagari script on their phone or speak the same inquiry. However, what about those who communicate in languages that are spoken by fewer than a million people or have little to no online presence? Microsoft Research’s Project ELLORA (Enabling Low Resource Languages) is assisting these languages in India.
“We utilise technology to work with low-resource languages, but we believe that these marginalised populations are aware of their own needs and wishes. Microsoft Research’s Kalika Bali told indianexpress.com, “We engage with them to identify their pain spots and determine how technology can help.” Bali is an expert in Natural Language Processing, which combines linguistics and artificial intelligence to teach computers how to comprehend spoken and written languages.
The primary objective of ELLORA, according to Bali, is to ensure that these languages, which have very few written resources and no digital presence at all, are not left behind by recent advances in language technology brought about by artificial intelligence (AI) and advanced natural language models. Moreover, a digital presence could assist some of these languages in surviving the extinction threat.
Microsoft Research (MSR) has decided to first concentrate on three of them. Nearly three million people speak Gondi in Madhya Pradesh, Maharashtra, Chhattisgarh, Andhra Pradesh, and Telangana. Mundari is spoken in Jharkhand, Odisha, and West Bengal.
According to Bali, the company has completed some of its longest projects in Gondi and collaborated with CGNet Swara in Chhattisgarh. CGNet Swara is an online service that enables Gondi speakers to telephone in local news in their language.
“We’ve assisted with items such as Adivasi radio, which served as a hub for getting information by phone in Gondi. “We have also collaborated with them to develop a machine translation system, as access to information in their native languages is one of their greatest challenges,” stated Bali.
MSR hopes to test this machine language-based translation system in the field in the near future, and if successful, Gondi speakers will be able to access any information available in Hindi in their own tongue. MSR is collaborating with Pratham books to build a digital dictionary for the Idu Mishmi language in Arunachal Pradesh.
MSR has collaborated with IIT-Kharagpur and GIZ, the German Development Fund, on Mundari. In Mundari’s situation, the assignment is specific: design educational materials for the youngsters due to the paucity of accessible resources. “The objective is to construct the complete pipeline.” We are working on developing a text-to-speech model that would enable the system to speak Mundari. We are also developing a model for machine translation. In reality, we have a tiny machine translation model available,” said Bali, adding that they are now evaluating the model and will also work on speech recognition.
The ultimate goal is to have a whole system in place for Mundari so that speakers may access information or use technology by speaking, listening, or typing into their phones in their native language. Bali further emphasised that their models for Mundari-like languages do not rely on word-for-word translations. Instead, they request that native speakers translate Hindi sentences into their own language, so creating the resource and data set that will be used to feed the computer model.
Interneural Machine Translation (INMT) is a technology they developed as part of their work that can predict the next word when translating between various languages, for example from Hindi to Mundari. “It provides me with predicted suggestions in Mundari. “It’s similar to the predictive text feature on smartphone keyboards, but it works in two languages,” Bali noted, adding that such technologies will also boost the effectiveness of human translators.
Obviously, there is also the difficulty of ensuring that the models function on low-end phones. Given that members of marginalised communities have access to lower-end phones, the models must be optimised with this crucial issue in mind. “One of the major issues is that we want these models to function on mobile devices. “We have spent a great deal of time figuring out how to reduce, condense, and quantify these models so that they can operate on a mobile phone,” Bali added.
Also read:- Twitter disables direct message feature.
Regarding the current buzz surrounding Large Language Models (LLMs) and their importance in translation systems, Bali stated that they had also conducted research on publically available LLMs. However, additional effort will be required to enable these models to function with such languages that have minimal or no data sets. “It is an open research subject as to how these LLMs may be adapted to function with some of the smaller languages. And, you know, the solution may lie in constructing an additional layer on top of this technology. Or the issue may be a lack of sufficient data to feed into the foundation models. I do not believe we are really certain. It is a matter of open study to see how we can do this,” she stated.
Currently, the ultimate objective of Project ELLORA remains unchanged: “That the gap between linguistically privileged and disadvantaged does not widen.”