Mastering the Most Complex Languages With NLU
Most AI-based applications that are being launched today are available in English, but not in all -- or even most -- linguistic markets. It’s a challenge that must be addressed in order to progress on a relatively even keel. Otherwise, some languages risk getting left behind.
Language is not a synonym for English. That may seem obvious at first glance, but in the AI community, the terms are often used interchangeably -- much to the frustration of people looking for AI solutions whose first language is not English. The fact is, most AI-based applications that are being launched today are available in English, but not in all -- or even most -- linguistic markets. And considering 80 percent of the world communicates in a mother tongue that isn’t English, this presents a big challenge.
It’s a challenge that must be addressed in order for technological advancements in AI to progress on a relatively even keel. Otherwise, some languages risk getting left behind. So, what are the challenges and what can be done about them? Let’s take a look.
Since only 20 percent of the world speaks English as a native language, how has it become the default language for tech? It has everything to do with the fact that the many world’s tech leaders -- Google, Apple, Microsoft -- are American-based companies. Since these companies are the ones leading and driving innovation, it makes sense that the innovations will be in their language of business operation: English.
As AI expert and COO of ultimate.ai, Sarah al-Hussaini explains, innovations are based on data collection, which is based on natural language. Due to the extremely high costs in terms of time and money that this process requires, that language has been limited to the innovator’s language of operation.
“There are therefore fewer sets in...more than 4,500 other spoken languages in the world. The scaling of similar innovations is therefore a massive investment requirement and a major obstacle for the countries concerned. This is also the reason why in regions with fewer inhabitants on average (i.e. fewer buyers) -- like Scandinavia or the Benelux countries -- many technical innovations cannot be found in natural language (Google Home, for example, is not available in all these markets).”
In the NLP world, this is causing some major challenges. Says Post-Doctoral AI Researcher Maja Popovic who has studied this topic extensively.
“English is the dominant language supported in NLP applications. Many authors of research papers even do not put the language in the paper -- an American professor pled to everyone in the community to abandon this practice and always specify the language, because [to reiterate] ‘language is not a synonym for English’.”
But as Dr. Popovic points out, this is more the case with Eastern European-based languages, rather than Western ones like French, German, and Spanish. The more commonly spoken languages are actually quite well-supported in the NLP community. It’s an important distinction, because NLP and the larger sector of AI is a horizontal technology that will transform pretty much all industries. If this innovation and transformation happen only in the most commonly spoken languages, the world’s most complex languages risk getting left behind. From here, a gap could form that will only widen with time.
The main challenges NLP faces in these less represented languages are their rich morphologies and lack of large data sets, which are key for training the AI. The South-Eastern European languages are the least represented and supported. Macedonian (with more than 1.3 million speakers), Bosnian (with 2 million) and Montenegrin (with 230,000) are at the bottom of this pack, followed by Bulgarian (with 15 million speakers), Serbian (with 8.6 million speakers), Croatian (with 6 million speakers), and finally Slovenian (with 2.5 million speakers), which is the most supported language of this bunch. That makes for just under 36 million people whose languages lack AI community support. Some of these languages present unique challenges, like Serbian. As Dr. Popovic explains:
“Serbian is bi-alphabetical (it uses both Latin and Cyrillic scripts) so depending on the particular task, it might be no problem at all, [or] it might be tricky.”
This isn’t to say that these languages are being completely left behind. Luckily, there are some tech giants and research groups that operate in non-English languages. Dr. Popovic explains that,
“The best-supported language is Czech -- a strong research group has already been working for many years on several NLP aspects such as syntax and parsing, POS tagging, machine translation.”
Russian, as well, is being increasingly supported in recent years. This is thanks, in large part, to investment in R&D by Russian search giant Yandex. North Eastern European languages, like the Baltic tongues, have also received some attention from the AI community in recent years from universities and Tilde -- an organisation working to enable multilingual communication in tech innovation. There’s a potential solution to the unique challenge with bi-alphabetical languages like Serbian, too. Serbian is quite similar to Croatian, so combining data from the two languages in an appropriate way has proven to be very helpful with training AI. This can be said for other less represented languages as well.
Another potential solution is for governments to start taking action. As Sarah al-Hussaini points out:
“Government action can play a key role in democratising AI. First, government funding can promote research and development of technologies in national languages. For example, grants, subsidies, and tax benefits can be given to companies that use and / or create applied AI using natural language.”
Sarah has another potential solution, too: governments can also assist in the collection of large databases in non-English speaking markets, which can be made publicly accessible. The effect of reducing the obstacles of AI technology development could be crucial in closing the innovation gap between English and non-English NLU and AI technologies.
Though we’ve focused on Eastern European languages in this article, that is not to say they are the least represented languages in the world of AI. No, as Dr. Popovic states, compared to many African languages, these languages at least have a seat at the table. The more complex the language, and the fewer native speakers there are working in the tech industry, the more of these challenges it will face.
Finally, no article on investment in AI R&D in complex languages would be complete without at least touching on China. The Asian superpower, you see, dwarfs the US in terms of non-military AI R&D spending, at 5.7 billion USD, compared to the US’ 1 billion USD. As such, it probably goes without saying that Mandarin and Cantonese are well supported in the NLP community. Indeed, there are even organisations that have been set up to support Asian languages in general, amongst the NLP community. So no, the investment in AI has not been spread evenly across linguistic boundaries -- nowhere near. But with a dash of government regulation, a slice of creative linguistic training, and a splash of R&D investment by non-English tech stars, we can get closer to democratising the NLP landscape.
This way, one day, everyone -- no matter their mother tongue -- will be able to enjoy and benefit from the power of AI.
About our experts
Sarah Al-Hussaini is COO at ultimate.ai.
Dr Maja Popovic is a post-doctoral researcher at ADAPT Centre in Dublin City. Maja is specialised in doing research on Natural Language Understanding & Processing. The Computer Scientist worked for the German AI Research Center (DFKI) and Humboldt University Berlin before she continued her research in Dublin.
Stay in the loop
Subscribe to our email newsletter. No spam, just occasional insight from our experts.