December 19, 2016

In a new initiative to support language technology development, Tilde has undertaken an effort to create new multilingual open data sets for EU languages. The multilingual corpora will enable the technology community to develop key services such as high-quality machine translation systems for a range of languages and domains.

“The lack of language resources is one of the biggest obstacles to the development of language technology in Europe,” said Tilde CEO Andrejs Vasiļjevs. “In an effort to overcome this obstacle, Tilde has made a commitment to create new multilingual corpora for European languages – particularly the smaller languages that need them most – and make them openly available to developers. We invite the rest of the language technology community to join this effort to crack language barriers.”

As part of this initiative, Tilde will identify and collect multilingual open data sets in multiple languages and several key domains. In addition, Tilde will clean, align, and format the collected resources using its sophisticated data-processing tools, thus rendering the corpora useable for developing new products and services.

In 2017, Tilde plans to submit over 10M segments of multilingual open data for publication on the META-SHARE repository, maintained by the Multilingual Europe Technology Alliance, and on the EU Open Data Portal. These open data sets will be available to technology developers, researchers, localization companies, and machine translation providers. The corpora will provide a crucial resource for boosting the quality of machine translation engines, including the new breed of machine translation systems built with neural networks.

Tilde’s activities will be undertaken as part of the ODINE Open Data Incubator for Europe, which aims to support the next generation of digital businesses and fast-track the development of new products and services.