The SEMO project. Semantic Analysis-Based Multilanguage Document Management SystemThe retrieval of metadata from various documents and their conversion into another format is one of the most significant problems faced by document processing systems. The problems get even more pressing if you have to process documents that originally are available in paper format and contain data in various languages. Currently, various technologies are used for such tasks. Usually, they are focused on specific applications and cannot be used for other documents and processes. The industry requires a new universal data retrieval technology that can be used as a primary element in various systems for metadata retrieval and conversion into other formats for further processing. Being aware of the above problem and to solve it, the Tilde team implemented the SEMO project supported by the EU funding. Goals of the projectThe goal of the SEMO project is to develop a new intelligent technology that retrieves metadata from documents both in paper and electronic format regardless of their type, structure and language. Paper documents first get digitalized, then they are subject to the OCR procedure and then the documents are entered into the system for metadata retrieval. The metadata retrieval tool analyzes the entered structured file, recognizes the metadata type, classifies and then retrieves the document content. A special screening procedure estimates the probability of any recognition and retrieval errors and assesses the quality of each document processed. ApplicationWith the successful implementation of the project, a universal technology is created suitable for use in various document processing systems. It may trigger off a rapid development in this field and increase the expansion of document processing systems in the business and public sectors. It will considerably advance the increase in the efficiency of document processing, improve the data accuracy and enable the providing of direct data supply in any system or database regardless of the data exchange format used by the system to communicate with the external environment. PublicationsIn order to promote the project, the research results were made public in the following academic conferences: Applied Linguistics in Research and Education, March 25-26, 2010, in St. Petersburg, Russia, and Time of Challenges and Opportunities: Issues, Solutions and Prospects, March 17–19, 2011, in Rezekne, Latvia. |