Title: Challenges and methods for multilingual text mining
Citation: Proceedings of the 7th international conference on Language Resources and Evaluation (LREC) - ISBN 2-9517408-6-7
Publisher: European Language Resources Association ELRA
Publication Year: 2010
JRC N°: JRC57832
URI: http://www.lrec-conf.org/proceedings/lrec2010/keynotes/keynote2.pdf
Type: Articles in periodicals and books
Abstract: Multilingual text processing is useful because the information content found in different languages is complementary, both regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed for many languages, most text analysis tools have only been applied to small sets of languages because the development effort per language is large. Self-training tools obviously alleviate the problem, but even the effort of providing training data and of manually tuning the results is usually considerable. In this paper, we gather insights by various multilingual system developers on how to minimise the effort of developing natural language processing applications for many languages. We also explain the main guidelines underlying our own effort to develop complex text mining software for tens of languages. While these guidelines ¿ most of all: extreme simplicity ¿ can be very restrictive and limiting, we believe to have shown the feasibility of the approach through the development of the Europe Media Monitor (EMM) family of applications (http://press.jrc.it/overview.html). EMM is a set of complex media monitoring tools that process and analyse up to 100,000 online news articles per day in between twenty and fifty languages. We will also touch upon the kind of language resources that would make it easier for all to develop highly multilingual text mining applications. We will argue that ¿ to achieve this ¿ the most needed resources would be uniform and simple multilingual dictionaries, corpora, and software tools.
JRC Directorate:Space, Security and Migration

Files in This Item:
There are no files associated with this item.

Items in repository are protected by copyright, with all rights reserved, unless otherwise indicated.