A survey of methods to ease the development of highly multilingual text mining applications
Multilingual text processing is useful because the information content
found in different languages is complementary, both regarding facts and opinions.
While Information Extraction and other text mining software can, in principle, be
developed for many languages, most text analysis tools have only been applied to
small sets of languages because the development effort per language is large. Selftraining
tools obviously alleviate the problem, but even the effort of providing training
data and of manually tuning the results is usually considerable. In this paper, we gather
insights by various multilingual system developers on how to minimise the effort of
developing natural language processing applications for many languages. We also
explain the main guidelines underlying our own effort to develop complex text mining
software for tens of languages. While these guidelines—most of all: extreme simplicity—
can be very restrictive and limiting, we believe to have shown the feasibility
of the approach through the development of the Europe Media Monitor(EMM) family
of applications (http://emm.newsbrief.eu/overview.html). EMM is a set of complex
media monitoring tools that process and analyse up to 100,000 online news articles per
day in between twenty and fifty languages. We will also touch upon the kind of
language resources that would make it easier for all to develop highly multilingual text
mining applications. We will argue that—to achieve this—the most needed resources
would be freely available, simple, parallel and uniform multilingual dictionaries,
corpora and software tools.
STEINBERGER Ralf;
2012-12-30
SPRINGER
JRC61764
1574-020X,
http://www.springerlink.com/content/86656518k7116r2u/,
https://publications.jrc.ec.europa.eu/repository/handle/JRC61764,
10.1007/s10579-011-9165-9,
Additional supporting files
File name | Description | File type | |