An official website of the European Union How do you know?      
European Commission logo
JRC Publications Repository Menu

Cross-lingual Named Entity Recognition

cover
We are presenting a mature text analysis application that relies heavily on multilingual Named Entity Recognition and Classification (NERC) and cross-lingual information aggregation. Its results are publicly accessible on the news analysis and exploration site NewsExplorer (http://press.jrc.it/NewsExplorer), which has reached an average usage of 600,000 hits per day at the end of the year 2006. NewsExplorer fully integrates NERC with other tools such as monolingual and multilingual topic tracking (and thus cross-lingual document similarity calculation), name variant detection and merging, quotation recognition and social network detection. In fact, multilingual and cross-lingual NERC is an integral component of this system, without which the application would not work. The main statement we want to make in this article is that both the quantity and the quality of NERC increases when applying it to texts written in different languages, while at the same time NERC is itself an important contributor to cross-lingual text analysis applications. Improving the one will thus benefit the other, in a bootstrapping manner. We will present the fully-automatic, highly multilingual news analysis application NewsExplorer, which shows that different kinds of information on persons and organisations can be found in the news of different languages and presented to the users in an aggregated way. This information includes the name variants, titles, age and nationality of persons, and more. The name variants gathered from the news in currently ten languages are at the same time an important ingredient for the automatic tracking of related news across these languages and for the detection of relations between persons mentioned in the news (social networks). We furthermore claim that, firstly, highly multilingual applications benefit from very simple and preferably language-independent procedures and algorithms to keep the language-specific effort needed for each of the languages to a minimum. The language-specific effort to add a new language to NewsExplorer with a reasonable accuracy is restricted to between one week and several months, depending on the morphological complexity of the language involved. Secondly, the very high number of spelling variants found for single persons ¿ not only in a multilingual context, but even when dealing with a single language ¿ indicates that name variant detection and merging must be an integral part of NER when working in real-life applications. In the following sections of this paper, we first highlight the role of NERC in NewsExplorer (Section 2). Section 3 describes our algorithms used for person, organisation and location name recognition. We then explain the closely linked tools for name variant detection and merging (4) and social network detection (5). Section 6 discusses ongoing and future work. In Section 7, we conclude the article and discuss the results. As there are various, rather different aspects to the work described in this article (NER for persons and for locations; name transliteration; social network detection), we mention prior art and evaluation results inside the various sections, together with the description of our own approach for the relative tasks.
2009-10-22
John Benjamins Publishing Co.
JRC53717
978-90-272-8922-3,   
Language Citation
NameCountryCityType
Datasets
IDTitlePublic URL
Dataset collections
IDAcronymTitlePublic URL
Scripts / source codes
DescriptionPublic URL
Additional supporting files
File nameDescriptionFile type 
Show metadata record  Copy citation url to clipboard  Download BibTeX
Items published in the JRC Publications Repository are protected by copyright, with all rights reserved, unless otherwise indicated. Additional information: https://ec.europa.eu/info/legal-notice_en#copyright-notice