Automatic Construction of Multilingual Name Dictionaries
This chapter is a contribution to the forthcoming book 'Learning Machine Translation', MIT Press, to be published in 2008.
ABSTRACT: Machine Translation and other Natural Language Processing systems often experience performance loss if they have to process texts with unknown words, such as proper names. Proper name dictionaries are rare and can never be complete because new names are being made up all the time. A solution to overcome this performance loss could be to recognise and mark named entities in text before translating it and to carry over the named entity untranslated. This would also help avoid the accidental translation of a name such as 'Bill Black', e.g. into French as ¿Facture Noire¿. An even better translation would be achieved if the target language spelling of the name would be used, and this seems even crucial when translating from languages with a different script, such as Chinese, Arabic or Cyrillic. We will show that multilingual name dictionaries are furthermore helpful for a number of other text analysis applications, including information retrieval, topic detection and tracking, relation and event extraction, and more. We then present a method and a system to recognise named entities of the types ¿person¿ and ¿ to some extent ¿ ¿organisation¿ in multilingual text collections and to automatically identify which of the newly identified names are variants of a known name. By doing this for currently nineteen languages and in the course of years, a multilingual name dictionary has been built up that contains to date over 630,000 names plus over 135,000 known variants, with up to 170 multilingual variants for a single name. The automatically generated name dictionary is used daily, for various purposes, in the publicly accessible multilingual news aggregation and analysis system NewsExplorer.
POULIQUEN Bruno;
STEINBERGER Ralf;
2009-02-02
MIT Press
JRC41746
978-0-262-07297-7,
http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11753,
https://publications.jrc.ec.europa.eu/repository/handle/JRC41746,
Additional supporting files
| File name | Description | File type | |