Title: Automatic Construction of Multilingual Name Dictionaries
Authors: POULIQUEN BRUNOSTEINBERGER RALF
Publisher: MIT Press
Publication Year: 2009
JRC N°: JRC41746
ISBN: 978-0-262-07297-7
URI: http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11753
http://publications.jrc.ec.europa.eu/repository/handle/JRC41746
Type: Articles in books
Abstract: This chapter is a contribution to the forthcoming book 'Learning Machine Translation', MIT Press, to be published in 2008. ABSTRACT: Machine Translation and other Natural Language Processing systems often experience performance loss if they have to process texts with unknown words, such as proper names. Proper name dictionaries are rare and can never be complete because new names are being made up all the time. A solution to overcome this performance loss could be to recognise and mark named entities in text before translating it and to carry over the named entity untranslated. This would also help avoid the accidental translation of a name such as 'Bill Black', e.g. into French as ¿Facture Noire¿. An even better translation would be achieved if the target language spelling of the name would be used, and this seems even crucial when translating from languages with a different script, such as Chinese, Arabic or Cyrillic. We will show that multilingual name dictionaries are furthermore helpful for a number of other text analysis applications, including information retrieval, topic detection and tracking, relation and event extraction, and more. We then present a method and a system to recognise named entities of the types ¿person¿ and ¿ to some extent ¿ ¿organisation¿ in multilingual text collections and to automatically identify which of the newly identified names are variants of a known name. By doing this for currently nineteen languages and in the course of years, a multilingual name dictionary has been built up that contains to date over 630,000 names plus over 135,000 known variants, with up to 170 multilingual variants for a single name. The automatically generated name dictionary is used daily, for various purposes, in the publicly accessible multilingual news aggregation and analysis system NewsExplorer.
JRC Institute:Institute for the Protection and Security of the Citizen

Files in This Item:
There are no files associated with this item.


Items in repository are protected by copyright, with all rights reserved, unless otherwise indicated.