Building a Multilingual Named Entity-Annotated Corpus Using Annotation Projection

EHRMANN, Maud; TURCHI, Marco; STEINBERGER, Ralf

As developers of a highly multilingual named entity recognition (NER) system, we face an evaluation resource bottleneck problem: we need evaluation data in many languages, the annotation should not be too time-consuming, and the evaluation results across languages should be comparable. We solve the problem by automatically annotating the English version of a multi-parallel corpus and by projecting the annotations into all the other language versions. For the translation of English entities, we use a phrase-based statistical machine translation system as well as a lookup of known names from a multilingual name database. For the projection, we incrementally apply different methods: perfect string matching, perfect consonant signature matching and edit distance similarity. The resulting annotated parallel corpus will be made available for reuse.

EHRMANN Maud; TURCHI Marco; STEINBERGER Ralf;

2012-01-26

Socoma Ltd.

JRC65739

1313-8502,

http://lml.bas.bg/~iva/ranlp2011/RANLR2011_Proceedings.PDF, https://publications.jrc.ec.europa.eu/repository/handle/JRC65739,

Name	Country	City	Type

Datasets

ID	Title	Public URL

Dataset collections

ID	Acronym	Title	Public URL

Scripts / source codes

Description	Public URL

Additional supporting files

File name	Description	File type

Show metadata record Copy citation url to clipboard Download BibTeX