Title: The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages
Authors: STEINBERGER RALFPOULIQUEN BRUNOWIDIGER AnnaIGNAT CAMELIAERJAVEC TomazTUFIS DanVARGA Dániel
Citation: The 5th International Conference on Language Resources and Evaluation - Proceedings p. 2142-2147
Publisher: European Language Resources Association (ELRA)
Publication Year: 2006
JRC Publication N°: JRC32786
URI: http://publications.jrc.ec.europa.eu/repository/handle/JRC32786
Type: Contributions to Conferences
Abstract: We are presenting a new, unique and freely available parallel corpus available in all 20 official European Union (EU) languages, with additional documents available for some EU candidate countries. The average size is about 10 Million (check?) words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ lan-guage pair combinations. The UTF-8-encoded collection in XML format is accompanied by a tool to produce a bilingual para-graph-aligned parallel corpus for 190+ possible language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and key-word-indexing software. Due to the considerable number of parallel texts in many languages, the JRC-Acquis is particularly suitable to test and benchmark text analysis software (for instance for alignment, sentence splitting and term extraction) across different languages.
JRC Institute:Institute for the Protection and Security of the Citizen

Files in This Item:
There are no files associated with this item.


Items in repository are protected by copyright, with all rights reserved, unless otherwise indicated.