Please use this identifier to cite or link to this item:
|Title:||The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages|
|Authors:||STEINBERGER RALF; POULIQUEN BRUNO; WIDIGER Anna; IGNAT CAMELIA; ERJAVEC Tomaz; TUFIS Dan; VARGA Dániel|
|Citation:||The 5th International Conference on Language Resources and Evaluation - Proceedings p. 2142-2147|
|Publisher:||European Language Resources Association (ELRA)|
|JRC Publication N°:||JRC32786|
|Type:||Contributions to Conferences|
|Abstract:||We are presenting a new, unique and freely available parallel corpus available in all 20 official European Union (EU) languages, with additional documents available for some EU candidate countries. The average size is about 10 Million (check?) words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ lan-guage pair combinations. The UTF-8-encoded collection in XML format is accompanied by a tool to produce a bilingual para-graph-aligned parallel corpus for 190+ possible language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and key-word-indexing software. Due to the considerable number of parallel texts in many languages, the JRC-Acquis is particularly suitable to test and benchmark text analysis software (for instance for alignment, sentence splitting and term extraction) across different languages.|
|JRC Institute:||Institute for the Protection and Security of the Citizen|
Files in This Item:
There are no files associated with this item.
Items in repository are protected by copyright, with all rights reserved, unless otherwise indicated.