The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages
We are presenting a new, unique and freely available parallel corpus available in all 20 official European Union (EU) languages, with additional documents available for some EU candidate countries. The average size is about 10 Million (check?) words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ lan-guage pair combinations. The UTF-8-encoded collection in XML format is accompanied by a tool to produce a bilingual para-graph-aligned parallel corpus for 190+ possible language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and key-word-indexing software. Due to the considerable number of parallel texts in many languages, the JRC-Acquis is particularly suitable to test and benchmark text analysis software (for instance for alignment, sentence splitting and term extraction) across different languages.
STEINBERGER Ralf;
POULIQUEN Bruno;
WIDIGER Anna;
IGNAT Camelia;
ERJAVEC Tomaz;
TUFIS Dan;
VARGA Dániel;
2006-12-22
European Language Resources Association (ELRA)
JRC32786
Additional supporting files
File name | Description | File type | |