An official website of the European Union How do you know?      
European Commission logo
JRC Publications Repository Menu

The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages

cover
We are presenting a new, unique and freely available parallel corpus available in all 20 official European Union (EU) languages, with additional documents available for some EU candidate countries. The average size is about 10 Million (check?) words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ lan-guage pair combinations. The UTF-8-encoded collection in XML format is accompanied by a tool to produce a bilingual para-graph-aligned parallel corpus for 190+ possible language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and key-word-indexing software. Due to the considerable number of parallel texts in many languages, the JRC-Acquis is particularly suitable to test and benchmark text analysis software (for instance for alignment, sentence splitting and term extraction) across different languages.
2006-12-22
European Language Resources Association (ELRA)
JRC32786
Language Citation
NameCountryCityType
Datasets
IDTitlePublic URL
Dataset collections
IDAcronymTitlePublic URL
Scripts / source codes
DescriptionPublic URL
Additional supporting files
File nameDescriptionFile type 
Show metadata record  Copy citation url to clipboard  Download BibTeX
Items published in the JRC Publications Repository are protected by copyright, with all rights reserved, unless otherwise indicated. Additional information: https://ec.europa.eu/info/legal-notice_en#copyright-notice