DCEP - Digital Corpus of the European Parliament
The paper presents a new highly multilingual sentence-aligned parallel corpus consisting of various document types and covering a wide range of subject domains. With a total of 1.37 billion words in 23 languages (253 language pairs), gathered in the course of ten years, this is the largest single release of documents by a European Union institution. Corpus statistics, required preprocessing, sentence alignment, and possible gains in statistical machine translation when adding this corpus to the previously existing ones are also considered.
HAJLAOUI Najeh;
KOLOVRATNÍK David;
VAEYRYNEN Jaakko;
STEINBERGER Ralf;
VARGA Dániel;
2014-09-24
European Language Resources Association (ELRA)
JRC87087
978-2-9517408-8-4,
http://www.lrec-conf.org/proceedings/lrec2014/pdf/943_Paper.pdf,
https://publications.jrc.ec.europa.eu/repository/handle/JRC87087,
Additional supporting files
| File name | Description | File type | |