Title: Massive Multilingual Corpus Compilation - Acquis Communautaire and Totale
Citation: JOURNAL ARCHIVES OF CONTROL SCIENCES vol. 15 no. 4 p. 519-540
Publication Year: 2005
JRC N°: JRC31129
URI: http://publications.jrc.ec.europa.eu/repository/handle/JRC31129
Type: Articles in Journals
Abstract: Large, uniformly encoded collections of texts, corpora, are an invaluable source of data, not only for linguists, but also for Language Technology tools. Especially useful are multilingual parallel corpora, as they enable e.g. the induction of translation knowledge in the shape of multilingual lexica or full-fledged machine translation models. The utility of such corpora is even greater if they are sentence aligned between the languages, and are linguistically annotated. But parallel corpora, esp. large ones, are still scarce, and have been, so far, difficult to acquire. Recently, however, a large new source of parallel texts has become available on the Web, which contains EU law texts (the Acquis Communautaire) in all the languages of the current EU, and more, i.e. parallel texts in over twenty different languages. The paper discusses the compilation of this text collection into the massively multilingual JRC-Acquis corpus, which is freely available for research use. Next, the text annotation tool “totale”, which performs multilingual text tokenization, tagging and lemmatisation is presented. The tool implements a simple pipelined architecture which is, for the most part, fully trainable, requiring a word-level syntactically annotated text corpus and, optionally, a morphological lexicon. To train totale for seven different languages we have used the MULTEXT-East corpus and lexicons; we describe this resource and the training of totale, and its application to the Slovene part of the ACQUIS corpus. The paper concludes with the description of a preliminary experiment on alignment, and plans for the future
JRC Institute:Institute for the Protection and Security of the Citizen

