Title: Massive Multilingual Corpus Compilation: Acquis Communautaire and Totale
Citation: The 2nd Language & Technology Conference - Human Language Technologies as a Challenge for Computer Science and Linguistics
Publisher: Association for Computing Machinery (ACM) and UAM Fundacja
Publication Year: 2005
JRC N°: JRC31095
URI: http://publications.jrc.ec.europa.eu/repository/handle/JRC31095
Type: Articles in periodicals and books
Abstract: The paper discusses the compilation of massively multilingual corpora, the EU ACQUIS corpus, and the corpus annotation tool “totale”. The ACQUIS text collection has recently become available on the Web, and contains EU law texts (the Acquis Communautaire) in all the languages of the current EU, and more, i.e. parallel texts in over twenty different languages. Such document collections can serve as the basis for multilingual parallel corpora of unprecedented size and variety of language, useful as training and testing dataset for a host of different HLT applications. The paper describes the steps that were undertaken to turn the text collection into a linguistically annotated text corpus. In particular, we discuss the harvesting and wrapper induction of the corpus, and the usage of its annotation with EuroVoc descriptors. Next, the text annotation tool “totale” which does multilingual text tokenization, tagging and lemmatisation is presented. The tool implements a simple pipelined architecture which is, for the most part, fully trainable, requiring a word-level syntactically annotated text corpus and, optionally, a morphological lexicon. To train totale for seven different languages we have used the MULTEXT-East corpus and lexicons; we describe this resource and the training of totale, and its application to the ACQUIS corpus. Finally, we turn to the current experiments in aligning the corpus, and developments we plan to undertake in the future.
JRC Directorate:Space, Security and Migration

Files in This Item:
There are no files associated with this item.

Items in repository are protected by copyright, with all rights reserved, unless otherwise indicated.