An official website of the European Union How do you know?      
European Commission logo
JRC Publications Repository Menu

Multi-label Eurovoc classification for Eastern and Southern EU languages

cover
Multi-label document classification is the task of automatically assigning multiple categories to the same document (e.g. a book is about cooking and about Austrian food). At least for Machine Learning approaches, this task is harder than standard (single label) classification because it is not clear for the learning software whether the presence of a feature (typically a word) is an indication of one class or another (e.g. whether the presence of the word ‘salt’ is an indication for the category cooking or for the category Austrian food). Multi-label classification is a real challenge if the number of classes is very high and if the number of training documents per category is unevenly distributed. We are presenting experiments with the JRC EuroVoc Indexer software JEX (Steinberger et al. 2012), which has been trained for all official EU languages on tens of thousands of documents per language to assign the thousands of class labels of the EuroVoc thesaurus . JEX is a multi-label classification system using a bag-of-words document representation. When applying such a tool that uses word forms as classifier features to languages as different as Germanic (e.g. English), Romance (e.g. French), Slavic (e.g. Czech or Polish) and Finno-Ugric languages (e.g. Estonian or Hungarian), the question arises how much the classifier performance differs. It can be expected that the significantly higher ratio of word forms to lemmas in Slavic and Finno-Ugric languages has a negative impact on the classifier performance, or that more training material would be needed for these more highly inflected languages to achieve the same performance. Similarly, one might wonder whether part-of-speech (POS) information is useful. JEX will soon be made available to parliamentary and other users. The experiments described in this chapter thus have a practical relevance as they can give an indication to the users and their technical partners as to whether they should invest in improving the software through linguistic pre-processing.
2013-02-08
Cambridge Scholar Press
JRC68883
978-1-4438-3878-8,   
http://www.c-s-p.org/flyers/Multilingual-Processing-in-Eastern-and-Southern-EU-Languages--Low-Resourced-Technologies-and-Transla1-4438-3878-0.htm,    https://publications.jrc.ec.europa.eu/repository/handle/JRC68883,   
Language Citation
NameCountryCityType
Datasets
IDTitlePublic URL
Dataset collections
IDAcronymTitlePublic URL
Scripts / source codes
DescriptionPublic URL
Additional supporting files
File nameDescriptionFile type 
Show metadata record  Copy citation url to clipboard  Download BibTeX
Items published in the JRC Publications Repository are protected by copyright, with all rights reserved, unless otherwise indicated. Additional information: https://ec.europa.eu/info/legal-notice_en#copyright-notice