An official website of the European Union How do you know?      
European Commission logo
JRC Publications Repository Menu

Multi-word Entity Classification in a Highly Multilingual Environment

cover
This paper describes an approach for the classification of millions of existing multiword entities (MWEntities), such as organisation or event names, into thirteen category types, based only on the tokens they contain. In order to classify our very large in-house collection of multilingual MWEntities into an applicationoriented set of entity categories, we trained distantly-supervised classifiers in 43 languages based on MWEntities extracted from BabelNet. The best-performing classifier was the multi-class SVM using a TF.IDF-weighted data presentation. Interestingly, one unique classifier trained on a mix of all languages consistently performed better than classifiers trained for individual languages, reaching an averaged F1-value of 88.8%. In this paper, we present the training and test data, including a human evaluation of its accuracy, describe the methods used to train the classifiers,and discuss the results.
2018-01-12
The Association for Computational Linguistics (ACL)
JRC105656
978-1-945626-48-7,   
http://aclweb.org/anthology/W17-17,    https://publications.jrc.ec.europa.eu/repository/handle/JRC105656,   
NameCountryCityType
Datasets
IDTitlePublic URL
Dataset collections
IDAcronymTitlePublic URL
Scripts / source codes
DescriptionPublic URL
Additional supporting files
File nameDescriptionFile type 
Show metadata record  Copy citation url to clipboard  Download BibTeX
Items published in the JRC Publications Repository are protected by copyright, with all rights reserved, unless otherwise indicated. Additional information: https://ec.europa.eu/info/legal-notice_en#copyright-notice