Title: Multi-word Entity Classification in a Highly Multilingual Environment
Publisher: The Association for Computational Linguistics (ACL)
Publication Year: 2017
JRC N°: JRC105656
ISBN: 978-1-945626-48-7
URI: http://aclweb.org/anthology/W17-17
Type: Articles in periodicals and books
Abstract: This paper describes an approach for the classification of millions of existing multiword entities (MWEntities), such as organisation or event names, into thirteen category types, based only on the tokens they contain. In order to classify our very large in-house collection of multilingual MWEntities into an applicationoriented set of entity categories, we trained distantly-supervised classifiers in 43 languages based on MWEntities extracted from BabelNet. The best-performing classifier was the multi-class SVM using a TF.IDF-weighted data presentation. Interestingly, one unique classifier trained on a mix of all languages consistently performed better than classifiers trained for individual languages, reaching an averaged F1-value of 88.8%. In this paper, we present the training and test data, including a human evaluation of its accuracy, describe the methods used to train the classifiers,and discuss the results.
JRC Directorate:Joint Research Centre Corporate Activities

Files in This Item:
There are no files associated with this item.

Items in repository are protected by copyright, with all rights reserved, unless otherwise indicated.