header.html

An official website of the European Union How do you know?      
European Commission logo

handle.jsp

cover
This paper describes an algorithm for document representation in a reduced vectorial space by a process of feature extraction. The algorithm is applied and evaluated in the context of the supervised classification of news articles from the collection of Le Monde newspaper issued in the years 2003 and 2004. We are generating a document representation (or profile), in a space of 800 dimensions, represented by semantic tags from a machine-readable dictionary. We are dealing with two issues: the synonymy handled by thematic conflation and polysemy for which we have developed a statistical method for word-sense disambiguation. We propose four variants for the profile generation (of a document) depending on whether a recursive system is used or not, and whether a corrective factor for polysemous words is taken into account or not. To determine the best classifier provided by our algorithm we have evaluated 32 variants, depending on the algorithm type (as previously) and on three other parameters that influence the document representation: grammatical category selection, 15% reduction of the profile, and a stop-list of semantic tags. The evaluation is done on a set of documents from six categories by calculating the precision, the recall and the F-measure to determine the best algorithm related to the threshold detection. Some parameters (like profile reduction) have low influence on the classifier performance and others (corrective factor for the ambiguous words, stop-list) improve it noticeably.
2006-11-27
JRC32638
https://publications.jrc.ec.europa.eu/repository/handle/JRC32638,   
NameCountryCityType
Datasets
IDTitlePublic URL
Dataset collections
IDAcronymTitlePublic URL
Scripts / source codes
DescriptionPublic URL
Additional supporting files
File nameDescriptionFile type 

footer.html