This paper describes an algorithm for document representation in a reduced vectorial space by a process of feature extraction. The algorithm is applied and evaluated in the context of the supervised classification of news articles from the collection of Le Monde newspaper issued in the years 2003 and 2004.
We are generating a document representation (or profile), in a space of 800 dimensions, represented by semantic tags from a machine-readable dictionary. We are dealing with two issues: the synonymy handled by thematic conflation and polysemy for which we have developed a statistical method for word-sense disambiguation.
We propose four variants for the profile generation (of a document) depending on whether a recursive system is used or not, and whether a corrective factor for polysemous words is taken into account or not. To determine the best classifier provided by our algorithm we have evaluated 32 variants, depending on the algorithm type (as previously) and on three other parameters that influence the document representation: grammatical category selection, 15% reduction of the profile, and a stop-list of semantic tags. The evaluation is done on a set of documents from six categories by calculating the precision, the recall and the F-measure to determine the best algorithm related to the threshold detection. Some parameters (like profile reduction) have low influence on the classifier performance and others (corrective factor for the ambiguous words, stop-list) improve it noticeably.
IGNAT Camelia;
ROUSSELOT Francois;
VINESSE Jerome;
2006-11-27
JRC32638
https://publications.jrc.ec.europa.eu/repository/handle/JRC32638,
| Name | Country | City | Type |
|---|
This document is only visible at the Commission level.
You are not authorized to publish or distribute it outside the European Commission.
This is a public document. You can share this publication.
Datasets
| ID | Title | Public URL |
|---|
Dataset collections
| ID | Acronym | Title | Public URL |
|---|
Scripts / source codes
| Description | Public URL |
|---|
Additional supporting files
| File name | Description | File type |
|---|