Title: Un Algorithme de Génération de Profil de Document Et Son Evaluation dans le Contexte de la Classification Thématique
Citation: The 8th International Conference on the Statistical Analysis of Textual Data p. 529-542
Publication Year: 2006
JRC N°: JRC32638
URI: http://publications.jrc.ec.europa.eu/repository/handle/JRC32638
Type: Articles in periodicals and books
Abstract: This paper describes an algorithm for document representation in a reduced vectorial space by a process of feature extraction. The algorithm is applied and evaluated in the context of the supervised classification of news articles from the collection of Le Monde newspaper issued in the years 2003 and 2004. We are generating a document representation (or profile), in a space of 800 dimensions, represented by semantic tags from a machine-readable dictionary. We are dealing with two issues: the synonymy handled by thematic conflation and polysemy for which we have developed a statistical method for word-sense disambiguation. We propose four variants for the profile generation (of a document) depending on whether a recursive system is used or not, and whether a corrective factor for polysemous words is taken into account or not. To determine the best classifier provided by our algorithm we have evaluated 32 variants, depending on the algorithm type (as previously) and on three other parameters that influence the document representation: grammatical category selection, 15% reduction of the profile, and a stop-list of semantic tags. The evaluation is done on a set of documents from six categories by calculating the precision, the recall and the F-measure to determine the best algorithm related to the threshold detection. Some parameters (like profile reduction) have low influence on the classifier performance and others (corrective factor for the ambiguous words, stop-list) improve it noticeably.
JRC Directorate:Space, Security and Migration

Files in This Item:
There are no files associated with this item.

Items in repository are protected by copyright, with all rights reserved, unless otherwise indicated.