Representation de Textes a l'Aide d'Etiquettes Semantiques dans le Cadre de la Classification Automatique

IGNAT, Camelia; ROUSSELOT, François

This paper describes an algorithm for document representation in a reduced vectorial space by a process of fea-ture extraction. The algorithm is evaluated in the context of the supervised classification of news articles. We are generating a document representation (profile) represented by semantic tags from a machine-readable dictionary. We are dealing with synonymy handled by thematic conflation, and polysemy for which we have developed a statistical method for word-sense disambiguation. We propose four variants for the profile generation depending on whether a recursive system is used or not, and whether a corrective factor for polysemous words is taken into account or not. We have evaluated 32 variants, depending on the algorithm type and on three other parameters: grammatical category selection, 15% reduction of the profile, and a stop-list of semantic tags. Some parameters (like profile reduction) have low influence on the classifier performance and others (corrective factor for the ambiguous words, stop-list) improve the perform-ance noticeably.

IGNAT Camelia; ROUSSELOT François;

2008-05-30

Romanian Academy

JRC40906

0035-3957,

https://publications.jrc.ec.europa.eu/repository/handle/JRC40906,

Name	Country	City	Type

Datasets

ID	Title	Public URL

Dataset collections

ID	Acronym	Title	Public URL

Scripts / source codes

Description	Public URL

Additional supporting files

File name	Description	File type

Items published in the JRC Publications Repository are protected by copyright, with all rights reserved, unless otherwise indicated. Additional information: https://ec.europa.eu/info/legal-notice_en#copyright-notice