An official website of the European Union How do you know?      
European Commission logo
JRC Publications Repository Menu

Robustness Issues in Text Mining

cover
We extend the Forward Search approach for robust data analysis to address problems in text mining. In this domain, datasets are collections of an arbitrary number of documents, which are represented as vectors of thousands of elements according to the vector space model. When the number of variables v is so large and the dataset size n is smaller by order of magnitudes, the traditional Mahalanobis metric cannot be used as a similarity distance between documents. We show that by monitoring the cosine (dis)similarity measure with the Forward Search approach it is possible to perform robust estimation for a document collection and order the documents so that the most dissimilar (possibly outliers, for that collection) are left at the end. We also show that the presence of more groups of documents in the collection is clearly detected with multiple starts of the Forward Search.
2013-01-09
Springer Verlag
JRC70256
978-3-642-33041-4,   
http://rd.springer.com/chapter/10.1007/978-3-642-33042-1_29,    https://publications.jrc.ec.europa.eu/repository/handle/JRC70256,   
10.1007/978-3-642-33042-1_29,   
Language Citation
NameCountryCityType
Datasets
IDTitlePublic URL
Dataset collections
IDAcronymTitlePublic URL
Scripts / source codes
DescriptionPublic URL
Additional supporting files
File nameDescriptionFile type 
Show metadata record  Copy citation url to clipboard  Download BibTeX
Items published in the JRC Publications Repository are protected by copyright, with all rights reserved, unless otherwise indicated. Additional information: https://ec.europa.eu/info/legal-notice_en#copyright-notice