Title: Fighting Web Spam
Authors: SYDOW MarcinPISKORSKI JakubWEISS DawidCASTILLO Carlos
Publisher: IOS Press
Publication Year: 2008
JRC N°: JRC48839
ISBN: 978-1-58603-898-4
ISSN: 1874-6268
URI: http://www.iospress.nl/
http://publications.jrc.ec.europa.eu/repository/handle/JRC48839
Type: Articles in books
Abstract: High ranking of a Web site in search engines can be directly correlated to high revenues. This amplifies the phenomenon of Web spamming which can be defined as preparing or manipulating any features of Web documents or hosts to mislead search engines¿ ranking algorithms to gain an undeservedly high position in search results. Web spam remarkably deteriorates the information quality available on the Web and thus affects the whole Web community including search engines. The struggle between search engines and spammers is ongoing: both sides apply increasingly sophisticated techniques and counter-techniques against each other. In this paper, we first present a general background concerning the Web spam phenomenon. We then explain why the machine learning approach is so attractive for Web spam combating. Finally, we provide results of our experiments aiming at verification of certain open questions. We investigate the quality of data provided as the Web Spam Reference Corpus, widely used by the research community as a benchmark, and propose some improvements. We also try to address the question concerning parameter tuning for cost-sensitive classifiers and we delve into the possibility of using linguistic features for distinguishing spam from non-spam.
JRC Institute:Institute for the Protection and Security of the Citizen

Files in This Item:
There are no files associated with this item.


Items in repository are protected by copyright, with all rights reserved, unless otherwise indicated.