Open Access Open Access  Restricted Access Subscription or Fee Access

Data Mining Approaches for Web Spam Detection

K.M. Annammal, J. Sugunthan, A. Siva Sundari, N. Jaisankar

Abstract


Web spam is a serious problem for search engines
because the quality of their results can be severely degraded by the presence of this kind of page. In this paper, we present an efficient spam detection system based on a classifier that combines new linkbased features with language-model (LM)-based ones. We have specifically applied the Kullback–Leibler divergence on different combinations of these sources of information in order to characterize
the relationship between two linked pages. In this paper, we present an efficient spam detection system based on a Hybrid clustering that combines K-means and SVM and then classified by using C4.5 with Qualified link-based features and Language Model(LM) based once. The result is an accurate system for detecting Web spam using fewer features.


Keywords


Content Analysis, Information Retrieval, Language Models (LMs), Link Integrity, Web Spam Detection

Full Text:

PDF

References


Lourdes Araujo and Juan Martinez-Romo,”Web Spam Detection: New

Classification Features Based on Qualified Link Analysis and

Language Models” Vol.5, No.3, 2010.

L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates.

Link-based characterization and detection of web spam. In AIRWeb’06:

Proceedings of the 2th international workshop on Adversarial

information retrieval on the web, 2006.

Zoltan Gyongyi, Hector Garcia-Molina, Web spam Taxonomy. In

Proceedings of the 30th International Conference on Very Large

Databases (VLDB), 2004.

D. Zhou, C. J. C. Burges, and T. Tao. Transductive link spam detection.

In AIRWeb ’07: Proceedings of the 3rd international workshop on

Adversarial information retrieval on the web, pages 21–28, New York,

NY, USA, 2007. ACM.

A. A. Bencz´ur, I. B´ır´o, K. Csalog´any, and M. Uher. Detecting

nepotistic links by language model disagreement. In WWW ’06:

Proceedings of the 15th international conference on World Wide Web,

pages 939– 940, New York, NY, USA, 2006. ACM.

Z. Gy¨ongyi and H. Garcia-Molina. Web spam taxonomy. In

Proceedings of the first International Workshop on Adversarial

Information Retrieval on the Web (AIRWeb), 2005.

D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and

statistics: using statistical analysis to locate spam web pages. In WebDB

’04: Proceedings of the 7th International Workshop on the Web and

Databases, pages 1–6, New York, NY, USA, 2004.ACM.

A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam

web pages through content analysis. In WWW ’06: Proceedings of the

th international conference on World Wide Web, pages 83–92, New

York, NY, USA, 2006. ACM.

J. Piskorski, M. Sydow, and D. Weiss. Exploring linguistic features for

web spam detection: a preliminary study. In AIRWeb ’08: Proceedings

of the 4th international workshop on Adversarial information retrieval

on the web, pages 25–28, New York, NY, USA, 2008. ACM.

J. Abernethy, O. Chapelle, and C. Castillo. Webspam identification

through content and hyperlinks. In Proceedings of the fourth

International Workshop on Adversarial Information Retrieval on the

Web (AIRWeb), 2008.

C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know

your neighbors: web spam detection using the web topology. In SIGIR

’07: Proceedings of the 30th annual international ACM SIGIR

conference on Research and development in information retrieval, pages

–430, New York, NY, USA, 2007. ACM.

G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with

language model disagreement. In In Proceedings of the First

International Workshop on Adversarial Information Retrieval on the

Web (AIRWeb), 2005.

A. A. Bencz´ur, I. B´ır´o, K. Csalog´any, and M. Uher. Detecting

nepotistic links by language model disagreement. In WWW ’06:

Proceedings of the 15th international conference on World Wide Web,

pages 939– 940, New York, NY, USA, 2006. ACM.

X. Qi, L. Nie, and B. D. Davison. Measuring similarity to detect

qualified links. In AIRWeb ’07: Proceedings of the 3rd international

workshop on Adversarial information retrieval on the web, pages 49–

, New York, NY, USA, 2007. ACM.

X. Qi, L. Nie, and B. D. Davison, “Measuring similarity to detect

qualified links,” in Proc. 3rd Int. Workshop on Adversarial Information

Retrieval on the Web (AIRWeb’07), New York, 2007, pp. 49–56, ACM.

Levent Bolelli, Seyda Ertekin, Ding Zhou and C.LeeGiles(2007). “ KSVMeans:

A Hybrid Clustering Algorithm for Multi-Type Interrelated

Datasets”. In College of Information Sciences and Technology. The

Pennsylvania State University Park, PA, USA , IEEE/WIC/ACM

International Conference on Web Intelligence.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.