Personalization and Clustering of Similar Web Pages

Smita Gupta; Anurag Malik

Personalization and Clustering of Similar Web Pages

Smita Gupta, Anurag Malik

Abstract

Over the last decade, we have justly arrived in the clichéd information age. There is a vast expansion in the amount of online resources "out there". Moreover, the evolution of the Internet into the Global Information Infrastructure, together with the massive popularity of the Web, has also enabled the ordinary citizen to become not just a consumer of information, but also a part of it. In order to make user trouble free, it is required to save his/her time and effort. So some way is needed to give the relevant information to the user in a quick way and also enables to manage the whole lot of data without troublesome. Through this paper, we are using tf-idf (term frequency inverse document frequency approach) technique along with the concept of web mining to attain the required solution. Web mining is the application of data mining techniques that aims in discovering the patterns from the Web. Among its different ways, like Web usage mining, Web content mining and Web structure mining, here, efforts are only being made in the field of web content mining. In this work, a windows application is developed which act as a data analysis tool. This application is using the API of Bing search engine. The proposed algorithm is applied on the snippets (short description provided below each search result) of web search results to find those web pages that contains maximum number of query words. Moreover, it also aims at managing the information more easily on client's machine by using simple grouping technique.

Keywords

Term Frequency-Inverse Document Frequency (TF-IDF); Static Clustering; Mining Methods and Algorithms; Text Mining; Web Mining; Information Retrieval

Full Text:

PDF

References

Ajay Ohri, (2010), “Data mining through Cloud Computing”. http://knol.google.com/k/data-mining-through-cloud-computing#.

Andrei Broder ,(2002), ”A taxonomy of web search” , IBM Research , SIGIR Forum, Fall 2002, Vol. 36, No. 2

Bamshad Mobasher, “Data Mining for Web Personalization”, Center for Web Intelligence School of Computer Science, Telecommunication, and Information Systems DePaul University, Chicago, Illinois, USA

Giles, L. and S. Lawrence, (1999), “Accessibility and distribution of information on the web.” Nature, 400.

API Basics , http://www.bing.com/developers/s/APIBasics.html

Personalization is not Technology: Using Web Personalization to promote your Business, http://www.boxesandarrows.com/view/personalization_is_not_technology_using_web_personalization_to_promote_your_business_goal. Accessed by Christian Ricci on 2004/01/12

Scoring and Ranking Techniques - tf-idf term weighting and cosine similarity, http://www.ir-facility.org/scoring-and-ranking-techniques-tf-idf-term-weighting-and-cosine-similarity. , Published Mar 31, 2010 by Michael Dittenbach

Information Retrieval and Data Mining, Part 1 – Information Retrieval, http://lsirwww.epfl.ch/courses/dis/2007ws/lecture/week%2010%20Vector%20Space%20Model.pdf. Accessed by Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Information Retrieval – 1, 2007-8

Cosine Similarity and Term Weight Tutorial, An Information Retrieval Tutorial on Cosine Similarity Measures, Dot Products and Term Weight Calculations, http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html, by Dr. E. Garcia 2006

How does Google Pick Snippets for Your Pages to Show in Search Results?, http://www.seobythesea.com/2007/12/how-does-google-pick-snippets-for-your-pages-to-show-in-search-results/. Accessed by By Bill Slawski, on December 18, 2007

Martin-Bautista, M. J., Vila, M., and Larsen, H. L. (1999) , “A Fuzzy Genetic Algorithm Approach to an Adaptive Information Retrieval Agent,” Journal of the American Society for Information Science (50:9), pp. 760-771

Mulvenna, M., Anand , S.S., B¨uchner,2000, ” A.G.: Personalization on the net using web mining”, Communication of ACM 43(8) 122–125

Porter, M.F., (1980), “An Algorithm for Suffix Stripping Program”, 14 no. 3, pp. 130-137.

Rainie, L. and J. Shermak., (2005), “Search engine use shoots up in the past year and edges towards email as the primary internet application.” Technical report, Online Activities & Pursuits, Pew Internet & American Life Project.

Raymond Kosala, Hendrik Blockeel, (2000),”Web Mining Research: A Survey”, In ACM SIGKDD

S. K. Card, J. Mackinlay, and B. Shneiderman. (1999), “Readings in Information Visualization: Using Vision to Think”. Interactive Technologies Series. Morgan Kaufmann Publishers

Shady Elbassuoni, (2007), “Adaptive Personalization of Web Search”, JUNE 2007 (elbassmasters)

Xiaohui Cui, Thomas E. Potok, Paul Palathingal , (2005), “Document Clustering using Particle Swarm Optimization”, Applied Software Engineering Research Group Computational Sciences and Engineering Division Oak Ridge National Laboratory Oak Ridge, IEEE

Y. Wang, M. Kitsuregawa, “ Link-based Clustering of Web Search Results”, In Proceedings of The Second International Conference on Web-Age Information Management.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me