Open Access Open Access  Restricted Access Subscription or Fee Access

Analysis of Heuristic Measures for Cluster Split in Bisecting K-means

Y. Sri Lalitha, A. Govardhan

Abstract


With ever increasing number of documents on web and other repositories, the task of organizing and categorizing these documents to the diverse need of the user by manual means is a complicated job, hence a machine learning technique named clustering is very useful. This paper proposes work is based on shared neighbors. Two documents are said to be neighbors of each other when their similarity is greater than a threshold. Here we choose to work with bisecting k-means in which cluster quality depends on choosing a cluster to be split till k clusters are formed.  The automatic selection of cluster to be split is difficult and time consuming in text documents due to its high dimensionality. This paper implements Bisecting k-means a text document clustering technique to analyze the best criteria needed to select a cluster to be split.  We have compared our results with the ones proposed in literature and our observed that our experimental results showed promising results when tested on real life data sets.

�!o @HȎUt user navigation is therefore particularly important. Data Extraction is the process of retrieving data out of data sources further data processing. Online data exists in the form of a web record. Depending on the end user query, the query results are generated by web databases and from this query results pages. The main objective of this paper is to extract and align important data from different domains with the help of HTML tags and its value. After extracting data, Self Organizing Map (SOM) will classify the extracted data from different domains in the form of clusters. Clustering is the process of grouping physical or abstract objects into classes of similar objects.

 


Keywords


Text Clustering, Similarity Measures, Coherent Clustering, Splitting Criteria.

Full Text:

PDF

References


Rijsbergen C.J.V., Information Retrieval, Butters worth, London, 2nd Ed. 1989.

Kowalski G., Information Retrieval Systems – Theory and Implementations, Kluwer Academic Publishers, 1997.

Karger D.R., Cutting D.R.,et.al, Scatter/Gather: A Cluster-based Approach Browsing Large Document Collections, SIGIR ‘92, pp. 318–329, 1992.

Zamir O, et. al, Fast and Intuitive Clustering of Web Documents, KDD ’97, pp 287-290, 1997.

Koller D., Sahami M., Hierarchically Classifying Documents using Very Few Words, Proceedings of 14th International Conference Machine Learning , pp. 170-178, 1997.

Salton G., Automatic Text Processing, Addison-Wesley, New York, 1989.

Steinbach M., Karypis G., Kumar V.A., Comparison of Document Clustering Techniques, in KDD Workshop on Text Mining, 2000.

Li A.Y., Chung S.M., Parallel Bisecting k-means with prediction clustering algorithm, The Journal of Supercomputing 39(1) 2007 19–37.

Larsen B., Aone C., Fast and Effective Text Mining using Linear-time Document Clustering. In Proceedings of Fifth ACM SIGKDD International Conference, 1999.

Arthur D., Vassilvitskii S., k-means++ Advantage of Careful Seeding. in Symposium on Discrete Algorithms, 2007.

Jain A.K., Dubes R.C., Algorithms for clustering data-advance reference series. Prentice-Hall, Upper Saddle River, NJ 1988.

Diego Ingaramo, David Pinto et., all Evaluation of Internal Validity Measures in Short-Text Corpora, CICLing'08 Proceedings of the IX International Conference on Computational Linguistics and Intelligent Text Processing, Springer-Verlag Berlin, Heidelberg, 2008 pp. 555-567.

Guha S., Rastogi R., Shim K., ROCK: a Robust Clustering Algorithm for Categorical Attributes, Information Systems 25(5) 2000 pp. 345–366.

Luo C., Li A.Y.,Chung S.M., Text Document Clustering based on neighbors, Data and Knowledge Engineering, 2009.

Sergio M. S., Daniel L. B., Sergio B., Giovanna Gazzaniga, Choosing the cluster to split in bisecting divisive clustering algorithm.

Leung C.S., Lee M.,Chan J.H.(Eds.), Text Mining with an Augmented Version of the Bisecting K-Means Algorithm ICONIP 2009, Part II, LNCS 5864 Springer-Verlag Berlin Heidelberg, pp. 352–359, 2009.

N. Sandhya, Y. Sri Lalitha,V. Sowmya, K. Anuradha, A. Govardhan, Analysis of Stemming Algorithm for Text Clustering, IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 5, No 1, September 2011,pp.352-359.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.