Open Access Open Access  Restricted Access Subscription or Fee Access

An Efficient K-Means Clustering Algorithm for Large Data

K. Srinivasa Rao, K. Kiran Kumar, P. Srinivasa Rao

Abstract


Cluster analysis is one of the major data analysis methods for clustering the large data sets. The cluster analysis deals with the problems of organization of a collection of data objects into clusters based on some similarity. k-means is one of the most popular data partitioning algorithms that solve the well known clustering problem. Performance of the k-means clustering greatly depends upon the correctness of the initial centroids. Typically the initial centroids for the original k-means clustering are determined randomly. So, the clustering result may reach the local optimal solutions, not the global optimum. Several improvements have been proposed to improve the performance of k-means algorithm. This paper proposes an Efficient k-means algorithm for finding the better initial centroids and an efficient way for assigning data points to appropriate clusters. The proposed algorithm is tested with six bench mark datasets, which are taken from UCI machine learning data repository and found that the proposed algorithm gives better result than the existing.

Keywords


Clustering, Data Partitioning, Data Mining, Heuristic K-Means, K-Means Algorithm.

Full Text:

PDF

References


Koheri Arai and Ali Ridho Barakbah, “Hirerachical K-means: an algorithm for Centroids intialization for k-means,"department of information science and Electrical Engineering Politechnique in Surabaya @ Reports of the Faculty of Science and Engineering, Saga University, Vol. 36, No.1, 2007.

Fahim A.M., Salem A.M., Torkey F.A. and Ramadan M.A. “An efficient enhanced k-means clustering algorithm,” Journal of Zhejiang University SCIENCE A ISSN 1009-3095 (Print); ISSN 1862-1775 (Online), vol 7(10):1626-1633 Mar 2006.

K. A. Abdul Nazeer and M. P. Sebastian, “Improving the accuracy and efficiency of the k-means clustering algorithm,” in International Conference on Data Mining and Knowledge Engineering (ICDMKE), Proceedings of the World Congress on Engineering (WCE-2009), Vol I, July 1-3, 2009, London, U.K.

Margaret H. Dunham and S. Sridhar Data Mining - Introductory and Advanced Concepts. Person Education, 2006.

Fang Yuan , Zeng-Hui-Meng , Hong-Xia Zhangz and Chun-Ru Dong , “A New Algrothim to Get the Initial Centroids,” Department of Computer Science, Baoding College of Finance, Baoding, 071002 P.R.China, IEEE Aug 2004.

Daxin Jiang , Chun Tang and Aidong Zhang , “Cluster Analysis for Gene Expression Data: A Survey ,” Department of Computer Science and Engineering, State University of New York at Bu_alo 2003.

S. Deelers and S. Auwatanamongkol, Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with the Highest Variance, " international Journal of Computer Science, Volume 2, Number 4.

Madhu Yedla, Srinivasa Rao P , Sridhar Reddy G and Srinivasa T M, “An Enhanced K-means Clustering Algorithm with Better Initial Cluster Centers,” in preceedings of Fouth International Conference on Information Processing, Aug 2010.

Madhu Yedla, Srinivasa Rao P and Srinivasa T M, “Enhancing K-means Clustering Algorithm with Improved Initial Center,” in International Journal of Computer Science and Information Technology, June 2010.

Xu R and Wunsch D II, "Survey of clustering algorithms," IEEE Transaction on Neural Networks, Vol. 16, Issue 3, pp. 645-678, May 2005.

J. B. MacQueen, "Some Methods for classification and Analysis of Multivariate Observations," Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Vol 1, pp. 281-297, 1967.

JiaWei Han and Micheline Kamber, Data Mining Concepts and Techniques (Second Edition). Beijing: China Machine Press, 2006.[text book]

Greg Hamerly and Charles Elkan, “Alternatives to the k-means algorithm that find better clustering,” Proceedings of the 11th international conference on Information and knowledge management, 2002, pp. 600-607.

M. Matteucci, “A tutorial on clustering algorithms,” http://www.elet.polimi.it/upload/matteucc/Clustering/tutotuto.html/.

Iris, Ecoli, New Thyroid, Echocardiogram and Breast Cancer Wisconsin (Original) data sets. http://archive.ics.uci.edu/ml/machine-learning-databases.

Height-Weight data http://www.disabledworld.com / artman /publish / height-weight-teens.shtml.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.