Implementation of K-Modes Algorithm to Cluster Very Large Categorical Data Sets in Data Mining

K. Sujatha

Implementation of K-Modes Algorithm to Cluster Very Large Categorical Data Sets in Data Mining

K. Sujatha

Abstract

This paper is mainly related to Data Mining and in particular it is in Clustering. Partitioning a large set of objects into homogeneous groups is a fundamental operation in Data Mining. This process of grouping objects into homogenous groups is called as clustering. In general, K-Means algorithm is used for clustering large data sets in Data Mining but its efficiency is limited to cluster numerical objects only. However, K-Means algorithm working efficiently with numerical values, its use is limited in Data Mining because data sets in Data Mining often contain categorical values. In this paper we present an algorithm called K-Modes algorithm to extend the K-Means paradigm to categorical domains. Here we introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes and use a frequency based method to up date modes in the clustering process. Here the WEKA tool is used for the implementation of K-modes algorithm.

Keywords

Categorical Data, Clustering, Data Mining, Dissimilarity Measures, K-Means, K-Modes, Weka Tool

Full Text:

PDF

References

Anderberg, M. R. (1973) Cluster Analysis forApplications, Academic Press.

A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining Zhexue Huang* Cooperative Research Centre for Advanced Computational Systems CSIRO Mathematical and Information Sciences GPO Box 664, Canberra 2601, AUSTRALIA email:Zhexue.Huang@cmis.csiro.au

Cormack, R.M. 1971. A review of classification. J. Roy. Statist. Soc.Serie A, 134:321–367.

Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P.Smyth, and R. Uthurusamy (Eds.), AAAI Press/The MIT Press, pp. 573–592.

Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values ZHEXUE HUANG huang@mip.com.au ACSys CRC, CSIRO Mathematical and Information Sciences, GPO Box 664,Canberra, ACT 2601, Australia.

Fisher, D. H. (1987) Knowledge Acquisition Via Incremental

Conceptual Clustering, Machine Learning, 2(2), pp.139-172.

Fred, A., Jain, A.K.: Data Clustering Using Evidence Accumulation, in Proceedings of the International Conference on Pattern Recognition (ICPR), Quebec City, August 2002.

Fred, A., Jain, A.K. Combining Multiple Clustering Using Evidence Accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, number 6,835-850, 2005.

Goldberg, D. E. (1989) Genetic Algorithms in Search,Optimisation,and Machine Learning,Addison-Wesley.

Huang, Z. (1997) Clustering Large Data Sets with Mixed Numeric and Categorical Values, In Proceedings of The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, World Scientific.

Huang, Z. 1997b. A fast clustering algorithm to cluster very large categorical data sets in data mining. Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Dept. of Computer Science, The University of British Columbia, Canada, pp. 1–8.

IBM. 1996. Data Management Solutions. IBM White Paper, IBM Corp.

Jain, A. K. and Dubes, R. C. (1988) Algorithms for Clustering Data, Prentice Hall.

Kaufman, L. and Rousseeuw, P.J. 1990. Finding Groups in Data—An Introduction to Cluster Analysis. Wiley.

Khan, S.S., Ahmad, A.: Cluster center initialization algorithm for Kmeans clustering. Pattern Recognition Letters, 25 (11), 1293-1302, 2004.

Kirkpatrick, S., Gelatt, C. D. and Vecchi, M. P. (1983) Optimisation by Simulated Annealing, Science, 220(4598), pp.671-680.

Klosgen, W. and Zytkow, J.M. 1996. Knowledge discovery in databases terminology. Advances in Knowledge.

MacQueen, J. B. (1967) Some Methods for Classification and Analysis of Multivariate Observations, In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297.

Michalski, R. S. and Stepp, R. E. (1983) Automated Construction of Classifications: Conceptual Clustering Versus Numerical Taxonomy, IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(4),pp. 396- 410.

Michelene kamber and Jiawei Han .Data Mining: Concepts and Techniques.Simon Fraster University.2001 by academic press.

Murtagh, F. (1992) Comments on ―Parallel Algorithms for Hierarchical Clustering and Cluster Validity‖, IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(10), pp. 1056-1057.

Murthy, C. A. and Chowdhury, N. (1996) In Search of Optimal Clusters Using Genetic Algorithms, Pattern Recognition Letters, 17, pp. 825-832.

Rose, K., Gurewitz, E. and Fox, G. (1990) A Deterministic Annealing Approach to Clustering, Pattern Recognition Letters, 11, pp. 589-594.

Selim, S. Z. and Ismail, M. A. (1984) K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality, IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(1), pp. 81-87.

Williams, G.J. and Huang, Z. 1996. A case study in knowledge acquisition for insurance risk assessment using a KDD methodology. Proceedings of the Pacific Rim Knowledge Acquisition Workshop, Dept. of AI, Univ. of NSW, Sydney, Australia, pp. 117–129.

WebReference:http://www.csWaikatoac.nz/ml/weak/

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me