Open Access Open Access  Restricted Access Subscription or Fee Access

Efficient and Effortless Similarity Measures for Cluster Ensembles

R.J. Anandhi, Dr. Natarajan Subramaniyam

Abstract


Spatial data mining basically deals with the discovery of implicit knowledge in spatial data. With the tremendous rise in the accumulation of spatial data, new approaches in spatial data mining has become is an critical requirement. With so many clustering algorithms and their derivatives available,  and also the success stories of bagging and boosting in classification, has brought the area of cluster ensembles to limelight in the last decade.  There are different techniques like voting, graph based and information theory approaches of ensembles available. In our work, we have brought out that by using a guided approach in combining the outputs of the various clusterers, we can reduce the intensive computations and also generate robust clusters. Cluster ensembles provide a tool for consolidation of results from a portfolio of individual clustering results. The major challenge in fusion of ensembles is the generation of voting matrix or proximity matrix which is in the order of n2, where n is the number of data points. This is very expensive both in time and space factors, with respect to spatial datasets. Instead, in our method, we compute a symmetric clusterer compatibility matrix of order (m x m), where m is the number of clusterers and m <<n, using the cumulative similarity between the clusters of the clusterers. This matrix is used for identifying which two clusterers, if considered for fusion initially, will provide more information gain. This paper discusses the need for simple, elegant yet effective similarity measures for cluster mining. As the underlying data structure is already known in the case of cluster ensembles, we have tried to utilize that knowledge to find the similarity between the probable clusterer merge points. We have used the set theory approach and the Shannon partition entropy as the basis for our calculation of multiparty merge entropy. The correctness and efficiency of the proposed cluster ensemble algorithm is demonstrated by usage of various cluster validity metrics like accuracy, misclassification rate, Dunn indices, inter cluster density and intra cluster density, measured for the real world datasets available in University of California Irvine’s data repository.

Keywords


Clustering Ensembles, Cluster Compatibility Matrix, Cluster Validity Metrics, Partition Entropy, Degree of Over Shadow

Full Text:

PDF

References


M.Ester, H. Kriegel, J. Sander, X. Xu. ”Clustering for Mining in Large Spatial Databases”. Special Issue on Data Mining, KI-Journal Tech Publishing, Vol.1, 98.

A.L.N. Fred and A.K. Jain, “Data Clustering using Evidence Accumulation”. In Proc. of the 16th International Conference on Pattern Recognition, ICPR 2002, Quebec City.

A.L.N. Fred and A.K. Jain, “Robust data clustering” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, USA, 2003.

Filkov, V. and Skiena, S. “Integrating microarray data by concensus clustering”. In International Conference on Tools with Artificial Intelligence, 2003

Kai Kang, Hua-Xiang Zhang, Ying Fan, “A Novel Clusterer Ensemble Algorithm Based on Dynamic Cooperation”, IEEE 5TH International Conf. on Fuzzy Systems and Knowledge Discovery 2008.

Ng R.T., and Han J., “Efficient and Effective Clustering Methods for Spatial Data Mining”, Proc. 20th Int. Conf. on Very Large DataBases, 144-155, Santiago, Chile, 1994.

Su-lan Zhai1,Bin Luo1 Yu-tang Guo : Fuzzy Clustering Ensemble Based on Dual Boosting , Fourth International Conference on Fuzzy Systems and Knowledge Discovery 07.

A.Strehl, J.Ghosh, “Cluster ensembles - a knowledge reuse framework for combining multiple partitions”, Journal of Machine Learning Research, 3: 583-618, 2002.

A.Strehl, J.Ghosh, “Cluster ensembles- a knowledge reuse framework for combining partitionings”, in: Proc. Of 11th National Conference On Artificial Intelligence, NCAI, Edmonton, Alberta, Canada, pp.93-98, 2002.

Zeng, Y., Tang, J., Garcia-Frias, J. and Gao, G.R., “An Adaptive Meta- Clustering Approach: Combining The Information From Different Clustering Results”, CSB2002 IEEE Computer Society Bioinformatics Conference Proceeding.

Fern X Z, Brodley CE (2004). Solving Cluster Ensemble Problems by Bipartite Graph Partitioning." In Proceedings of International Conference on Machine Learning, pp. 36-43. ACM, New York.

Karypis G, Aggarwal R, Kumar V, Shekhar S (1999). Multilevel Hypergraph Partitioning: Applications in VLSI Domain." IEEE Transaction on VLSI System, 7(1), 69-7

Fischer B, Buhmann JM (2003)., “Path-based clustering for grouping of smooth curves and texture segmentation,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 25, no.4.


Refbacks

  • There are currently no refbacks.