Open Access Open Access  Restricted Access Subscription or Fee Access

Partitional Distance-based Projected Clustering Algorithm

P. Srilakshmi, T. Deepthi

Abstract


Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. To address this problem, a number of projected clustering algorithms have been proposed. However, most of them encounter difficulties when clusters hide in subspaces with very low dimensionality. These challenges motivate an effort to propose a robust partitional distance-based projected clustering algorithm. The algorithm consists of three phases. The first phase performs attribute relevance analysis by detecting dense and sparse regions and their location in each attribute. Starting from the results of the first phase, the goal of the second phase is to eliminate outliers, while the third phase aims to discover clusters in different subspaces. The clustering process is based on the K-means algorithm, with the computation of distance restricted to subsets of attributes where object values are dense. Our algorithm is capable of detecting projected clusters of low dimensionality embedded in a high-dimensional space and avoids the computation of the distance in the full-dimensional space

Keywords


Agglomerative Approach, Attribute Relevance Classification, Analysis Eliminating Outliers, Clustering, Clique

Full Text:

PDF

References


Mohamed Bougessa and Shengrui Wang “Mining Projected Cluster in High-Dimensional Spaces”,IEEE transaction on knowledge and data engineering vol.21. No.4 April 2009

Jae-Woo Chang and Du-Seok Jin. “A new cell-based clustering method for large, high-dimensional data in data mining applications”. In Proceedings of the ACM symposium on applied computing,2002.

Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. “Automatic subspace clustering of high dimensional data for data mining applications”, Data Mining and Knowledge Discovery, vol. 11, no. 1, pp. 5-33, 2005.

A.K.Jain, M.N.Murty, and P.J.Flynn.“Data clustering: A Review,” ACM Computing Surveys (CSUR), vol. 31, no.3, pp. 264-323, 1999.

H. Liu and L. Yu, “Toward Integrating Feature Selection Algorithms for Classification and Clustering,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 4, pp. 491-502, Apr. 2005.

C. C. Aggarwal, Joel L. Wolf, Phillip S. Yu, Cecilia Procopiuc, and Jong Soo Park. “Fast algorithms for projected clustering”. In Proceedings of the ACM SIGMOD international conference on Management of data, pp. 61-72, 1999.

K.Y.L. Yip, D.W. Cheng, and M.K. Ng, “On Discovery of Extremely Low-Dimensional Clusters Using Semi-Supervised Projected Clustering,” Proc. 21st Int‟l Conf. Data Eng. (ICDE ‟05), pp. 329-340, 2005.

K.Y.L. Yip, D.W. Cheng, and M.K. Ng, “HARP: A Practical Projected Clustering Algorithm,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1387-1397, Nov. 2004.

M. Lung and N. Mamoulis, “Iterative Projected Clustering by Subspace Mining,” IEEE Trans. Knowledge and Data Eng., vol. 17,no. 2, pp. 176-189, Feb. 2005.

C.H. Cheng, A.W. Fu, and Y. Zhang, “Entropy-based subspace clustering for mining numerical data”. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, pp. 84-93, 1999.

Sanjay Goil, Harsha Nagesh, and Alok Choudhary. “MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets”, Technical Report CPDC- TR-9906-010, Northwestern Univ., 1999.

L. Parsons, E. Haque, and H. Liu, “Subspace Clustering for High Dimensional Data: A Review,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 90-105, 2004.

F. Angiulli and C. Pizzuti, “Outlier Mining in Large High-Dimensional Data Sets,” IEEE Trans. Knowledge and Data Eng.,vol. 17, no. 2, pp. 369-383, Feb. 2005.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.