Identifying Outliers in Datasets Using Outlier Removal Clustering (ORC) Algorithm

N. Nirmaladevi; R. Suresh Kumar

Identifying Outliers in Datasets Using Outlier Removal Clustering (ORC) Algorithm

N. Nirmaladevi, R. Suresh Kumar

Abstract

The objective function of general K-Mean, this work associates a weight vector with each cluster to indicate which dimensions are relevant to the clusters. To prevent the value of the objective function from decreasing because of the elimination of dimensions, virtual dimensions are added to the objective function. The values of data points on virtual dimensions are set artificially to ensure that the objective function is minimized when the real subspace clusters or the clusters in original space are found. The outlier detection problem in some cases is similar to the classification problem. For example, the main concern of clustering-based outlier detection algorithms is to find clusters and outliers, which are often regarded as noise that should be removed in order to make more reliable clustering. This research work presents an algorithm that provides outlier detection and data clustering simultaneously. The algorithm improves the estimation of centroids of the generative distribution during the process of clustering and outlier discovery.

Keywords

Data Mining, Clustering, K-Means, High Dimensions, Outlier Removal Clustering (ORC) Algorithm.

Full Text:

PDF

References

Aniket Rangrej, Ashish V Tendulkar and Sayali Kulkarni 2011. Comparative study of clustering techniques for short text documents In Proceedings of the 20th international conference companion on World Wide Web. Pp.111-112.

Xiaoyun Chen, Youli Su, Yi Chen and Guohua Liu.2009.GK-means: an Efficient K-means Clustering Algorithm Based on Grid.Pp. 1-4.

Fasahat Ullah Siddiqui and Nor Ashidi Mat Isa 2011. Enhanced moving K-Means (EMKM) algorithm for image segmentation. IEEE Transactions on Consumer Electronics, vol.57, no.2, pp. 833-841.

Alessandro Finamore, Marco Mellia and Michela Meo2011.Mining unclassified traffic using automatic clustering techniques.In Traffic Monitoring and Analysis, Springer Berlin Heidelberg. vol. 6613, pp.150-163.

Nor Ashidi Mat Isa, Samy A Salamah and Umi Kalthum Ngah 2009.Adaptive fuzzy moving K-means clustering algorithm for image segmentation. Consumer Electronics on IEEE Transactions. vol.55. issue.4. Pp.2145-2153.

Vaidya, Jaideep, and Chris Clifton. 2003.Privacy-preserving k-means clustering over vertically partitioned data. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 206-215.

Almeida, J. A. S., L. M. S. Barbosa, A. A. C. C. Pais, and S. J. Formosinho 2007. Improving hierarchical cluster analysis: A new method with outlier detection and automatic clustering. Chemometrics and Intelligent Laboratory Systems87, no. 2: 208-217.

Sun, Yufen, Gang Liu, and Kun Xu 2010. A k-Means-Based Projected Clustering Algorithm. In Computational Science and Optimization (CSO), 2010 Third International Joint Conference on, vol. 1, pp. 466-470.

Moosmann, Frank, Bill Triggs, and Frederic Jurie 2007. Fast discriminative visual codebooks using randomized clustering forests. Advances in Neural Information Processing Systems 19: 985-992

Guo, Ping, Ji-Yong Dai, and Yan-Xia Wang 2006. Outlier Detection in High Dimension Based on Projection. In Machine Learning and Cybernetics, 2006 International Conference on, pp. 1165-1169.

Jagannathan, Geetha, and Rebecca N. Wright 2005.Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 593-599.

Jain, Anil K 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31, no. 8: 651-666.

Yanfeng Zhang, Xiaofei Xu and Yunming Ye 2010.NSS-AKmeans: An Agglomerative Fuzzy K-means clustering method with automatic selection of cluster number. Vol. 2, Pp. 32-38.

Aggarwal, Charu C., and Philip S. Yu 2001. Outlier detection for high dimensional data. In ACM Sigmod Record, vol. 30, no. 2, pp. 37-46.

Knorr, Edwin M., Raymond T. Ng, and Vladimir Tucakov 2000. Distance-based outliers: algorithms and applications. The VLDB Journal—the International Journal on Very Large Data Bases 8, no. 3-4: 237-253.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me