A Novel Algorithm for Clustering High Dimensional Data

Isa Inuwa-Dutse, Xinyue Liu, Dejan Milojicic

Abstract


The Challenges of Cluster Analysis and Related Work K-means is one of the most commonly used clustering algorithm, but it does not perform well on data with outliers or with clusters of different sizes or non-globular shapes. The single link agglomerative clustering method is the most suitable for capturing clusters with non-globular shapes, but this approach is very sensitive to noise and cannot handle clusters of varying density. However, most of the clustering challenges, particularly those related to “quality,” rather than computational resources, are the same challenges that existed decades ago: how to find clusters with differing sizes, shapes and densities, how to handle noise and outliers, and how to determine the number of clusters. The general idea of our novel subspace outlier model is to analyze for each point, how well it fits to the subspace that is spanned by a set of reference points. The experimental evaluation showed that proposed method can find more interesting and more meaningful outliers in high dimensional data with higher accuracy than full dimensional outlier models by no additional computational costs.


Keywords


Clustering, High-Dimensional, Nearest Neighbours, Data Points, Root Mapping.

Full Text:

PDF

References


Gorunescu, Florin. Data Mining: Concepts, models and techniques. Vol. 12. Springer Science & Business Media, 2011.

Günnemann, Stephan, et al. "Subspace correlation clustering: finding locally correlated dimensions in subspace projections of the data." Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012.

Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering." ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.

Kriegel, Hans‐Peter, et al. "Density‐based clustering." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1.3 (2011): 231-240.

Yuan, Xiaoru, et al. "Dimension projection matrix/tree: Interactive subspace visual exploration and analysis of high dimensional data." IEEE Transactions on Visualization and Computer Graphics 19.12 (2013): 2625-2633.

Nasiruddin, Mohammad. "A state of the art of word sense induction: A way towards word sense disambiguation for under-resourced languages." arXiv preprint arXiv: 1310.1425 (2013).

Bachem, Olivier, et al. "Fast and provably good seedings for k-means." Advances in Neural Information Processing Systems. 2016.

Correa, Carlos D., and Peter Lindstrom. "Locally-scaled spectral clustering using empty region graphs." Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012.

Meyer, Fernand, and Jean Stawiaski. "Morphology on graphs and minimum spanning trees." International Symposium on Mathematical Morphology and Its Applications to Signal and Image Processing. Springer, Berlin, Heidelberg, 2009.

Muja, M., & Lowe, D. G. (2014). Scalable nearest neighbor algorithms for high dimensional data. IEEE transactions on pattern analysis and machine intelligence, 36(11), 2227-2240.

Tao, Jun, Chaoli Wang, and Ching Kuang Shene. "FlowString: Partial streamline matching using shape invariant similarity measure for exploratory flow visualization." 2014 IEEE Pacific Visualization Symposium. IEEE, 2014.

Angiulli, Fabrizio, and Clara Pizzuti. "Fast outlier detection in high dimensional spaces." European Conference on Principles of Data Mining and Knowledge Discovery. Springer, Berlin, Heidelberg, 2002.

Zhang, Y., & Telesca, D. (2014). Joint clustering and registration of functional data. arXiv preprint arXiv:1403.7134.


Refbacks

  • There are currently no refbacks.