Open Access Open Access  Restricted Access Subscription or Fee Access

Pearson Correlation Coefficient k-Nearest Neighbor Outlier Classification on Real-Time Data Set

Dr. D. Rajakumari, S. Karthika

Abstract


Detection and classification of data that do not meet the expected behavior (outliers) plays the major role in wide variety of applications such as military surveillance, intrusion detection in cyber security, fraud detection in on-line transactions. Nowadays, an accurate detection of outliers with high dimension is the major issue. The trade-off between the high-accuracy and low computational time is the major requirement in outlier prediction and classification. The presence of large size diverse features need the reduction mechanism prior to classification approach. To achieve this, the Distance-based Outlier Classification (DOC) is proposed in this paper. The proposed work utilizes the Pearson Correlation Coefficient (PCC) to measure the correlation between the data instances. The minimum instance learning through PCC estimation reduces the dimensionality. The proposed work is split up into two phases namely training and testing.  During the training process, the labeling of most frequent samples isolates them from the infrequent reduce the data size effectively. The testing phase employs the k-Nearest Neighborhood (k-NN) scheme to classify the frequent samples effectively. The dimensionality and the k-value are inversely proportional to each other. In proposed work, the selection of large value of k offers the significant reduction in dimensionality. The combination of PCC-based instance learning and the high value of k reduces the dimensionality and noise respectively. The comparative analysis between the proposed PCC-k-NN with the conventional algorithms such as Decision Tree, Naïve Bayes, Instance-Based K-means (IBK), Triangular Boundary-based Classification (TBC) regarding sensitivity, specificity, accuracy, precision, and recall proves its effectiveness in OC. Besides, the experimental validation of proposed PCC-k-NN with the state-of art methods regarding the execution time assures trade-off between the low-time consumption and high-accuracy.


Keywords


Data Mining, Distance-based Instance Learning, Outlier Detection, Outlier Classification, Pearson Correlation Coefficient, k-Nearest Neighbor.

Full Text:

PDF

References


M. A. G. Sagade and R. Thakur, "Study of Outlier Detection Techniques for Low and High Dimensional Data," 2014.

T. Al-Khateeb, M. M. Masud, L. Khan, C. Aggarwal, H. Jiawei, and B. Thuraisingham, "Stream Classification with Recurring and Novel Class Detection Using Class-Based Ensemble," in 12th International Conference on Data Mining (ICDM), 2012 IEEE 2012, pp. 31-40.

A. Albanese, S. K. Pal, and A. Petrosino, "Rough Sets, Kernel Set, and Spatiotemporal Outlier Detection," IEEE Transactions on Knowledge and Data Engineering, vol. 26, pp. 194-207, 2014.

Y. Qian, Q. Wang, H. Cheng, J. Liang, and C. Dang, "Fuzzy-rough feature selection accelerator," Fuzzy Sets and Systems, vol. 258, pp. 61-78, 2015.

F. Angiulli, "Prototype-Based Domain Description for One-Class Classification," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, pp. 1131-1144, 2012.

N. Pham and R. Pagh, "A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data," in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, 2012, pp. 877-885.

L. Galluccio, O. Michel, P. Comon, M. Kliger, and A. O. Hero, "Clustering with a new distance measure based on a dual-rooted tree," Information Sciences, vol. 251, pp. 96-113, 12/1/ 2013.

B. Krawczyk, M. Woźniak, and B. Cyganek, "Clustering-based ensembles for one-class classification," Information Sciences, vol. 264, pp. 182-195, 2014.

H. Kriegel, P. Kroger, E. Schubert, and A. Zimek, "Outlier Detection in Arbitrarily Oriented Subspaces," in 12th International Conference on Data Mining (ICDM), 2012 IEEE 2012, pp. 379-388.

O. M. B. Saeed, S. Sankaran, A. R. M. Shariff, H. Z. M. Shafri, R. Ehsani, M. S. Alfatni, et al., "Classification of oil palm fresh fruit bunches based on their maturity using portable four-band sensor system," Computers and Electronics in Agriculture, vol. 82, pp. 55-60, 3// 2012.

A. Zimek, M. Gaudet, R. J. Campello, and J. Sander, "Subsampling for efficient and effective unsupervised outlier detection ensembles," in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013, pp. 428-436.

S. Krishnan and H. G. Kerkhoff, "Exploiting Multiple Mahalanobis Distance Metrics to Screen Outliers from Analog Product Manufacturing Test Responses," Design & Test, IEEE, vol. 30, pp. 18-24, 2013.

R. Todeschini, D. Ballabio, V. Consonni, F. Sahigara, and P. Filzmoser, "Locally centred Mahalanobis distance: A new distance measure with salient features towards outlier detection," Analytica Chimica Acta, vol. 787, pp. 1-9, 7/17/ 2013.

A. Akila and E. Chandra, "Slope finder—a distance measure for DTW based isolated word speech recognition," Int J Eng Comput Sci, vol. 2, pp. 3411-3417, 2013.

Z. Chao, S. Huanfeng, Z. Mingliang, Z. Liangpei, and W. Penghai, "Reconstructing MODIS LST Based on Multitemporal Classification and Robust Regression," Geoscience and Remote Sensing Letters, IEEE, vol. 12, pp. 512-516, 2015.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.