Open Access Open Access  Restricted Access Subscription or Fee Access

Updating Solving Set Algorithm of Outlier Detection to Reduce the Iterations for Large Data Sets and Its Application to Fault Diagnosis

P. S. Dhabe, A. S. Shingare, M. L. Dhore

Abstract


In this paper original solving set algorithm for detection of possible outliers is updated to have less iterations and thus there by less time. Original algorithm selects initial solving set randomly, but if we select this set carefully using standard deviation of each pattern with respect to each other. The proposed modification requires less time and iterations than the original one. Our experimentation says that this modification requires around half to two third of the patterns in the initial solving set having maximum standard deviation. We have compared original and updated algorithms using synthetic 2-dimensional data set, as described in section II, as well as a fault diagnosis data set from NASA. We observed that the time required to detect outliers for updated algorithm is less than the original one and it exhibit better outlier detection rate than the original one along with better cluster entropy. Better outlier detection rate, less time required and better cluster entropy are the key features of this modification that makes it suitable for outlier detection from large data sets.

Keywords


Data Mining, Distance-Based Outlier, Fault Diagnosis, Outlier Detection.

Full Text:

PDF

References


D. M. Hawkins, “Identification of Outliers”. Chapman and Hall, London, 1980.

E. Knorr and R. Ng, “Algorithms for mining distance based outliers in large datasets”, Proceedings of the 24th Conference on VLDB, New York, 1998, pp. 392 -403.

Aggarwal, C. C., Yu, S. P., “An effective and efficient algorithm for high-dimensional outlier detection, The VLDB Journal, 2005, vol. 14, pp. 211-221.

Victoria J. Hodge and Jim Austin, A Survey of Outlier Detection Methodologies, Arti_cial Intelligence Review 22:pp. 85-126, 2004.

Aggarwal, C. C. & Yu, P. S. (2001). Outlier Detection for High Dimensional Data. Proceedings of the ACM SIGMOD Conference 2001.

Jingke Xi, Outlier Detection Algorithms in Data Mining, Second International Symposium on Intelligent Information Technology Application, IITA'08 Volume1, 20-22Dec 2008 pp. 94-97.

Angiulli, F., Basta, S., and Pizzuti, C., Distance-based Detection and Prediction of Outliers, IEEE Transactions on Knowledge and Data Engineering, Institute of Electrical and Electronics Engineers, New York, 2006, pp. 145-160.

Motaz K. Saad and Nabil M. Hewahi, A comparative Study of Outlier Mining and Class Outlier Mining , ISSR Journals, Vol. 1 (1) - June 2009.

Manzoor Elahi, Xinjie Lv, Wasif Nisar, Imran Ali Khan, Ying Qiao, Hongan Wang, DB-Outlier Detection Algorithm using Divide and Conquer approach over Dynamic DataStream , 2008 International Conference on Computer Science and Software Engineering , DOI 10.1109/CSSE.2008.1053.

Rolf Isermann, Fault-Diagnosis Systems, Springer-Verlag Berlin Heidelberg, 2006.

J. Han and M. Kamber, Data Mining, Concepts and Techniques, Morgan Kaufmann, 2001.

F. Angiulli and C. Pizzuti, “Outlier Mining in Large High- Dimensional Data Sets,” IEEE Trans. Knowledge and Data Eng., vol. 2, no. 17, pp. 203-215, Feb. 2005.

S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proc. Int’l Conf. Management of Data (SIGMOD ’00), pp. 427-438, 2000.

W. Jin, A.K.H. Tung, and J. Han, “Mining Top-n Local Outliers in Large Databases,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’01), 2001.

E. Knorr, R. Ng, and V. Tucakov, “Distance-Based Outlier: Algorithms and Applications,” VLDB J., vol. 8, nos. 3-4, pp. 237-253, 2000.

S.D. Bay and M. Schwabacher, “Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule,” Proc. Int’l Conf. Knowledge Discovery and Data Mining (KDD’03), 2003.

E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo, “A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data,” Applications of Data Mining in Computer Security, Kluwer, 2002.

Avaliable https://c3.ndc.nasa.gov/dl/data/disk-defect-data/ [online]


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.