An Optimized Approach to Record Deduplication

V. Nirmala; B. Rosiline Jeetha

An Optimized Approach to Record Deduplication

V. Nirmala, B. Rosiline Jeetha

Abstract

Record deduplication is a specialized technique for eliminating duplicate copies of repeating record. Duplicate record detection is important for data preprocessing and cleaning. The increasing volume of information available in digital media becomes a challenging problem for data administrators. The increased volume even created redundant data also in the database. So a system or method is become immense to control the redundancy and duplication. Databases are increasing in size at an exponential rate, and it plays an important role in all industry. Detection of duplicate Records in IT industry become is necessary to obtain precise results while searching and to shrink storage requirements. This paper presents the problem of duplicate records and their detection. In the proposed approach, we made a method that makes use of BAT for generating the optimal similarity measure to decide whether the data is duplicate or not. The optimal similarity measure is generated using BAT algorithm for the training datasets. This system is initialized with a population of random solutions and searches for optima by updating bat generations We have used Synthetic datasets to analyze the proposed algorithm and the performance of the proposed algorithm is compared against the genetic programming technique with the help of evaluation metrics. Our Approach makes the user free from the burden of having to choose and tune this parameter.

Keywords

BAT Algorithm Data Preprocessing, Duplicate Detection, Data Duplication, Genetic Programming

Full Text:

PDF

References

A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, “Duplicate Record Detection: A Survey,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 1, pp. 1-16, Jan. 2007.

A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig, “Syntactic Clustering of the Web,” Proc. Sixth Int’l World Wide Web Conf. (WWW6), pp. 1157-1166, 1997.

Ahmed, K. Elmagarmid, Panagiotis G. Ipeirotis and Vassilios S. Verykios, 2007. Duplicate recorddetection: A survey. IEEE Trans. Knowl. Data Eng., 19: 1-16. DOI: 0.1109/TKDE.2007.250581.

Bhagwat, D., K. Eshghi, D.D. Long and M.Lillibridge, 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. Proceedings of the 17th IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, (MASCOTS ’09), London, UK.

Bolosky, W.J., S. Corbin, D. Goebel and J.R. Douceur,2000. Single instance storage in Windows® 2000.Proceedings of the 4th Conference on USENIX Windows Systems Symposium, (WSS ’00),USENIX Association Berkeley, CA, USA, pp: 2-2.

Donghui Feng, Gully Burns and Eduard Hovy ,“Extracting Data Records from Unstructured Biomedical Full Text” Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 837–846, Prague, June 2007.

H.B. Newcombe, “Record Linking: The Design of Efficient Systems for Linking Records into Individual and Family Histories,” Am. J. Human Genetics, vol. 19, no. 3, May 1967.

Jiansheng Wei,1Ke Zhou, 2Lei Tian, 1Hua Wang, Dan Feng,” A Fast Dual-level Fingerprinting Scheme for Data Deduplication“

Kumar, J.P. and P. Govindarajulu, 2009. Duplicate and near duplicate documents detection: A review. Eur.J. Sci. Res., 32: 514-527.

M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg, “Adaptive Name Matching in Information Integration,” IEEE Intelligent Systems, vol. 18, no. 5, pp. 16-23, Sept./Oct. 2003.

Mark W. Storer Kevin Greenan Darrell D. E. Long Ethan L. Miller,” Secure Data Deduplication”

Michael O. Rabin, "Fingerprinting by random polynomials", Technical Report, No. TR-15-81, Center for Research in Computing Technology, Harvard University, Cambridge, MA, USA, 1981.

Moise´s G. de Carvalho, Alberto H.F. Laender, Marcos Andre´ Gonc¸alves, and Altigran S. da Silva.” A Genetic Programming Approach to Record Deduplication”

P. Christen, “Probabilistic Data Generation for Deduplication and Data Linkage,” Intelligent Data Eng. and Automated Learning, pp. 109-116, Springer, 2005.

Peter Christen.”Probabilistic Data Generation for Deduplication and Data Linkage”, http://datamining.anu.edu.au/linkage.html.

R. Bell and F. Dravis, “Is You Data Dirty? and Does that Matter?,” Accenture Whiter Paper, http://www.accenture.com, 2006.

R.A. Baeza-Yates and B.A. Ribeiro-Neto, Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.

Weifeng Su, Jiying Wang, and Frederick H. Lochovsky, “Record Matching over Query Results from Multiple Web Databases”, IEEE Transactions On Knowledge And Data Engineering, VOL. 22, NO. 4, APRIL 2010

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me