An Evaluation under Concealment of Duplication Entities in XML Documents

R. Thiyagarajan; S. Priyanka; T.K.P. Rajagopal

An Evaluation under Concealment of Duplication Entities in XML Documents

R. Thiyagarajan, S. Priyanka, T.K.P. Rajagopal

Abstract

Detecting duplicates is a significant of data cleaning; the mission is to recognize multiple representations of a same real-world data or business data and necessary to improve the value of data. Number of approaches both for relational and XML data are exist. As XML is popularly used for data exchange and data publishing on the Web, algorithms to detect duplicates in XML documents are required. XML is a language used for publish data on web so the possibility of error and noise will occur. Hence, the data should be cleaned, which requires solutions for fuzzy duplicate detection in XML. The hierarchical and semi-structured nature of XML strongly differs from the flat and structured relational model, which has received the main attention in duplicate detection so far. We consider the challenges of detecting duplicates in XML to develop valuable, well-organized solutions to the problem. We present a comparison of algorithms, which are used to perform duplicate detection effectively for all kinds of XML objects, given dependencies between different XML elements.

Keywords

Revelation of Duplication. Data Cleaning, XML Data, Similar Objects

Full Text:

PDF

References

”An efficient duplication detection system for XML documents” Thandar Lwin & Thi Thi Soe Nyunt

”Fuzzy duplicate detection on XML data” Melanie Weis Humboldt- Universit¨at zu Berlin Unter den.

S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Approximate XML joins. In Proc. of SIGMOD, pages 287–298, Madison, WI, 2002.

M. A. Hern´andez and S. J. Stolfo. The merge/purge problem for large databases. In Proc. of SIGMOD, pages 127–138, San Jose, CA, 1995.

L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In Proc. of DASFAA, Kyoto, Japan, 2003.

K. Kailing, H.-P. Kriegel, S. Schnauer, and T. Sei- del. Efficient similarity search for hierarchical data in large databases. In Proc. of EDBT, pages 676–693, Heraclion, Crete, 2004.

Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD Conference, Baltimore, MD (2005) 85–96.

Weis, M., Naumann, F.: DogmatiX Tracks down Duplicates in XML. In: SIGMOD Conference, Baltimore, MD (2005).

P. Calado, M. Herschel, and L. Leita˜ o, “An Overview of XML Duplicate Detection Algorithms,” Soft Computing in XML Data Management, Studies in Fuzziness and Soft Computing, vol. 255, pp. 193-224, 2010.

W. E. Winkler. Advanced methods for record linkage. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC, 1994.

S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, and T. Yu, “Approximate XML Joins,” Proc. ACM SIGMOD Conf. Management of Data, 2002.

J.C.P. Carvalho and A.S. da Silva, “Finding Similar Identities among Objects from Multiple Web Sources,” Proc. CIKM Workshop

Refbacks

There are currently no refbacks.

Username
Password
Remember me