Open Access Open Access  Restricted Access Subscription or Fee Access

Automation of Template and Data Extraction from Dynamic Web Documents

S. Pradeepa, K. Satheesbabu, K. Sabeetha

Abstract


Many websites contain large set of pages generated using the common templates with contents. Due to the irrelevant terms in templates, they degrade the accuracy and performance of web applications. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. Thus, in order to prevent the duplication in the templates, nowadays we handle them with some detection techniques. In this paper, we present techniques for automatically producing clusters based on MDL cost that can be used to extract search result records from dynamically generated web documents and extract the data from clustered documents using TTTCR algorithm. Data extraction is a process of extracting the data out of data processing for further data processing. Thus, we don’t need additional template extraction process after clustering. Experimental results show that our proposed approach is feasible and effect for improving template and data extraction accuracy.

Keywords


Minimum Description Length (MDL), Template Extraction, Clustering, Template Table Text Chunk Removal (TTTCR).

Full Text:

PDF

References


Chulyun Kim and Kyuseok Shim,“Automatioc Template Extraction from Heterogeneous Web pages”, IEEE Transactions on knowledge and data Engineering,vol 23,April 2011.

S.Pradeepa and K.Satheesbabu” Automation of Template extraction and dynamic web documents”, International Conferences on advances in Engineering and Technology.

M.de Castro Reis, P.B.Golgher,A.S.da Silva, and A.H.F. Laender, “Automatic Web news Extraction using Tree Edit Distance”, proc.13th Int’l Conf.World Wide Web,2004.

I.S. Dhillon,S.Mallela, and D.S. Modha,”Information- Theoretic Co-Clustering’, proc.ACM SIGKDD,2003.

K.Vieira, A.S. da Silva ,N.Pinto, E.S. de Moura, J.M.B. Cavalcanti, and J.Freire, “A Fast and Roubust Method for Web Page Template Detection and Removal”, Proc.15th ACM Int’l Conf. Information and Knowledge Management9cikm0,2006.

S.Zheng, D. Wu, R.Song, and j.R.Wen, “Joint optimationa of wrapper Generation and Template Detection”,proc. ACM SIGKDD,2007.

A.Z. Broder, M.Charikar, A.M. Frieze, and M.Mitzenmacher, “Min- Wise Independent Permutations,”,J.Computr and System Sciences,vol. 60,pp.630-659,2000.

Z.Chen, F.Korn,N.Koudas, and S.Muithukrishnan, “Selectivity Estimation for Boolean Queries,”,proc. ACM SIGMOD-SIGACT-SIGART Symp.Priciples of Database Systems(PODS),2000.

A.Arasu and h.Garcia-Molina, “Etracting structured Data from Web Pages,” Proc. ACM SIGMOD, 2003.

Comparision of similarity coefficient based on RAPD markers in the common bean, http://dx.doi.org/10.1590/S1415-47571999000300024.

Liing Ma and Nazil Goharian and Abdur Chowdhury “Automatic Data Extraction From Template Generated Web pages”.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.