Open Access Open Access  Restricted Access Subscription or Fee Access

A Novel Approach for Mining Web Documents Based on Bayesian Learning Classifier Systems

M. Deepa, P. Tamijeselvy

Abstract


Web mining is a new area of data mining. Since web is one of the biggest repositories of data, analyzing and exploring regularities using data mining in web user behavior can improve system performance and enhance the quality and delivery of Internet information services to the end user. Clustering and classification have been useful in active areas of machine learning research that promise to help us cope with the problem of information overload on the Internet. BIRCH is a clustering algorithm designed  to  operate  under  the  assumption  "the  amount  of memory  available  is  limited,  whereas  the  dataset  can  be arbitrary large". The algorithm generates "a compact dataset summary" minimizing the I/O cost involved Also the effect of noise and uncertainty are major issues in Web mining. Traditionally, probability is used to measure the uncertainty in the system. The Bayesian approach provides a mathematical Bayes’ theorem to manipulate existing beliefs with some new evidence in order to form new beliefs. Bayesian inference has been seen in the literature as a robust method to deal with noise and uncertainty. Therefore, we propose a modification of UCS, using Bayesian update. This method is able to achieve higher accuracy than UCS and requires only half of the learning time to converge. The algorithm thus minimizes the outliers involved and contains enough information to apply the well known SMOKA - Smoothened k-means clustering algorithm to the set of summaries and to generate the partitions of the original dataset. We expect that the proposed method to work more quickly because it reduces the time required exploring a search space and finding a correct action for a condition.


Keywords


Algorithms: BIRCH (Balanced Iterative Reducing and Clustering Algorithm), Bayes Theorem, K-Means Algorithm, BCS.

Full Text:

PDF

References


P. Berkhin. A survey of clustering data mining techniques.In J. Kogan, C. Nicholas, and M.Teboulle, editors,Grouping Multidimensional Data: Recent Advancesin Clustering, pages 5.72.Springer.Verlag,Berlin, 2006.

M. Berry and M. Browne. UnderstandingSearch Engines.SIAM, 1999.

D. L. Boley. Principal direction divisive partitioning.Data Mining and Knowledge Discovery, 2(4):325.344, 1998.

Paul S. Bradley, Usama M. Fayyad, and Cory Reina.Scaling clustering algorithms to large databases. InKnowledge Discovery and Data Mining, pages 9.15, Menlo Park, CA, 1998.

E. Chisholm and T. Kolda. New term weighting formulas for the vector space method in information retrieval,1999. Report ORNL/TM-13756, Computer Science and Mathematics Division, Oak Ridge National Laboratory.

I. S. Dhillon, J. Kogan, and C. Nicholas. Feature selection and document clustering. In M.W. Berry, editor, Survey of Text Mining, pages 73.100. Springer-Verlag,2003

R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classi_cation. JohnWiley & Sons, second edition, 2000.

G. Hardy, Littlewood J.E., and G. Polya. Inequalities. Cambridge University Press, Cambridge, 1934.

J. Kogan. Introduction to Clustering Large and High. Dimensional Data. Cambridge University Press, NewYork, 2007.

J. Kogan. Scalable clustering with smoka. In Proceedings of International Conference on computing:Theory and Applications. IEEE Computer SocietyPress, to appear.

J. Kogan, C. Nicholas, and V. Volkovich. Text mining with hybrid clustering schemes. In M.W.Berry and W.M. Pottenger, editors, Proceedings of the Workshop on Text Mining (held in conjunction with the Third SIAM International Conference on Data Mining),pages 5.16, 2003.

J. Kogan, M. Teboulle, and C. Nicholas. The entropic geometric means algorithm: an approach for building small clusters for large text datasets. In D. Boley et al, editor, Proceedings of the Workshop on Clustering Large Data Sets pages 63.71, 2003.

D. Littau and D. Boley. Clustering very large data sets with PDDP. In J. Kogan, C. Nicholas,and M. Teboulle, editors, Grouping Multidimensional Data: Recent Advances in Clustering, pages 99.126.Springer.Verlag, 2006.

K. Rose, E. Gurewitz, and C.G. Fox. A deterministic annealing approach to clustering. Pattern Recognition Letters, 11(9):589.594, 1990.

M. Teboulle and J. Kogan. Deterministic annealing and a 3 -means type smoothing optimization algorithm for data clustering. In I. Dhillon, J. Ghosh, and J. Kogan,editors, Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with the Fifth SIAM International Conference on Data Mining), pages 13.22, Philadelphia, PA, 2005. SIAM.

G. Zhang, B. Kleyner and M. Hsu. A local search approach to k-clustering. Tech Report HPL-1999-119, 1999.

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH:A new data clustering algorithm and its applications. Journal of Data Mining and Knowledge Discovery, 1(2):141.182, 1997.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.