Open Access Open Access  Restricted Access Subscription or Fee Access

Automatic Tamil Document Categorization Based on the Naive Bayes Algorithm

S. Kohilavani, T. Mala, T. V. Geetha

Abstract


This paper deals with automatic classification of tamil documents. Documents are repositories of knowledge. There are numerous documents available and effective search in documents is time consuming. To make document search a simpler task and for various other applications like event detection and tracking, document clustering and grouping we need to perform document categorization. Document categorization is a challenging task. Document categorization has recently become an active research topic in the area of information retrieval. The objective of document categorization is to assign entries from a set of prespecified categories to a document. Traditionally this categorization task is performed manually by domain experts. Each incoming document is read and comprehended by the expert and then it is assigned to a number of categories chosen from the set of prespecified categories. It is inevitable that a large amount of manual effort is required. A promising way to deal with this problem is to learn a categorization scheme automatically from training examples. In the training phase we are given a set of documents with class labels attached, and a classification system is built using a learning method. Once the categorization scheme is learned, it can be used for classifying future documents. Document category can be found out using various techniques. In this paper, Naive Bayes (NB) which is a statistical machine learning algorithm, is used to classify tamil documents to one of pre-defined categories. Experiments are used to evaluate the Naive Bayes categorizer. The data set used during these experiments consists of 50 documents per category. The experimental results shows that the Naive Bayes classifier performs well and its effectiveness is achieved with 89.8% accuracy.

Keywords


Document Categorization, Naïve Bayes, Stopwords, Preprocessing, Classifier.

Full Text:

PDF

References


El-Kourdi M., Bensaid A. and Rachidi T., “Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm”, Proceedings of COLING 20th Workshop on Computational Approaches to Arabic Script-based Languages, pp. 51-58, August 2004.

Fabrizio sebastiani, “Machine Learning in Automated Text Categorization” ACM Computing Surveys, Vol. 34, Issue No. 1, pp. 1–47, March 2002.

Joachims, “A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization”, Proceedings of ICML-97,14th International Conference on Machine Learning, 1997.

Joachims, Thorsten, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, Machine Learning: ECML-98. Proceedings of 10th European Conference on Machine Learning, pp. 137-42. 1998.

Lewis, M. Ringnette, "Comparison of two learning algorithms for text categorization," Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR'94), 1994.

Maria-Luiza Antonie and Osmar R. Zaiane, “Text document categorization by term association”, Proceedings of IEEE International Conference on Data Mining, pp.19 – 26, December 2002.

M.Sahami, “Learning limited dependence Bayesian classifier”, Proceedings of the second international Conference on Knowledge Discovery and Data Mining, pp.335-338, AAAI press, 1996.

T. Rachidi, O. Iraqi, M. Bouzoubaa, A. Ben AlKhattab, M. El Kourdi, A. Zahi, and A. Bensaid, “Barq: distributed multilingual Internet search engine with focus on Arabic language,” Proceedings of IEEE Conf. on Sys., Man and Cyber., Washington DC, October 5-8, pp. , 2003.

Wai Lam; Ruiz, M.; Srinivasan, P. “Automatic text categorization and its application to text retrieval” IEEE Transactions on Knowledge and Data Engineering, Vol. 11, Issue no. 6, pp. 865 – 879, November 1999.

Taeho Jo and Dongho Cho, “Index Based Approach for Text Categorization”, International journal of mathematics and computers in simulation, Issue 2, Volume 1, 2007.

Y. Yang, “An evaluation of statistical approaches to text categorization,” Journal of Information Retrieval, Vol. 1, Number 1-2, pp. 69--90, 1999.


Refbacks

  • There are currently no refbacks.