A Survey on Pre-processing Techniques for Text Mining

Manthan J. Vyas, Sanjay D. Bhanderi


Text mining is the process of obtaining interesting patterns or knowledge from text documents. The most often used type of data in the WWW is text. Text mining is used to extract interesting knowledge from unstructured text data. Pre-processing is a very important phase in the text mining process. Text mining framework includes two components, text refining and knowledge distillation. This paper is about pre-processing for text mining in English and Gujarati language. There is very less work done for text mining in Gujarati language. It is very challenging task as Gujarati is very rich in morphology, it gives rise to a very large number of word forms and feature spaces. Some pre-processing techniques in Gujarati are introduced in this paper.


Pre-Processing, Stop-Words, Stemming, Text Mining

