Open Access Open Access  Restricted Access Subscription or Fee Access

Experiments with Different Indexing Techniques for Text Retrieval tasks on Gujarati Language using Bag-of-Words Approach

Dr. Jyoti Pareek, Hardik Joshi, Krunal Chauhan, Rushikesh Patel

Abstract


This paper presents results of various experiments carried out to improve text retrieval of gujarati text documents. Text retrieval involves searching and ranking of text documents for a given set of query terms. We have tested various retrieval models that uses bag-of-words approach. Bag-of-words approach is a traditional approach that is being used till date where the text document is represented as collection of words. Measures like frequency count, inverse document frequency etc. are used to signify and rank relevant documents for user queries. Different ranking models have been used to quantify ranking performance using the metric of mean average precision. Gujarati is a morphologically rich language, we have compared techniques like stop word removal, stemming and frequent case generation against baseline to measure the improvements in information retrieval tasks. Most of the techniques are language dependent and requires development of language specific tools. We used plain unprocessed word index as the baseline, we have seen significant improvements in comparison of MAP values after applying different indexing techniques when compared to the baseline.


Keywords


Information Retrieval (IR), Frequent Case Generation (FCG), Gujarati Language, Mean Average Precision (MAP), Stemming, Stop Words, Text Mining, Text Retrieval.

Full Text:

PDF

References


C. Mooers, “Zatocoding applied to mechanical organization of knowledge,” Am. Doc., 1951.

F. Lancaster, Information retrieval systems; characteristics, testing, and evaluation. John Wiley & Sons, 1968.

W. Croft, D. Metzler, and T. Strohman, Search engines: Information retrieval in practice. 2010.

“Text REtrieval Conference (TREC) Home Page.” [Online]. Available: http://trec.nist.gov/. [Accessed: 12-Dec-2016].

“Conference and Labs of the Evaluation Forum (CLEF Initiative).” [Online]. Available: http://www.clef-initiative.eu/. [Accessed: 12-Dec-2016].

“Forum for Information Retrieval and Evaluation (FIRE).” [Online]. Available: http://fire.irsi.res.in/fire/2016/home. [Accessed: 12-Dec-2016].

T. Saracevic, “Evaluation of evaluation in information retrieval,” … Res. Dev. Inf. Retr., 1995.

“Terrier IR Platform.” [Online]. Available:http://www.terrier.org. [Accessed: 12-Dec-2016]

A. Singhal, J. Choi, and D. Hindle, “At&t at TREC-7,” NIST Spec., 1999.

S. Robertson, S. Walker, and S. Jones, “Okapi at TREC-3,” NIST Spec., 1995.

C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to ad hoc information retrieval,” Proc. 24th Annu. Int. ACM, 2001.

S. Clinchant and E. Gaussier, “Information-based models for ad hoc IR,” Proc. 33rd Int. ACM, 2010.

G. Amati and C. Van Rijsbergen, “Probabilistic models of information retrieval based on measuring the divergence from randomness,” ACM Trans. Inf. Syst., 2002.

E. Voorhees and D. Harman, TREC: Experiment and evaluation in information retrieval. 2005.

P. Majumder, M. Mitra, and D. Pal, “The FIRE 2008 evaluation exercise,” ACM Trans. …, 2010.

J. Paik, K. Kettunen, D. Pal, and K. Järvelin, “Frequent Case Generation in Ad Hoc Retrieval of Three Indian Languages–Bengali, Gujarati and Marathi,” Multiling. Inf. Access, 2013.

“Gujarati Language Resources & Tools.” .

J. Pareek and H. Joshi, “Evaluation of some Information Retrieval models for Gujarati Ad hoc Monolingual Tasks,” VNSGU J. Sci. Technol., vol. 3, no. 2, pp. 176–181, 2012.

H. Joshi, J. Pareek, and R. Patel, “To stop or not to stop—Experiments on stopword elimination for information retrieval of Gujarati text documents,” NUiCONE 2012 Conf., 2012.

K. Chauhan, R. Patel, and H. Joshi, “Towards Improvement in Gujarati Text Information Retrieval by Using Effective Gujarati Stemmer,” J. Information, 2013.

K. Kettunen and E. Airio, “Is a morphologically complex language really that complex in full-text retrieval?,” Adv. Nat. Lang. Process., 2006.

K. Kettunen, “Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval: an overview,” J. Doc., 2009.

A. Kent, M. Berry, and F. Luehrs, “Machine literature searching VIII. Operational criteria for designing information retrieval systems,” American, 1955.

P. Clough and M. Sanderson, “Evaluating the performance of information retrieval systems using test collections.,” Inf. Res., 2013.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.