A Probabilistic Smoothing Approach for Language Models Applied to Protein Sequence Data

Gopal Suresh; Chellapa Vijayalakshmi

A Probabilistic Smoothing Approach for Language Models Applied to Protein Sequence Data

Gopal Suresh, Chellapa Vijayalakshmi

Abstract

Most modern techniques for statistical processing of language modeling are widely applied to many domains such as Speech recognition, Machine translation and Information Retrieval etc. The basic idea behind the language model is probabilistic, which describes the task of probability estimation defined over strings frequently designed as a sentence. One of the core problem addresses a language model is termed assmoothing, its primitive goal is to improve the model accuracy by djusting the maximum likelihood estimate of probabilities. To retrieve this challenge, the paper focuses a well-known smoothing technique called Good-Turing, applied over a bioinformatics task of protein sequence. Also, the computational procedure of this technique uses an R program to estimate bigram and trigram probabilities of language models for the protein sequence. Experimental results shows the appropriate fitting of exponential and linear smoothing curves defined over bigram and trigram sequences respectively, with very high model accuracy.

Keywords

Bigram Model, Language Model, Smoothing NGram Model, Trigram Model

Full Text:

PDF

References

Chengxiang Zhai, and John Lafferty, “A study of smoothing methods for language models applied to Ad Hoc Information Retrieval”, Proceedings of the 24th Annual International SIGIR Conference on Research and Development in Information Retrieval, New York, USA,2001.

Gerasimos Potamianos and Frederick Jelinek, “A study of n-gram and decision tree letter language modeling methods”, Speech Communication, Vol. 24, pp. 171-192, 1998.

I.J. Good, “The population frequencies of species and the estimation of population parameters”. Biometrika, Vol.40, pp. 237-264, 1953.

Jozsef Domokos and Gavril Toderean , “Text conditioning and statistical language modeling aspects for Romanian language”, Acta Universitatis Sapientia ,Vol. 1, pp.187-197, 2009.

Lachlan Coin, Alex Bateman and Richard Durbin, “Enhanced protein domain discovery by using language modeling technique from speech recognition”, in Proceedings of the National Academy of Sciences of the United States of America, pp. 4516-4520, 2003.

Madhavi Ganapathiraju,Vijayalaxmi Manoharan and Judith Klein Seetharaman, “BLMT: Statistical sequence analysis using N-grams”.Applied Bioinformatics, Vol.3, pp.193-200, 2004.

Qi-Wen Dong, Lei Lin, Xiao-Long Wang and Ming-Hui Li, “A Patternbased SVM for protein remote homology detection”, Proceedings of the fourth International Conference on machine learning and cybernetics,Guangzhou, 2005.

Vesa Siivola, Teemu Hirsimaki and Sami Virpioja, “On Growing and pruning Kneser-Ney smoothed N-gram models”. IEEE Transactions on Audio, speech and Language processing, Vol.15, 2007.

William A. Gale and Geoffrey Sampson, “Good-Turing frequency estimation without tears”. Journal of Computational Linguistics, Vol.2, pp.217-237, 1995.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me