Design and Implementation of Devnagari Spell Checker based on Soundex Phonetic Similarity Concepts

Shaikh Phiroj Chhaware; Dr. Mohammad Atique

Design and Implementation of Devnagari Spell Checker based on Soundex Phonetic Similarity Concepts

Shaikh Phiroj Chhaware, Dr. Mohammad Atique

Abstract

Nowadays with the advent in Information Technology, in India where the majority of peoples are Hindi language speaking, a perfect Devnagari Spell Checker is required for word processing a document in Hindi language. The one of the challenging field is how to implement a perfect spell checker for the Hindi language for doing spell checking in the printed document as we generally do for English like language in Microsoft word. The proposed approach consist of a development of Hindi word database using Unicode standard for character encoding available for Devnagari character set and a spell check engine which will match the word from the available database of words and then for non-word, it presents a list of most appropriate threshold number of suggestions based on Soundex Phonetic string matching algorithm along with Levenstein‟s Edit distance calculation methods. The Soundex Phonetic string matching algorithm works on the some predefined rules where the entire language character set is divided among some category. The phonetically similar characters are present in a single category. The consonants and additional consonants are only considered while forming the categories and ignoring the vowels & special symbols. The limitation of Soundex Phonetic algorithm is removed by applying the Levenstein‟s Edit distance calculation method which calculates the distance between two strings and the minimum distance are always considered for ranking of the suggestions.

Keywords

Devnagari Script, Levenstein‟s Edit Distance, Soundex Phonetic String Matching Algorithm, Unicode Conventions, Suggestions Generation, Ranking Algorithms, Corpus Design.

Full Text:

PDF

References

Shaikh Phiroj Chhaware and Prof. Mrs. Latesh G. Malik (2009), “Design of Devnagari Spell Checker for Printed Documents: A Hybrid Approach” at International Conference on Web Sciences “ICWS 2009” at K. L. College of Engineering, Green Fields, Vijayawada on dated 10th to 11th January, 2009..

Shaikh Phiroj Chhaware and Prof. Mrs. Latesh G. Malik (2008), “A Novel approach for Design of Devnagari Spell Checker for Printed Document” at National Conference “BITCON-2008” at Bhilai Institute of Technology, Durg (CG) on dated 7th to 8th November 2008.

R.C. Angell, G.E. Freund and P. Willet, (1983) "Automatic spelling correction using a trigram similarity measure", Information Processing and Management. 19: 255-261.

V. Cherkassky and N. Vassilas (1989) "Back-propagation networks for spelling correction". Neural Network. 1(3): 166-173.

K.W. Church and W.A. Gale (1991) "Probability scoring for spelling correction". Statistical Computing. 1(1): 93-103.

F.J. Damerau (1964) "A technique for computer detection and correction of spelling errors". Commun. ACM. 7(3): 171-176.

R.E. Gorin (1971) "SPELL: A spelling checking and correction program", Online documentation for the DEC-10 computer.

S. Kahan, T. Pavlidis and H.S. Baird (1987) "On the recognition of characters of any font size", IEEE Trans. Patt. Anal. Machine Intell. PAMI-9. 9: 174-287.

K. Kukich (1992) "Techniques for automatically correcting words in text". ACM Computing Surveys. 24(4): 377-439.

V.I. Levenshtein (1966) "Binary codes capable of correcting deletions, insertions and reversals". Sov. Phys. Dokl., 10: 707-710.

U. Pal and B.B. Chaudhuri (1995) "Computer recognition of printed Bangla script" Int. J. of System Science. 26(11): 2107-2123.

J.J. Pollock and A. Zamora (1984) "Automatic spelling correction in scientific and scholarly text". Commun. ACM-27. 4: 358-368.

P. Sengupta and B.B. Chaudhuri (1993) "A morpho-syntactic analysis based lexical subsystem". Int. J. of Pattern Recog. And Artificial Intell. 7(3): 595-619.

P. Sengupta and B.B. Chaudhuri (1995) "Projection of multi-worded lexical entities in an inflectional language". Int. J. of Pattern Recog. and Artificial Intell. 9(6): 1015-1028.

R. Singhal and G.T. Toussaint (1979) "Experiments in text recognition with the modified Viterbi algorithm". IEEE Trans. Pattern Analysis Machine Intelligence. PAMI-1 4: 184-193.

E.J. Yannakoudakis and D. Fawthrop (1983) "An Intelligent spelling corrector". Information Processing and Management. 19(12): 101-108.

P. Kundu and B.B. Chaudhuri (1999) "Error Pattern in Bangla Text". International Journal of Dravidian Linguistics. 28(2): 49-88.

Naushad UzZaman and Mumit Khan, A Bangla Phonetic Encoding for Better Spelling Suggestions, Proc. 7th International Conference on Computer and Information Technology (ICCIT 2004), Dhaka, Bangladesh, December 2004.

Naushad UzZaman and Mumit Khan, A Double Metaphone Encoding for Bangla and its Application in Spelling Checker, Proc. 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, pp. 705-710, Wuhan, China, October 30 - November 1, 2005.

Naushad UzZaman and Mumit Khan, A Comprehensive Bangla Spelling Checker, Proc. International Conference on Computer Processing on Bangla (ICCPB-2006), Dhaka, Bangladesh, 17 February, 2006.

Naushad UzZaman, Phonetic Encoding for Bangla and its Application to Spelling Checker, Name Searching, Transliteration and Cross Language Information Retrieval, Undergraduate Thesis (Computer Science), BRAC University, May 2005.

Munshi Asadullah, Md. Zahurul Islam, and Mumit Khan, Error-tolerant Finite-state Recognizer and String Pattern Similarity Based Spell-Checker for Bengali.

W. B. Canvar and J. M. Trenkle, N-gram based Text Categorization, Symposium on Document Analysis and Information Retrieval, Pages 161-176, University of Nevada, Las Vegas, 1994.

Hall, P. A. V., Dowling, G. R.: Approximate string matching. ACM Computing Surveys, 12(4):381–402 (1980).

Ristad, E. S., Yianilos, P.N.: Learning string-edit distance, Pattern Analysis and Machine Intelligence, IEEE Transactions (1998).

Ron Bekkerman, (2002), "Distributional Clustering of Words for Text Categorization", M.Sc. Thesis, August the 8th, , SIGIR‟01, USA 2001.

Cavnar W and Trenkl J . (1994), NR - Gram Based Text Categorization. In Symposium one Document Analysis and Retrieval Information Las Vegas, , pp140-148.

Jalam, R. and Chauchat, J.H. (2002), "Why the N-grams make it possible to classify texts? Search for relevant keywords using N grams characteristics", in Morin, A. and Sébillot, P., editors, 6th International Conference one Textual Dated Statistical Analysis, volume 1.

McNamee P.,J.Mayfield, (2004), "Character N-Gram Tokenization for European Language Text Retrieval", in Information retrieval 7(1-2),pp 73-97.

Miao Y., (2005) ,"Document Clustering using character n-grams: A comparative evaluation with term-based and word-based clustering", technical report, Faculty of Computer Science, Dalhousie University,.www.cs.dal.ca/research/rechreports/2005/

M. Damashek, “Gauging Similarity with n-grams: Language-Independent Categorization of Text,” Science 267, pp 843 – 848, 10 February 1995.

S.H. Mustafa, “Character contiguity in N-gram-based word matching: the case for Arabic text searching”, Information Processing and Management, Vol. 41, pp. 819–827, 2005.

W. B. Canvar and J. M. Trenkle, N-gram based Text Categorization, Symposium on Document Analysis and Information Retrieval, Pages 161-176, University of Nevada, Las Vegas, 1994.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me