Open Access Open Access  Restricted Access Subscription or Fee Access

Word Alignment to Encourage Outsized English-Hindi Parallel Corpus

Shweta Dubey, Tarun Dhar Diwan

Abstract


Proposed work gives description about methodology to understand parallel English-Hindi sentences using word alignment. It is part of natural language processing (NLP) where processing of natural language is done to increase understandability of natural language. NLP is part of artificial intelligence (A.I) to develop human intelligence of natural. Various previous works ignore word identities and consider only the sentence lengths which don’t give satisfactory point to exact identification of words, so proposed system is useful to align large outsized parallel corpus by aligning words there. Used methodology is foundation to develop the parallel English-Hindi word dictionary after syntactically and semantically analysis of the English-Hindi source text. Method of proposed system is used for the English and Hindi sentences; moreover the methodology can be used for other languages. Outsized parallel corpus of English-Hindi pair language is not frequently available. Progress is based on two strategies to solve this problem. First is normalization of tagged English sentences and Hindi sentences. Second is mapping English-Hindi sentence using parallel English-Hindi word dictionary. Fortunately word alignment is clearly known and few aligning algorithms are without restraint accessible.


Keywords


Tagging, Local Word Grouping, Word Mapping, Normalization, Part of Speech Tagging (Post), Word Dictionary, Multi Word Expressions, Mapping Score

Full Text:

PDF

References


Niraj Aswani, “Aligning words in English-Hindi parallel corpora”, Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 115–118.

Tong Xiao, Huizhen Wang, “The NiuT rans Machine Translation System for NTCIR-9 Patent”, Proceedings of NTCIR-9, December 6-9, 2011, Tokyo, Japan, Pages 593-599.

Niraj Aswani, “A hybrid approach to align sentences and words in English-Hindi parallel corpora”, Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 57–64.

Antony P J, Nandini. J. Warrier, Dr. Soman K P,“Penn Treebank-Based Syntactic Parsers for South Dravidian Languages using a Machine Learning Approach”, International Journal of Computer Applications (0975 –8887), Volume 7– No.8, October 2010, pages 14-21.

Yoshinobu Kano, Jun’ichi Tsujii, “Sharable Type System Design for Tool Inter-Operability and Combinatorial Comparison”, The First International Conference on Global Interoperability for Language Resources, pages 121-129.

Richard Beaufort, Sophie Roekhaut, Louise-Amélie, Cougnon Cédrick Fairon, “A hybrid rule/model-based finite-state framework for normalizing SMS messages”, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 770– 779.

Hassan Al-Haj, Shuly Wintner, “Identifying Multi-word Expressions by Leveraging Morphological and Syntactic Idiosyncrasy”, Proceedings of the 23rd International conference on Computational Linguistics (Coling 2010), pages 10–18.

Yulia Tsvetkov, Shuly -Wintner, “Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources” Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 836–845.

Aswarth Dara, Prashanth Mannem, Hemanth Sagar Bayyarapu and Avinesh PVS,” Transferring Syntactic Relations from English to Hindi Using Alignments on Local Word Groups”.

Niraj Aswani, Robert Gaizauskas, “Aligning words in English-Hindi parallel corpora”

Niraj Aswani Robert Gaizauskas, “A hybrid approach to align sentences and words in English-Hindi parallel corpora”

Akshar Bharati, V.Sriram, A.Vamshi Krishna, Rajeev Sangal, Sushma Bendre, “An Algorithm for Aligning Sentences in Bilingual Corpora Using Lexical Information”.

Gurpreet Singh Josan, “Development of Punjabi-Hindi Aligned Parallel Corpus from Web Using Machine Translation”

Aasim Ali, Shahid Siddiq,” Development of Parallel Corpus and English to Urdu Statistical Machine Translation”

Sachin Manchanda1, Divanshu Gupta2, Aram Bhusal, Afreen Ansari and Ratna Sanyal, “Language independent Lexicon Building Tool”


Refbacks

  • There are currently no refbacks.