Noun Phrase Detection and its Challenges in Large-Scale Natural Language Data Processing

Lakhan Bhaskar Kadel; Deepak Kumar Soni; Ravinder Yadav

Noun Phrase Detection and its Challenges in Large-Scale Natural Language Data Processing

Lakhan Bhaskar Kadel, Deepak Kumar Soni, Ravinder Yadav

Abstract

Noun phrases of a text document normally are the main information holders. Therefore, the detection of these elements is very important in many applications which are related to information retrieval and extraction, such as collecting appropriate documents by search engines according to the query of a user and also useful in many significant tasks of the natural language processing (NLP) like parsing, word sense disambiguation, machine translation, text summarization, etc. Different approaches have been proposed for Noun phrase detection. This paper presents a detailed review covering those different approaches for noun phrase detection and comparisons are shown between those approaches in terms of accuracy and other parameters. The paper also presents challenges of large-scale natural language data processing and suggests a method that is suitable for very large corpora in today’s big data era.

Keywords

Big Data, Chunking, Hadoop, MapReduce, NLP, Noun Phrase Detection.

Full Text:

PDF

References

Zhou GD, Su J, “Named entity recognition using an HMM-based chunk tagger,” in Proc. of the 40th annual meeting on association for computational linguistics, Morristown, USA, pp. 473–80, 2001.

Aronson AR., “Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program,” in Proc. of the AMIA symposium, Washington, pp. 17–21, 2001.

Rinaldi F, Schneider G, Kaljurand K, Hess M, Andronis C, Konstandi O, et al. “Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach,” Artif Intell Med, pp.127–36, 2007.

Bourigault D., “Surface grammatical analysis for the extraction of terminological noun phrases,” in Proceedings Internat. Conf. on Computational Linguistics (COLING-92), pp. 977–981, 1992.

Church K., “A stochastic parts program and noun phrase parser for unrestricted text,” in Proc. Conf. on Applied Natural Language Processing, ANLP, pp. 136–143, 1988.

L. Ramshaw and M. Marcus, “Text chunking using transformation-based learning,” in Proc. of the 3rd Workshop on Very Large Corpora (ACL), pp. 82-94, 1995.

E. Brill, “Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging,” Computational Linguistics, 21(4), 1995.

Veenstra, J., “Fast np chunking using memory-based learning Techniques,” in Proc. Belgian-Dutch Conf. on Machine Learning (BENELEARN-98), pp. 71–78, 1998.

Argamon, S., Dagan, I., Krymolowski, Y., “A memory-based approach to learning shallow natural language patterns,” in Proceedings Joint Internat. Conf. COLING-ACL, pp. 67–73, 1998.

Tjong-Kim-Sang, E.F., “Noun phrase representation by system combination,” in Proceedings Conference of the North American Chapter of the Association for Computational Linguistics and Conf. on Applied Natural Language Processing (ANLP-NAACL), pp. 50–55, 2000.

Rob Koeling, “Chunking with Maximum Entropy Model,” in Proceedings of CoNLL-2000 and LLL-2000, Portugal, pp. 139-141, 2000.

Adam Berger, Stephen A. Della Pietra and Vincent J. Della Pietra, “A Maximum Entropy Approach to Natural Language Processing,” Computational Linguistics, 22(1):39-71, 1996.

F. Pla, A. Molina, and N. Prieto, “Tagging and chunking with bigrams,” in Proceedings of the seventeenth conference on Computational linguistics, pp. 614-620, 2000.

Kudo, T., Matsumoto, Y., “Use of support vector learning for chunk identification,” in Proc. Conf. on CoNLL-2000 and LLL-2000, pp. 142–144, 2000.

Kudo, T., Matsumoto, Y., “Chunking with support vector machines,” in Proc. Conf. North American Chapter of the Association for Computational Linguistics on Human Language technologies (NAACL), Association for Computational Linguistics, pp. 1–8, 2001.

Sha, F., Pereira, F., “Shallow parsing with conditional random fields,” in Proc. Conf. NAACL, Association for Computational Linguistics, pp. 134–141, 2003.

Shen, H., Sarkar, A., “Voting between multiple data representations for text chunking,” in Proc. 18th Meeting of the Canadian Society for Computational Intelligence, Canadian AI 2005. Springer, pp. 389–400, 2005.

Lourdes Araujo, J.Ignacio Serrano, “Highly accurate error-driven method for noun phrase detection,” in Journal Pattern Recognition Letters, Elsevier Science Inc. New York, USA, Vol. 29, Issue 4, pp. 547-557, ISSN: 0167-8655, 2008.

Ning Kang*, Erik M. van, Jan, “Comparing and combining chunkers of biomedical text,” Journal of biomedical informatics 44, Elsevier Inc., pp. 354-360, 2010.

Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on large Clusters,” in Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI), San Francisco CA, 2004.

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, “The Hadoop Distributed File System,” IEEE / NASA Goddard Conference on Mass Storage Systems and Technologies, pp. 1-10, 2010.

S. Ghemawat, H. Gobioff, S. Leung, “The Google file system,” in Proc. of ACM Symposium on Operating Systems Principles, Lake George, NY, pp. 29–43, Oct 2003.

S. Abney, “Parsing by chunks,” Kluwer Academic Publishers, Dordrecht, pp. 257-278, 1991.

Herve Dejean, “Theory refinement and natural language processing,” in proceedings of the coling, Association for computational linguistics, vol. 1, 2000.

A. Molina, F. Pla, “Shallow Parsing using Specialized HMMs,” Journal of Machine Learning Research (2) , pp. 595-613, 2002

Alex Holmes, “Hadoop in heartbeat”: Hadoop in Practice, ISBN: 9781617290237, Manning Publication, Shelter Island, NY 11964, 2012.

Refbacks

There are currently no refbacks.

Username
Password
Remember me