Scalable Frequent Pattern Mining of Biological Sequences

E. Ramanujam

Scalable Frequent Pattern Mining of Biological Sequences

E. Ramanujam

Abstract

In the field of Bioinformatics, Frequent pattern mining is the challenging task of mining interesting or hidden information from DNA or Protein sequences. Frequent Pattern Mining is preferable for expressing the function and structure of two or more protein and DNA sequences. Significant numbers of algorithms have been proposed in finding frequently ordered arrangements of frequent patterns that are similar expression (Motif) of a group of genes. Unfortunately, the computation and the memory cost of the algorithms are expensive, when the size of the database is huge. The objective of this paper is to speed up the mining process in terms of time using dynamic parallelism. This paper proposes an algorithm, Scalable Frequent Pattern Mining with Constraints by parallelizing the Frequent Pattern Tree Growth using OpenMP. Experiments have been conducted based on various real time biological datasets which shows that the proposed algorithm outperforms the existing algorithms.

Keywords

Frequent Pattern Tree; Pattern Mining; Scalability; Constraints;

Full Text:

PDF

References

D.He, X.Zhu, X.Wu, “Approxiamte repeating pattern mining with gap requirements”, In Proc. Of 21st Int’l. Conf. on Tools with Artificial Intelligence, 2009, ICTAI’09, pp.17-24.

S.E.Rombo, “Optimal extraction of motif patterns in 2D,” Inf. Process. Lett. 109(2009), 1015-1020.

G.Pavesi, F.Zambelli, G.Pesole, “WeederH: An algorithm for finding conserved regulatory motifs and regions in homologous sequences”, BMC Bioinf. 8(2007), 1-14.

Data Mining Reference

R.Agarwal, T.Imielinski, R.Srikant, “Mining Association Rules between sets of items in large databases”, SIGMOD, May, 1983.

M.Baena-Garcia, R.Morales-Bueno, “Mining Interesting measures for string pattern mining”, Knowledge-Based Systesm, 25(1), pp.45-50, 2012.

Y.C. Li, J.S. Yeh, C.C. Chang, “Isolated Items discarding strategy for discovering high utility itemsets,” Data and Knowledge Enigneering, 64(2008), 198-217.

H. Xiong, S. Shekhar, P.N. Tan, V. Kumar, “Exploiting A Support-based Upper Bound of Pairs,” ACM SIGMOD, August 2004.

J. Chang, “Mining weighted sequential patterns in a sequence database with a time-interval weight,” Knowledge-Based Systems, 24(1), pp.1-9, 2011.

J.Pei, J.Han et.al., “Mining sequential Patterns by pattern-growth: the Prefixspan approach,” IEEE Transactions on Knowledge and Data Engineeering, 2004.

OpenMP C and C++ Application Programming Interface, OpenMP architecture Review Board.

Agarwal. R and Srikant. R, “Fast Algorithms for Mining Association Rules,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 487-499, 1994

Agarwal. R and Srikant. R, “Mining Sequential Patterns,” Proc. 11th IEEE Int’l Conf. Data Eng. (ICDE), pp. 3-14, 1995.

R. Srikant, R. Agarwal, “Mining sequential Patterns: Generalization and Performance improvements,” In. Proc. Of 15th Int’l. Conf. on extending Database Technology,London:Springer-Verlag, 1996, pp.3-17.

Yan. X , Han.J and Afshar. R, “Clospan: Mining Closed Sequential Patterns in Large Datasets,” Proc. SIAM Int’l Conf. Data Mining (SDM), 2003

M.Zaki, “SPADE: An efficient Algorithm for mining frequent sequences”, Machine Learning, (2001). Pp. 31-60.

Han.J et. Al.,” FreeSpan: Frequent Pattern- Projected Sequential Pattern Mining,” Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining (KDD'00), Boston, MA, August 2000.

Han.J et. Al.,” PrefixSpan: Mining Sequential Patterns efficiently by prefix projected pattern growth. In Proc. 2001 Int. Conf. Data Engineering (ICDE’ 01), pp. 215-224, Heidelberg, Germany, April 2001.

Zaki.M.J, “Sequence Mining in Categorical Domains: Incorporating Constraints,” Proc. Ninth Int’l Conf. Information and Knowledge Management (CIKM), pp.442-449, 2000.

M.J. Zaki, C.D. Carothers, B.K. Szymanski, “VOGUE: A variable order hidden Markov Models with duration based on frequents sequence mining,” Trans. Knowledge Discovery Data, 4(1), 2010.

R. Kolpakov, G. kucherov, “Finding Maximal Repetitions in a word in linear time,” In. Proc. Of 1999 Symposium on Foundations of Computer Science, Washington, 1999, pp. 596-604.

O. Delgrange, E.Rivals, “STAR: an algorithm to search for tandem approximate repeats,” Bioinformatics, 20(2004), pp. 2812-2820.

A.Krishnan, F.Tang, “Exhaustive Whole-genome tandem repeats search,” Bioinformatics, 20(2004),l 2702-2710.

S. Kurtz, J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J.Stoye, R. Giegerich, “REPuter: the main fold applications of repeated analysis of genomic scale,” Nucleic Acids Res. 29(2001), 4633- 4642.

G. Benson, “Tandem repeats finder: a program to analyze DNA sequences,” Nucleic Acids Res. 27(1999), 573-580.

D.Wang, G.Wang, G.Q.Wu, B.Chen, “Finding LPRs in DNA sequence based on a new index SUA,” In. Proc. Of the IEEE Fifth Symposium on Bioinformatics and Bioengineering (BIBE 2005), Washington, 2005, pp. 281-284.

Y.Xiong, Y.zhu, “BioPM:an efficient algorithm for protein motif mining”, In. Proc. Of ICBBE’07, IEEE Press, 2007, pp.394-397.

Q.Zhou, Q.Jiang,S.Li, X. Xie, L.Lin, “An efficeint algorithm for protein sequence pattern mining,” In. Proc. Of 2010 Fifth Int’l. Conf. on Computer Science and Education (ICCSE), 2010, pp.1876-1881.

Agarwal.R, Aggarwal.C, Prasad.V.V.V, “A Tree projection of algorithm for generation of frequent item-sets,” Journal of Parallel and Distributed Computing, vol.61, no.3, pp. 350-371, 2001.

Han.J., Pei.J and Yin.Y., “Mining Frequent Patterns without Candidate generation, ACM SIGMD Record, vol. 29, no.2, pp. 1-12.

Grahne.G, Lakshmanan.L.V.S., Wang.X, “Efficient Mining of Constraint Correlated sets,” In. Proceedings of 16th Int’l. Conf. on Data Engineering (ICDE’00), IEEE Computer Society, pp. 512-521, 2000.

Han.J, Lakshmanan. L.V.S., Ng.R.T, “Constraint-Based Multidimensional Data Mining”, IEEE Computer Society Press, vol.32, no.8, pp46-50, 1999.

Han.J., Pei.J and Wang.W, “Mining Sequential Patterns with Constraints in Large Databases, “Proc. 11th Int’l Conf. Information and Knowledge Management (CIKM), pp. 18-25, 2002.

J.Han, J.Pei, “From Sequential Pattern Mining to structured Pattern Mining: a pattern growth approach,” J. Comp. Sci. Tech, 23, 2004, pp. 257-279.

http:.//pfam.sanger.ac.uk.

http://www.ncbi.nlm.nih.gov/nuccore/

L. Chen, W. Liu, “Frequent Patterns mining in multiple biological sequences,” Computers in Biology and Medicine, 43(2013), pp. 1444-1452.

J. Pan, P. Wang, W. Wang, B. shi, G. Yang, “Efficient algorithms for mining maximal frequent concatenate sequences in biologicail datasets,” In: proc. Of 5th Int’l. Conf. on computer and Information Technology (CIT’05), 2005, Shanghai, china, pp. 98-104.

Md. Rezaul Karim, Md. Mamumur Rashid, B. Jeong, H. choi, “An Efficeint Approach to Mining Maximal contiguous Frequent Pattens from Large DNA Sequence Databases,”, Genomics and Informatics, 10(1), 2012, pp. 51-57.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me