Analysis of Microarray Data using Data Mining Techniques

J. Jasmine Gabrie; P. Valarmathie

Analysis of Microarray Data using Data Mining Techniques

J. Jasmine Gabrie, P. Valarmathie

Abstract

Gene expression data is essential for understanding cellular activities of all organisms in identifying the diseases and discovering drugs. Generally gene expression data may have missing values due to experimental errors during the laboratory processes , inappropriate thresholds in preprocessing, insufficient resolution of the microarray, image corruption, dust or scratches on the slide. Imputation of missing values is more recommended in order to increase the effectiveness of analysis algorithms than removal of data. And there is a need to discover a better clustering algorithm to identify the differently expressed genes. However, choice of suitable clustering method(s) for an experimental dataset is not straightforward till date. So in this paper we propose AVG imputation method for Pre-Processing and a hybrid clustering algorithm for Post-Processing. The hybrid clustering algorithm is tested with the AVG-Imputed missing value analyzed data as well as the original data. The results show that pre-processed data produce high-quality clusters and appropriate number of clusters in terms of BIC value, Log Likelihood and Sum of Squared Error criteria than the original data.

Keywords

AVG-Imputation, Data Mining, Gene Expression Data, Hybrid Clustering Algorithm, K-Means Clustering Algorithm, Missing Value Analysis, Model based Clustering Algorithm.

Full Text:

PDF

References

J. Jasmine Gabriel, P. Valarmathie,” Unified Clustering Technique for Microarray Gene Expression Data”, International Conference on Computing and Control Engineering, Dr.M.G.R University, Chennai, 2012.

Luis E. Aerate, Bruno M. Nogueira, ”Techniques for Missing Value Recovering in Imbalanced Databases”, IEEE International Conference on Systems, 2006.

Saravanakumar Selvaraj, Jeyakumar Natarajan, ”Microarray Data Analysis and Mining Tools”, Department of Bioinformatics, Bharathiar University, Coimbatore, 2011.

E. Acuna and C. Rodriguez, “The treatment of missing values and its effect in the classifier accuracy,” Classification, Clustering and Data Mining Applications, pp.639-648, 2004.

O. Troyanskaya, M.Cantor, G. Sherlock, P. Brown, T. Hastie, R.Tibshirani, D. Botstein and R.B. Altman, “Missing value estimation methods for DNA microarrays.” Bioinformatics, vol. 17, pp. 520-525, 2001.

Guoqing Zhao, Wei Deng ,“An HMM-based hierarchical clustering method for gene expression time series data" School of Computer Science & Technology, China,2010.

M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proc. Nat.Acad. Sci. USA, vol. 95, no. 25, pp. 14 863–14 868, Dec. 1998.

P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky,E. S. Lander, and T. R. Golub, “Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation,” Proc. Nat. Acad. Sci. USA, vol. 96, no.6, pp. 2907–2912, Mar. 1999.

Feng Luo, Kun Tang,Weixiang Liu, Tianfu Wang, “Hierarchical Clustering of Gene Expression Data with Divergence Measure", Shenzhen University, Shenzhen, University of Texas at Dallas, Richardson, TX ,2003.

Anja von Heydebreck, Wolfgang Huber, “Analysis of microarray gene expression data”, Heidelberg, 2003.

Xiaofeng Zhu, Shichao Zhang,” Missing Value Estimation for Mixed-Attribute Data Sets”, IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No. 1, 2011.

Adrian E. Raftery, Nema Dean,”Variable Selection for Model-Based Clustering”, Department of Statistics, University of Washington, 2004.

Young, K. Y. Fraley, C.z Murua, A.x Raftery, A. E.z Ruzzo,,”Model-Based Clustering and Data Transformations for Gene Expression Data” , The Third Georgia Tech-Emory International Conference on Bioinformatics, computer Science and Engineering, University of Washington,2001.

Shigeyuki Oba, Masa-Aki Sato, Ichiro Takemasa, Morito Monden, Ken-Ichi Matsubara, Shin Ishii, ” A Bayesian missing value estimation method for gene expression profile data” , Bioinformatics,Vol.19, No.16, pp 2088-2096,2003.

Acuna and C. Rodriguez, “The treatment of missing values and its effect in the classifier accuracy,” Classification, Clustering and Data Mining Applications, pp.639-648, 2004.

Erfaneh Naghieh and Yonghong Peng, “Microarray Gene Expression Data Mining: Clustering Analysis Review” Department of Computing, University of Bradford, UK, 2004.

Mark A. J. Song, ”Application in a Marketing Database with Massive Missing Data”, IEEE International Conference on Systems, Taiwan, 2006.

H. Kim, G.H. Golub and H. Park, “Missing value estimation for DNA microarray gene expression data: local least squares imputation.”Bioinformatics, vol. 21, pp. 187-198, 2005.

S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church, “Systematic determination of genetic network architecture,” Nat. Genet.,vol. 22, no. 3, pp. 281–285, Jul. 1999.

Shichao Zhang and Jilia Zhang ,”Missing Value Imputation Based on Data Clustering” , Department of Computer Science, Normal University, Guilin, China, 2007.

Qiankun Zhao, Prasenjit Mitra,” Hierarchical Clustering Based Value Imputation using Heterogeneous Gene Expression Microarray Datasets”, Penn State University, USA, 2007.

Elena Tsiporkova and Veselka, ”A novel gene-centric clustering algorithm for standardization of time series expression data”, 4th International IEEE Conference Boeva Technical University, Bulgaria, 2008.

Guiquan Liu, Xiufang Jiang, Lingyun Wen, “A Clustering System for Gene Expression Data Based on principle curve ”, Anhui Province, University of Science and Technology of China, Hefei, China,2010.

S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church, “Systematic determination of genetic network architecture,” Nat. Genet.,vol. 22, no. 3, pp. 281–285, Jul. 1999.

M. Inaba, H. Imai, and N. Katoh, ªExperimental Results of a Randomized Clustering Algorithm, Proc. 12th Ann. ACM Symp. Computational Geometry, pp. C1-C2, May 1996.

E. Forgey, ªCluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification,º Biometrics, vol. 21, p. 768, 1965.

T. Kanungo, D.M. Mount, N.S. Netanyahu, C. Piatko, R. Silverman, and A.Y. Wu, ªComputing Nearest Neighbors for Moving Points and Applications to Clustering,º Proc. 10th Ann. ACMSIAM Symp. Discrete Algorithms, pp. S931-S932, Jan. 1999.

T. Kanungo, D.M. Mount, N.S. Netanyahu, C. Piatko, R. Silverman, and A.Y. Wu, ªThe Analysis of a Simple k-means Clustering Algorithm,º Proc. 16th Ann. ACM Symp. Computational Geometry, pp. 100-109, June 2000.

L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons, 1990.

M. Ester, H. Kriegel, and X. Xu, ªA Database Interface for Clustering in Large Spatial Databases, Proc. First Int'l Conf.Knowledge Discovery and Data Mining (KDD-95), pp. 94-99, 1995

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me