Open Access Open Access  Restricted Access Subscription or Fee Access

A Comprehensive Review on Class Imbalance Problem

S. Jahangeer Sidiq, Majid Zaman, Muheet Butt

Abstract


Classification of imbalanced data distribution using the standard learning algorithms which assume a relatively equal misclassification costs and relatively balanced underlying class distribution has encountered a serious drawbacks. This paper presents a comprehensive review of learning from Class imbalanced data. Our aim is to provide a review of the class imbalance problem, the state-of-art techniques and the performance measurement metrics used for evaluation under class imbalance scenario. Class imbalance problem in presence of multiple classes is also discussed. 


Keywords


Class Imbalance, Classification, Multi-Class,Performance Measures.

Full Text:

PDF

References


G. Weiss. Mining with rarity: A unifying framework.SIGKDD Explorations, 6(1):7-19, 2004.

R. Akbani, S. Kwek and N. Jakowicz, Applying support vector machines to imbalanced datasets, Proc. Eur. Conf. Mach. Learn., Pisa, Italy (September 2004), pp.39–50.

R. Anand, K. G. Mehrotra, C. K. Mohan and S. Ranka, An improved algorithm for neural network classification of imbalanced training sets, IEEE Trans. NeuralNetworks 4(6) (1993) 962–969.

G. E. A. P. A. Batista, R. C. Prati and M. C. Monard, A study of the behavior of several methods for balancing machine learning training data, SIGKDD ExplorationsSpecial Issue on Learning from Imbalanced Datasets 6(1) (2004) 20–29.

Sidiq, S. J., Ahmed, M., & Ashraf, M. (2017). An Empirical Comparison of Supervised Classifiers for Diabetic Diagnosis. International Journal, 8(1).

L. Breiman, Bagging predictiors, Mach. Learn. 24(2) (1996) 123–1[23].

L. Breiman, Arcing classifier, Ann. Statist. 26(3) (1998) 801–849.

L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32.

L. Breiman, J. H Freidman, R. A. Olshen and C. J. Stone, Classification and Regression Trees (Wadsworth Belmont, 1984).

C. Cardie and N. Howe, Improving minority class predication using case-specific feature weights, Proc. Fourteenth Int. Conf. Mach. Learn., Nashville, TN (July 1997), pp. 57–65.

K. Carvajal, M. Chac´on, D. Mery and G. Acuna, Neural network method for failure detection with skewed class distribution, INSIGHT, J. British Institute of Non-Destructive Testing 46(7) (2004) 399–402.

N. V. Chawla, K. Bowyer, L. Hall andW. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357.

N. V. Chawla, N. Japkowicz and A. Kolcz, Editorial: special issue on learning from imbalanced data sets, SIGKDD Explorations Special Issue on Learning from ImbalancedDatasets 6(1) (2004) 1–6.

A. Estabrooks, A combination scheme for inductive learning from imbalanced datasets, Master’s thesis, Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada (2000).

K. Ezawa, M. Singh and S. W. Norton, Learning goal oriented bayesian networks for telecommunications risk management, Proc. Thirteenth Int. Conf. Mach. Learn., Bari, Italy (1996), pp. 139–147.

W. Fan, S. J. Stolfo, J. Zhang and P. K. Chan, Adacost: misclasification cost-sensitive boosting, Proc. Sixth Int. Conf. Mach. Learn. (ICML-99), Bled, Slovenia (1999), pp. 97–105.

T. E. Fawcett and F. Provost, Adaptive fraud detection, Data Mining and Knowledge Discovery 1(3) (1997) 291–316.

Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an aplication to boosting, J. Comput. Syst. Sci. 55(1) (1997) 119–139.

J. Friedman, T. Hastie and R. Tibshirani, Additive logistic regression: a statistical view of boosting, Ann. Statist. 28(2) (2000) 337–374.

J. A. Hanley and B. J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Intell. Data Anal. J. 143 (1982) 29–36.

J. Hertz, A. Krogh and R. G. Palmer, Introduction to the Theory of Neural Computation (Addison Wesley, 1991).

N. Japkowicz, Concept-learning in the presence of between-class and within-class imbalances, Proc. Fourteenth Conf. Canadian Soc. for Computational Studies of Intelligence, Ottawa, Canada (June 2001), pp. 67–77.

N. Japkowicz and S. Stephen, The class imbalance problem: a systematic study, Intell. Data Anal. J. 6(5) (2002) 429–450.

M. V. Joshi, Learning Classifier Models for Predicting Rare Phonemena, PhD thesis, University of Minnesota, Twin Cites, Minnesota, USA (2002).

M. V. Joshi, V. Kumar and R. C. Agarwal, Evaluating boosting algorithms to classify rare classes: comparison and improvements, Proc. First IEEE Int. Conf. Data Min.(ICDM’01) (2001).

J. Kittler, M. Katef, R. Duin and J. Matas, On combining classifiers, IEEE Trans. Patt. Anal. Mach. Intell. 20(3) (1998).

M. Kubat, R. Holte and S. Matwin, Machine learning for the detection of oil spills in satellite radar images, Mach. Learn. 30 (1998) 195–215.

D. Lewis and W. Gale, Training text classifiers by uncertainty sampling, Proc. Seventeenth Annual Int. ACM SIGIR Conf. Research and Development in Information,New York, NY (August 1998), pp. 73–79.

Y. Lin, Y. Lee and G. Wahba, Support vector machines for classification in nonstandard situations, Mach. Learn. 46 (2002) 191–202.

C. X. Ling and C. Li, Decision trees with minimal costs, Proc. 21st Int. Conf. Machine Learn., Banff, Canada (July 2004).

B. Liu, Y. Ma and C. K. Wong, Improving an association rule based classifier, Proc. 4th Eur. Conf. Principles of Data Mining and Knowledge Discovery, Lyon, France(September 2000), pp. 504–509.

D. Margineantu, Class probability estimation and cost-sensitive classification decisions, Proc. 13th Eur. Conf. Mach. Learn., Helsinki, Finland (August 2002), pp. 270–281.

P. M. Murph and D.W. Aha, UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California: Irvine (1991).

R. C. Prati and G. E. A. P. A. Batista, Class imbalances versus class overlapping: an analysis of a learning system behavior, Proc. Mexican Int. Conf. Artif. Intell.(MICAI), Mexico City, Mexico (April 2004), pp. 312–321.

F. Provost and T. Fawcett, Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions, Proc. Third Int. Conf. KnowledgeDiscovery and Data Mining (KDD-97), Newportbeach, CA (August 1997), pp. 43–48.

J. R. Quinlan, Improved estimates for the accuracy of small disjuncts, Mach. Learn. 6 (1991) 93–98.

J. R. Quinlan, C4.5: Programs For Machine Learning (Morgan Kaufmann Publishers,1993).

J. R. Quinlan, Induction of decision trees, Mach. Learn. 1(1) (1986) 81–106.

B. Raskutti and A. Kowalczyk, Extreme rebalancing for SVMs: a case study, Proc. Eur. Conf. Mach. Learn., Pisa, Italy (September 2004), pp. 60–69.

P. Riddle, R. Segal and O. Etzioni, Representation design and brute-force induction in a Boeing manufactoring domain, Appl. Artif. Intell. 8 (1991) 125–147.

R. E. Schapire, The boosting approach to machine learning — An overview, MSRI Workshop on Nonlinear Estimation and Classification, Berkeley, CA (March 2002),pp. 149–172.

R. E. Schapire and Y. Singer, Improved boosting algorithms using confidence-rated predictions, Mach. Learn. 37(3) (1999) 297–336.

Y. Sun, A. K. C. Wong and Y. Wang, Parameter inference of cost-sensitive boosting algorithms, Fourth Int. Conf. Machine Learning and Data Mining (MLDM), Leipzig, Germany (July 2005), pp. 21–30.

P. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining (Addison Wesley,2006).

D. M. J. Tax and R. P. W. Duin, Using two-class classifiers for multiclass classification, Int. Conf. Patt. Recogn., Quebec, Canada (2002).

K. M. Ting, A comparative study of cost-sensitive boosting algorithms, Proc. 17th Int. Conf. Mach. Learn., Stanford University, CA (2000), pp. 983–990.

V. Vapnik and A. Lerner, Pattern recognition using generalized portrait method, Automat. Remont Contr. 24 (1963) 774–780.

G. Weiss, Mining with rarity: a unifying framework, SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets 6(1) (2004) 7–19.

G. Weiss and F. Provost, Learning when training data are costly: the effect of class distribution on tree induction, J. Artif. Intell. Res. 19 (2003) 315–354.

G. Wu and E. Y. Chang, Class-boundary alignment for imbalanced dataset learning, Proc. ICML’03 Workshop on Learning from Imbalanced Data Sets, Washington, DC (August 2003).

B. Zadrozny and C. Elkan, Learning and making decisions when costs and probabilities are both unknown, Proc. Seventh Int. Conf. Knowledge Discovery and DataMining, San Francisco, CA (August 2001), pp. 204–213.

B. Zadrozny, J. Langford and N. Abe, Cost-sensitive learning by cost-proportionate example weighting, Proc. Third IEEE Int. Conf. Data Min., Melbourne, Florida (November 2003), pp. 435–442.

J. Zhang and I. Mani, KNN approach to unbalanced data distributions: a case study involving information extraction, Proc. ICML’03 Workshop on Learning from ImbalancedData Sets, Washington, DC (August 2003).

Z. H. Zhou and X. Y. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Trans. Knowl. Data Engin. 18(1) (2006) 63–77.

Japkowicz, N., C. Myers, and M. Gluck. A novelty detection approach to classification.in IJCAI. 1995.

Sun, Y., A.K. Wong, and M.S. Kamel, Classification of imbalanced data: A reviewInternational Journal of Pattern Recognition and Artificial Intelligence, 2009. 23(04): p.687-719.

Batista, G.E., R.C. Prati, and M.C. Monard, Balancing strategies and class overlapping,in Advances in Intelligent Data Analysis VI2005, Springer. p. 24-35.

Visa, S., Fuzzy Classifiers for Imbalanced Data Sets, in Department of Electrical and Computer Engineering and Computer Science2006, Univeristy of Cincinnati: Cincinnati


Refbacks

  • There are currently no refbacks.