Open Access Open Access  Restricted Access Subscription or Fee Access

Big Data Analysis using Gradient Boosting Algorithm

Hojiwala Robin, Ridhdhi Naik

Abstract


Now a days technologies change rapidly. More and more data are generated through different devices like sensors, E-mails, mobile application etc. On daily basis this all device need storage device for store that data. This all data store for decision making, future predication and analysis for different purpose. The traditional storage system are not enough for storage and processing data. Big data and Cloud computing are main stream for data storage and data analysis in IT field. Big data is dealing with the huge amount of data for analyze and process for real time data. Big data analytics use different algorithms for Business and Marketing analysis. The data are fast moving in real time. They need to efficient algorithm require so gradient boosting algorithms family deal with the fast moving data in big data analytics. The gradient boosting algorithm focus on accuracy and speed. There are three algorithm in gradient boosting algorithm is XGBoost, Light GBM and CatBoost. XGBoost is scalable and reliable technique for the efficient machine learning challenges. LightGBM is an accurate algorithm dealing with the fast training performance. LightGBM use selective sampling of high gradient instances. CatBoost modifies the computation of gradient to avoid shift prediction in order to improve the accuracy of algorithm. This work proposes a practical analysis of how these novel variants of gradient boosting work in terms of training speed, generalization performance and hyper-parameter setup. In addition, a comprehensive comparison between XGBoost, LightGBM, CatBoost, random forests and gradient boosting has been performed using carefully tuned models as well as using their default settings. The results of this comparison indicate that CatBoost obtains the best results in generalization accuracy and AUC in the studied datasets although the differences are small. LightGBM is the fastest of all methods but not the most accurate. Finally, XGBoost places second both in accuracy and in training speed. Finally an extensive analysis of the effect of hyper-parameter tuning in XGBoost, LightGBM and CatBoost is carried out using two novel proposed tools.

Keywords


XGBoost, LightGBM, CatBoost, Weak Learner, TTF.

Full Text:

PDF

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.