Short Notes on Boosting

Introduction

The following notes are from the COURSERA Practical Machine Learning course and are intended to help others understand the concepts and code behind the math. The basic idea behind boosting is to take lots of possibly weak predictors, then weigh them and add them up, thus getting a stronger one. We could see two clear steps:

  1. Start with a set of classifiers (h1, h2, …, hn). For example, these could be all possible trees, all possible regressions, etc.
  2. Create a classifier that combines classification functions

Several things we want to obtain with this process:

  • The goal is to minimize error (on the training set)
  • Iteratively, select one h at each step
  • Calculate weights based on errors
  • Up weight missed classifications and select next h

The most famous boosting algorithm is ADAboost.

Boosting in R

Boosting in R can be used with any subset of classifiers. One large subclass of boosting is gradient boosting. R has multiple libraries for boosting. The differences include the choice of basic classification function and combination rules.

  • gbm: boosting with trees
  • mboost: model based boosting
  • ada: statistical boosting based on additive logistic regression
  • gamBoost: for boosting generalized additive models

Most of these are already included in the caret package, making the implementation relatively easy.

Boosting Example

For our boosting example, we will use the ISLR package and the Wage data set included. The idea here is to try to predict wage based on the many other predictors included.

# Create training and testing sets 
wage <- subset(Wage, select = -c(logwage)) 
inTrain <- createDataPartition(y = wage$wage, p = 0.7, list = FALSE) 
training <- wage[inTrain, ] 
testing <- wage[-inTrain, ] 

Now that we have our data we can create training and testing sets. For the problem at hand, there is no pre-processing of the data necessary, but for other types of prediction models, such as regression, some pre-processing is required for non-numerical values or variables with no variability because of the predominance of zero values.

# Create training and testing sets 
wage <- subset(Wage, select = -c(logwage)) 
inTrain <- createDataPartition(y = wage$wage, p = 0.7, list = FALSE) 
training <- wage[inTrain, ]  
testing <- wage[-inTrain, ]

Fitting a boosting model in R using the caret package is very easy. The syntax is much like any other for training a data set, just include the option method = “gbm” in the option line.

# Fit a model using boosting
modFit <- train(wage ~ ., method = "gbm", data = training, verbose = FALSE)
print(modFit)
## Stochastic Gradient Boosting 
## 
## 2102 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared 
##   1                   50      34.39326  0.3213741
##   1                  100      33.85476  0.3322011
##   1                  150      33.79401  0.3341418
##   2                   50      33.85892  0.3324366
##   2                  100      33.77240  0.3351736
##   2                  150      33.87053  0.3330873
##   3                   50      33.73768  0.3365461
##   3                  100      33.89518  0.3324402
##   3                  150      34.09446  0.3263083
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 50, interaction.depth
##  = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Word of caution, boosting takes a little more of processing time in some computers, so you might have to wait a few seconds for the function to return the model object. If you run this code yourself you will see multiple warning messages related to the non-processing of the data previous to the exercise (not necessary for this simple example.) If you look closely, you will see that the function chose a Stochastic Gradient Boosting model, 2102 samples and 10 predictors. Mind you, there is a difference between observations in the training set and the final boosting model since it has resampled from our training set with bootstrap (25 reps.) The output says the summary of sample sizes is 2102, so we will keep that in mind when we try to cross-validate and test the model.

Testing the Model

Remember boosting resampled the training data set, so now it has fewer observations than in the original. Thus it is hard to cross-validate with the same data (length of vectors are not the same!) What we will do now is use the testing data to plot real wage data from the test set against the wage prediction from our model using the test data. These are numerical outcomes, a little harder to see on a confusion matrix but more visible in a plot.

qplot(y = predict(modFit, testing), x = wage, data = testing, ylab = "Prediction of Wage", xlab = "Wage")

If you look at the plot, our model seems to capture quite well those lower-bound wages, but there is a group of higher-wages that seem to fall out of our prediction capacity. We can see a quick rundown of our wage prediction ranges and real wages using the cut function.

predWage <- cut(predict(modFit, testing), 5)
realWage <- cut(testing$wage, 5)

summary(predWage)
## (67.3,88.9]  (88.9,110]   (110,132]   (132,153]   (153,175] 
##         143         372         220         103          60
summary(realWage)
## (20.7,76.6]  (76.6,132]   (132,188]   (188,244]   (244,300] 
##         128         564         164          12          30

While the real wages vary from 19.8 to 319.0, our prediction only ranges from 61.1 to 169.0. This is concentrated in the second and third subsets of real wages.

Conclusion

Despite having some problem with higher salaries, we clearly saw from the plot a good amount of prediction power between real wages and estimations. What is clear is the power of boosting taking rather weak predictors and incrementing their capacity through resampling for much more accurate prediction power.