Using H20 for competitive data science
In this special H2O guest blog post, Gaston Besanson and Tim Kreienkamp talk about their experience using H2O for competitive data science. They are both students in the new Master of Data Science Program at the Barcelona Graduate School of Economics and used H2O in an in-class Kaggle competition for their Machine Learning class. Gaston’s team came in second, scoring 0.92838 in overall accuracy, slightly surpassed by Tim’s team with 0.92964, on a subset of the famous “Forest Cover” dataset.
What is your background prior to this challenge?
Tim: We both are students in the Master of Data Science at the Graduate School of Economics in Barcelona. I come from a business background. I took part in a few Kaggle challenges before, but didn’t have a formal machine learning background before this class.
Gaston: I have a mixed background in Economics, Finance and Law. With no prior experience on Kaggle or Machine Learning other than Andrew Ng’s online course :).
Could you give a brief introduction to the dataset and the challenges associated with it?
Tim: The good thing about this dataset is that it is relatively “clean” (no missing values etc) and small (7 mb of training data). This allows for fast iteration and testing out a couple of different methods and hunches relatively quickly (relatively – a classmate of ours spent $300 on AWS trying to train support vector machines). The main challenge I see in the multiclass nature – this always makes it harder as basically one has to train 7 models (due to the one-vs-all nature of multiclass classification).
Gaston: Yes, this dataset is a classic on Kaggle: Forest Cover Type Prediction. Which, as Tim said and adding to it, there are 7 types of trees and 54 features (10 quantitative variables, like Elevation, and 44 binary variables: 4 binary wilderness areas and 40 binary soil type variables). What come to our attention was the highly unbalanced that was the dataset. Class 1 and 2 represented 80% of the training data.
What feature engineering and preprocessing techniques did you use?
Gaston: Our team added an extra layer to this competition that was to predict as best as possible the type of tree in a region with the purpose of minimizing the ﬁres. Even though we used the same loss for each type of misclassiﬁcation – in other words, all trees are equally important -, we decided to create new features. We created six new variables to try to identify features important to ﬁre risk. And, we applied a normalization on both the training and the test sets to the 60 features.
Tim: We included some difference and interaction terms. However, we didn’t scale the numerical features or use any unsupervised dimension reduction techniques. I briefly tried to do supervised feature learning with H2O Deep Learning – it gave me really impressive results in cross-validation, but broke down on the test set.
Editor’s note: L1/L2/Dropout regularization or fewer neurons can help avoid overfitting
Which supervised learning algorithms did you try and to what success?
Tim: I tried H2O’s implementation of Gradient Boosting, Random Forest, Deep Learning (MLP with stochastic gradient descent), and the standard R implementation of SVM and k-NN. k-NN performed poorly, so did SVM – Deep Learning overfit, as I already mentioned. The tree based methods both performed very well in our initial tests. We finally settled for Random Forest, since it gave the best results and was faster to train than Gradient Boosting.
Gaston: We tried KNN, SVM, Random Forest all from different packages, with not that great results. And finally we used H2O’s implementation of GBM – we ended up using this model because it introduces a lot of freedom into the model design. The model we used had the following attributes: Number of trees: 250; Maximum Depth: 18; Minimum Rows: 10; Shrinkage: 0.1.
What feature selection techniques did you try?
Tim: We didn’t try anything fancy (like LASSO) for this challenge. Instead, we decided to take advantage of the fact that random forests can compute feature importances. I used this to code my own recursive elimination procedure. At each iteration, a random forest was trained and cross-validated (ten fold). The feature importances are computed, the worst two features are discarded, and the next iteration begins with the remaining features. The resulting cross validation errors at each stage made up a nice “textbook-like” curve, where the error first decreased with fewer features and at the end made a sharp increase again. We then chose the set of features that gave the second-best cross validation error, to not overfit by feature selection.
Gaston: Actually, we did not do any feature selection other than removing the variables that did have a variance, which if I am not mistaken was one in the original dataset (before feature creation). Neither turns the binary variables into one categorical (one for wilderness areas and one for soil type). We had a naïve approach of sticking with the story of fire risk no matter what; maybe next time we will change the approach.
Why did you use H2O and what were the major benefits?
Tim: We were constrained by our teachers in the sense that we could only use R – that forced me out of my scikit-learn comfort zone. So I looked for something as accurate and fast. As an occasional Kaggler, I am familiar with Arno’s forum post, and so I decided to give H2O a shot – and I didn’t regret it at all. Apart from the nice R interface, the major benefit is the strong parallelization – this way we were able to make the most of our AWS academic grants.
Gaston: I came across H2O just by searching the web and reading about alternatives within R possibilities after the GBM package proved really untestable. Just to add to what Tim said, I think H2O will be my weapon of choice in the near future.