#ICYMI on the BGSE Data Science blog: Randomized Numerical Linear Algebra (RandNLA) For Least Squares: A Brief Introduction

Dimensionality reduction is a topic that has governed our (the 2017 BGSE Data Science cohort) last three months. At the heart of topics such as penalized likelihood estimation (Lasso, Ridge, Elastic Net, etc.), principal component analysis and best subset selection lies the fundamental trade-off between complexity, generalizability and computational feasibility.

David Rossell taught us that even if we have found a methodology to compare across models, there is still the problem of enumerating all models to be compared… read the full post by Robert Lange ’17 on Barcelona GSE Data Scientists

#ICYMI on the BGSE Data Science blog: Covariance matrix estimation, a challenging research topic

Covariance matrix estimation represents a challenging topic for many research fields. Sample covariance matrix might perform poorly in many circumstances, especially when the number of variables is approximately equal or greater then the number of observations. Moreover, when the precision matrix is the object of interest the sample covariance matrix might not be positive definite and more robust estimators must be used.

With this article I will try to give a brief (and non-comprehensive) overview of some of the topics in this research field. In particular, I will describe Stenian shrinkage, covariance matrix selection through penalized likelihood and graphical lasso implementing the description with some potential extensions of these methodologies… Read the full post by Davide Viviano ’17 on Barcelona GSE Data Scientists

Can computers see?

Dario Garcia-Gasulla from the Barcelona Supercomputing Center introduced BGSE Data Science students to the Convolutional Neural Network (CNN) method of object recognition.

image recognition

Computers today have the ability to process information from images notably thanks to object recognition. This ability improved greatly over the past few years and reached human levels on complex tasks in 2015. The algorithm allowing them to do such thing is called Convolutional Neural Network (CNN). This method also enabled Google’s AI to beat one of the best GO players in the world and build self-driving cars.

This week, the Renyi Hour invited Dario Garcia-Gasulla from the Barcelona Supercomputing Center who introduced the Data Science students to this method.

You can find the presentation slides here:

 

Repost from Barcelona GSE Data Scientists blog

Bayesian statistics applications in Physics

Data Science students had a talk with Johannes Bergstron, a Postdoctoral researcher at Universitat de Barcelona, about physics and the implications of Bayesian statistics for the field.

From the Barcelona GSE Data Science student blog:

bergström

Data Science students had a talk with Johannes Bergstron, a Postdoctoral researcher at Universitat de Barcelona, about physics and the implications of Bayesian statistics for the field. Read about the Renyi Hour talk and view the presentation slides on the BGSE Data Scientists blog.

Hackathon: 24 hours to predict 7-day churn for SocialPoint’s Dragoncity game

Follow the 24-hour adventure of Data Science students and find out which Barcelona GSE personality was on-site at the Agbar tower to judge the results!

dragoncity

Over on the Barcelona GSE Data Science blog, you can read a post by Aimee Barciauskas about the Dragoncity hackathon that she and several other Data Science students participated in last week. Follow their 24-hour adventure and find out which Barcelona GSE personality was on-site at the Agbar tower to judge the results!

Data Science field trip to Spain’s most powerful computing cluster

Unburying ourselves from a pile of Latin and Greek letters, most notably q’s and delta’s and little red g‘s, the 2015-16 candidates for a Master’s in Data Science visited MareNostrum, Spain’s most powerful computer cluster located on the western edge of inland Barcelona.

BSC

Everyone needs to take a break once in awhile. Unburying ourselves from a pile of Latin and Greek letters, most notably q’s and delta’s and little red g‘s, the 2015-16 candidates for a Master’s in Data Science at the Barcelona Graduate School of Economics visited MareNostrum, Spain’s most powerful computer cluster located on the western edge of inland Barcelona.

Read the full post on the Barcelona GSE Data Science blog

A Bayesian Search for the Needle in the Haystack

Master project by Timothée Stumpf-Fétizon. Barcelona GSE Master’s Degree in Data Science

Editor’s note: This post is part of a series showcasing Barcelona GSE master projects by students in the Class of 2015. The project is a required component of every master program.


Author: 
Timothée Stumpf-Fétizon

Master’s Program:
Data Science

Paper Abstract:

I develop an extension to Monte Carlo methods that sample from large and complex model spaces. I assess the extension using a new and fully functional module for Bayesian model choice. In standard conditions, my extension leads to an increase of around 30 percent in sampling efficiency.

Presentation Slides:

This is work in progress and there is no telling whether the rule works better in all situations!

If you’re interested in using BMA in practice, you can fork the software on my github (working knowledge of Python required!)

Using H20 for competitive data science

h20

Reposted from H2o and Barcelona GSE Data Scientists


In this special H2O guest blog post, Gaston Besanson and Tim Kreienkamp talk about their experience using H2O for competitive data science. They are both students in the new Master of Data Science Program at the Barcelona Graduate School of Economics and used H2O in an in-class Kaggle competition for their Machine Learning class. Gaston’s team came in second, scoring 0.92838 in overall accuracy, slightly surpassed by Tim’s team with 0.92964, on a subset of the famous “Forest Cover” dataset.

What is your background prior to this challenge?

Tim: We both are students in the Master of Data Science at the Graduate School of Economics in Barcelona. I come from a business background. I took part in a few Kaggle challenges before, but didn’t have a formal machine learning background before this class.

Gaston: I have a mixed background in Economics, Finance and Law. With no prior experience on Kaggle or Machine Learning other than Andrew Ng’s online course :).

Could you give a brief introduction to the dataset and the challenges associated with it?

Tim: The good thing about this dataset is that it is relatively “clean” (no missing values etc) and small (7 mb of training data). This allows for fast iteration and testing out a couple of different methods and hunches relatively quickly (relatively – a classmate of ours spent $300 on AWS trying to train support vector machines). The main challenge I see in the multiclass nature – this always makes it harder as basically one has to train 7 models (due to the one-vs-all nature of multiclass classification).

Gaston: Yes, this dataset is a classic on Kaggle: Forest Cover Type Prediction. Which, as Tim said and adding to it, there are 7 types of trees and 54 features (10 quantitative variables, like Elevation, and 44 binary variables: 4 binary wilderness areas and 40 binary soil type variables). What come to our attention was the highly unbalanced that was the dataset. Class 1 and 2 represented 80% of the training data.

What feature engineering and preprocessing techniques did you use?

Gaston: Our team added an extra layer to this competition that was to predict as best as possible the type of tree in a region with the purpose of minimizing the fires. Even though we used the same loss for each type of misclassification – in other words, all trees are equally important -, we decided to create new features. We created six new variables to try to identify features important to fire risk. And, we applied a normalization on both the training and the test sets to the 60 features.

Tim: We included some difference and interaction terms. However, we didn’t scale the numerical features or use any unsupervised dimension reduction techniques. I briefly tried to do supervised feature learning with H2O Deep Learning – it gave me really impressive results in cross-validation, but broke down on the test set.

Editor’s note: L1/L2/Dropout regularization or fewer neurons can help avoid overfitting

Which supervised learning algorithms did you try and to what success?

Tim: I tried H2O’s implementation of Gradient Boosting, Random Forest, Deep Learning (MLP with stochastic gradient descent), and the standard R implementation of SVM and k-NN. k-NN performed poorly, so did SVM – Deep Learning overfit, as I already mentioned. The tree based methods both performed very well in our initial tests. We finally settled for Random Forest, since it gave the best results and was faster to train than Gradient Boosting.

Gaston: We tried KNN, SVM, Random Forest all from different packages, with not that great results. And finally we used H2O’s implementation of GBM – we ended up using this model because it introduces a lot of freedom into the model design. The model we used had the following attributes: Number of trees: 250; Maximum Depth: 18; Minimum Rows: 10; Shrinkage: 0.1.

What feature selection techniques did you try?

Tim: We didn’t try anything fancy (like LASSO) for this challenge. Instead, we decided to take advantage of the fact that random forests can compute feature importances. I used this to code my own recursive elimination procedure. At each iteration, a random forest was trained and cross-validated (ten fold). The feature importances are computed, the worst two features are discarded, and the next iteration begins with the remaining features. The resulting cross validation errors at each stage made up a nice “textbook-like” curve, where the error first decreased with fewer features and at the end made a sharp increase again. We then chose the set of features that gave the second-best cross validation error, to not overfit by feature selection.

Gaston: Actually, we did not do any feature selection other than removing the variables that did have a variance, which if I am not mistaken was one in the original dataset (before feature creation). Neither turns the binary variables into one categorical (one for wilderness areas and one for soil type). We had a naïve approach of sticking with the story of fire risk no matter what; maybe next time we will change the approach.

Why did you use H2O and what were the major benefits?

Tim: We were constrained by our teachers in the sense that we could only use R – that forced me out of my scikit-learn comfort zone. So I looked for something as accurate and fast. As an occasional Kaggler, I am familiar with Arno’s forum post, and so I decided to give H2O a shot – and I didn’t regret it at all. Apart from the nice R interface, the major benefit is the strong parallelization – this way we were able to make the most of our AWS academic grants.

Gaston: I came across H2O just by searching the web and reading about alternatives within R possibilities after the GBM package proved really untestable. Just to add to what Tim said, I think H2O will be my weapon of choice in the near future.

For a more detailed description of the methods used and results obtained, see the report of Gaston’s and Tim’s teams.

Data visualization: London property prices

Barcelona GSE Data Science student Stefano Costantini ’15 shares a data viz exercise that explores London property prices from 1995-2013.

StefanoData Science student Stefano Costantini ’15 has posted this data viz project exploring London property prices on his website. Have a look and follow Stefano on Twitter @stefanoc.


 

London property prices: Visualising the evolution of the residential market (1995 to 2013)

The London residential property market has always been strong. However, it is only in the last twenty years or so that property prices have increased to such levels that previously “cheap” areas have now turned into prime locations. The gentrification process, together to an increase in population, have pushed up the prices even in peripheral areas. The purpose of this exercise is to visualise these changes, covering the period 1995-2013.

graphic
Evolution of local average prices by quarter for the whole period 1995-2013

See more graphics, read about the methodology and tools used in the project, and download the code and the data from Stefano’s website.

Photo Diary: Exams Winter 2015

How masters and PhD students are surviving finals this month…

Staking out a cozy corner in the library

 

It’s all about the snacks

 

Moments of Zen

 

A little help from our friends

 

Have a photo you’d like to share? Email it to thevoice@barcelonagse.eu or mention @barcelonagse on Twitter or Instagram