Solving data science problems with Record Matching

Presentation by Data Science alum Jordan McIver ’15


Every organisation needs to be able to properly connect disparate datasets to take full advantage of their data assets. Alchemmy held an event to discuss approaches and technologies to connect datasets and watchouts to consider once they are connected.

Check out my talk here where we look at an approach that best enables data scientists by partnering them with the other staff who actually hold the context of the data:

Video summary

Most businesses have some or all of the following problems: not enough data science resources for the work required; a large community of data-adjacent staff who have most of the context but are not contributing what they know in the right way; data science problems lacking that same context; algorithms that cannot overcome a lack of data quality or availability of training data. Jordan walks through the use of interactive dashboards where users quality assess the data and this feeds back into the data science process which addresses these problems.


Jordan McIver ’15 is Head of Data Consulting at Alchemmy in London. He is an alum of the Barcelona GSE Master’s in Data Science.


Investigation of Sentiment Importance on Intraday Stock Returns

Data Science master project by Michele Costa, Alessandro De Sanctis, Laurits Marschall and S. Hamed Mirsadeghi ’18

Investigation of Sentiment Importance on Intraday Stock Returns

Editor’s note: This post is part of a series showcasing Barcelona GSE master projects by students in the Class of 2018. The project is a required component of every master program.


Michele CostaAlessandro De SanctisLaurits Marschall and S. Hamed Mirsadeghi

Master’s Program:

Data Science

Paper Abstract:

The main goal of our Master Project is to predict intraday stock market movements using two different kinds of input features: financial indicators and sentiments from news and tweets. While the former are part of the common technical analysis of financial econometric models, the extracted sentiment of news articles and tweets from Twitter are also proven to correlate with stock markets movements. Our paper aims at contributing to the existing academic and professional knowledge in two main directions. First, we evaluate three different approaches to extract the sentiment from both social and mass media based on its forecasting power. Second, we deploy a battery of engineered features based on the sentiment, together with the financial indicators, in a machine learning model for a fine-grained minute-level forecasting exercise. In the end, two different classes of models are fitted to test the forecasting power of the combined input features. We estimated a classical ARIMA-model, and an XGBoost-model as machine learning algorithm. We collected data on the companies Apple, JPMorgan Chase, Exxon Mobil, and Boeing.

Figure: Exxon Mobil
The picture shows how sentiments towards Exxon Mobil moved over time. The two lines refers to two different methodologies: Loughran-McDonald is based on a financial dictionary while SentiStrength was trained on social media such as MySpace.

More about the Data Science Program at the Barcelona Graduate School of Economics

BGSE Data Talks: Professor Piotr Zwiernik

The Barcelona GSE Data Science student blog has a new post featuring an interview with Piotr Zwiernik (UPF and BGSE), Data Science researcher and professor in the BGSE Data Science Master’s Program.

The Barcelona GSE Data Science student blog has a new post featuring an interview with Piotr Zwiernik (UPF and BGSE), Data Science researcher and professor in the BGSE Data Science Master’s Program:

Hello and welcome to the second edition of the „Data Talks“ segment of the Data Science student blog. Today we have the honor to interview Piotr Zwiernik, who is assistant professor at Universitat Pompeu Fabra. Professor Zwiernik was recently awarded the Beatriu de Pinós grant from the Catalan Agency for Management of University and Research Grants. In the Data Science Master’s Program he teaches the maths brush-up and the convex optimization part of the first term class „Deterministic Models and Optimization“. Furthermore, he is one of the leading researchers in the field of Gaussian Graphical Models and algebraic statistics. We discuss his personal path, the fascination for algebraic statistic as well as the epistemological question of low-dimensional structures in nature…

Read the full interview on the Barcelona GSE Data Scientists blog

BGSE represented by “Just Peanuts” at Data Science Game finals in Paris

Class of 2017 Data Science graduates Roger Garriga, Javier Mas, Saurav Poudel, and Jonas Paul Westermann qualified for the final round of the Data Science Game in Paris this fall. Here is their account of the event.

Data Science Game is an annual competition organized by an association of volunteers from France. After competing in a tough online classificatory phase during the master we classified to the finals in Paris where we would be presented with a new problem to solve in a 2 days hackathon.

The hackathon was held in a palace property of Capgemini called Les Fontaines. It was an amazing building that made the experience even better.

The problem presented was to estimate the demand of 1.500 different products on 4 different countries using historic orders from 100.000 customers during the past 5 years by forecasting the three subsequent months. This was a well defined challenge that could be tackled with a large variety of solutions and for us specially the time constrain was one of the main challenges, since at the end we could be only 3 instead of 4.

We started by exploring the data and we realised that there were a lot of missing values due to a cross of databases done by the company who provided the data. So we spent some time by cleaning up the data and filling some of the missing values, to later on apply our models. After all the cleaning the key element to solve the challenge was later on to engineer good features that would represent well the data and then apply a simple model to predict the 3 months ahead.

The hackathon can be summed up in a day and a half coding, modeling and discussing without sleeping surrounded by 76 other participants from all across the world that were basically doing exactly the same, with short pauses to eat pizza, hamburgers and Indian food. So, a pretty good way to spend a weekend.

This slideshow requires JavaScript.

BGSE “Just Peanuts” qualifies for Data Science Game finals in Paris

A team of Barcelona GSE Data Science students from the Class of 2017 will compete in the final round of the Data Science Game in Paris at the end of September. 

data science game

A team of Barcelona GSE Data Science students from the Class of 2017 will compete in the final round of the Data Science Game in Paris at the end of September.

Among 400 international teams from 220 universities that participated in the first round, the BGSE team is among the 20 teams who have qualified for the final. The team is called “Just Peanuts” and its members are Roger Garriga, Javier Mas, Saurav Poudel, and Jonas Paul Westermann.

In the following interview, they talk about the Data Science Game and their expectations for the final.

What is the Data Science Game?

The Data Science Game is an annual Data Science competition for university students organized by ENSAE (Paris). Teams of up to four people can participate and represent their university. There is a free-for-all qualification round online and the top 20 teams are invited to the Finale in Paris.

Why did you decide to participate?

During the course we already took part in one data science challenge as part of the Computational Machine Learning course. That was quite fun and we have been generally wanting to take part in Kaggle-like challenges throughout the year. On top of that, we of course need to represent the Barcelona GSE and put the word out about our amazing Master’s.

Can you explain the task your team had to perform in the first round of the game?

The challenge for the online qualification round was related to predicting user’s music preferences. Data was provided by Deezer, a Music streaming service based in France. The training dataset consisted of 7+ million rows each pertaining to one user-song interaction describing weather the user listened to the song (for longer that 30 seconds) or not and whether the song was suggested to the user by the streaming service as well as further variables relating to the song/user.

How/by whom was the first round judged/scored?

The online round was hosted on Kaggle, a common website for these kinds of data science prediction challenges. Scoring was done according to the ROC AUC metric (reciever operator characteristic Area under the curve).

Was it difficult to combine participating in the game with your courses and assignments in the master program?

As we started really investing time into the challenge only quite late (about two weeks before the end) we spent a lot of time during the final days. The last 120 hours before submission were probably entirely spent on the challenge which definitely cut into our normal working schedules. Especially the last weekend before the deadline was very intense and spent mostly sitting shirtless at the table of a very overheated apartment living off frozen pizza and chips.

What specifically from the master’s helped you succeed in the game?

Part of the final model we used and what also made the first miles in terms of achieving a good score was a library recommended by one of the PhD students who also give lectures in our course. But also beyond that, we used all kinds of background knowledge and experience gained from the course. A constant scheme during the challenge were problems with difference in distribution and construction of the training and testing datasets. This gave inaccurately high cross-validation results and made it difficult to assess the quality of predictions.

Another issue was simply the size of the data that meant training and parameter tuning were extremely time consuming and we needed to expand our infrastructure beyond our own laptops. For both of those problems we’ve talked about possible solutions during the Master’s and applied combinations thereof.

What will you have to do for the final round? Can you tell us about your strategy or will that give too much information to the other teams?

The final round will be a two-day hackathon-like data science challenge on-site in Paris. No information has been shared with us on details of the challenge but we are thinking it might be something related to sound processing to continue the theme from part one.

How can we follow your progress in the competition?

We will surely be writing an update after the Paris trip and probably also give some social media updates during the event.

#ICYMI on the BGSE Data Science blog: Prediction as a Game

In this article we provide a general understanding of sequential prediction, with a particular attention to adversarial models.

Prediction as a Game

by Davide Viviano ’17

In this article we provide a general understanding of sequential prediction, with a particular attention to adversarial models. The aim is to provide theoretical foundations to the problem and discuss real life applications…

#ICYMI on the BGSE Data Science blog: RandNLA for LS (Part 2)

Randomized Numerical Linear Algebra for Least Squares – Part 2

by Robert Lange ’17

In today’s article we are going to introduce the Fast Johnson Lindenstrauss Transform (FJLT). This result is going to be the fundament of two very important concepts which speed up the computation of an ε-approximation to the LS objective function and the target vector…

See also Part 1 of this post

Alum Charlie Thompson (ITFD ’14) uses data science to build a virtual Coachella experience

ITFD alum Charlie Thompson ’14 is an R enthusiast who enjoys “tapping into interesting data sets and creating interactive tools and visualizations.”

image credit:

ITFD alum Charlie Thompson ’14 is an R enthusiast who enjoys “tapping into interesting data sets and creating interactive tools and visualizations.” His latest blog post explains how he used cluster analysis to build a Coachella playlist on Spotify:

“Coachella kicks off today, but since I’m not lucky enough to head off into the California desert this year, I did the next best thing: used R to scrape the lineup from the festival’s website and cluster the attending artists based on audio features of their top ten Spotify tracks!”

source: Charlie Thompson


source: Charlie Thompson

Read the full blog post on his website

Charlie shares a bit of his background on his website:

Currently an Analytics Specialist at a tech startup called VideoBlocks, I create models of online customer behavior and manage our A/B testing infrastructure. I previously worked as a Senior Data Analyst for Booz Allen Hamilton, where I developed immigration forecasts for the Department of Homeland Security. I also built RShiny applications for various clients to visualize trends in global disease detection, explore NFL play calling, and cluster MLB pitchers. After grad school I worked as a Research Assistant in the Macroeconomics Department of Banc Sabadell in Spain, measuring price bubbles in the Colombian housing market.

I have an MS in International Trade, Finance, and Development from the Barcelona Graduate School of Economics and a BS in Economics from Gonzaga University. For my Master’s thesis I drafted a policy proposal on primary education reform in Argentina, using cluster analysis to determine the optimal regions to implement the program. I also conducted research in behavioral economics and experimental design, using original surveys and statistical modelling to estimate framing effects and the maximization of employee effort.

Read more about Charlie on his website

Statistical Racism

Nandan Rao ’17 (Data Science) has posted a simulation over on the BGSE Data Science blog to see if racial profiling really helps catch more criminals.

Source: Nandan Rao ’17

“In the real-life algorithms being implemented by police departments, as in our toy simulation, the data used to find criminals is not the data on crimes, but the data on crimes caught.”

Read the post and see the code he uses to produce the simulation and graphics over on the BGSE Data Science blog.

Source: Nandan Rao ’17

#ICYMI on the BGSE Data Science blog: Interview with Ioannis Kosmidis

In this series we interview professors and contributors to our fields of passion: computational statistics, data science and machine learning. Post by Robert Lange ’17 and Nandan Rao ’17 on BGSE Data Scientists blog.

In this series we interview professors and contributors to our fields of passion: computational statistics, data science and machine learning. The first interviewee is Dr. Ioannis Kosmidis. He is Senior Lecturer at the University College London and during our first term he taught a workshop on GLMs and statistical learning in the n>>p setting. Furthermore, his research interests also focus on data-analytic applications for sport sciences and health…read the full post by Robert Lange ’17 and Nandan Rao ’17 on Barcelona GSE Data Scientists