Recently, I was asked if what we do at The Analytics Store involves ‘Big Data’. In my mind I have been analysing big data for years but it did get me thinking about the whole emergence of the ‘Big Data’ trend and where The Analytics Store fits into it. Around the same time, I was also asked to give a joint presentation with a company I worked with in a previous lifetime around the use of ‘Big Data’ in Analytics. I agreed, but had my doubts.
In preparing for the presentation, I asked myself what is ‘Big Data’ – is this just a good marketing tagline or is there something more behind it? There isn’t a huge amount of meaningful information about ‘Big Data’ out there, beyond the marketing bumph. The Big Data trend emerged around early 2011 (see the Google trends image below). Most of the material references Big Data as being multi-structured, complex data that is difficult to process using traditional methods. Big Data is generally described in terms of three dimensions: variety (number of different data sources and types), velocity (speed at which data is produced and processed), and volume (amount of data). After reading a whole slew of adjective-filled articles, however, however, I was still not convinced that there was anything new going on.
In that previous lifetime I mentioned, about 2008-2009, I was working for an online gaming company, happily analysing data. The analytics team was a new team at the time and we started by looking at all of the data available for us to use. Most of the company and customer data was available to use, but there was one data source that contained all of the customer interactions in it that looked particularly useful. This data set, however, was simply too big for us to do any sort of meaningful analysis on within a reasonable timescale. You could have called it a ‘Big Data’ source. Being the data junkie that I am I was disappointed, but we had lots of data to work with so we got busy with that. The company I was presenting with recently had been working on trying to access and process the very same data source that had been off limits for me. During their presentation it was revealed that, using new methodologies such as MapReduce, data processes that used to take a week to run were now completing in 15 minutes. The speed at which massive data volumes were accessed and processed was staggering. The numbers spoke for themselves.
So, what has really changed with the emergence of ‘Big Data’ is the ability to access and manipulate really large datasets. Yes, I am convinced that this is more than just good marketing, but how does it impact how we work as data miners/scientists/statisticians? Do we need to develop new functions and algorithms to analysis all of this fabulous data or can we simply apply our tried and tested techniques such as Decision Trees, Neural Network, Clustering, Association Analysis, Regressions to these data sources…..only time will tell.