Stop Chasing the Big Data; Mid Data makes more sense
[Re-posted from OdinText.com Blog]
After attending the American Marketing Association’s first conference on Big Data this week, I’m even more convinced of what I already suspected from speaking to hundreds of Fortune 1000 marketers the last couple of years. Extremely few are working with anything approaching what would be called “Big Data” – And I believe they don’t need to – But many should start thinking about how to work with Mid Data!
“Big Data”, “Big Data”, “Big Data”. It seems like everyone is talking about it, but I find extremely few researchers are actually doing it. Should they be?
If you’re reading this, chances are that you’re a social scientist or business analyst working in consumer insights or related area. I think it’s high time that we narrowed the definition of ‘Big Data’ a bit and introduced a new more meaningful and realistic term “MID DATA” to describe what is really the beginning of Big Data.
If we introduce this new term, it only makes sense that we refer to everything that isn’t Big or Mid data as Small Data (I hope no one gets offended).
I’ve included a chart, and for simplicity will think of size here as number of records, or sample if you prefer.
‘Small Data’ can include anything from one individual interview in qualitative research to several thousand survey responses in longitudinal studies. At this level of size, quantitative and qualitative can technically be lumped together, as neither currently fit the generally agreed upon (and admittedly loose) definition of what is currently “Big Data”. You see, rather than a specific size, the current definition of Big Data varies depending on the capabilities of the organization in question. The general rule for what would be considered Big Data would be data which cannot be analyzed by commonly used software tools.
As you can imagine, this definition is an IT/hardware vendor’s dream, as it describes a situation where a firm does not have the resources to analyze (supposedly valuable) data without spending more on infrastructure, usually a lot more.
What then is Mid Data? At the beginning of Big Data, some of the same data sets we might call Small Data can quickly turn into Big Data. For instance, the 30,000-50,000 records from a customer satisfaction survey which can sometimes be analyzed in commonly available analytical software like IBM-SPSS without crashing. However, add text comments to this same data set and performance slows considerably. These same data sets will now often take too long to process or more typically crash.
If these same text comments are also coded as is the case in text mining, the additional variables added to this same dataset may increase significantly in size. This then is currently viewed as Big Data, where more powerful software will be needed. However I believe a more accurate description would be Mid Data, as it is really the beginning of Big Data, and there are many relatively affordable approaches to dealing with this size of data. But more about this in a bit…
Now that we’ve taken a chunk out of Big Data and called it Mid Data, let’s redefine Big Data, or at least agree on where Mid Data ends and when ‘Really Big Data’ begins.
To understand the differences between Mid Data and Big Data we need to consider a few dimensions. Gartner analyst Doug Laney famously referred to Big Data as being 3-Dimensional; that is having increasing volume, variety, and velocity (now commonly referred to as the 3V model).
To understand the difference between Mid Data and Big Data though, only two variables need to be considered, namely Cost and Value. Cost (whether in time or dollars) and expected value are of course what make up ROI. This could also be referred to as the practicality of Big Data Analytics.
While we often know that some data is inherently more valuable than other data (100 customer complaints emailed to your office should be more relevant than a 1000 random tweets about your category), one thing is certain. Data that is not analyzed has absolutely no value.
As opposed to Mid Data, to the far right of Big Data or Really Big Data, is really the point beyond which an investment in analysis, due to cost (which includes risk of not finding insights worth more than the dollars invested in the Big Data) does not make sense. Somewhere after Mid Data, big data analytics will be impractical both theoretically, and for your firm in very real economic terms.
Mid Data on the other hand then can be viewed as the Sweet Spot of Big Data analysis. That which may be currently possible, worthwhile and within budget.
Mid Data is where many of us in market research have a great opportunity. It is where very real and attainable insight gains await.
Really Big Data, on the other hand, may be well past a point of diminishing returns.
On a recent business trip to Germany I had the pleasure of meeting a scientist working on a real Big Data project, the famous Large Hedron Collider project at CERN. Unlike the Large Hadron Collider, consumer goods firms will not fund the software and hardware needed to analyze this level of Big Data. Data magnitudes common at the Collider (output of 150 million sensors delivering data 40 million times per second) are not economically feasible but nor are they needed. In fact, scientists at CERN do not analyze this amount of Big Data. Instead, they filter out 99.999% of collisions focusing on just 100 of the “Collisions of Interest” per second.
The good news for us in business is that if we’re honest, customers really aren’t that difficult to understand. There are now many affordable and excellent Mid Data software available, for both data and text mining, that do not require the exabytes of data or massively parallel software running on thousands of servers. While magazines and conference presenters like to reference Amazon, Google and Facebook, even these somewhat rare examples sound more like IT sales science fiction and do not mention the sampling of data that occurs even at these companies.
As scientists at Cern have already discovered, it’s more important to properly analyze the fraction of the data that is important (“of interest”) than to process all the data.
At this point some of you may be wondering, well if Mid Data is more attractive than Big Data, then isn’t small data even better?
The difference of course is that as data increases in size we can not only be more confident in the results, but we can also find relationships and patterns that would not have surfaced in traditional small data. In marketing research this may mean the difference between discovering a new niche product opportunity or quickly countering a competitor’s move. In Pharma, it may mean discovering a link between a smaller population subgroup and certain high cancer risk, thus saving lives!
Mid Data could benefit from further definition and best practices. Ironically some C-Suite executives are currently asking their IT people to “connect and analyze all our data” (specifically the “varied” data in the 3-D model), and in the process they are attempting to create Really Big (often bigger than necessary) Data sets out of several Mid Data sets. This practice exemplifies the ROI problem I mentioned earlier. Chasing after a Big Data holy grail will not guarantee any significant advantage. Those of us who are skilled in the analysis of Small or Mid Data clearly understand that conducting the same analysis across varied data is typically fruitless.
It makes as much sense to compare apples to cows as accounting data to consumer respondent data. Comparing your customers in Japan to your customers in the US makes no sense for various reasons ranging from cultural differences to differences in very real tactical and operational options.
No, for most of us, Mid Data is where we need to be.
[Full Disclosure: Tom H. C. Anderson is Managing Partner of Anderson Analytics which develops and sells patent pending data mining and text analytics software platform OdinText]