Do we really need big data, or do we need nimble data?
At this week’s IAB’s Annual Leadership Meeting, Fidelity’s CMO Jim Speros outlined the challenges with “Big Data” – Too Much Data, Access to Data, Lack of Analytical Capability, among others. A spirited discussion ensued that has led me to follow-up on my previous post on big data.
At the center of our discussion was the question of Big versus Nimble. I say “versus” because even if you could make Big and Nimble, which is what a lot of companies are setting out to achieve, the reality is this is achieved at considerable cost of time, money and resources.
I used to say to my Amazon colleagues “I would rather have data points on 1000 random/representative customers with historical behavior data than to have all hundreds and millions of customers but with only one day worth of partial history.”
To an analyst, the beauty of sampling is that the insights you generate to inform business decision would not be different whether you analyzed the sample or you analyzed the whole population, as long as the sample process is robust and statistical significant tests are done. Big Data proponents would have you believe that the priority for companies should be how to store and manipulate terabytes of data. A far cheaper and more attainable goal is to understand what is the minimal canonical set of data that needs to be stored (and accessed quickly) in order to achieve the business objectives that the data is supposed to serve.
In other words, companies should find the “kernel” of their Big Data.
The kernel is by definition a small and yet central and essential part of the whole. I also like term kernel because in the field of abstract algebra, kernel has a meaning in the study of homomorphism. As data scientists, our job of extending insights from one phenomenon (or subset of data) to others (or the whole universe of data) is eerily similar to finding an isomorphism between two groups.
And just how do we go about finding this Kernel of Big Data?
In my experience, here are the three questions you need answers for before embarking on your Big Data journey:
- What are the business problems you are solving and questions you need answers for?
- Start with the business questions that you are addressing, and not the data you have. The latter will bias what you should do. It’s not what you can do with data given what you have, it’s what you should do given what problems you want to solve.
- What is the absolute minimum core set (or the kernel) of data that we need to access constantly, periodically, and what data is irrelevant to our business questions?
- Prioritizing storage and processing avoids the hard work of doing due diligence for actual data needs by linking to business problems. It’s simply postponing the pain and saying “let me gather the data and put them somewhere first and worry about what to do with it later,”. I recently made this mistake with the digital photos I have of my children, which will now need to wait until retirement to sort through. I wish I had been more systematic in determining what the most important pictures were to keep at an easily accessible location.
- What’s the plan to turn the data into answers to business questions? (i.e. Where is the analytical capability?)
- If there is no blueprint for turning data into actionable insights, then whatever decisions you make on Big Data processing or storage will be blind and are often times likely to hinder the insight generation rather than to facilitate. Going back to my digital photos, if I had thought about what I would do with my kids’ pictures earlier – making photo books based on occasions, events and milestones, then I could design a system that organize the pictures differently.
- Going through the analytics planning will greatly inform what the core set of data you need access in an agile way.
With answers to these three questions, you will find the kernel of your Big Data and are ready to take your Big Data journey. And, for the analysts and data scientists out there, if you’d like to join our team, drop us a line.