Data can be used to describe the world around us. Typically, the part of the world that is relevant to a problem at hand. Such as, we want to understand who a person applying for a loan is to asses their probability to default. Or what songs a person prefers to keep them on a music platform. Or what is around a self-driving car so not to kill anyone!
In order to leverage analytics to solve problems like these, we need to describe the problems to a computer. And we do that using data — picking the factors we believe are most relevant for each problem.
When assessing the probability to default, we care about a borrower’s credit score and debt-to-income ratio. But not her favourite song. On the other hand, the music preferences are gauged from the previously played songs, not one’s credit score. Whereas, a self-driving car treats people cold-heartedly as obstacles. Not caring about their music taste or credit history at all!
These are just some examples of how data is being used to describe the world. Each of the data descriptions is built with a goal in mind and a problem at hand — because the context matters. Data scientists and whole companies are doing their best to build a perfect data representation of the problem they are working on.
The data representation is a critical input for a computer. But it doesn’t matter to whom we want to provide the data with — be it a computer, or a person. Either way, what we need to keep in mind (despite all our best effort) is:
Data is an imperfect reflection of the real world.
Why imperfect? Well, in essence, data is a mirror of the world. Now, look at the image above. While the tree is crisp, with all the details sharp and clear, the reflection in the water is blurred, imperfect, simplified. It’s just a reflection in the water!
Similarly, data is a blurred, imperfect and simplified version of the world. Our world is simply too complex to be described by whatever number of features we fit into a data storage. Not even mentioning what the quality of the features we do fit in the data strorage is and how well they reflect what they represent.
Should we worry? Well, we could. But it’s not going to change anything. Perhaps, we should rather embrace this inherent ambiguity of data.
Once we start thinking about data as of a reflection of the world, we realise that what we want to do with a complex real business problem is to capture its essence in data. We do that because we can then use technology and analytics to help find a solution. We can play with scenarios and manipulate the problem in ways impossible to do in the real physical world. And if our data reflection is correct, representative, and unbiased, we can then apply the solution we found in the data world back to our real-world problem.
And that’s the crux of it. It doesn’t matter if data is good, or bad. Right, or wrong. Accurate, or not. What matters is: how representative of the problem is the data? Will the (data-powered or data-informed) solution work in real world?
That’s why we always need to start with the business problem! We need to make sure we captured the essence of the problem well. We need to think about all the important factors thoroughly. What are the main objects that play a role in our problem? What are their properties and the attributes relevant for the problem? How do the objects relate to and influence one another? Are there any external factors? What data can be used to reflect these? Is the data reliable? Is it representative enough?
The concept of data reflecting reality is incredibly useful for data scientists. But it is arguably even more important for non-data professionals. Let me give two reasons:
- It provides a powerful mental image helping us to embrace and adopt data ultimately for the benefit of ours and our businesses.
- It prompts us to think critically about how well data is representing our real-world situation and constantly look for both gaps and biases in the data.
We can’t rely on data to automatically give us all the answers. It’s always important to stop and think. Whenever facing a problem, we aim to solve with data, we should be asking:
How representative of the real world the data is?
Because we need to make sure we have a simple (not simplistic!), unbiased, and accurate data representation of the real-world problem if we want to apply a data solution back to the real world. And have a reasonable chance it will work well.