Keeping Up With Data — Week 39 Reading List
5 minutes for 5 hours’ worth of reading
What is the role of intuition in data science? Having studied general mathematics I learned that one can’t emphasise definitions and theorems over understanding. This might not always be obvious in real life. Knowing the ‘what’ and ‘how’ gets the job done most of the time. But without the ‘why’ we can easily get lost in more complex problems. I’d argue that a desire to become intimately familiar with various concepts by looking at them from different angles, visualising them, and practising them, should be in the DNA of any data scientist. Just like the image above can help us develop stronger intuition for R².
So, let’s get into this week’s reading list. Because the intuition isn’t going to build itself!
- Is BI dead? ‘Original BI’ has been taken apart and most of the functions are now supported by separate tools in the modern data stack. Today’s BI doesn’t worry about data ingestion, storage, or transformation. What’s left is data consumption. But that hasn’t changed much since the inception of BI. BI is not dead. Endangered? Maybe. It needs to evolve to escape extinction. Benn argues that modern BI should focus on consumption only. It shouldn’t worry about using data in operational ways. Neither should come with bespoke data governance layers — they should be legless. However, BI should include all consumption — both self-serve consumption, as well as ad-hoc analyses. Currently it covers the first. The second is often done in SQL and the Python notebooks of data analysts. Doing both at one place, brings data professionals and businesspeople together. It helps build the notorious bridge between data and business. As Benn puts it: “So long as companies need dashboards and executives need reports to go spelunking through as they wait for the economy class passengers to board, we’ll need BI.” The question is: how will the BI of tomorrow look like? (Benn Stancil)
- All statistical models are wrong. Are any useful? Statistical models are often powerful in explaining real-world phenomena. Not only those governed by natural laws. But also, for instance, to estimate probability of an outcome of experiments for a given population. To do that, we need to randomly select individuals from the population to take part in the experiment. Our model (e.g., logistic regression) then takes the data from the experiment and provides conclusions for the population. However, statistical models come with a set of assumptions that are not always validated. Expecting that the randomness used in the survey and experiment design covers the randomness of natural world is — let’s say — naive. But such is the convention in scientific practice. The consequences are that the parameter estimates are often incorrect. But what’s worse, they can be so incorrect, that the true parameters are not even covered in 95% confident intervals. Should we worry? (arg min blog)
- What is an A/B Test? Let’s stay with the topic of evaluating the odds ratio. What is the difference between the odds of an outcome (e.g., playing a movie) for two groups of randomly selected groups? How to design the experiment? And how to select the groups to be able to generalise the conclusion from the sampled groups to the whole population? And what should be the metrics used to measure the impact? The article aims at building an intuition behind tackling these questions. The next one on the table is how to evaluate the differences between the groups. With the previous article in mind, would you use logistic regression or simply calculate the odds ratio for the observed data? (Netflix Technology Blog)
That’s it for this week. So long and thanks for all the fish!