Keeping Up With Data #79
5 minutes for 5 hours’ worth of reading
The image above comes from Bill Schmarzo’s article called The Ugly Truth About Data Management. It is a sad truth that many data infrastructures resemble a Rube Goldberg machine. One taking a relatively straightforward business reality, captured in an incredibly complex (and complicated) way. Not only it prevents an organisation from effectively solving its business problems, but it also keeps many of the smartest people fighting the battle to keep the system alive. In order to minimise the risk of making wrong and irreversible (which is the worst!) choices, we should keep learning from industry best practices and lessons learnt shared by others.
And that’s the whole point of this weekly reading list.
- Recommender Systems, Not Just Recommender Models: The purpose of the recommender models is to score an interest of a user in an item. This is obviously very useful in many situations when companies want to serve personalised subsets of content, items, products etc. to their customers. But in real life, the model is not enough as there are many other challenges. There might be too many items making the computations difficult, we don’t want to recommend some of the items, or we want to promote some other items. Therefore, a complete recommender system goes way beyond the recommender model and consists of four main stages — Retrieval, Filtering, Scoring, and Ordering. (Even Oldridge @ NVIDIA Merlin)
- How to Measure and Mitigate Position Bias: “Position bias happens when higher positioned items are more likely to be seen and thus clicked regardless of their actual relevance. This leads to lesser engagement on lower ranked items.” This presents a challenge to the ML engineers because “training our models on biased historical data perpetuates the bias via a self-reinforcing feedback loop.” Luckily, there are ways to measure and mitigate position bias. Adding randomness is one of them. (Eugene Yan)
- Advanced exploratory data analysis (EDA) with Python: Every data project needs EDA. Sometimes more thorough, sometimes a quick one. But we always need to get familiar with the data and inspect anything relevant for the problem at hand. I remember ten years ago I wrote my own EDA package to increase my productivity with this important step. Nowadays, there are many packages and guides — such as this one — helping data analysts and data scientists to spend more time exploring the data than writing code to explore the data. From time to time, I’m reviewing data science testing tasks during a recruitment process. My advice to the candidates: Please don’t rush to get to modelling too soon. In the end, it’s usually very inefficient. (Michael Notter @ EPFL Extension School)
Brent Dykes wrote a piece for Forbes, in which he’s advising companies not to let a misguided AI strategy sabotage their brand experience. He says that: “by focusing exclusively on cost savings with your AI strategy, your organization could be sabotaging its own brand reputation.” I personally cannot agree more. Use technology to solve people’s problems. Not because it’s cool.