Keeping Up With Data — Week 35 Reading List
5 minutes for 5 hours’ worth of reading
Gartner’s hype cycle for data science and ML, shown above, brings plenty of terms we’ve been hearing for a while and couple of new ones, too. Gartner is often coining or popularising new terms, some of which are understandable — like ‘small and wide data’ — others need to be constantly googled (at least by me)— like ‘citizen data science’ or ‘X analytics’. Another I find slightly confusing is the co-existence of ‘MLOps’ and ‘ModelOps’ in the picture. But I guess it says a lot that the ‘innovation trigger’ stage is full of terms, while the ‘plateau of productivity’ is not.
While sometimes thought leaders seem to be complicating simple things, data is generally about simplifying complex reality — as can be seen in the following articles.
- Simpson’s Paradox and Interpreting Data: Data as a finite representation of a very complex real world and will never be a perfect reflection. Intuition behind what’s missing in the data (but should be included) is the art of data science. Simpson’s paradox states: A trend or result that is present when data is put into groups that reverses or disappears when the data is combined. The reason for this is so called ‘lurking variables’, which split the data into multiple distributions. They are difficult to find. And the decision to look at the data together — or by groups — is entirely situational. People sometimes consider data as an absolute truth. ‘Data don’t lie’, they say. Well, what if an important assumption is not met? Be careful to draw conclusions for a complex reality based on findings from a simple reflection. (Tom Grigg @ TDS)
- The Role of AI in HR Decision Making: Is there anything more complex than people? In such complex environments — like organisations — it’s difficult to image a fully autonomous AI making decisions. But luckily, it doesn’t mean that HR can’t leverage AI for a wide range of decision making. Instead of automation, we should think of augmentation. Data-augmented decision making combines ‘could’, ‘should’ and ‘would’ questions. The first two can be answered with data. Could we fill in a position with existing talent? Should we do it? The third type — Would the person be happy to transfer? Would it be a good fit? — not so much. But that’s the complexity of HR that we need to take into account. (myHRfuture)
- Pseudo-R²: A Metric for Quantifying Interestingness: In case of linear outcomes, the common measure (by statisticians) of interestingness in ‘variance explained’ — often described by R². But what to do in the case of non-linear outputs (e.g., “yes” or “no”)? For instance, what splits of an overall conversion rate do we consider most interesting? By device? By campaign? By country? By gender? And how can we quantify that? The suggestion is to use McFadden’s pseudo-R². Mostly because it balances variation with composition. Pseudo-R² is low when the groups explain no variation (in conversion rates) and also when one of the groups is significantly larger. Just as intuition tells us that the most interesting split is the one with proportional sizes of the groups with largest differences between the conversion rates. (Heap blog)
Apart from reading these (and many more) interesting articles, I’ve also published a piece about challenges of data adoption pair with tips on how to overcome them.
Until next week!