Keeping up with data — Week 47 reading list

5 minutes for 5 hours’ worth of reading

A great book to have in the living room for the long winter evenings! Image from Amazon.com.

ecently, I updated my CV. For me, it is always an interesting exercise, requiring looking at oneself from distance. What am I good at? How to compel others about it? What gaps should I focus on? Coding? Algorithms? Communication skills? Selling skills? What’s the best pathway?

Luckily, CV’s don’t have to be updated too frequently and the self-reflection can shift to self-improvement. Yet — of course — this means that the old CV will need to be updated again!

Anyway, without further ado, here are this week’s top reads:

  1. Estimating the Impact of Training Data with Reinforcement Learning: Not all data samples are equally useful for training deep-learning models. Knowing which ones are most valuable can improve the model performance and also suggest improvements for data collection. The authors suggest a novel approach — fitting a data value estimator together with the predictor in a reinforcement learning loop. (Google AI Blog)
  2. AI has cracked a key mathematical puzzle for understanding our world: Partial differential equations (PDEs) are very handy for describing the physical phenomena in our universe. But they are also notoriously hard to solve. The researchers at Caltech came up with a new deep-learning technique solving PDEs by fitting neural networks in Fourier space. An article that takes me back to varsity days. Not that I’m going to solve PDEs any time soon! (MIT Technology Review)
  3. SHAP values explained exactly how you wished someone explained to you: SHAP is an explainer of ML models based on game theory. SHAP quantifies the contribution that each feature brings to the prediction made by the model. The article explains the explainer so that the black-box models are not explained by another black box. The method — requiring to fit 2 ^ #features models — is computationally very demanding. Though usable approximations and samplings exist (e.g. here) it is exciting to see new methods for complex models and large datasets (like #1 above). (Towards Data Science)
  4. What I learned from looking at 200 machine learning tools: An industry analysis of AI/ML tools. Despite the boom of AI-powered start-ups the number of tooling companies is low (see the list), though this might be changing with the recently growing popularity of MLOps. Major progress is still to be expected in the model deployment and serving space. Majority of the tools are open source. Not every company needs ML researchers, but many will need to serve ML models so let’s keep an eye on the AI/ML tools. (Chip Huyen)
  5. How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh: Making data architecture scalable by breaking it down to smaller, domain-specific components and treating data as a product seems to be the new trend. The ‘cooperation’ of the components is then ensured by globally governed interoperability and standardisation of communications. Just reading this brings back the nightmares of data management in corporates. (Zhamak Dehghani); see also What is a Data Mesh — and How Not to Mesh it Up: Arguably an easier-to-digest article on the same topic. (Towards Data Science)
  6. Data-science? Agile? Cycles? My method for managing data-science projects in the Hi-tech industry: Managing data science projects is notoriously demanding, largely because of the science component (requiring creativity, and often with uncertain timelines). Data science projects traditionally include six stages: 1) Literature review; 2) Data exploration; 3) Algorithm development; 4) Result analysis; 5) Review; and 6) Deployment. It is advised to have these stages explicitly stated in the project management tool, flexibly move the project back and forth based on individual tasks, findings and approaching deadlines. Build the MVP quickly and iterate from there. (Towards Data Science)

The data mesh piece (see #5 above) got me thinking about the potential governance problems. Similarly to societies, data is powerless when isolated and difficult to govern when it gets too big. This area certainly requires additional consideration; if you have thoughts on this, please do add them in the comments below.

Thanks for reading!

Please feel free to share your thoughts or reading tips in the comments.

Follow me on Medium and LinkedIn.

Data scientist with corporate, consulting and start-up experience | avid cyclist | amateur pianist | CEO & co-founder at DataDiligence.com