Keeping up with data — Week 46 reading list
A curious mind wandering through the world of data
The world of data is not only about algorithms and technology. People are also involved in the data value chain. My last gig has given me a little bit of insight into how people behave at a workplace and how habits are created. But most importantly it made me curious about these topics in connection with data.
This week’s list covers a mix of topics. So, I hope there’s something for everyone.
- Boost Your Team’s Data Literacy: Companies are lacking data-driven problem-solving skills like: asking the right questions; testing hypotheses using A/B tests; understanding which data is relevant; interpreting data well to draw useful and meaningful conclusions; telling a story to help decision-makers see the big picture and act on the results of analysis. These ‘soft skills’ make a difference. The article’s suggestions are to (1) ensure people know how to use the tools; (2) set up a capability academy for data skills; (3) use examples and stories in awareness campaigns; and (4) bake data into all important decision making. (HBR)
- Rethinking the build vs buy approach to talent: Hiring new employees to keep up with the rapid pace of technological, digital and data development is very expensive, if not impossible. More organisations are taking a hybrid approach and combine hiring with training. But the L&D programmes aimed at developing new technical skills or data literacy need to look differently to the standard L&D solutions. They should be run by current practitioners and focus on projects and assignments tailored to company’s data, tools and tech stack; ‘on the job’ data training. These, together with senior leadership leading by example, are important for making a company truly data driven. (Josh Bersin)
- Models for integrating data science teams within organizations: Deploying data scientists in organisations is not easy. There are plenty of models, each with benefits and drawbacks, for example: centre of excellence, data scientists as consultants, data scientists hired directly by product teams, product data science models with data scientist in each product team but reporting into a central data science team. Each organisation is different but in my experience the product data science model works well. With more products and increasing headcount, the CDO needs to figure out the way to scale it that is right for the organisation. (Pardis Noorzad @ Medium)
- Safely Rolling Out ML Models To Production: Best practices for CI/CD of ML systems. In the CI phase, one needs to perform not only data and model validation but also test for production data assumptions and stress test the model’s operational performance. For the CD phase, shadow evaluation, A/B tests and multi-arm bandits are discussed. Cool, cool, cool. But this was the candy in the article: “While, the CI/CD paradigms address the “what” and the “how” of new models roll-out, the “when” is covered by the CT (Continuous Training) paradigm.” (Oren Razon @ towards data science)
- Bringing Personalized Search to Etsy: Etsy uses historical and contextual features to personalise user search results. Historical features are describing users’ shopping habits and behaviours. Contextual features use textual description (title, tags) and are capturing what items the user has interacted with in the context of all items (using e.g. Tf-Idf). When a user enters a search query, the algorithm selects 1000 most relevant items (ignoring the personalisation features) and consequently ranks them using the personalised historical and contextual features. It is a nice example of an ’80–20’ approach where you use a rough algorithm to quickly narrow down the list of possible solutions and then adopt a fine — more sophisticated — approach to accurately select the best solution from the pre-selected list. And a reminder that every improvement step is increasingly more demanding. (Etsy)
I find the data literacy topic crucial. Unfortunately, most of the materials seem skewed towards data visualisation (see e.g. thedataliteracyproject.org). Given the progress of advanced analytics, people need to accept the probabilistic nature of these solutions and not assume that they have to work on 100% (and disregard them when they don’t).