Keeping up with data — Week 50 reading list

5 minutes for 5 hours’ worth of reading

Source: https://twitter.com/demartsc/status/1333815253184913410

hile some people are spending their time creating a Tableau visualisation of Trevor Noah’s hoodie collection I’m trying to achieve my cycling goal of climbing 200'000 m in 2020. It’s difficult to argue which is more beneficial for humankind!

This week’s list covers topics from data intuition to regular expressions. Here it comes.

  • Defining Data Intuition: The article defines data intuition as a resilience to misleading data and analyses. With data invading the life of all of us, it is important to develop a feeling for stinky data or suspicious methodology and methods, and carefully decide whether to trust the data and analysis at hand. I’m very pleased to read about data scientists at Mozilla designing trainings for non-data colleagues to build data intuition. It seems that the times when most data-literacy initiatives were limited to the ability to read charts are finally over. (Ryan T. Harter)
  • The Modern Data Stack: Past, Present, and Future: An exciting outlook for the modern data stack for the next five years. After a very innovative period of 2012–2016 there was a maturation phase until 2020. Are we now ready for the next big leap in the modern data stack? Are the solutions for data governance, real-time analytics, automated feedback into operations and democratised data exploration around the corner? Despite the technically sounding heading, the article is imho profoundly relevant even for top executives. (dbt blog)
  • Almost Everything You Need To Know on Data Discovery Platforms: Data discovery, aka data governance, platforms are gaining attention. With increasing volumes and sources of data it is important not to turn it into one big mess. Data discovery platforms are supposed to guide you through the available data, navigate you to the data you’re looking for, explain its meaning, source, recency and trustworthiness. So far, most of the solutions have been developed by big tech companies with urgent needs to keep their data in control. Luckily, a few of them are open source. (Eugene Yan)
  • The way we train AI is fundamentally flawed: The mismatch between training and real-life data is often put as a reason for ML models not working poorer in production than in the data ‘laboratory’. This is especially tricky for models using unstructured data, like images and free text. A group of Google researchers are now suggesting that testing the models on unseen test set might not be enough. The ML Engineers need to better formulate requirements on the models and not just react to the models’ failures in real-life situations. (MIT Technology Review)
  • A simple intro to Regex with Python: Working on the Advent of Code reminded me how abysmal I am with regular expressions (btw. a surprisingly good resource). This intro to regex in Python has refreshed my memory and taught me new tricks too. Regex is such a powerful yet enigmatic tool in the data scientist’s toolbox. /^(\S)(?!\1)(\S)(\1\2)*$ 🤷‍♂️ (Tirthajyoti Sarkar @ Towards Data Science); see also Regular Expressions Demystified: RegEx isn’t as hard as it looks

Tristan Handy’s article (#2 above) was the food for thought for me this week. Mostly because it highlighted the needs for a modern data stack to make data accessible and convenient for masses. The call for a technical solution to a historic problem of a very non-technical nature.

Thanks for reading!

Please feel free to share your thoughts or reading tips in the comments.

Follow me on Medium, LinkedIn and Twitter.

Data scientist with corporate, consulting and start-up experience | avid cyclist | amateur pianist | CEO & co-founder at DataDiligence.com