Keeping Up With Data — Week 27 Reading List

5 minutes for 5 hours’ worth of reading


ately, I’ve been involved with time series forecasting. Whilst this is something that started for me at university with Holt-Winters and ARIMA models, nowadays, there are many more approaches to predicting future values of a time series. Many of them include machine learning or deep learning techniques. And it’s really easy — anyone can do that, using one of many available Python libraries. One of them — Prophet — is particularly user-friendly as it doesn’t require tweaking a lot of parameters. But just because it’s easy, it doesn’t mean it’s safe to use and a lot can go wrong.

Ive been directed to the articles below by a great ML engineer and friend of mine. Thanks to all that are sending me reading tips every week!

  • Offline Policy Evaluation: Run fewer, better A/B tests: Companies are automating plenty of decisions using policies. A policy can be a simple rule (‘if the customer hasn’t used the product for 30 days, do …’), it can be powered by ML (‘if the propensity to leave score is above 0.9, do …’) or it can even be a complex recommender system. All these policies have parameters that can be tweaked. Any change in the parameters is often followed by an A/B test. But do we need to A/B test all the changes? Can’t we first estimate their impact using historical data? Yes, we can — using Offline Policy Evaluation methods. (Edoardo Conti @ Medium)
  • Fusing Elasticsearch with neural networks to identify data: Twitter has thousands of legacy datasets with millions of columns. How to quickly map these to a taxonomy describing which columns contain, for instance, PII data? Though the article is quite technical, the key message is clear. Create a high-quality, manually tagged in their case, training data set and use it to train a machine learning solution for automated tagging. And then build a solution to provide a feedback loop and re-train the model to continuously make it better. Two things stand out for me: (1) they called the solution ‘annotation recommendation service’, not an annotation service; and (2) they re-iterated the old adage that ‘while state-of-the-art deep learning can be impactful, applying simpler techniques in the right context can prove to be more beneficial than one often expects’. (Twitter Engineering)
  • Self-supervised learning: The dark matter of intelligence: Supervised learning is powerful, but its limitation is the need for large amount of training data. Self-supervised learning (SSL) might be the way to get around it. SSL obtains supervisory signals from the data itself, often leveraging the underlying structure in the data. Typically, it is trained on predicting the missing parts (in text, or audio). Generally, it is about assessing compatibility of two inputs — a cow and a cow lying on the beach, or a beginning of a sentence and a possible ending. Thanks to that it can learn from orders of magnitudes compared to supervised learning. ‘Self-supervision is one step on the path to human-level intelligence, but there are surely many steps that lie behind this one,’ says none other than Yan LeCun. (Facebook AI)

A cold and rainy week is ending, but the weekend’s forecast is looking good, so I can’t wait to go hiking in the Alps with the family!

Thanks for reading!

Please feel free to share your thoughts or reading tips in the comments.

Follow me on Medium, LinkedIn and Twitter.

Data scientist with corporate, consulting and start-up experience | avid cyclist | amateur pianist | CEO & co-founder at