Keeping Up With Data #63
5 minutes for 5 hours’ worth of reading
The importance of having a digital twin of the whole business was one of the most important realisations of 2021 for me. Every company is facing many problems (or opportunities) and data could often help with solutions. However, it’s not any data. It is data that is representative of the business and its problems — the digital twin.
The digital twin can be used as the main ingredient when solving a business problem (see Use Cases in the image above) through smart ML features and accurate ML models. And if the digital twin is truly representative, the ML solution with be applicable to the real world.
Therefore, all businesses should strive for representative digital twin, right?
Anyway, the final reading list of 2021 covers feature engineering, ML models, and the whole data-features-use case value topology.
- Feature Engineering on Time-Series Data for Human Activity Recognition: As data scientists we often work with time series. We might be forecasting sales using past sales or classifying movement based on accelerometer data. When working on such tasks we might want to engineer features to be presented to the machine learning models to maximise their performance. There are many sophisticated transformations and even algorithms that don’t require advanced feature engineering, but if you are building your feature space and are looking for inspiration — read this article. There are statistical measures, Fast Fourier transform, or indices. But please mind that before the author started creating dozens of features, a diligent exploratory data analysis was performed. (Pratik Nabriya @ TDS)
- Forecasting with trees: M5 forecasting challenge (see here and here) is the fifth edition of a competition dealing with forecast of daily sales of products in a few Walmart’s locations. As the paper points out, M3 competition was ruled by classical forecasting techniques, M4 by deep-learning, and the latest edition — M5 — has been dominated by the approaches based on gradient boosted trees. Why are these approaches — often in the form of LightGBM or XGBoost implementations — so successful in this competition? Read the paper by the Amazon’s researchers to find the answer as well as possibilities for model extensions. Are you dealing with sales forecasting? Than this is a ‘must read’ for you. (Tim Januschowski et al. @ International Journal of Forecasting)
- Features Part 2: Clarifying the Data-Features-Use Case Value Topology: Bill Schmarzo explains the links between raw data, curated data, ML features, ML models, use cases (which are impacting business KPIs), and economic value creation. What imho makes the article powerful is the clear articulation of the following two points: (1) use cases are dependent on data; and (2) there is a whole value chain between data and use cases. What Bill calls ‘curated data’, I call a digital twin. I believe that having a digital twin — that is reflective and representative of the business — is incredibly powerful. Such digital twin can be used to solve use cases targeting critical business problems and opportunities through relevant ML features and ML models. And at the same time, material imperfections of the digital twin dictate the needs for new data sources. (Data Science Central)
And that’s it for 2021. Happy 2022!
Here’s to representative digital twins and realised economic value of data! 🥂