Keeping Up With Data — Week 31 Reading List
AI has been used extensively in the fight against Covid. It was a perfect show time for the latest algorithms and solutions. Yet, despite hundreds of AI tools being built, none of them worked. The main reason is — as it is often the case — poor data quality. Data were used from multiple sources, patched together like Frankenstein’s monster and labelled by radiologists (leading to incorporation bias) not by a result (e.g., a PCR test). For us data scientists, yet another reminder that data can make or break our models.
Digital twin, data quality of most cited data sets and data-ink ratio are on the menu this week.
- Seeing double? Digital twins are a useful model: The idea of a digital twin is not new, but lately, plenty new applications are emerging. Digital twin is no longer only used for physical assets. It can model whole businesses, their divisions, or customers. It is a “virtual replica of a digital environment, or a digital replica of something in the physical world”, which provides deep insight as well as a new dimension to the prescriptive analytics — what to do next. Designing a digital twin is not easy, though. One needs to keep focus on the most important factors of the real world about to be turned digital. The words: twin, replica or doppelganger — indicate it must be a perfect copy. Don’t let that scare you off. (Mail & Guardian)
- Error-riddled data sets are warping our sense of how good AI really is: Data is arguably the most important ingredient for an AI system. A study by MIT uncovered that the top 10 most cited data sets (like ImageNet, QuickDraw, or MNIST) contain a meaningful number of wrong labels. It is important to pay attention to data quality of these block-buster data sets since they are often used to train and select models to be used in real life. They also provide a benchmark for the progress of the field. Both of these can be flawed by issues in the data. (MIT Technology Review)
- Little Known Ways to Make your Data Visualization Awesome: Have you ever heard of a term ‘data-ink ratio’? I haven’t. The term was coined by Edward R. Tufte, who defined data-ink as “non-erasable core of the graphic”. He also introduced a couple of principles but their essence is in removing everything that doesn’t add anything new to the graphic. I know that the data visualisation can be very subjective, but from my perspective, maximising the data-ink ration is a great way to convey a message and compel your audience. (Mala Deep @ TDS)
Two weeks ago, I wrote about JupyterLite. But there is so many Python notebooks for data scientists. Today, I came across a website covering twenty of the most popular ones. I knew just six of them!