Data Science in a Post-Corona World

The recent Corona pandemic is a world-wide catastrophe, impacting every aspect of our lives. As a PhD student in Statistics, I want to share some thought on how this crisis could substantially change the way we do Data Science in a post-corona world. While these aspects are minuscule in comparison to the humanitarian crises we are facing, I nevertheless want to add this fragment to the many aspects in which COVID-19 influences us.

Data Science has two main goals. First, we want to explain the past data we have observed: Why are people switching from product A to product B? How are the votes in the recent election and the Brexit referendum related? All these questions relate to the data we have seen. The second goal of Data Science is to predict the future: How many units of product A will be sold next week? How will people vote in the next election? Here, we utilise the data we have observed to make “educated guesses” about the future.

Usually, the more data is available, the better our predictions become. This mechanism works well when the future is rather similar to the past. However, we may observe some events that don’t fit in – so-called “outliers”. For example, when looking at stock return data, the financial crisis of 2007/08 is clearly visible. However, despite this big shock, most stocks returned to their normal levels a year or two later.

Return of the Dow Jones index. Picture generated from Yahoo Finance.

Life will not be the same after Corona, and neither will Data Science.

Virtually any data is influenced by the current Corona pandemic. The stock market is hit heavily. Unemployment rates sky-rocket. Sales of most products go down, while they soar for toilet paper and pasta. CO2 emissions are at a historic low. The impact of the pandemic on any data collection gives rise to new challenges for Data Scientists. Corona changes how humans buy, eat, work – well, live. These changes don’t just last for a few days, but for a substantial amount of time. As restrictions will only be lifted gradually, it might take a while until we are “back to normal”.

Besides, we don’t even know if and how that “normal” will be similar to what we were used to before 2020. Maybe we’ll spend more time with our family and friends instead of going on the fifth city break this year. Maybe we will buy more regional products as we realised how important a strong local economy is. While this is only speculative, it is essential to realise that we just don’t know how the future will look like compared to a pre-COVID past.

Both issues, a long abnormal period and a shift of what is “normal”, constitute special challenges for Data Science. From now on, anyone doing data analysis will have to deal with those questions. We can’t just exclude the first half of 2020 as it would leave too big of a big gap in our analysis. Additionally, past data might not be particularly useful to predict the future anyway. For example, we wouldn’t use data from 2010 to predict today’s internet usage. In the same spirit, data collected before the pandemic might not be that informative to predict post-Corona.

Luckily, statisticians are not new to these questions.

So-called “Change Point Models” are a well-established method focusing on spotting changes in a time series and estimating what happens before and after these change points. Now, any time series model has to work with data influenced Corona pandemic and Change Point Models are an elegant way of dealing with the respective issues.

This relates both to explaining past data and predicting new data. Change Point Models could do a good job at capturing this extraordinary period. However, predictions as to what the future holds could be less certain, as past data (i.e. pre-Corona) becomes less relevant. Any predictions we make will be based on less – or less relevant – data.

The Corona crisis substantially changes the way we analyse data, as virtually any data set will be affected by the pandemic. Fortunately, statisticians already have developed tools to deal with some of these issues. In turn, old historic data might become worthless as it is not relevant for what the future holds. Generally, this leads to more uncertainty in our predictions. Back in 2017, the Economist proclaimed that data would be the new oil. Well, relevant data might be the new diamond – and data scientists its diamond cutters.