Using classic k-fold CV to evaluate an ML model on time series data can produce incorrect results due to look-ahead bias. Instead, consider walk-forward validation, which is (probably) the safest evaluation method for time series data. However, the optimal choice will depend on your data and model (you may want to check the article below).
Data scientists cross-validate everything. This is just the natural order of things. Unless you are dealing with neural networks, which are notoriously difficult to fit even once (let alone k times), the chances are that you use CV 10 times out of 10.
The good old k-fold CV is great. It's intuitive. It has a solid 50-year track record. It's implemented in every mainstream language. It's backed up by theory: it's guaranteed to give a consistent estimate of an out-of-sample error with lower variance than a single train-test split. But there is a catch: most theoretical results regarding CV assume that our data is i.i.d. (for example, see [Bayle2020]). What if you're dealing with time series data?
Time series data is not i.i.d. Data samples from different time periods may produce different model weights during training and show different model errors during testing. In particular, if you train on data from a later period ("the future") to evaluate model performance on data from an earlier period ("the present"), you may see an optimistic (= unrealistically small) out-of-sample error estimate. This phenomenon is called look-ahead bias. Classic k-fold CV is prone to this problem.
But does the future always hide something of a particular value? In other words, when does look-ahead bias matter? To answer that, we need to dive deeper into the world of probability and statistics.
A data generation process is usually assumed to follow some sort of a rule regarding how the data distribution can change over time. Without this assumption, we have no hope of drawing any meaningful conclusions about the future. Imagine that a vitamin D supplement that boosts a patient's immune system can suddenly become absolutely inefficient with no observable change in the patient's biochemistry, their intake of other drugs, their environment, etc. In this case, it would be impossible to accurately evaluate its impact on the patient's health and determine the optimum daily dosage for the next month today.
One common assumption is that a data generation process is (weakly) stationary - i.e., covariance between two data points depends only on a time interval between these two points. Speaking broadly, a stationary process can only evolve over time in a very limited way. Common examples are white noise and the ARMA process.
There are many theoretical results related to such processes. In particular, stationary processes seem to be less affected by look-ahead bias. [Bergmeir2012] numerically compares different model error evaluation methods on stationary data and concludes that k-fold CV is a viable option to use for out-of-sample error evaluation and hyper-parameter tuning of ML models. [Bergmeir2018] theoretically proves that this is the case for autoregressive models.
On the other hand, [Schnaubelt2019] numerically demonstrates that k-fold CV can be an inadequate choice for non-stationary processes. Okay, so what can we use instead?
Note: Sometimes we can transform data to make a process stationary. Two common cases are removal of seasonality and trend, which we won't cover here.
Surprise, surprise: I wasn't the first to think about the potential issue of look-ahead bias for time series data! People invented a bunch of different evaluation methods. Here are some of the most popular ones:
All these methods split time series data without shuffle. Where applicable, the data points before and after a split point are referred to as the "past" and the "future", respectively.
1) Single split (a.k.a. last block validation) divides time series in two parts: the "past" (used for training) and the "future" (used for validation).
2.a) Walk-forward validation makes several splits of time series into the "past" and the "future", progressively moving the split point from earlier to later dates. The most "classical" approach uses an expanding window for the training set and a fixed-size window for the validation set but other methods are used as well. Unlike CV, WF validation never uses data from the "future" for training. There are several modifications of WF validation: for example, see [Tashman2000].
2.b) Moving-origin forward validation is a modification of WF validation that uses an expanding window for both training and validation sets (i.e., all data is used in every split).
Pros (w.r.t. WF validation):
Cons (w.r.t. WF validation):
3.a) Blocked k-fold CV is almost like classic k-fold CV but data is not shuffled. Instead, data is split in k continuous blocks. If one block is left out for validation then the rest are used for training. This allows us to partially preserve the sequential nature of times series.
3.b) Combinatorial CV (the term coined in the book [Lopez2018]) is a more "sophisticated" cousin of blocked k-fold CV. Like before, data is split in k continuous blocks but p blocks (1≤p≤k) can be selected for testing at once. This produces more data splits, which may be useful for algorithmic strategy backtesting as we can generate several backtest paths at once.
Pros (w.r.t. blocked k-fold CV):
Cons (w.r.t. blocked k-fold CV):
Good, we have plenty of validation methods to choose from. So, which one to choose? Well. We could rely on conclusions from the papers referenced above or... we could test them ourselves! I made a jupyter notebook with a comparison of different validation methods on synthetic time series data. I tested linear and non-linear models on stationary and non-stationary time series data. Since I generated data myself I was able to make multiple data simulations to obtain the mean and standard deviation of model error estimates.
Here are some of the findings.
1) For stationary processes, blocked k-fold CV is generally superior to WF validation, showing smaller bias and variance for earlier folds. The images below show estimated model MAE (= median average error) by fold using CV (left) and WF validation (right) for Linear Regression on an ARMA time series.
2) For non-stationary processes with a stochastic mean drift, blocked k-fold CV produces an interesting result: the error estimate is lower in the inner folds. This effect is especially noticeable for models with an intercept (this includes Gradient Boosted Trees which have an implicit "intercept" term). Intuitively, the simplest fold to predict (if any) should be either the first one (because our pesky drift impacted it the least) or the last one (thanks to the longest available history to learn from). So why? The reason is that the information about the realized drift can be learned by an intercept term more efficiently in the middle fold, where it becomes a problem of interpolation rather than extrapolation. Hence, we get an optimistic bias in a model error estimate. The images below show estimated model MAE by fold using CV for Linear Regression with an intercept (left) and Gradient Boosted Trees (right) on an ARMA time series with a stochastic trend.
3) Walking-origin forward validation often has a significantly higher model error estimate variance for earlier folds compared to WF validation. Therefore, its usability is doubtful. The images below show estimated model MAE by fold using walking-origin forward validation (left) and WF validation (right) for Linear Regression with an intercept on an ARMA time series with a stochastic trend.
4) Combinatorial CV produces either exactly the same or very similar results compared to blocked k-fold CV. So you can use it instead of blocked k-fold CV if you want to see multiple backtest paths for your trading strategy.
There is one more thing to consider. Let's take MSE as our model error. Then we can compute the difference between the real MSE and our estimate (for example, computed via CV) as
A look-ahead bias will be the part of the bias term. Now, why do you need to estimate out-of-sample model error? If the goal is...
Thus, it may be okay to have a look-ahead bias for model hyper-parameter tuning. There, we can focus on the variance of our estimate. Concerning our empirical study, this means that CV might be a preferable method for hyper-parameter tuning even when we see that the error estimate in the inner folds is optimistic. However, as [Schnaubelt2019] shows, this is not always the case.
Dealing with time series data is tough. If you have a limited amount of data and don't use it efficiently, you risk getting a very volatile estimate. On the other hand, if you aren't careful you can be impacted by look-ahead bias and get a rather optimistic estimate. So selecting the right evaluation method is essential. My default method is WF validation, which should be a solid choice for most situations.
Still, the optimal choice depends on your model and data. Here is my humble advice based on the findings above:
Keep in mind that evaluation is only one of the many parts of your data analysis. If you strive for robust and realistic results, you may want to check...
That's it for today! Evaluate your models responsibly. And if you got any comments or questions - please don't hesitate to drop me a message.
I strongly suspect that this will never be done. Yet, it'd be good to expand my experimental study and...