Transfer Learning for Time Series Forecasting and Classification

  • by

A brief history: ImageNet was first published in 2009 and over the next four years would go on to form the bedrock of most computer vision models. To this day whether you are training a model to detect pneumonia or classify models of cars you will probably start with a model pre-trained on ImageNet or some other large (and general image) dataset.

More recently papers like ELMO and BERT (2018) leveraged transfer learning to effectively improve performance on several NLP tasks. These models create effective context dependent representations of words. These representations can then be leveraged for a variety of tasks such as question answering, named entity recognition, and much more.

Moreover, at a macro level transfer learning has paved the way for progress in areas with limited data across the board. It has helped to democratize deep learning by helping research groups and companies with limited data leverage it effectively. Therefore being able to leverage transfer learning in the time series domain (where there are many events with a limited temporal history) is crucial.

If this in-depth educational content on is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material. 

What about time series?

Currently, there is no model nor place to go for transfer learning for time series. Moreover, research on the topic is relatively sparse. A paper by Fawaz el. al. discussed transfer learning for time series classification. They concluded:

These experiments revealed that transfer learning can improve or degrade the models predictions depending on the dataset used for transfer.

From this, we learn the for time series that the similarity between the source and target dataset is in many ways more important than in CV or NLP. The authors then choose to develop a technique, which forms time series representations to find the most similar time series to use for transfer. Although this paper is an interesting initial exploration it still leaves many unanswered questions. What about the multivariate time series case (authors only look at the single variate)? Would a different architecture help facilitate transfer even among dissimilar time series? Likewise, a few other articles explore limited cases where transfer can be effective in the time series area, however none purpose a general framework for transfer learning particularly in the multivariate case.

How transfer learning works in other domains

Before diving into the challenges of transfer learning with respect to time series forecasting lets look into how it works in other domains. In computer vision transfer learning generally works as the model learns in a hierarchical fashion; specifically the “earlier” layers in the model learn more general patterns (e.g. shapes, outlines, edges) whereas the later layers learn the more task specific features (whiskers on a cat or shape of a car’s headlights). This ability has led to success even using transfer learning to help with medical diagnosis and staging when pre-trained on ImageNet.

This has generally held true in NLP as well, however, it required a different architecture. Specifically, models like BERT and ELMO paved the way for transfer learning in the sequence to sequence domain. The transformer architecture in particular functioned well for transfer learning. It reasons the same would likely hold true to sequence problems such as time series.

Time series forecasting specific challenges

There are several core challenges that are specific to time series forecasting. The biggest one is that with time series it is harder to find a useful hierarchy or set of intermediate representations that generalize across to different problems. We do have certain components that people traditionally decompose time series into such as seasonality, trend and remainder. However, developing a model that effectively learns intermediate decoupled representations of these remains elusive. The authors of “Reconstruction and Regression Loss for Time-Series Transfer Learning” explore creating a specialized loss function that helps to facilitate positive transfer through a decoupling process. They propose using an initial model to extract general features in (conjunction with the reconstruction loss), before using a time series specific model for forecasting. This technique seems to help improve performance, though the paper is limited to the single variate time series forecasting use case.

A second challenge with multivariate time series forecasting is that many times the problems have a different number of feature time series. For example, with respect to COVID-19 we might have mobility data (3 features), new infections (1 feature), weather (3 features) for a total of 7 features. However, for something like flu forecasting we might only have new infections and weather data for a total of four features (e.g. mobility data was not collected for the flu). In our experiments we have generally found using a model specific initial “embedding_layer” helpful then having transferable middle layers.

Our modified transformer architecture for time series transfer learning. The transformer layers are transferable whereas the initial embedding layer is generalized. For more information on these experiments see our Weights and Biases report.

How to use flow forecast for transfer learning

Flow forecast is an open source deep learning for series framework

To facilitate transfer learning for time series forecasting, flow forecast has several features that make it easy to pre-train and leverage pre-trained time series models. In the model parameters section you can utilize a parameter called excluded_layers. This means that when you load a pre-trained model that the weights for these layers won’t be loaded and instead instantiated fresh (if they even exist in the new model).

"excluded_layers":["embedding_layer.weight", "embedding_lay.bias", "dense_shape.weight", "dense_shape.bias"]

This makes it easy to leverage weights from models where several layers might not be present or might not match the shape. See this notebook for a real world example.

Secondly with flow-forecast we have easy tracking of prior pre-training datasets. This means you can easily keep track of the complete history of what other time series data your model was trained on. This can facilitate finding the best pre-training datasets. To include this in your config would typically just include a running list as a parameter (see full example here).

the_config3 = { "model_name": "DARNN", "pretrained_rivers": pretrained, "model_type": "PyTorch", "early_stopping":{"patience":3

Finally, we are working on adding additional features like making it easy to employ different learning rates and selective freezing to different layers as well as designing auto-encoder modules to find the most similar temporal datasets. We consider easy transfer learning a first class feature in our framework that we highly prioritize.

What we have found in our research and using flow-forecast

So far we have found generalized transfer learning useful for small datasets like our COVID-19 forecasts. We haven’t tested it extensively enough on large datasets to make conclusions at this point. We also believe transfer learning can be very effective when comes to incorporating meta-data into forecasts. For example, models need to see many different types of meta-data (more on this in another article) and temporal data to learn how to merge them effectively. We also designed a sort of transfer learning protocol where we initially sweep to find the best static hyper-parameters. Then we take these parameters and pre-train the model with similar parameters (e.g. forecast-length, number of layers) before running a final hyper-parameter sweep on the non-static parameters (e.g. batch size, learning rate, etc).


Transfer learning for time series has seen some limited progress, however, it has not been widely used. This is likely due to problems regarding number of features, usefulness of intermediate representations, and differences in seasonality (e.g. more negative transfer). With flow forecast we aim to provide more easy-to-use modules so that it is simple to leverage transfer learning successfully in the temporal domain. We believe transfer learning will come to play a larger role in time series.

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

Enjoy this article? Sign up for more applied AI updates.

We’ll let you know when we release more technical education.

The post Transfer Learning for Time Series Forecasting and Classification appeared first on TOPBOTS.

Leave a Reply

Your email address will not be published. Required fields are marked *