
Timer: Generative Pre-trained Transformers Are Large Time Series Models
Although unsupervised pre-training on time series data has
been widely explored, yielding breakthroughs based on the
masked modeling (Zerveas et al., 2021) and contrastive
learning (Woo et al., 2022), there are still unsolved funda-
mental issues for developing LTSMs. Firstly, the dataset
infrastructure and unified treatment for heterogeneous time
series are lagging behind other fields. As a result, prior unsu-
pervised pre-training methods are typically constrained to a
small scale and primarily focus on in-dataset transfer (Zhang
et al., 2022; Nie et al., 2022). Secondly, the architecture of
scalable large models remains underexplored in the field of
time series. It is observed that non-autoregressive structures,
which are prevalent and effective in small time series models,
may not be suitable for LTSMs. Thirdly, existing large-scale
pre-trained models (Woo et al., 2023; Das et al., 2023b) pri-
marily concentrated on a single task (e.g., forecasting), and
have scarcely addressed task unification. Consequently, the
applicability of LTSMs remains elevatable.
In this paper, we dive into the pre-training and adaptation of
large time series models. By aggregating publicly available
time series datasets and following curated data processing,
we construct Unified Time Series Dataset (UTSD) of hierar-
chical capacities to facilitate the research on the scalability
of LTSMs. To pre-train large models on heterogeneous time
series data, we propose the single-series sequence (S3) for-
mat that converts multivariate series with reserved patterns
into unified token sequences. For better generalization and
versatility, we adopt the GPT-style objective that predicts
the next token (Bengio et al., 2000). Eventually, we present
Timer, a large-scale pre-trained Time Series Transformer.
Unlike prevalent encoder-only architecture (Nie et al., 2022;
Wu et al., 2022; Das et al., 2023a), Timer exhibits similar
characteristics as large language models such as flexible con-
text length and autoregressive generation. It also presents
notable few-shot generalization, scalability, and task gen-
erality, outperforming state-of-the-art task-specific models
on forecasting, imputation, and anomaly detection. Overall,
our contributions can be summarized as follows:
•
We delve into the LTSM development by curating large-
scale datasets comprised of 1B time points, proposing
a unified sequence format to cope with data hetero-
geneity, and presenting Timer, a generative pre-trained
Transformer for general time series analysis.
•
We apply Timer on various tasks, which is realized in
our unified generative approach. Timer exhibits notable
feasibility and generalization in each task, achieving
state-of-the-art performance with few samples.
•
By pre-training on increasing available time series data,
Timer exhibits zero-shot forecasting capability. Quanti-
tative evaluations and quality assessments are provided
among concurrent large time series models.
2. Related Work
2.1. Unsupervised Pre-training on Sequences
Unsupervised pre-training on large-scale data is the essen-
tial step for modality understanding for downstream applica-
tions, which has achieved substantial success in sequences,
covering natural language (Radford et al., 2021), patch-
level image (Bao et al., 2021) and video (Yan et al., 2021).
Supported by powerful backbones (Vaswani et al., 2017)
for sequential modeling, the paradigms of unsupervised
pre-training on sequences have been extensively studied in
recent years, which can be categorized into the masked mod-
eling (Devlin et al., 2018), contrastive learning (Chen et al.,
2020), and generative pre-training (Radford et al., 2018).
Inspired by significant progress achieved in relevant fields,
masked modeling and contrastive learning have been well-
developed for time series. TST (Zerveas et al., 2021) and
PatchTST (Nie et al., 2022) adopt the BERT-style masked
pre-training to reconstruct several time points and patches
respectively. LaST (Wang et al., 2022b) proposes to learn
the representations of decomposed time series based on
variational inference. Contrastive learning is also well incor-
porated in prior works (Woo et al., 2022; Yue et al., 2022).
TF-C (Zhang et al., 2022) constrains the time-frequency con-
sistency by temporal variations and frequency spectrums.
SimMTM (Dong et al., 2023) combines masked modeling
and contrastive approach within the neighbors of time series.
However, generative pre-training has received relatively less
attention in the field of time series despite its prevalence
witnessed in developing large language models (Touvron
et al., 2023; OpenAI, 2023). Most large language models are
generative pre-trained (Zhao et al., 2023) with token-level
supervision, where each token is generated based on the pre-
vious context and independently supervised (Bengio et al.,
2000). Consequently, they are not constrained by specific
input and output lengths and excel at multi-step generation.
Furthermore, prior studies (Wang et al., 2022a; Dai et al.,
2022) have demonstrated that scalability and generalization
largely stem from generative pre-training, which requires
more training data than other pre-training paradigms. Thus,
our work aims to investigate and revitalize generative pre-
training towards LTSMs, facilitated by extensive time series
and deftly designed adaptation on downstream tasks.
2.2. Large Time Series Models
Pre-trained models with scalability can evolve to large foun-
dation models (Bommasani et al., 2021), featured by increas-
ing model capacity and pre-training scale to solve various
data and tasks. Large language models even demonstrate
advanced capabilities such as in-context learning and emer-
gent abilities (Wei et al., 2022). As of present, research on
large time series models remains at a nascent stage. Ex-
2
评论