Quick Start

torchcast is a python package for time-series forecasting in PyTorch. Its focus is on training and forecasting with batches of time-serieses, rather than training separate models for one time-series at a time. In addition, it provides robust support for multivariate time-series, where multiple correlated measures are being forecasted.

To briefly provide an overview of these features, we’ll use a dataset from the UCI Machine Learning Data Repository. It includes data on air pollutants and weather from 12 sites.

[2]:
df_aq = load_air_quality_data('weekly')

df_aq
loading from gh...
[2]:
week station PM2p5 PM10 SO2 NO2 CO O3 TEMP PRES DEWP RAIN WSPM
0 2013-02-25 Aotizhongxin 38.263889 57.791667 36.541667 56.750000 958.236111 37.583333 2.525000 1022.777778 -15.666667 0.0 2.130556
1 2013-02-25 Changping 31.986111 47.152778 31.870773 44.969203 870.355072 45.925725 1.915278 1019.633333 -16.231944 0.0 1.475000
2 2013-02-25 Dingling 28.083333 37.816919 15.955314 34.916667 627.838164 49.222222 1.915278 1019.633333 -16.231944 0.0 1.475000
3 2013-02-25 Dongsi 45.083333 60.680556 31.687198 60.160024 1165.036836 43.635870 2.268056 1024.697222 -16.913889 0.0 1.775000
4 2013-02-25 Guanyuan 38.472222 55.208333 35.467995 62.855676 1075.746981 38.277778 2.525000 1022.777778 -15.666667 0.0 2.130556
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2515 2017-02-27 Nongzhanguan 46.807359 76.195652 13.304348 56.108696 1032.608696 41.152174 9.647917 1016.014583 -9.964583 0.0 1.787500
2516 2017-02-27 Shunyi 47.020833 58.625000 13.750000 61.791667 1033.333333 40.750000 8.289583 1016.695833 -8.525000 0.0 1.827083
2517 2017-02-27 Tiantan 43.662500 70.429167 10.577273 64.052381 1012.954545 37.800000 9.647917 1016.014583 -9.964583 0.0 1.787500
2518 2017-02-27 Wanliu 36.979167 60.750000 12.663949 65.371377 1061.322464 35.964015 8.868750 1014.450000 -9.300000 0.0 1.558333
2519 2017-02-27 Wanshouxigong 39.550595 55.199405 9.835404 57.981366 993.788820 37.146998 9.647917 1016.014583 -9.964583 0.0 1.787500

2520 rows × 13 columns

Prepare our Dataset

In torchcast we set up our data and model with the following:

  • The groups which define separate time-serieses. Here we have multiple sites. Groups are not necessarily simultanoues to each other (e.g. we could have time-series of product purchases with products having varying release-dates) and correlations across these groups are not modeled.

  • The measures which define separate metrics we are measuring simultanously. Here we have the two kinds of particulate-matter (2.5 and 10).

The TimeSeriesDataset is similar to PyTorch’s native TensorDataset, with some useful metadata on the batch of time-serieses (the station names, the dates for each).

For a quick example, we’ll focus on predicting particulate-matter (PM2.5 and PM10). We’ll log-transform since this is strictly positive.

[3]:
df_aq['PM2p5_log10'] = np.log10(df_aq['PM2p5'])
df_aq['PM10_log10'] = np.log10(df_aq['PM10'])

# create a dataset:
dataset_all = TimeSeriesDataset.from_dataframe(
    dataframe=df_aq,
    dt_unit='W',
    measure_colnames=['PM2p5_log10', 'PM10_log10'],
    group_colname='station',
    time_colname='week'
)

# Split out training period:
SPLIT_DT = np.datetime64('2016-02-22')
dataset_train, _ = dataset_all.train_val_split(dt=SPLIT_DT)
dataset_train
[3]:
TimeSeriesDataset(sizes=[torch.Size([12, 156, 2])], measures=(('PM2p5_log10', 'PM10_log10'),))

Specify our Model

In torchcast our forecasting model is defined by measures and processes. The processes give rise to the measure-able behavior. Here we’ll specify a random-walk/trend component and a yearly seasonal component for each pollutant.

[4]:
processes = []
for m in dataset_train.measures[0]:
    processes.extend([
        LocalTrend(id=f'{m}_trend', measure=m),
        Season(id=f'{m}_day_in_year', period=365.25 / 7, dt_unit='W', K=3, measure=m, fixed=True)
    ])
kf_first = KalmanFilter(measures=dataset_train.measures[0], processes=processes)

Train our Model

The KalmanFilter class provides a convenient fit() method that’s useful for avoiding standard boilerplate for full-batch training:

[5]:
kf_first.fit(
    dataset_train.tensors[0],
    start_offsets=dataset_train.start_datetimes
)
Initializing PM2p5_log10_trend.position to 1.8530830144882202
Initializing PM10_log10_trend.position to 1.9869805574417114
[5]:
KalmanFilter(processes=[LocalTrend(id='PM2p5_log10_trend'), Season(id='PM2p5_log10_day_in_year'), LocalTrend(id='PM10_log10_trend'), Season(id='PM10_log10_day_in_year')], measures=['PM2p5_log10', 'PM10_log10'])

Calling forward() on our KalmanFilter produces a Predictions object. If you’re writing your own training loop, you’d simply use the log_prob() method as the loss function:

[6]:
pred = kf_first(
        dataset_train.tensors[0],
        start_offsets=dataset_train.start_datetimes,
        out_timesteps=dataset_all.tensors[0].shape[1]
)

loss = -pred.log_prob(dataset_train.tensors[0]).mean()
print(loss)
tensor(-0.7233, grad_fn=<NegBackward0>)

Inspect & Visualize our Output

Predictions can easily be converted to Pandas DataFrames for ease of inspecting predictions, comparing them to actuals, and visualizing:

[7]:
df_pred = pred.to_dataframe(dataset_all)
df_pred[['actual','mean','upper','lower']] = 10 ** df_pred[['actual','mean','upper','lower']]
df_pred
[7]:
group time measure mean lower upper actual
0 Aotizhongxin 2013-02-25 PM2p5_log10 91.807892 35.214321 239.354019 38.263885
1 Aotizhongxin 2013-03-04 PM2p5_log10 87.755508 34.788483 221.367203 139.428558
2 Aotizhongxin 2013-03-11 PM2p5_log10 85.758369 34.170952 215.226624 157.071411
3 Aotizhongxin 2013-03-18 PM2p5_log10 86.671127 34.581505 217.222595 67.321434
4 Aotizhongxin 2013-03-25 PM2p5_log10 84.763466 33.794220 212.605728 107.333313
... ... ... ... ... ... ... ...
5035 Wanshouxigong 2017-01-30 PM10_log10 89.650490 39.856392 201.654251 119.059532
5036 Wanshouxigong 2017-02-06 PM10_log10 91.290504 40.491997 205.817383 61.144417
5037 Wanshouxigong 2017-02-13 PM10_log10 93.597961 41.411308 211.550446 139.613068
5038 Wanshouxigong 2017-02-20 PM10_log10 96.385437 42.535732 218.408188 47.162453
5039 Wanshouxigong 2017-02-27 PM10_log10 99.412216 43.764740 225.816269 55.199409

5040 rows × 7 columns

[8]:
df_pred['percent_error'] = np.abs(df_pred['mean'] - df_pred['actual']) / df_pred['actual']
print("Percent Error: {:.1%}".format(df_pred.query("time>@SPLIT_DT")['percent_error'].mean()))
Percent Error: 35.4%

The Predictions class comes with a plot classmethod for getting simple plots of forecasted vs. actual:

[9]:
pred.plot(df_pred.query("group=='Changping'"), split_dt=SPLIT_DT, time_colname='time', group_colname='group')
_images/quick_start_16_0.png

Finally you can produce dataframes that decompose the predictions into the underlying processes that produced them:

[10]:
pred.plot(
    pred.to_dataframe(dataset_all, type='components').query("group=='Changping'"), split_dt=SPLIT_DT,
    time_colname='time', group_colname='group'
)
_images/quick_start_18_0.png
[ ]: