This module includes:
Data-utils, such as those for converting time-series data from a Pandas DataFrame into a PyTorch
torch.utils.data.Dataset
and/or torch.utils.data.DataLoader
, as well as a function for handling
implicit missing data.
A function for adding calendar-features: i.e. weekly/daily/yearly season dummy-features for usage as predictors.
A function for creating a simple baseline model, against which to compare more sophisticated forecasting models.
Simple trainer classes for PyTorch models, with specialized subclasses for torchcast’s model-classes, as well as a special class for training neural networks to embed complex seasonal patterns into lower dimensional embeddings.
A ‘Stopping’ class for controlling convergence/stopping for the fit() method in state-space models.
—
This trainer is designed to train a torch.nn.Module
to embed complex seasonal patterns (e.g. cycles on the
yearly, weekly, daily level) into a lower-dimensional space. See Using NN’s for Long-Range Forecasts: Electricity Data for an example.
Since this requires passing y it’s not really useful genuine prediction, but is primarily for visualizing/sanity-checking outputs after/during training.
A simple trainer for a standard nn.Module (not a state-space model). Note: this is meant to be helpful for quick development, it’s not meant to replace better tools (e.g. PyTorch Lightning) in more complex settings.
Usage:
dataloader = DataLoader(my_data, batch_size=32)
trainer = SimpleTrainer(module=nn.Linear(10, 1))
for loss in trainer(dataloader):
# log the loss, early-stopping, etc.
A trainer for a :StateSpaceModel
. This is for usage in contexts where the data are too large for
StateSpaceModel.fit()
to be practical. Rather than the base DataLoader, this class takes a
torchcast.utils.TimeSeriesDataLoader
.
Usage:
from torchcast.kalman_filter import KalmanFilter
from torchcast.utils import TimeSeriesDataLoader
from torchcast.process import LocalTrend
my_dl = TimeSeriesDataLoader.from_dataframe(my_df)
my_model = KalmanFilter(processes=[LocalTrend(id='trend')])
my_trainer = StateSpaceTrainer(module=my_model)
for loss in my_trainer(my_dl, forward_kwargs={'n_step' : 14*7*24, 'every_step' : False}):
# log the loss, early-stopping, etc.
module – A StateSpaceModel
instance (e.g. KalmanFilter
or ExpSmoother
).
dataset_to_kwargs – A callable that takes a torchcast.utils.TimeSeriesDataset
and returns a dictionary
of keyword-arguments to pass to each call of the module’s forward
method. If left unspecified and the batch
has a 2nd tensor, will pass tensor[1]
as the X
keyword.
optimizer – An optimizer (or a class to instantiate an optimizer). Default is torch.optim.Adam
.
Allows control of convergence/stopping for the fit() method in state-space models.
abstol – The absolute tolerance.
patience – How many iterations monitored metrics need to change less than abstol
before we stop.
monitor_loss – Should loss be monitored as part of convergence checks?
monitor_params – If True
, all parameters will be monitored. If a list of names, only those
parameters will be monitored. Can also be a function that takes a param-name and returns true/false.
max_iter – The maximum number of iterations after which training will be stopped regardless of convergence.
module – The module whose parameters are being optimized. Required if monitor_params
is not False
.
This is a convenience wrapper around
DataLoader(collate_fn=TimeSeriesDataset.make_collate_fn())
. Additionally, it provides a from_dataframe()
classmethod so that the data-loader can be created directly from a pandas dataframe. This can be more
memory-efficient than the alternative route of first creating a TimeSeriesDataset
from a dataframe, and
then passing that object to a data-loader.
dataframe – A pandas DataFrame
group_colname – Name for the group-column name.
time_colname – Name for the time-column name.
dt_unit – A numpy.timedelta64 (or string that will be converted to one) that indicates the time-units used – i.e., how far we advance with every timestep. Can be None if the data are in arbitrary (non-datetime) units.
measure_colnames – A list of names of columns that include the actual time-series data in the dataframe. Optional if X_colnames and y_colnames are passed.
X_colnames – In many settings we have a set of columns corresponding to predictors and a set of columns corresponding to the actual time-series data. The former should be passed as X_colnames and the latter as y_colnames.
y_colnames – See above.
pad_X – When stacking time-serieses of unequal length, we left-align them and so get trailing missings.
Setting pad_X
allows you to select the padding value for these. Default 0-padding.
kwargs – Other arguments to pass to TimeSeriesDataset.from_dataframe()
.
An iterable that yields TimeSeriesDataset
.
TimeSeriesDataset
includes additional information about each of the Tensors’ dimensions: the name for
each group in the first dimension, the start (date)time (and optionally datetime-unit) for the second dimension,
and the name of the measures for the third dimension.
Note that unlike torch.utils.data.TensorDataset
, indexing a TimeSeriesDataset
returns another
TimeSeriesDataset
, not a tuple of tensors. So when using TimeSeriesDataset
, use
TimeSeriesDataLoader
(equivalent to DataLoader(collate_fn=TimeSeriesDataset.collate)
).
dataframe – A pandas DataFrame
group_colname – Name for the group-column name.
time_colname – Name for the time-column name.
dt_unit – A numpy.timedelta64 (or string that will be converted to one) that indicates the time-units used – i.e., how far we advance with every timestep. Can be None if the data are in arbitrary (non-datetime) units.
measure_colnames – A list of names of columns that include the actual time-series data in the dataframe. Optional if X_colnames and y_colnames are passed.
X_colnames – In many settings we have a set of columns corresponding to predictors and a set of columns corresponding to the actual time-series data. The former should be passed as X_colnames and the latter as y_colnames.
y_colnames – See above.
pad_X – When stacking time-serieses of unequal length, we left-align them and so get trailing missings.
Setting pad_X
allows you to select the padding value for these. Default 0-padding.
kwargs – The dtype and/or the device.
Get an array (algined with self.group_names) with the number of ‘duration’ for each group, defined as the number of timesteps until the last measurement (i.e. the last timestep after which all measures are nan).
Since TimeSeriesDatasets are padded, this can be a helpful way to get the length of each time-series.
Get the subset of the batch corresponding to groups. Note that the ordering in the output will match the original ordering (not that of group), and that duplicates will be dropped.
Take a dataset and split it into a dataset with multiple tensors.
measure_groups – Each argument should be a list of measure-names.
A TimeSeriesDataset
, now with multiple tensors for the measure-groups.
A 2D array of datetimes (or integers if dt_unit is None) for this dataset.
which – If this dataset has multiple tensors of different number of timesteps, which should be used for constructing the times array? Defaults to the one with the most timesteps.
A 2D numpy array of datetimes (or integers if dt_unit is None).
train_frac – The proportion of the data to keep for training. This is calculated on a per-group basis, by
taking the last observation for each group (i.e., the last observation that a non-nan value on any measure). If
neither train_frac nor dt are passed, train_frac=.75
is used.
dt – A datetime to use in dividing train/validation (first datetime for validation), or a dictionary of group-names : date-times.
quiet – If True, will not emit a warning for groups having only nan after dt
Two TimeSeriesDatasets
, one with data before the split, the other with >= the split.
Return a new TimeSeriesDataset with unneeded padding removed. This is useful if we’ve subset a dataset and the remaining time-serieses are all shorter than the previous longest time-series’ length.
For example, this method combined with get_durations()
can be helpful for splitting a single dataset with
time-series of heterogeneous lengths into multiple datasets:
>>> ds_all = TimeSeriesDataset(x, group_names=group_names, start_times=start_dts)
>>> durations = ds_all.get_durations()
>>> ds_long = ds_all[durations >= 8784] # >= a year
>>> ds_short = ds_all[durations < 8784].trimmed() # shorter than a year
>>> assert ds_short.tensors.shape[0] < ds_long.tensors.shape[0]
Subset a TimeSeriesDataset
so that some/all of the groups have later start times.
start_times – An array/list of new datetimes, or a single datetime that will be used for all groups.
n_timesteps – The number of timesteps in the output (nan-padded).
quiet – If True, will not emit a warning for groups having only nan after the start-time.
A new TimeSeriesDataset
.
Create a new Batch with a different Tensor, but all other attributes the same.
Add season features to data by taking a date[time]-column and passing it through a fourier-transform.
data – A dataframe with a date[time] column.
K – The degrees of freedom for the fourier transform. Higher K means more flexible seasons can be captured.
period – Either a np.timedelta64, or one of {‘weekly’,’yearly’,’daily’}
time_colname – The name of the date[time] column. Default is to try and guess with the following (in order): ‘datetime’, ‘date’, ‘timestamp’, ‘time’.
A copy of the original dataframe, now with K*2 additional columns capturing the seasonal pattern.
Given a dataframe time-serieses, convert implicit missings within each time-series to explicit missings.
data – A pandas dataframe.
group_colnames – The column name(s) for the groups.
time_colname – The column name for the times. Will attempt to guess based on common labels.
dt_unit – Passed to pandas.date_range
. If not passed, will attempt to guess based on the minimum
difference between times.
max_dt_colname – Optional, a column-name that indicates the maximum time for each group. If not supplied, the actual maximum time for each group will be used.
A dataframe where implicit missings are converted to explicit missings, but the min/max time for each group is preserved.
Generate a dataframe with a baseline forecast for each group in the data. The baseline forecast is generated by taking the lagged value of the target, possibly smoothed first.
data – The data to generate the baseline forecast for.
group_colnames – The column name(s) for the groups.
time_colname – The column name for the times.
value_colname – The column name for the target values.
is_train – A boolean column (or a function that takes the data and returns a boolean column) indicating whether each row belongs to the training set.
lag – The number of time-steps to lag the target by. Note that currently each row is assumed to be a timestep, with no implicit (or explicit) missings.
smooth – The window size to use for smoothing the target. Set to 0 to disable smoothing.
A dataframe with group_colnames, time_colname, and baseline columns.