Strong RL


Strong-RL applications are run as pipelines of configured components, imported and extended to solve the particulars of your reinforcement learning problem.

Here we describe these components and the concepts that bind them together and make them suitable to solve a host of action recommendation problems.


Although we refer primarily to the most common “batch” mode, Strong-RL can operate in two kinds of modes.

Batch mode acts on targets on an approximately-fixed interval (e.g., daily, weekly). This is the simplest mode in which to develop because interval duration is constant and thus should not directly affect observed reward. It also allows for simple, serial pipeline execution.

Streaming mode acts on targets at variable intervals. Targeting could be driven by the occurrence of a specific event (e.g., entering a store), a statistical calculation (e.g., average spending decreasing below some threshold). Streaming RL applications are much more complicated to build because they require continuous pipeline execution. They also require considering how to handle the relationship between interval and reward which, if ignored, can strongly bias any recommendations.

The mode you choose to operate in is defined by the components that you import and wire to your application.



Applications (implemented in bind together all of the components in your application, allowing them to share a common application configuration. As each component is defined, it is passed the Application object and then binds itself to it such that Application.component references the component and references the Application.


The time cursor (implemented in strong_rl.base.timecursor.TimeCursor) determines the time before which event data will be modelled, and at which action recommendations will be made.

Although time is always important (and tricky) in any software application, time is managed in its own dedicated component here because the concept of time is fundamental to building and appropriately testing any reinforcement application.

In this context, the time cursor is a boundary such that:

  • When we model data, we look for all events before this time.

  • When we write actions, we write them at this time.

With this implementation, we can think of actions being recommended based at the interval between successive targetings of a single target. For example, imagine an application that delivers hourly next-best action recommendations:

  • 12am-1am: Target engages in behavior, emitting event-level data.

  • 1am: Target is targeted by Strong-RL, and processed by the actor who recommends some action.

  • 1am-2am: Target engages in more behavior, emitting more event-level data.

  • 2am: Target is targeted by Strong-RL. The agent learns from the events from 1am-2am about the quality of recommendations at 1am in order to make better recommendations following future events from 12am-1am. (In reinforcement learning language, the 1am-2am period will be summarized as state-prime while 12am-1am will be summarized as the state. We act at the juncture between them.)

Even if running the application at 1am takes longer than a microsecond, all of the actions will be written at exactly 1am due to the time cursor being fixed at this point (likewise for 3am). This makes it clear that the actions chosen were done with only data preceding this point in hand.

It is expected that the TimeCursor will be explicitly set before running the pipeline, establishing a single timepoint at which all components will be run.


The Datalog (implemented in strong_rl.base.datalog.DataLog, extended by strong_rl.batch.datalog.BatchDataLog) provides write/read access to stored event-level data. Although it is not required to write data to your datastore using the Datalog (you can write data using third-party ETLs, for example), it provides a .write() method which can be passed any iterable of Event objects and will write them to the appropriate date partition. Of course, it also exposes a .read() method for reading those event datasets as Spark DataFrames.


Your applications’ Datamodeler (implemented in strong_rl.base.datamodeler.DataModeler, extended by strong_rl.batch.datamodeler.BatchDataModeler), unsurprisingly, models your data. In this context, modeling refers to the building of Spark DataFrames (queryable as views) from schemas and/or SQL queries.

Specifically, it builds two kinds of models:

  • Event models, built from each Event-inheriting class passed to the Application in its EventSet.

  • Miscellaneous higher-level models, built from each Model-inheriting class passed to the Application in its ModelSet. At minimum, there must be at least one model in this set, the Target model: a query which yields one-target-per-record data describing each target (e.g., customer) that you want to potentially act on.

For example, let’s say you had an application designed to send promotional coupons to players at a casino. You have two kinds of events:

  • EntryEvent: A player swipes their card upon arrival to the casino

  • SlotEvent: A player plays at a slot machine

You also have two kinds of models:

  • VisitModel: Aggregates SlotEvent and EntryEvent data to create a one-visit-per-row DataFrame summarizing player visits.

  • PlayerModel (our Target model): Aggregates VisitModel records to create a one-player-per-row DataFrame summarising players.

The datamodeler would first build views for each of your events — an entries view and a slots view. It would then query those views using the query defined in the VisitModel to create a new view — visits. Finally, it would query the visits view using the query defined in the PlayerModel to create the final dataset.

Except for previous actions (defined as standard Action events), only the results in the Target model are passed onto the pipeline. All other events, models, and their corresponding views are left behind after the target models are created.


The Targeter (implemented in strong_rl.base.targeter.Targeter, extended by strong_rl.batch.targeter.Targeter) is one simple query designed to select only those target models that you would like to pass to your Actor for consideration.

In most cases, anyone from whom you are collecting data and thus modelling will be targeted. This means that this query will likely be: SELECT * FROM target_model_name.

However, you may want to target only people who meet some criteria. These criteria must be columns in the target models themselves, allowing you to query, e.g., SELECT * FROM players WHERE last_visit >= NOW() - INTERVAL '30' DAY.


The target models from the Targeter are retrieved by the Actor (implemented in, extended by for the purpose of passing to your agent(s) and generating action recommendations.

The Actor’s workflow is the most complex of any component in the pipeline:

  1. Retrieves the targets from the Targeter.

  2. Merges these targets with previous action recommendations, if such actions exist.

  3. Splits targets (now merged with their action histories) into batches.

  4. Passes each sequentially to each of the agents defined in your application’s AgentSet.

  5. From each agent, it gathers any recommended actions and then executes them: passing them to the datalog (for persistence, if enabled) and to the Environment currently associated with your application (for execution).

  6. After each agent processes a batch, the agent’s .update() and .save() methods are called for incremental training and saving of their updated state.

By design, while the Actor uses Spark, the agents with which it communicates are standard Python and receive only native, standard Python objects. While agents could use Spark for distributed processing, they could also use any other means of distribution including Dask, RLlib, or Celery. Or they could process in standard single-process fashion.


Environments (implemented in strong_rl.base.environment.Environment) are one of the keys to Strong-RL’s success as a robust environment for developing and deploying reinforcement learning applications.

Environment-inheriting classes implement only a single .act() which receives targets and their corresponding recommended actions. What they do with those actions is completely up to the environment, but in most cases, an application will include several environments that can be swapped in to your Application for various purposes:

  • A default, null environment (i.e., just Environment, itself) that simply ignores recommended actions. You might use this when running unit tests.

  • A simulation environment that maintains its own latent state and, when receiving actions, updates that state before emitting new events to the DataLog. An entire simulated world can be implemented as an Environment, including simulated customers who receive actions and then engage in behaviors (which then causes them to receive new actions, emit new behaviors, etc. in a dynamic loop).

  • A live environment that receives recommended actions and then pushes them for execution. For example, it might receive the recommendation to send an email to a customer and then translate this into a request to MailChimp’s API. Alternatively, it might simply export these recommendations to a dataset that can be reviewed before execution in “human-in-the-loop”-style implementation.


TRAART is an acronym that stands for:

  • **T**arget

  • **R**ecommended **A**ction

  • **A**ction

  • **R**eward

  • **T**arget-Prime

TRAART data is the standard kind of data stored and learned from by agents. It is related to the concept of a SARS’ tuple in standard reinforcement learning vocabulary, with two key differences.

First, by storing Target models and not states, we have the benefit of accessing richer data about a target than merely that which comprised their state (which is perhaps unintelligible due to various transformations).

Second, we not only store the recommended action but also the observed action; unlike in video games, we can’t always be sure that what was recommended was executed. By storing both, we leave it up to the agent what to do with these samples.

Although agents inheriting Agent have the flexibility to capture and store any data from targets, the default Strong-RL components and tools (i.e., labs datasets) all assume that agents will capture and learn from TRAART data. Strong-RL includes a means of efficiently storing TRAART data as gzip-compressed Parquet files (strong_rl.internals.traart) and a default, extendable agent class (strong_rl.algorithms.agents.traart_base.BaseTRAART) that observes TRAART data.


Events are the only source of data that a strong-rl can receive as input.

Events can be either custom (may change with each implementation) or native (standard, required by each implementation).

Events are defined by creating a class that inherits from, for example:

class SlotEvent(Event):
    name = "slot_pull"

    name = DataField(StringType(), False)
    timestamp = DataField(TimestampType(), False)
    player_id = DataField(LongType(), False)
    bet = DataField(FloatType(), False)

By default, all fields other than the name of an event are “fillable”; that is, they are expected to be passed to create an object representing a particular event occurrence:

occurrences = [SlotEvent(, player_id=1, bet=100)]

In cases where your event has more static, unfillable attributes, you can specify exactly which fields are fillable like so:

class OtherEvent(Event):
    name = "other"
    version = "v1"

    name = DataField(StringType(), False)
    version = DataField(StringType(), False)
    timestamp = DataField(TimestampType(), False)
    player_id = DataField(LongType(), False)

    fillable = ['timestamp', 'player_id']

occurrences = [OtherEvent(, player_id=1)]


A target is an individual (potential) recipient of recommended actions, e.g., a customer. In practice, a target is an object of a Model-inheriting class with attributes that fit the target model’s schema. A simple target model may look like:

class Player(Model):
    name = "player"

    key = "player_id"

    query = """
        SELECT player_id,
               COUNT(DISTINCT slot_date) AS num_active_days,
               SUM(num_slot_pulls) AS num_slot_pulls,
               AVG(avg_bet) AS avg_daily_bet,
               LAST(avg_bet) AS last_daily_bet
        FROM slot_days
        GROUP BY player_id

    # query is validated against desired schema
    player_id = DataField(LongType(), False)
    num_active_days = DataField(LongType(),False)
    num_slot_pulls = DataField(LongType(), False)
    avg_daily_bet = DataField(DoubleType(), False)
    last_daily_bet = DataField(DoubleType(), False)

Each target in this case would have player_id, num_active_days, num_slot_pulls, avg_daily_bet, and last_daily_bet attributes that can be used to generate a state (in the RL sense, see below) and make action recommendations.


States are quantified transformations of a target’s current attributes, as emitted by the targeter. Although, in some cases, the data in a target model is appropriate/sufficient for submission to a quantitative model predicting action rewards, most applications will use a preprocessor (see below) to transform a target model’s “raw” data into a format appropriate for a statistical model. For example, this may include anomaly detection, standardization/normalization, aggregation, feature selection, or other standard data engineering tools.

Typically, we refer to a target’s processed data at the time of action recommendation as its “state”, and the processed data at their next action recommendation as the “state-prime”.


Rewards are the sole measure of the desirability of a target’s behavior following an action recommendation. Except in the most trivial cases when a single attribute of a target can be treated as the target, rewards are calculated via a “reward” function that converts event-level data into a single metric per target that agents will be attempting to maximize through optimal action recommendations.

Reward functions can be implemented in many ways, but we suggest one of the following:

  • A User-Defined SQL Function mapped to target data in the target model query, receiving target attributes as a struct and yielding a new reward attribute.

  • A method in your agent’s preprocessor, receiving an iterable of target data models and yielding a np.array of rewards.

When implemented properly, the reward function in your application will be highest when a target is doing exactly what you want them to do (e.g., purchasing your most expensive product as frequently as possible), lowest when a target does exactly what you don’t want them to do (e.g., unsubscribing from all future communications and never visiting your store again), and smooth in-between (e.g., monotically increasing as they approach the desired behavior).