Strong RL


Agents (implemented in strong_rl.algorithms.agents.agent.Agent) are a collection of lower-level components that, when combined, enable the intelligent selection of next best actions for targets.

These components underlie the agent’s only two methods with which the actor interfaces:

  • .observe(): Observe a set of targets at a point in time to learn about the effects of past actions.

  • .act(): Act on a set of targets at a point in time to make new recommendations.

Each major lower-level component is described below.


Preprocessors transform a target model’s current data into a quantified “state” vector (see above). Preprocessors vary depending on the nature of other components in the agent (most notably, the statistical models being employed as value estimators). However, they most likely engage in feature selection and standardization/normalization as well as aggregation of historical features to allay concerns of partial observability. We recommend the use of autoencoders or other dimensionality-reduction techniques such that the preprocessor yields a state which balances information and learnability.


An agent’s memory serves two purposes. First, agents often benefit from maintaining a “memory” of previous targets, actions, and rewards that they can re-sample and learn from at future timepoints. In its simplest form, this memory is an experience replay buffer; however, other more complex sampling strategies of memory may also be implemented. Second, agents encode pending observations of targets in memory. For example, upon first acting on a target, an agent will encode the target data model and the action it recommended. Once it sees them to act on them next, the agent can “complete” the observation by encoding the observed action, the reward, and the target-prime (i.e., where the target ended up after acting). We allow agent’s to encode their own pending observations like this to give them flexibility in storing whatever data they would like at the intervals they would like.


Broadly speaking, agents are designed to learn an optimal policy to drive targets toward higher reward. Here, however, we also use the term policy to refer to a component of an agent which takes model outputs (e.g., action value estimates, policy probabilties) and selects recommended next actions. For example, an epsilon-greedy policy would take action value estimates and select the highest estimate as the action to recommend except in cases where an annealed random 0-1 float is less than epsilon, in which case it would select a random possible action (to encourage exploration). Alternatively, a policy can stochastically sample from a normalized probability distribution. Constraints (see action-constraints) may be utilized within a policy to limit the selected actions.


Unless you are implementing a random or rule-based agent (which are both possible and very useful in some cases), agents likely have one or more statistical models underlying their policy choices. For example, an agent may use a Deep-Q Network to estimate the value of possible actions.

Agent State

In order for an agent to learn, it and its sub-components must maintain “state” — for example, model weights, training samples, epsilon values, and other hyperparameters.

Agents maintain state by creating/updating/retrieving objects via an AgentState object. This object essentially wraps a cache, exposing the same API as a standard implementation of the Cache interface.

When agents are loaded (they are auto-loaded upon initialization, and otherwise can be loaded via .load()), the state is injected into each Agent sub-component that inherits the StatefulMixin. This mixin adds a .set_state() method to the sub-component and, when the state is injected, adds self.state to the sub-components attributes.

With the state injected, stateful objects can be manipulated at will:

list = self.state.list("list_name")

Typically, these stateful objects will be initialized in the .load() method of the sub-component, such that they can be created after the state is injected and loaded into memory.

You likely don’t need to worry about manually saving state. In cases where you are using an in-memory store like Redis, state is automatically saved with each update. In cases where state exists in-memory in Python, as in the LocalCache, the actor saves the agent after every batch is processed by calling .save() on the agent.