Train a reinforcement learning agent within a specified environment
trains a reinforcement learning agent with a specified environment. After each
trainStats = train(
train updates the parameters of
agent to maximize the expected long-term reward of the
environment. When training terminates, the agent reflects the state of training at
Use the training options
trainOpts to specify training
parameters such as the criteria for termination of training, when to save agents,
the maximum number of episodes to train, and the maximum number of steps per
Configure the training parameters and train a reinforcement learning agent. Typically, before training, you must configure your environment and agent. For this example, load an environment and agent that are already configured. The environment is a discrete cart-pole environment created with
rlPredefinedEnv. The agent is a Policy Gradient (
rlPGAgent) agent. For more information about the environment and agent used in this example, see Train PG Agent to Balance Cart-Pole System.
rng(0) % for reproducibility load RLTrainExample.mat env
env = CartPoleDiscreteAction with properties: Gravity: 9.8000 MassCart: 1 MassPole: 0.1000 Length: 0.5000 MaxForce: 10 Ts: 0.0200 ThetaThresholdRadians: 0.2094 XThreshold: 2.4000 RewardForNotFalling: 1 PenaltyForFalling: -5 State: [4×1 double]
agent = rlPGAgent with properties: AgentOptions: [1×1 rl.option.rlPGAgentOptions]
To train this agent, you must first specify training parameters using
rlTrainingOptions. These parameters include the maximum number of episodes to train, the maximum steps per episode, and the conditions for terminating training. For this example, use a maximum of 1000 episodes and 500 steps per episode. Instruct the training to stop when the average reward over the previous five episodes reaches 500. Create a default options set and use dot notation to change some of the parameter values.
trainOpts = rlTrainingOptions; trainOpts.MaxEpisodes = 1000; trainOpts.MaxStepsPerEpisode = 500; trainOpts.StopTrainingCriteria = "AverageReward"; trainOpts.StopTrainingValue = 500; trainOpts.ScoreAveragingWindowLength = 5;
During training, the
train command can save candidate agents that give good results. Further configure the training options to save an agent when the episode reward exceeds 500. Save the agent to a folder called
trainOpts.SaveAgentCriteria = "EpisodeReward"; trainOpts.SaveAgentValue = 500; trainOpts.SaveAgentDirectory = "savedAgents";
Finally, turn off the command-line display. Turn on the Reinforcement Learning Episode Manager so you can observe the training progress visually.
trainOpts.Verbose = false; trainOpts.Plots = "training-progress";
You are now ready to train the PG agent. For the predefined cart-pole environment used in this example. you can use
plot to generate a visualization of the cart-pole system.
When you run this example, both this visualization and the Reinforcement Learning Episode Manager update with each training episode. Place them side by side on your screen to observe the progress, and train the agent. (This computation can take 20 minutes or more.)
trainingInfo = train(agent,env,trainOpts);
The Episode Manager shows that the training successfully reaches the termination condition of a reward of 500 averaged over the previous five episodes. At each training episode,
agent with the parameters learned in the previous episode. When training terminates, you can simulate the environment with the trained agent to evaluate its performance. The environment plot updates during simulation as it did during training.
simOptions = rlSimulationOptions('MaxSteps',500); experience = sim(env,agent,simOptions);
During training, train saves to disk any agents that meet the condition specified with
trainOpts.SaveAgentValue. To test the performance of any of those agents, you can load the data from the data files in the folder you specified using
trainOpts.SaveAgentDirectory, and simulate the environment with that agent.
Agent to train, specified as a reinforcement learning agent object, such
rlDDPGAgent object, or a custom agent. Before training, you
must configure the actor and critic representations of the agent. For more
information about how to create and configure agents for reinforcement
learning, see Reinforcement Learning Agents.
Environment in which the agent acts, specified as a reinforcement learning environment object, such as:
For more information about creating and configuring environments, see:
env is a Simulink environment, calling
train compiles and
simulates the model associated with the environment.
trainOpts— Training parameters and options
trainStats— Training episode data
Training episode data, returned as a structure containing the following fields.
EpisodeIndex— Episode numbers
Episode numbers, returned as the column vector
the number of episodes in the training run. This vector is
useful if you want to plot the evolution of other quantities
from episode to episode.
EpisodeReward— Reward for each episode
Reward for each episode, returned in a column vector of length
N. Each entry contains the reward for the
EpisodeSteps— Number of steps in each episode
Number of steps in each episode, returned in a column vector
N. Each entry contains the number
of steps in the corresponding episode.
AverageReward— Average reward over the averaging window
Average reward over the averaging window specified in
trainOpts, returned as a column vector
N. Each entry contains the average
award computed at the end of the corresponding episode.
TotalAgentSteps— Total number of steps
Total number of agent steps in training, returned as a column
vector of length
N. Each entry contains the
cumulative sum of the entries in
up to that point.
EpisodeQ0— Critic estimate of long-term reward for each episode
Critic estimate of long-term reward using the current agent
and the environment initial conditions, returned as a column
vector of length
N. Each entry is the critic
estimate (Q0) for the
agent of the corresponding episode. This field is present only
for agents that have critics, such as
SimulationInfo— Information collected during simulation
Information collected during the simulations performed for training, returned as:
For training in MATLAB environments, a structure containing the
SimulationError. This field is
a column vector with one entry per episode. When the
StopOnError option of
"off", each entry contains any
errors that occurred during the corresponding
For training in Simulink environments, a vector of
containing simulation data recorded during the
corresponding episode. Recorded data for an episode
includes any signals and states that the model is
configured to log, simulation metadata, and any errors
that occurred during the corresponding episode.
train updates the agent as training progresses. To
preserve the original agent parameters for later use, save the agent to a
By default, calling
train opens the Reinforcement
Learning Episode Manager, which lets you visualize the progress of the training.
The Episode Manager plot shows the reward for each episode, a running average
reward value, and the critic estimate
Q0 (for agents that have critics).
The Episode Manager also displays various episode and training statistics. To
turn off the Reinforcement Learning Episode Manager, set the
Plots option of
If you use a predefined environment for which there is a visualization, you
plot(env) to visualize the environment. If you call
plot(env) before training, then the visualization updates
during training to allow you to visualize the progress of each episode. (For
custom environments, you must implement your own
Training terminates when the conditions specified in
trainOpts are satisfied. To terminate training in
progress, in the Reinforcement Learning Episode Manager, click Stop
train updates the agent at
each episode, you can resume training by calling
train(agent,env,trainOpts) again, without losing the
trained parameters learned during the first call to
During training, you can save candidate agents that meet conditions you
trainOpts. For instance, you can save any
agent whose episode reward exceeds a certain value, even if the overall
condition for terminating training is not yet satisfied.
train stores saved agents in a MAT-file in the folder
you specify with
trainOpts. Saved agents can be useful, for
instance, to allow you to test candidate agents generated during a long-running
training process. For details about saving criteria and saving location, see
train performs the following iterative steps:
For each episode:
Reset the environment.
Get the initial observation s0 from the environment.
Compute the initial action a0 = μ(s0).
Set the current action to the initial action (a←a0) and set the current observation to the initial observation (s←s0).
While the episode is not finished or terminated:
Step the environment with action a to obtain the next observation s' and the reward r.
Learn from the experience set (s,a,r,s').
Compute the next action a' = μ(s').
Update the current action with the next action (a←a') and update the current observation with the next observation (s←s').
Break if the episode termination conditions defined in the environment are met.
If the training termination condition defined by
trainOpts is met, terminate training. Otherwise, begin
the next episode.
The specifics of how
train performs these computations depends on
your configuration of the agent and environment. For instance, resetting the environment
at the start of each episode can include randomizing initial state values, if you
configure your environment to do so.
To train in parallel, set the
ParallelizationOptions options in the option set
trainOpts. For more information, see