Train AC Agent to Balance Cart-Pole System Using Parallel Computing
This example shows how to train an actor-critic (AC) agent to balance a cart-pole system modeled in MATLAB® by using asynchronous parallel training. For an example that shows how to train the agent without using parallel training, see Train AC Agent to Balance Cart-Pole System.
Actor Parallel Training
When you use parallel computing with AC agents, each worker generates experiences from its copy of the agent and the environment. After every N
steps, the worker computes gradients from the experiences and sends the computed gradients back to the client agent (the agent associated with the MATLAB® process which starts the training). The client agent updates its parameters as follows.
For asynchronous training, the client agent applies the received gradients without waiting for all workers to send gradients, and sends the updated parameters back to the worker that provided the gradients. Then, the worker continues to generate experiences from its environment using the updated parameters.
For synchronous training, the client agent waits to receive gradients from all of the workers and updates its parameters using these gradients. The client then sends updated parameters to all the workers at the same time. Then, all workers continue to generate experiences using the updated parameters.
For more information about synchronous versus asynchronous parallelization, see Train Agents Using Parallel Computing and GPUs.
Create Cart-Pole MATLAB Environment Interface
Create a predefined environment interface for the cart-pole system. For more information on this environment, see Load Predefined Control System Environments.
env = rlPredefinedEnv("CartPole-Discrete");
env.PenaltyForFalling = -10;
Obtain the observation and action information from the environment interface.
obsInfo = getObservationInfo(env); numObservations = obsInfo.Dimension(1); actInfo = getActionInfo(env);
Fix the random generator seed for reproducibility.
rng(0)
Create AC Agent
An AC agent approximates the long-term reward, given observations and actions, using a critic value function representation. To create the critic, first create a deep neural network with one input (the observation) and one output (the state value). The input size of the critic network is 4 since the environment provides 4 observations. For more information on creating a deep neural network value function representation, see Create Policies and Value Functions.
criticNetwork = [ featureInputLayer(4,'Normalization','none','Name','state') fullyConnectedLayer(32,'Name','CriticStateFC1') reluLayer('Name','CriticRelu1') fullyConnectedLayer(1, 'Name', 'CriticFC')]; criticOpts = rlOptimizerOptions('LearnRate',1e-2,'GradientThreshold',1); critic = rlValueFunction(criticNetwork,obsInfo);
An AC agent decides which action to take, given observations, using an actor representation. To create the actor, create a deep neural network with one input (the observation) and one output (the action). The output size of the actor network is 2 since the agent can apply 2 force values to the environment, –10 and 10.
actorNetwork = [ featureInputLayer(4,'Normalization','none','Name','state') fullyConnectedLayer(32, 'Name','ActorStateFC1') reluLayer('Name','ActorRelu1') fullyConnectedLayer(2,'Name','action')]; actorOpts = rlOptimizerOptions('LearnRate',1e-2,'GradientThreshold',1); actor = rlDiscreteCategoricalActor(actorNetwork,obsInfo,actInfo);
To create the AC agent, first specify the AC agent options using rlACAgentOptions
.
agentOpts = rlACAgentOptions(... 'ActorOptimizerOptions',actorOpts,... 'CriticOptimizerOptions',criticOpts,... 'EntropyLossWeight',0.01);
Then create the agent using the specified actor representation and agent options. For more information, see rlACAgent
.
agent = rlACAgent(actor,critic,agentOpts);
Parallel Training Options
To train the agent, first specify the training options. For this example, use the following options.
Run each training for at most
1000
episodes, with each episode lasting at most500
time steps.Display the training progress in the Episode Manager dialog box (set the
Plots
option) and disable the command line display (set theVerbose
option).Stop training when the agent receives an average cumulative reward greater than
500
over10
consecutive episodes. At this point, the agent can balance the pendulum in the upright position.
trainOpts = rlTrainingOptions(... 'MaxEpisodes',1000,... 'MaxStepsPerEpisode', 500,... 'Verbose',false,... 'Plots','training-progress',... 'StopTrainingCriteria','AverageReward',... 'StopTrainingValue',500,... 'ScoreAveragingWindowLength',10);
You can visualize the cart-pole system can during training or simulation using the plot
function.
plot(env)
To train the agent using parallel computing, specify the following training options.
Set the
UseParallel
option toTrue
.Train the agent in parallel asynchronously by setting the
ParallelizationOptions.Mode
option to"async"
.
trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.Mode = "async";
For more information, see rlTrainingOptions
.
Train Agent
Train the agent using the train
function. Training the agent is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining
to false
. To train the agent yourself, set doTraining
to true
. Due to randomness in the asynchronous parallel training, you can expect different training results from the following training plot. The plot shows the result of training with six workers.
doTraining = false; if doTraining % Train the agent. trainingStats = train(agent,env,trainOpts); else % Load the pretrained agent for the example. load('MATLABCartpoleParAC.mat','agent'); end
Simulate AC Agent
You can visualize the cart-pole system with the plot function during simulation.
plot(env)
To validate the performance of the trained agent, simulate it within the cart-pole environment. For more information on agent simulation, see rlSimulationOptions
and sim
.
simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);
totalReward = sum(experience.Reward)
totalReward = 500
References
[1] Mnih, Volodymyr, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. ‘Asynchronous Methods for Deep Reinforcement Learning’. ArXiv:1602.01783 [Cs], 16 June 2016. https://arxiv.org/abs/1602.01783.
See Also
Related Examples
- Train AC Agent to Balance Cart-Pole System
- Train DQN Agent for Lane Keeping Assist Using Parallel Computing
- Train Biped Robot to Walk Using Reinforcement Learning Agents