Main Content

Train DDPG Agent for PMSM Control

This example demonstrates speed control of a permanent magnet synchronous motor (PMSM) using a deep deterministic policy gradient (DDPG) agent.

The goal of this example is to show that you can use reinforcement learning as an alternative to linear controllers, such as PID controllers, in speed control of PMSM systems. Outside their regions of linearity, linear controllers often do not produce good tracking performance. In such cases, reinforcement learning provides a nonlinear control alternative.

Load the parameters for this example.

sim_data

Open the Simulink model.

mdl = 'mcb_pmsm_foc_sim_RL';
open_system(mdl)

In a linear control version of this example, you can use PI controllers in both the speed and current control loops. An outer-loop PI controller can control the speed while two inner-loop PI controllers can control the d-axis and q-axis currents. The overall goal is to track the reference speeds in the Speed_Ref signal. This example uses a reinforcement learning agent to control the currents in the inner control loop while a PI controller controls the outer loop.

In this simulation, the reference speeds are represented by the speed_ref_rpm signal, which is the Speed_Ref signal expressed in rpm units.

Create Environment Interface

The environment in this example consists of the PMSM system, excluding the inner-loop current controller, which is the reinforcement learning agent. To view the interface between the reinforcement learning agent and the environment, open the Closed Loop Control subsystem.

open_system('mcb_pmsm_foc_sim_RL/Current Control/Control_System/Closed Loop Control')

The Reinforcement Learning block contains an RL agent block, the creation of the observation vector, and the reward calculation.

For this environment:

  • The observations are the d-axis and q-axis current errors iderror and iqerror, and their time derivatives and integrals.

  • The actions from the agent are the voltages vd_rl and vq_rl.

  • The sample time of the agent is 2e-4 seconds. The inner-loop control occurs at a different sample time than the outer loop.

  • The reward at each time step is:

rt=-(Q1*iderror2+Q2*iqerror2+R*jujt-12)

Here, Q1=Q2=0.1, and R=0.05 are constants, iderror is the d-axis current error, iqerror is the q-axis current error, and ujt-1 are the actions from the previous time step.

Create the observation and action specifications for the environment. For information on creating continuous specifications, see rlNumericSpec.

% Create observation specifications.
numObservations = 6;
observationInfo = rlNumericSpec([numObservations 1]);
observationInfo.Name = 'observations';
observationInfo.Description = 'Information on error and reference signal';

% Create action specifications.
numActions = 2;
actionInfo = rlNumericSpec([numActions 1]); 
actionInfo.Name = 'vqdRef';

Create the Simulink environment interface using the observation and action specifications. For more information on Simulink environments, see rlSimulinkEnv.

agentblk = 'mcb_pmsm_foc_sim_RL/Current Control/Control_System/Closed Loop Control/Reinforcement Learning/RL Agent';
env = rlSimulinkEnv(mdl,agentblk,observationInfo,actionInfo);

Provide a reset function for this environment using the ResetFcn parameter. During training, the function resetPMSM randomly initializes the final values of step reference speeds in the SpeedRef block between 1119 rpm (0.3 pu), 1492 rpm (0.4 pu) and 1865 rpm (0.5 pu) at the beginning of each episode.

env.ResetFcn = @resetPMSM;

Create Agent

The agent used in this example is a deep deterministic policy gradient (DDPG) agent. A DDPG agent approximates the long-term reward given the observations and actions using a critic value function. For more information on DDPG agents, see Deep Deterministic Policy Gradient Agents.

To create the critic, first create a deep neural network with two inputs (the observation and action) and one output. For more information on creating a neural network value function representation, see Create Policy and Value Function Representations.

rng(0)  % fix the random seed

statePath = [featureInputLayer(numObservations,'Normalization','none','Name','State')
    fullyConnectedLayer(64,'Name','fc1')];
actionPath = [featureInputLayer(numActions, 'Normalization', 'none', 'Name','Action')
    fullyConnectedLayer(64, 'Name','fc2')];
commonPath = [additionLayer(2,'Name','add')
    reluLayer('Name','relu2')
    fullyConnectedLayer(32, 'Name','fc3')
    reluLayer('Name','relu3')
    fullyConnectedLayer(16, 'Name','fc4')
    fullyConnectedLayer(1, 'Name','CriticOutput')];
criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'fc1','add/in1');
criticNetwork = connectLayers(criticNetwork,'fc2','add/in2');

Create the critic representation using the specified neural network and options. You must also specify the action and observation specification for the critic. For more information, see rlQValueRepresentation.

criticOptions = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',1);
critic = rlQValueRepresentation(criticNetwork,observationInfo,actionInfo,...
    'Observation',{'State'},'Action',{'Action'},criticOptions);

A DDPG agent decides which action to take given the observations using an actor representation. To create the actor, first create a deep neural network with one input (the observation) and one output (the action). Construct the actor in a similar manner to the critic. For more information, see rlDeterministicActorRepresentation.

actorNetwork = [featureInputLayer(numObservations,'Normalization','none','Name','State')
    fullyConnectedLayer(64, 'Name','actorFC1')
    tanhLayer('Name','tanh1')
    fullyConnectedLayer(32, 'Name','actorFC2')
    tanhLayer('Name','tanh2')
    fullyConnectedLayer(numActions,'Name','Action')
    tanhLayer('Name','tanh3')];
actorOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,observationInfo,actionInfo,...
    'Observation',{'State'},'Action',{'tanh3'},actorOptions);

To create the DDPG agent, first specify the DDPG agent options using rlDDPGAgentOptions. The agent trains from an experience buffer of maximum capacity 1e6 by randomly selecting mini-batches of size 512. The discount factor of 0.9995 favors long-term rewards. DDPG agents maintain time-delayed copies of the actor and critic networks known as targets. The target networks are updated every 20 agent steps during training.

Ts_agent = 2e-04;
agentOptions = rlDDPGAgentOptions("SampleTime",Ts_agent, ...
    "DiscountFactor", .9995, ...
    "ExperienceBufferLength",1e6, ...
    "MiniBatchSize",512, ...
    "TargetUpdateFrequency",20);

During training, the agent explores the action space using the Noise Model. Set the noise options using the NoiseOptions field. The noise variance decays at the rate of 1e-6, which favors exploration towards the beginning of training and exploitation in later stages.

agentOptions.NoiseOptions.MeanAttractionConstant = 0.5;
agentOptions.NoiseOptions.Variance = 0.15;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-6;
agentOptions.NoiseOptions.VarianceMin = 0.01;

Create the agent.

agent = rlDDPGAgent(actor,critic,agentOptions);

Train Agent

To train the agent, first specify the training options using rlTrainingOptions. For this example, use the following options:

  • Run each training for at most 2000 episodes, with each episode lasting at most ceil(T/Ts_agent) time steps.

  • Stop training when the agent receives an average cumulative reward greater than -80 over 100 consecutive episodes. At this point, the agent can track the reference speeds.

T = 1;
maxepisodes = 2000;
maxsteps = ceil(T/Ts_agent); 
trainingOpts = rlTrainingOptions(...
    'MaxEpisodes',maxepisodes, ...
    'MaxStepsPerEpisode',maxsteps, ...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',-80,... 
    'ScoreAveragingWindowLength',100);

Train the agent using the train function. Training this agent is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.

During training, only train the reinforcement learning agent for a single reference speed by setting testReferences to 0.

doTraining = false;
if doTraining
    % Turn off test references during training.
    testReferences = 0;
    % Train the agent.
    trainingStats = train(agent,env,trainingOpts);
else
    % Load pretrained agents for the example.
    load('rlPMSMAgent.mat')       
end

A snapshot of training progress is shown in the following figure. You can expect different results due to randomness in the training process.

Simulate Agent

To validate the performance of the trained agent, simulate the model. To simulate the model with different reference speeds, adjust the final values of the step references under the mcb_pmsm_foc_sim_RL/TestRef subsystem.

For this simulation, enable the test reference step signals by setting testReferences to 1.

testReferences = 1;
sim(mdl);

In this simulation, the reference speed steps through values of 1119 rpm (0.3 per-unit), 3357 rpm (0.9 pu) and 2611 rpm (0.7 pu) with steps times of 0.1, 2 and 4 seconds respectively. The PI and reinforcement learning controllers track the reference signal changes within 0.5-0.8 seconds.

The reinforcement learning controller was trained to track the reference speed of 1119 rpm (0.3 pu) but is able to generalize well across other speeds.