Reinforcement Leaning DQN Training Convergence Problem

4 views (last 30 days)
Hi everyone,
I am designing an energy management system for a vehicle, and using DQN for optimizing fuel consumption. Here are some related lines from my code.
env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo);
nI = obsInfo.Dimension(1);
nL = 24;
nO = numel(actInfo.Elements);
dnn = [
featureInputLayer(nI,'Name','state','Normalization','none')
fullyConnectedLayer(nL,'Name','fc1')
reluLayer('Name','relu1')
fullyConnectedLayer(nL,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(nO,'Name','output')];
criticOpts = rlRepresentationOptions('LearnRate',0.00025,'GradientThreshold',1);
critic = rlQValueRepresentation(dnn,obsInfo,actInfo,'Observation',{'state'},criticOpts);
agentOpts = rlDQNAgentOptions(...
'UseDoubleDQN',false, ...
'TargetUpdateMethod',"periodic", ...
'TargetUpdateFrequency',4, ...
'ExperienceBufferLength',1000, ...
'DiscountFactor',0.99, ...
'MiniBatchSize',32);
agentOptions.EpsilonGreedyExploration.Epsilon=1;
agentOptions.EpsilonGreedyExploration.EpsilonMin=0.2;
agentOptions.EpsilonGreedyExploration.EpsilonDecay=0.0050;
agentObj = rlDQNAgent(critic,agentOpts)
maxepisodes = 10000;
maxsteps = ceil(T/Ts);
trainingOpts = rlTrainingOptions('MaxEpisodes',10000,...
'MaxStepsPerEpisode',maxsteps,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','EpisodeReward',...
'StopTrainingValue', 0);
trainingStats = train(agentObj,env,trainingOpts)
The problem is that after training, rewards do not converge. Moreover, long-term estimated cumulative reward Q0 diverges. I already read some posts regarding the topic here, then I normalized my action and observation space which did not help. In addition to that, I also tried adding scaling layer right before the last fullyConnectedLayer which also did not help. You can find my training progress curves in attachment.
So, what can I try further so that Q0 does not diverge and episode rewards converge.
Also, I would really like to know how the Q0 is calculated. It is not possible for my model to have such big long-term estimated rewards.
Best Regards,
Gülin

Answers (1)

Darshak
Darshak on 29 Apr 2025
Hello Gülin Sayal,
I understand that the model does not converge, which might be due to “Q₀”.
“Q₀” is the initial state-action value estimate computed from the target critic network at the beginning of each episode (typically at time step t=0).
Mathematically, for a state “s₀”, the agent computes:
Q= max_a Q_target(s, a)
Where:
  • Q_target is the target critic network, updated periodically.
  • max_a implies it selects the maximum Q-value over all possible actions at that state.
If Q₀ diverges, it indicates instability in the value estimates that might happen due to:
  • High learning rate
  • Poor network structure
  • Unnormalized input/output
  • Unstable reward scale
  • Improper target network update frequency
To resolve diverging Q₀ values and non-converging rewards in DQN training, you may refer to the steps mentioned below:
1. Scale rewards in the environment (e.g., divide by a constant) to keep them within a range like [-1, 1].
reward = rawReward / 100;
2. Use Double DQN to reduce Q-value overestimation.
agentOpts = rlDQNAgentOptions('UseDoubleDQN', true, ...);
You can refer to the following documentation to gain further understanding “rlDQNAgentOptions” function: https://www.mathworks.com/help/releases/R2021a/reinforcement-learning/ref/rldqnagentoptions.html
3. Add a tanhLayer after the output to bound Q-values between -1 and 1.
dnn = [
featureInputLayer(nI,'Name','state','Normalization','none')
fullyConnectedLayer(64,'Name','fc1')
reluLayer
fullyConnectedLayer(64,'Name','fc2')
reluLayer
fullyConnectedLayer(nO,'Name','output')
tanhLayer('Name','tanhOut')];
You can refer to the following documentation for more information on “tanhLayer” function: https://www.mathworks.com/help/releases/R2021a/deeplearning/ref/nnet.cnn.layer.tanhlayer.html
4. Reduce the critic's learning rate for stable updates.
criticOpts = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',0.5);
5. Make target updates less frequent to reduce instability.
agentOpts.TargetUpdateFrequency = 20;
6. Use deeper networks to improve approximation capability.
% Replace 'nL = 24' with:
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(64)
7. Ensure exploration reduces over time:
agentOpts.EpsilonGreedyExploration.Epsilon = 1;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 0.1;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.01;
I hope this resolves the issue.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!