Clear Filters
Clear Filters

Cumulative reward in RLagent block is 100 times bigger than it should

8 views (last 30 days)
I am working on a reinforcement learning project with RL agent in simulink. I am using a DDPG agent.
  1. It occured to me, that the cumulative reward of every episode (displayed in the verbose, in the performance plot and as output of the RLagent block) is exactly 100 times bigger than the cumulative reward i can produce by integrating the reward signal.
  2. Furthermore, the Q0 value is converging (in my specific problem) to a value that is 2e-2 times the cumulative reward.
I did not find anything about this in the documnetation.
I wonder if this is normal, if it affects the training, or if depends upon the agent options.

Accepted Answer

Yatharth
Yatharth on 4 Sep 2023
Hi Enrico,
I understand that you observed a change in the magnitude of the cumulitave reward by a factor of 100 when displayed in verbose.
The difference in the scale of the cumulative reward displayed in the verbose, performance plot, and RL agent block compared to the cumulative reward obtained by integrating the reward signal is likely due to the scaling factor applied by the DDPG agent.
DDPG agents often use a scaling factor to normalize the rewards during training. This scaling factor is applied to ensure stable learning and to prevent the agent from being overly sensitive to the magnitude of the rewards. As a result, the displayed cumulative reward may be scaled up or down compared to the raw rewards.
Regarding the convergence of the Q0 value, it is expected that the Q0 value will be different from the cumulative reward. The Q0 value represents the expected future reward from the starting state, while the cumulative reward is the sum of the rewards obtained during an episode. These two values may have different scales and interpretations.
In general, the scaling of the cumulative reward and the convergence of the Q0 value should not significantly affect the training process as long as the agent is able to learn and improve its policy based on the rewards and the Q-values. It is important to focus on the relative improvement of the agent's performance over time rather than the absolute values of the rewards or Q-values.
I am attaching few links to provide insights into the implementation details of DDPG and related algorithms, including the use of reward scaling to stabilize training. While they may not specifically mention the scaling factor you observed, they discuss the general concept of scaling rewards in RL training to ensure stability and prevent sensitivity to reward magnitudes.
  1. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., ... & Wierstra, D. (2016). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Available at: https://arxiv.org/abs/1509.02971
  2. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Available at: https://arxiv.org/abs/1707.06347
  3. OpenAI Spinning Up in Deep RL: Reward Scaling. Available at: https://spinningup.openai.com/en/latest/algorithms/ddpg.html#reward-scaling
I hope this helps.

More Answers (0)

Products


Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!