“TD3 performs well, then abruptly saturates at bad actions (Simulink RL) — why?”

19 views (last 30 days)
Hello everyone,
I’m training a controller with TD3 because I want a deterministic policy. It performs well at first, but then it suddenly gets stuck in very poor actions with no warning. This seems odd—greedy policies shouldn’t behave like this(stuck a fixed trace, even it earned a very poor reward) . I’ve tried different exploration settings and input normalization, but the problem keeps returning.
Has anyone seen this on the MathWorks platform? What could cause TD3 to collapse like this, and how can I prevent it?
Episode: 1/200 | Episode reward: -150144.31 | Episode steps: 250 | Average reward: -150144.31 | Step Count: 250 | Episode Q0: -0.06
Episode: 2/200 | Episode reward: -142781.27 | Episode steps: 250 | Average reward: -146462.79 | Step Count: 500 | Episode Q0: -0.06
Episode: 3/200 | Episode reward: -146085.77 | Episode steps: 250 | Average reward: -146337.12 | Step Count: 750 | Episode Q0: -0.07
Episode: 4/200 | Episode reward: -127403.65 | Episode steps: 250 | Average reward: -141603.75 | Step Count: 1000 | Episode Q0: -0.14
Episode: 5/200 | Episode reward: -90967.66 | Episode steps: 250 | Average reward: -131476.53 | Step Count: 1250 | Episode Q0: -0.24
Episode: 6/200 | Episode reward: -58398.87 | Episode steps: 250 | Average reward: -119296.92 | Step Count: 1500 | Episode Q0: -0.28
Episode: 7/200 | Episode reward: -35903.63 | Episode steps: 250 | Average reward: -107383.59 | Step Count: 1750 | Episode Q0: -0.28
Episode: 8/200 | Episode reward: -10701.52 | Episode steps: 250 | Average reward: -95298.34 | Step Count: 2000 | Episode Q0: -0.06
Episode: 9/200 | Episode reward: -9437.55 | Episode steps: 250 | Average reward: -85758.25 | Step Count: 2250 | Episode Q0: 0.89
Episode: 10/200 | Episode reward: -17715.87 | Episode steps: 250 | Average reward: -78954.01 | Step Count: 2500 | Episode Q0: 2.43
Episode: 11/200 | Episode reward: -34624.05 | Episode steps: 250 | Average reward: -67401.98 | Step Count: 2750 | Episode Q0: 4.24
Episode: 12/200 | Episode reward: -40353.72 | Episode steps: 250 | Average reward: -57159.23 | Step Count: 3000 | Episode Q0: 6.50
Episode: 13/200 | Episode reward: -42417.75 | Episode steps: 250 | Average reward: -46792.43 | Step Count: 3250 | Episode Q0: 7.25
Episode: 14/200 | Episode reward: -43329.38 | Episode steps: 250 | Average reward: -38385.00 | Step Count: 3500 | Episode Q0: 8.65
Episode: 15/200 | Episode reward: -21137.36 | Episode steps: 250 | Average reward: -31401.97 | Step Count: 3750 | Episode Q0: 10.46
Episode: 16/200 | Episode reward: -20629.98 | Episode steps: 250 | Average reward: -27625.08 | Step Count: 4000 | Episode Q0: 12.07
Episode: 17/200 | Episode reward: -190383.39 | Episode steps: 250 | Average reward: -43073.06 | Step Count: 4250 | Episode Q0: 15.93
Episode: 18/200 | Episode reward: -188099.18 | Episode steps: 250 | Average reward: -60812.82 | Step Count: 4500 | Episode Q0: 16.85
Episode: 19/200 | Episode reward: -188479.67 | Episode steps: 250 | Average reward: -78717.04 | Step Count: 4750 | Episode Q0: 16.95
Episode: 20/200 | Episode reward: -189525.03 | Episode steps: 250 | Average reward: -95897.95 | Step Count: 5000 | Episode Q0: 18.68
Episode: 21/200 | Episode reward: -189286.51 | Episode steps: 250 | Average reward: -111364.20 | Step Count: 5250 | Episode Q0: 18.17
Episode: 22/200 | Episode reward: -190229.29 | Episode steps: 250 | Average reward: -126351.75 | Step Count: 5500 | Episode Q0: 19.29
Episode: 23/200 | Episode reward: -188722.45 | Episode steps: 250 | Average reward: -140982.22 | Step Count: 5750 | Episode Q0: 20.18
Episode: 24/200 | Episode reward: -189155.54 | Episode steps: 250 | Average reward: -155564.84 | Step Count: 6000 | Episode Q0: 21.27
Episode: 25/200 | Episode reward: -187477.81 | Episode steps: 250 | Average reward: -172198.88 | Step Count: 6250 | Episode Q0: 21.23
Episode: 26/200 | Episode reward: -187086.00 | Episode steps: 250 | Average reward: -188844.49 | Step Count: 6500 | Episode Q0: 23.44
Episode: 27/200 | Episode reward: -187086.00 | Episode steps: 250 | Average reward: -188514.75 | Step Count: 6750 | Episode Q0: 25.41
Episode: 28/200 | Episode reward: -187086.00 | Episode steps: 250 | Average reward: -188413.43 | Step Count: 7000 | Episode Q0: 33.53
Episode: 29/200 | Episode reward: -187086.00 | Episode steps: 250 | Average reward: -188274.06 | Step Count: 7250 | Episode Q0: 43.19
Episode: 30/200 | Episode reward: -187086.00 | Episode steps: 250 | Average reward: -188030.16 | Step Count: 7500 | Episode Q0: 48.98
Episode: 31/200 | Episode reward: -187086.00 | Episode steps: 250 | Average reward: -187810.11 | Step Count: 7750 | Episode Q0: 69.61

Answers (1)

Satyam
Satyam on 17 Oct 2025 at 6:54
This TD3 “collapse” usually happens due to critic divergence or loss of exploration. You can try troubleshooting by following these few steps:
  1. Critic instability: Large Q-values (e.g., 0 - 70) indicate divergence.Reduce learning rates (Actor: 1e-4, Critic: 1e-3) and target smoothing factor τ ≈ 0.005.
  2. Unscaled rewards/observations: Very large negative rewards destabilize training. Normalize inputs/outputs or scale rewards - see Normalize Data in RL Agents.
  3. Exploration loss: Keep non-zero Gaussian noise; avoid early decay.
  4. Monitor critic loss and Q-values: Use the Episode Manager to confirm when instability starts.
I hope it will fix your issue.

Products


Release

R2024b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!