Why is my DDPG agent converging to a state where it gets continuous penalization, while having a state it can go with 0 penalization?

17 views (last 30 days)
I am training a Reinforcement Learning DDPG agent to drive a vehicle to a reference.
The vehicle dynamics are:
  • x_dot = v*cos(psi);
  • y_dot = v*sin(psi);
  • psi_dot = w;
  • v_dot = a;
Having as observations - obs=[e_x, e_y, e_psi, e_v] - and actions - u=[w (psi_dot); a (v_dot)]- , my DDPG agent is failing to get to the reference with 0 error.
Reward ate ach step --> rwd = -(x^T*Q*x + u^T*R*u) (I used the same as the LQR cost funtion to have a comparison)
No matter how I tune the hyper-parameters or make my actor and critic networks more or less complex, the gap is always there.
For a reason that I don't know, I remembered to remove all biases from the neurons of my networks, buiding actor and critic networks that just have weights, and that actually solved the problem. All the trainings, with different hyperparameters, drived the error to 0:
What I wanted to ask is:
1 - Why does removing the biases solve the problem?
2 - Despite driving the errors to 0, removing the bias terms resulted in a degradated performance comparing to an agent with bias terms that I luckily got by stoping the training at a moment where the weights happened to drive the error to 0 (if I had let the training go for 10 more episodes the gap would appear again, that's why I can't use that agent, there is no consistency). How can I get the agent with bias terms drive the error to 0?
I would really appreciate if anyone could answear me, because I can't seem to find an explanation for this.

Answers (1)

Emmanouil Tzorakoleftherakis
My guess is that this happens due to the specifics of the problem. You want to build a controller that generates 'zeroes' when the error inputs are zero. Removing the biases happens to make this much easier assuming your actor is a feedforward net (think of Y=W*X+B - if X is close to zero, Y will be close to zero even if W is not perfectly optimized. However B will completely shift the signal).
By the way, your reference here is constant - it would be much harder to achieve the same with a time-varying reference. In general it is much harder to consistently achieve zero tracking error with RL compared to a more traditional controller, because you would need to do a lot of training on 'low-error' inputs.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!