Is reward not converge to a certain value show that the RL agent does no learn anything?
The result shows that every training the agent does the different choices, it won't learn something good from the previous one.
Although the reward is good and has the good result, next training it won't keep at that good choices, it will try the other choice then get the bad result.
How can I deal with this problem?
Thank for helping.