1. Nan Values During Training
- Maybe the gradients are too big, and they are exploding
- If there is a logarithmic function, gradients near zero is problematic
- I had this issue during training of MADDPG, I trained the critic first, and then policy network (not at the same time), and it started working.
- Check learning rate, sometimes you may be using a wrong one due to a mistake in the code.
2. Policy does not get better
- Check the exploration strategy, maybe you are not exploring properly. For example, for continuous actions spaces, you can explore by adding a noise to the policy net output (like in DDGP paper).
Leave a Reply