Add entropy term to encourage exploration
GAE
Distributional
Other environments
Bigger -> SLower nets
The exploration noise causes NAN gradients, thus NAN outputs
Need experience replay because it's OBVIOUSLY forgetting stuff from the past.
Use OpenAI examples
Combine 2 nets into one -> Works -> Learns a bit slower I think
Tuned hyper-parameters, specifically the size of roll-outs, number of updates and batch size
Next step -> Try GAE estimation
After -> Train in distributed setting with harder environments
Compare to OpenAI baseline
Incorporate into StarCraft