Abstract
Deep Q-Network (DQN) is an algorithm that
achieves human-level performance in complex
domains like Atari games. One of the important
elements of DQN is its use of a target network,
which is necessary to stabilize learning. We argue
that using a target network is incompatible with
online reinforcement learning, and it is possible
to achieve faster and more stable learning without
a target network when we use Mellowmax, an
alternative softmax operator. We derive novel
properties of Mellowmax, and empirically show
that the combination of DQN and Mellowmax, but
without a target network, outperforms DQN with a
target network