drqn_mazeworld

Deep Recurrent Q-Network

This is a demo of DQRN, with the env as MazeWorld.

MazeWorld (No mei zi here)

Basic maze:

Bonus maze:

IN: (0, 0), OUT: (2, 5)
state: (row, col, bonus_bit). bonus_bit indicates whether the bonus has been gotten. We can only see (row, col). Bonus is in a fixed place (3, 4).
reward: -5 for any step before finding OUT, 100 for achieving OUT, 100 for getting bonus.
action: "UP", "DOWN", "Left", "Right", encoded as 0, 1, 2, 3.

Partial-info bonus maze:

IN: (0, 0), OUT: (2, 5)
state: (row, col, bonus_bit). bonus_bit indicates whether the bonus has been gotten. We can only see (row). Bonus is in a fixed place (3, 4).
reward: -5 for any step before finding OUT, 100 for achieving OUT, 100 for getting bonus.
action: "UP", "DOWN", "Left", "Right", encoded as 0, 1, 2, 3.

Networks

Statement: The codes for this project are based on DQN and DRQN codes at https://github.com/metalbubble/DeepRL-Tutorials.

Deep Q-Network (DQN)

Deep Recurrent Q-Network (DRQN)

The network structure is proposed by Matthew Hausknecht & Peter Stone.
The network uses a RNN (basic RNN, GRU or LSTM) to "remember" or "forget" history states the agent has been to and the new action is based on the "memory".

Deep Recurrent Q-Network with actions (DRQNA, named by myself)

The network is a revision of DRQN. In optimal control we use the information set I_k to help us make decisions, sometimes only history of S_k is not enough. So I put a_k in the RNN.

Results

Basic maze

Bonus maze

For a DQN agent, it can only choose to go "RIGHT", for the policy has to be DETERMINISTIC.
For a DRQN agent, it can learn act "UP" at the first time reaching (3, 4), and then go "RIGHT".
For a DRQN agent, the same as DRQN.

Partial-info bonus maze