Deep Recurrent Q-Network
This is a demo of DQRN, with the env as MazeWorld.
Basic maze:
IN: (0, 0), OUT: (2, 5)
state: (row, col)
reward: -5 for any step before finding OUT, 100 for achieving OUT.
action: "UP", "DOWN", "Left", "Right", encoded as 0, 1, 2, 3.
Bonus maze:
IN: (0, 0), OUT: (2, 5)
state: (row, col, bonus_bit). bonus_bit indicates whether the bonus
has been gotten. We can only see (row, col). Bonus is in a fixed place
(3, 4).
reward: -5 for any step before finding OUT, 100 for achieving OUT, 100 for getting bonus.
action: "UP", "DOWN", "Left", "Right", encoded as 0, 1, 2, 3.
Partial-info bonus maze:
IN: (0, 0), OUT: (2, 5)
state: (row, col, bonus_bit). bonus_bit indicates whether the bonus
has been gotten. We can only see (row). Bonus is in a fixed place (3,
4).
reward: -5 for any step before finding OUT, 100 for achieving OUT, 100 for getting bonus.
action: "UP", "DOWN", "Left", "Right", encoded as 0, 1, 2, 3.
Deep Q-Network (DQN)
Deep Recurrent Q-Network (DRQN)
The network structure is proposed by Matthew Hausknecht & Peter Stone.
The network uses a RNN (basic RNN, GRU or LSTM) to "remember" or
"forget" history states the agent has been to and the new action is
based on the "memory".
Deep Recurrent Q-Network with actions (DRQNA, named by myself)
Basic maze
Bonus maze
For a DQN agent, it can only choose to go "RIGHT", for the policy has to be DETERMINISTIC.
For a DRQN agent, it can learn act "UP" at the first time reaching (3, 4), and then go "RIGHT".
For a DRQN agent, the same as DRQN.
Partial-info bonus maze
For a DQN agent, it can't converge (find OUT).
For a DRQN agent, it can't converge either.
For a DRQNA agent, it can find the optimal path.