资源算法DeepRL-P2-Continuous-Control

DeepRL-P2-Continuous-Control

2020-02-13 | |  29 |   0 |   0

Udacity Deep Reinforcement Learning Nanodegree

Project 2: Continuous Control

Train a Set of Robotic Arms

Photo credit: Google AI Blog

 

Background

Reinforcement Learning agents such as the one created in this project are used in many real-world applications. In particular, industrial control applications benefit greatly from the continuous control aspects like those implemented in this project. This Medium blog post describes several potential applications of this technology, including:

  • Robotic Control Systems: Very similar to the Reacher environment in this project!

  • Automotive Control Systems: DDPG-like algorithms can help instrument throttle commands, air-to-fuel ratios, and ignition control — improving fuel economy and reducing emissions.

  • Automatic Calibration: DDPG networks are used to calibrate industrial equipment and optimize performance. They can also help detect anomalies and issue alerts to avoid disasters.

  • HVAC Control Systems: Heating and Ventilation systems are everywhere, and improvements in these systems can lead to great energy savings. In fact, Google turned over control of its data center cooling systems to an AI system, reducing cooling costs by 40%. A July 2018 paper by Li et al addresses a similar issue for the National Super Computing Center (NSCC). Their algorithm uses an actor-critic model that's similar to the one I used in this project, which reduced NSCC electricity costs by 15%.

Goal

In this project, I build a reinforcement learning (RL) agent that controls a robotic arm within Unity's Reacher environment. The goal is to get 20 different robotic arms to maintain contact with the green spheres.

A reward of +0.1 is provided for each timestep that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

In order to solve the environment, our agent must achieve a score of +30 averaged across all 20 agents for 100 consecutive episodes.

Trained Agent

Summary of Environment

  • Set-up: Double-jointed arm which can move to target locations.

  • Goal: Each agent must move its hand to the goal location, and keep it there.

  • Agents: The environment contains 20 agents linked to a single Brain.

  • Agent Reward Function (independent):

    • +0.1 for each timestep agent's hand is in goal location.

  • Brains: One Brain with the following observation/action space.

    • Vector Observation space: 33 variables corresponding to position, rotation, velocity, and angular velocities of the two arm Rigidbodies.

    • Vector Action space: (Continuous) Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

    • Visual Observations: None.

  • Reset Parameters: Two, corresponding to goal size, and goal movement speed.

  • Benchmark Mean Reward: 30

 

Approach

Here are the high-level steps taken in building an agent that solves this environment.

  1. Evaluate the state and action space.

  2. Establish performance baseline using a random action policy.

  3. Select an appropriate algorithm and begin implementing it.

  4. Run experiments, make revisions, and retrain the agent until the performance threshold is reached.

 

1. Evaluate State & Action Space

The state space space has 33 dimensions corresponding to the position, rotation, velocity, and angular velocities of the robotic arm. There are two sections of the arm — analogous to those connecting the shoulder and elbow (i.e., the humerus), and the elbow to the wrist (i.e., the forearm) on a human body.

Each action is a vector with four numbers, corresponding to the torque applied to the two joints (shoulder and elbow). Every element in the action vector must be a number between -1 and 1, making the action space continuous.

 

2. Establish Baseline

Before building an agent that learns, I started by testing an agent that selects actions (uniformly) at random at each time step.

env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    states = env_info.vector_observations                  # get the current state (for each agent)scores = np.zeros(num_agents)                          # initialize the score (for each agent)while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        breakprint('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Running this agent a few times resulted in scores from 0.03 to 0.09. Obviously, if the agent needs to achieve an average score of 30 over 100 consecutive episodes, then choosing actions at random won't work.

 

3. Implement Learning Algorithm

To get started, there are a few high-level architecture decisions we need to make. First, we need to determine which types of algorithms are most suitable for the Reacher environment. Second, we need to determine how many "brains" we want controlling the actions of our agents.

Policy-based vs Value-based Methods

There are two key differences in the Reacher environment compared to the previous 'Navigation' project:

  1. Multple agents — The version of the environment I'm tackling in this project has 20 different agents, whereas the Navigation project had only a single agent. To keep things simple, I decided to use a single brain to control all 20 agents, rather than training 20 individual brains. Training multiple brains seemed unnecessary since all of the agents are essentially performing the same task under the same conditions. Also, training 20 brains would take a really long time!

  2. Continuous action space — The action space is now continuous, which allows each agent to execute more complex and precise movements. Essentially, there's an unlimited range of possible action values to control the robotic arm, whereas the agent in the Navigation project was limited to four discrete actions: left, right, forward, backward.

Given the additional complexity of this environment, the value-based method we used for the last project is not suitable — i.e., the Deep Q-Network (DQN) algorithm. Most importantly, we need an algorithm that allows the robotic arm to utilize its full range of movement. For this, we'll need to explore a different class of algorithms called policy-based methods.

Here are some advantages of policy-based methods:

  • Continuous action spaces — Policy-based methods are well-suited for continuous action spaces.

  • Stochastic policies — Both value-based and policy-based methods can learn deterministic policies. However, policy-based methods can also learn true stochastic policies.

  • Simplicity — Policy-based methods directly learn the optimal policy, without having to maintain a separate value function estimate. With value-based methods, the agent uses its experience with the environment to maintain an estimate of the optimal action-value function, from which an optimal policy is derived. This intermediate step requires the storage of lots of additional data since you need to account for all possible action values. Even if you discretize the action space, the number of possible actions can be quite high. For example, if we assumed only 10 degrees of freedom for both joints of our robotic arm, we'd have 1024 unique actions (210). Using DQN to determine the action that maximizes the action-value function within a continuous or high-dimensional space requires a complex optimization process at every timestep.

Deep Deterministic Policy Gradient (DDPG)

The algorithm I chose to model my project on is outlined in this paperContinuous Control with Deep Reinforcement Learning, by researchers at Google Deepmind. In this paper, the authors present "a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces." They highlight that DDPG can be viewed as an extension of Deep Q-learning to continuous tasks.

I used this vanilla, single-agent DDPG as a template. I further experimented with the DDPG algorithm based on other concepts covered in Udacity's classroom and lessons. My understanding and implementation of this algorithm (including various customizations) are discussed below.

Actor-Critic Method

Actor-critic methods leverage the strengths of both policy-based and value-based methods.

Using a policy-based approach, the agent (actor) learns how to act by directly estimating the optimal policy and maximizing reward through gradient ascent. Meanwhile, employing a value-based approach, the agent (critic) learns how to estimate the value (i.e., the future cumulative reward) of different state-action pairs. Actor-critic methods combine these two approaches in order to accelerate the learning process. Actor-critic agents are also more stable than value-based agents, while requiring fewer training samples than policy-based agents.

You can find the actor-critic logic implemented as part of the Agent() class here in ddpg_agent.py of the source code. The actor-critic models can be found via their respective Actor() and Critic() classes here in models.py.

Note: As we did with Double Q-Learning in the last project, we're again leveraging local and target networks to improve stability. This is where one set of parameters w is used to select the best action, and another set of parameters w' is used to evaluate that action. In this project, local and target networks are implemented separately for both the actor and the critic.

# Actor Network (w/ Target Network)self.actor_local = Actor(state_size, action_size, random_seed).to(device)self.actor_target = Actor(state_size, action_size, random_seed).to(device)self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=LR_ACTOR)# Critic Network (w/ Target Network)self.critic_local = Critic(state_size, action_size, random_seed).to(device)self.critic_target = Critic(state_size, action_size, random_seed).to(device)self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=LR_CRITIC, weight_decay=WEIGHT_DECAY)

Exploration vs Exploitation

One challenge is choosing which action to take while the agent is still learning the optimal policy. Should the agent choose an action based on the rewards observed thus far? Or, should the agent try a new action in hopes of earning a higher reward? This is known as the exploration vs. exploitation dilemma.

In the Navigation project, I addressed this by implementing an 

上一篇:DeepRL-PPO-tutorial

下一篇:DeepRL-P1-Navigation

用户评价
全部评价

热门资源

  • seetafaceJNI

    项目介绍 基于中科院seetaface2进行封装的JAVA...

  • spark-corenlp

    This package wraps Stanford CoreNLP annotators ...

  • Keras-ResNeXt

    Keras ResNeXt Implementation of ResNeXt models...

  • capsnet-with-caps...

    CapsNet with capsule-wise convolution Project ...

  • shih-styletransfer

    shih-styletransfer Code from Style Transfer ...