资源算法PCLRL

PCLRL

2019-09-10 | |  52 |   0 |   0

Code for several RL algorithms used in the following papers: * "Improving Policy Gradient by Exploring Under-appreciated Rewards" by Ofir Nachum, Mohammad Norouzi, and Dale Schuurmans. * "Bridging the Gap Between Value and Policy Based Reinforcement Learning" by Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. * "Trust-PCL: An Off-Policy Trust Region Method for Continuous Control" by Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans.

Available algorithms: * Actor Critic * TRPO * PCL * Unified PCL * Trust-PCL * PCL + Constraint Trust Region (un-published) * REINFORCE * UREX

Requirements: * TensorFlow (see http://www.tensorflow.org for how to install/upgrade) * OpenAI Gym (see http://gym.openai.com/docs) * NumPy (see http://www.numpy.org/) * SciPy (see http://www.scipy.org/)

Quick Start:

Run UREX on a simple environment:

python trainer.py --logtostderr --batch_size=400 --env=DuplicatedInput-v0 
  --validation_frequency=25 --tau=0.1 --clip_norm=50 
  --num_samples=10 --objective=urex

Run REINFORCE on a simple environment:

python trainer.py --logtostderr --batch_size=400 --env=DuplicatedInput-v0 
  --validation_frequency=25 --tau=0.01 --clip_norm=50 
  --num_samples=10 --objective=reinforce

Run PCL on a simple environment:

python trainer.py --logtostderr --batch_size=400 --env=DuplicatedInput-v0 
  --validation_frequency=25 --tau=0.025 --rollout=10 --critic_weight=1.0 
  --gamma=0.9 --clip_norm=10 --replay_buffer_freq=1 --objective=pcl

Run PCL with expert trajectories on a simple environment:

python trainer.py --logtostderr --batch_size=400 --env=DuplicatedInput-v0 
  --validation_frequency=25 --tau=0.025 --rollout=10 --critic_weight=1.0 
  --gamma=0.9 --clip_norm=10 --replay_buffer_freq=1 --objective=pcl 
  --num_expert_paths=10

Run Mujoco task with TRPO:

python trainer.py --logtostderr --batch_size=25 --env=HalfCheetah-v1 
  --validation_frequency=5 --rollout=10 --gamma=0.995 
  --max_step=1000 --cutoff_agent=1000 
  --objective=trpo --norecurrent --internal_dim=64 --trust_region_p 
  --max_divergence=0.05 --value_opt=best_fit --critic_weight=0.0

To run Mujoco task using Trust-PCL (off-policy) use the below command. It should work well across all environments, given that you search sufficiently among

(1) max_divergence (0.001, 0.0005, 0.002 are good values),

(2) rollout (1, 5, 10 are good values),

(3) tf_seed (need to average over enough random seeds).

python trainer.py --logtostderr --batch_size=1 --env=HalfCheetah-v1 
  --validation_frequency=250 --rollout=1 --critic_weight=1.0 --gamma=0.995 
  --clip_norm=40 --learning_rate=0.0001 --replay_buffer_freq=1 
  --replay_buffer_size=5000 --replay_buffer_alpha=0.001 --norecurrent 
  --objective=pcl --max_step=10 --cutoff_agent=1000 --tau=0.0 --eviction=fifo 
  --max_divergence=0.001 --internal_dim=256 --replay_batch_size=64 
  --nouse_online_batch --batch_by_steps --value_hidden_layers=2 
  --update_eps_lambda --nounify_episodes --target_network_lag=0.99 
  --sample_from=online --clip_adv=1 --prioritize_by=step --num_steps=1000000 
  --noinput_prev_actions --use_target_values --tf_seed=57

Run Mujoco task with PCL constraint trust region:

python trainer.py --logtostderr --batch_size=25 --env=HalfCheetah-v1 
  --validation_frequency=5 --tau=0.001 --rollout=50 --gamma=0.99 
  --max_step=1000 --cutoff_agent=1000 
  --objective=pcl --norecurrent --internal_dim=64 --trust_region_p 
  --max_divergence=0.01 --value_opt=best_fit --critic_weight=0.0 
  --tau_decay=0.1 --tau_start=0.1

Maintained by Ofir Nachum (ofirnachum).

上一篇:ObjectDetection

下一篇:SLIM

用户评价
全部评价

热门资源

  • seetafaceJNI

    项目介绍 基于中科院seetaface2进行封装的JAVA...

  • spark-corenlp

    This package wraps Stanford CoreNLP annotators ...

  • Keras-ResNeXt

    Keras ResNeXt Implementation of ResNeXt models...

  • capsnet-with-caps...

    CapsNet with capsule-wise convolution Project ...

  • inferno-boilerplate

    This is a very basic boilerplate example for pe...