Agents¶

Agent interfaces¶

class chainerrl.agent.Agent[source]¶

Abstract agent class.

act(obs)[source]¶

Select an action for evaluation.

Returns:	action
Return type:	~object

act_and_train(obs, reward)[source]¶

Select an action for training.

Returns:	action
Return type:	~object

get_statistics()[source]¶

Get statistics of the agent.

Returns:	List of two-item tuples. The first item in a tuple is a str that represents the name of item, while the second item is a value to be recorded. Example: [(‘average_loss’: 0), (‘average_value’: 1), …]

load(dirname)[source]¶

Load internal states.

Returns:	None

save(dirname)[source]¶

Save internal states.

Returns:	None

stop_episode()[source]¶

Prepare for a new episode.

Returns:	None

stop_episode_and_train(state, reward, done=False)[source]¶

Observe consequences and prepare for a new episode.

Returns:	None

Agent implementations¶

class chainerrl.agents.A2C(model, optimizer, gamma, num_processes, gpu=None, update_steps=5, phi=<function A2C.<lambda>>, pi_loss_coef=1.0, v_loss_coef=0.5, entropy_coeff=0.01, use_gae=False, tau=0.95, act_deterministically=False, average_actor_loss_decay=0.999, average_entropy_decay=0.999, average_value_decay=0.999, batch_states=<function batch_states>)[source]¶

A2C: Advantage Actor-Critic.

A2C is a synchronous, deterministic variant of Asynchronous Advantage: Actor Critic (A3C).

See https://arxiv.org/abs/1708.05144

Parameters:

model (A2CModel) – Model to train
optimizer (chainer.Optimizer) – optimizer used to train the model
gamma (float) – Discount factor [0,1]
num_processes (int) – The number of processes
gpu (int) – GPU device id if not None nor negative.
update_steps (int) – The number of update steps
phi (callable) – Feature extractor function
pi_loss_coef (float) – Weight coefficient for the loss of the policy
v_loss_coef (float) – Weight coefficient for the loss of the value function
entropy_coeff (float) – Weight coefficient for the loss of the entropy
use_gae (bool) – use generalized advantage estimation(GAE)
tau (float) – gae parameter
average_actor_loss_decay (float) – Decay rate of average actor loss. Used only to record statistics.
average_entropy_decay (float) – Decay rate of average entropy. Used only to record statistics.
average_value_decay (float) – Decay rate of average value. Used only to record statistics.
act_deterministically (bool) – If set true, choose most probable actions in act method.
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states

class chainerrl.agents.A3C(model, optimizer, t_max, gamma, beta=0.01, process_idx=0, phi=<function A3C.<lambda>>, pi_loss_coef=1.0, v_loss_coef=0.5, keep_loss_scale_same=False, normalize_grad_by_t_max=False, use_average_reward=False, average_reward_tau=0.01, act_deterministically=False, average_entropy_decay=0.999, average_value_decay=0.999, batch_states=<function batch_states>)[source]¶

A3C: Asynchronous Advantage Actor-Critic.

See http://arxiv.org/abs/1602.01783

Parameters:

model (A3CModel) – Model to train
optimizer (chainer.Optimizer) – optimizer used to train the model
t_max (int) – The model is updated after every t_max local steps
gamma (float) – Discount factor [0,1]
beta (float) – Weight coefficient for the entropy regularizaiton term.
process_idx (int) – Index of the process.
phi (callable) – Feature extractor function
pi_loss_coef (float) – Weight coefficient for the loss of the policy
v_loss_coef (float) – Weight coefficient for the loss of the value function
act_deterministically (bool) – If set true, choose most probable actions in act method.
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states

class chainerrl.agents.ACER(model, optimizer, t_max, gamma, replay_buffer, beta=0.01, phi=<function ACER.<lambda>>, pi_loss_coef=1.0, Q_loss_coef=0.5, use_trust_region=True, trust_region_alpha=0.99, trust_region_delta=1, truncation_threshold=10, disable_online_update=False, n_times_replay=8, replay_start_size=10000, normalize_loss_by_steps=True, act_deterministically=False, use_Q_opc=False, average_entropy_decay=0.999, average_value_decay=0.999, average_kl_decay=0.999, logger=None)[source]¶

ACER (Actor-Critic with Experience Replay).

See http://arxiv.org/abs/1611.01224

Parameters:

model (ACERModel) – Model to train. It must be a callable that accepts observations as input and return three values: action distributions (Distribution), Q values (ActionValue) and state values (chainer.Variable).
optimizer (chainer.Optimizer) – optimizer used to train the model
t_max (int) – The model is updated after every t_max local steps
gamma (float) – Discount factor [0,1]
replay_buffer (EpisodicReplayBuffer) – Replay buffer to use. If set None, this agent won’t use experience replay.
beta (float) – Weight coefficient for the entropy regularizaiton term.
phi (callable) – Feature extractor function
pi_loss_coef (float) – Weight coefficient for the loss of the policy
Q_loss_coef (float) – Weight coefficient for the loss of the value function
use_trust_region (bool) – If set true, use efficient TRPO.
trust_region_alpha (float) – Decay rate of the average model used for efficient TRPO.
trust_region_delta (float) – Threshold used for efficient TRPO.
truncation_threshold (float or None) – Threshold used to truncate larger importance weights. If set None, importance weights are not truncated.
disable_online_update (bool) – If set true, disable online on-policy update and rely only on experience replay.
n_times_replay (int) – Number of times experience replay is repeated per one time of online update.
replay_start_size (int) – Experience replay is disabled if the number of transitions in the replay buffer is lower than this value.
normalize_loss_by_steps (bool) – If set true, losses are normalized by the number of steps taken to accumulate the losses
act_deterministically (bool) – If set true, choose most probable actions in act method.
use_Q_opc (bool) – If set true, use Q_opc, a Q-value estimate without importance sampling, is used to compute advantage values for policy gradients. The original paper recommend to use in case of continuous action.
average_entropy_decay (float) – Decay rate of average entropy. Used only to record statistics.
average_value_decay (float) – Decay rate of average value. Used only to record statistics.
average_kl_decay (float) – Decay rate of kl value. Used only to record statistics.

class chainerrl.agents.AL(*args, **kwargs)[source]¶

Advantage Learning.

See: http://arxiv.org/abs/1512.04860.

Parameters:	alpha (float) – Weight of (persistent) advantages. Convergence is guaranteed only for alpha in [0, 1).

For other arguments, see DQN.

class chainerrl.agents.CategoricalDoubleDQN(q_function, optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, clip_delta=True, phi=<function DQN.<lambda>>, target_update_method='hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, batch_accumulator='mean', episodic_update_len=None, logger=<Logger chainerrl.agents.dqn (WARNING)>, batch_states=<function batch_states>, recurrent=False)[source]¶: Categorical Double DQN.

class chainerrl.agents.CategoricalDQN(q_function, optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, clip_delta=True, phi=<function DQN.<lambda>>, target_update_method='hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, batch_accumulator='mean', episodic_update_len=None, logger=<Logger chainerrl.agents.dqn (WARNING)>, batch_states=<function batch_states>, recurrent=False)[source]¶

Categorical DQN.

See https://arxiv.org/abs/1707.06887.

Arguments are the same as those of DQN except q_function must return DistributionalDiscreteActionValue and clip_delta is ignored.

class chainerrl.agents.DDPG(model, actor_optimizer, critic_optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, phi=<function DDPG.<lambda>>, target_update_method='hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, episodic_update=False, episodic_update_len=None, logger=<Logger chainerrl.agents.ddpg (WARNING)>, batch_states=<function batch_states>, burnin_action_func=None)[source]¶

Deep Deterministic Policy Gradients.

This can be used as SVG(0) by specifying a Gaussian policy instead of a deterministic policy.

Parameters:

model (DDPGModel) – DDPG model that contains both a policy and a Q-function
actor_optimizer (Optimizer) – Optimizer setup with the policy
critic_optimizer (Optimizer) – Optimizer setup with the Q-function
replay_buffer (ReplayBuffer) – Replay buffer
gamma (float) – Discount factor
explorer (Explorer) – Explorer that specifies an exploration strategy.
gpu (int) – GPU device id if not None nor negative.
replay_start_size (int) – if the replay buffer’s size is less than replay_start_size, skip update
minibatch_size (int) – Minibatch size
update_interval (int) – Model update interval in step
target_update_interval (int) – Target model update interval in step
phi (callable) – Feature extractor applied to observations
target_update_method (str) – ‘hard’ or ‘soft’.
soft_update_tau (float) – Tau of soft target update.
n_times_update (int) – Number of repetition of update
average_q_decay (float) – Decay rate of average Q, only used for recording statistics
average_loss_decay (float) – Decay rate of average loss, only used for recording statistics
batch_accumulator (str) – ‘mean’ or ‘sum’
episodic_update (bool) – Use full episodes for update if set True
episodic_update_len (int or None) – Subsequences of this length are used for update if set int and episodic_update=True
logger (Logger) – Logger used
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states
burnin_action_func (callable or None) – If not None, this callable object is used to select actions before the model is updated one or more times during training.

class chainerrl.agents.DoubleDQN(q_function, optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, clip_delta=True, phi=<function DQN.<lambda>>, target_update_method='hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, batch_accumulator='mean', episodic_update_len=None, logger=<Logger chainerrl.agents.dqn (WARNING)>, batch_states=<function batch_states>, recurrent=False)[source]¶

Double DQN.

See: http://arxiv.org/abs/1509.06461.

class chainerrl.agents.DoublePAL(*args, **kwargs)[source]¶

class chainerrl.agents.DPP(*args, **kwargs)[source]¶

Dynamic Policy Programming with softmax operator.

Parameters:	eta (float) – Positive constant.

For other arguments, see DQN.

class chainerrl.agents.DQN(q_function, optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, clip_delta=True, phi=<function DQN.<lambda>>, target_update_method='hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, batch_accumulator='mean', episodic_update_len=None, logger=<Logger chainerrl.agents.dqn (WARNING)>, batch_states=<function batch_states>, recurrent=False)[source]¶

Deep Q-Network algorithm.

Parameters:

q_function (StateQFunction) – Q-function
optimizer (Optimizer) – Optimizer that is already setup
replay_buffer (ReplayBuffer) – Replay buffer
gamma (float) – Discount factor
explorer (Explorer) – Explorer that specifies an exploration strategy.
gpu (int) – GPU device id if not None nor negative.
replay_start_size (int) – if the replay buffer’s size is less than replay_start_size, skip update
minibatch_size (int) – Minibatch size
update_interval (int) – Model update interval in step
target_update_interval (int) – Target model update interval in step
clip_delta (bool) – Clip delta if set True
phi (callable) – Feature extractor applied to observations
target_update_method (str) – ‘hard’ or ‘soft’.
soft_update_tau (float) – Tau of soft target update.
n_times_update (int) – Number of repetition of update
average_q_decay (float) – Decay rate of average Q, only used for recording statistics
average_loss_decay (float) – Decay rate of average loss, only used for recording statistics
batch_accumulator (str) – ‘mean’ or ‘sum’
episodic_update_len (int or None) – Subsequences of this length are used for update if set int and episodic_update=True
logger (Logger) – Logger used
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states
recurrent (bool) – If set to True, model is assumed to implement chainerrl.links.StatelessRecurrent and is updated in a recurrent manner.

class chainerrl.agents.IQN(*args, **kwargs)[source]¶

Implicit Quantile Networks.

See https://arxiv.org/abs/1806.06923.

Parameters:

quantile_thresholds_N (int) – Number of quantile thresholds used in quantile regression.
quantile_thresholds_N_prime (int) – Number of quantile thresholds used to sample from the return distribution at the next state.
quantile_thresholds_K (int) – Number of quantile thresholds used to compute greedy actions.
act_deterministically (bool) – IQN’s action selection is by default stochastic as it samples quantile thresholds every time it acts, even for evaluation. If this option is set to True, it uses equally spaced quantile thresholds instead of randomly sampled ones for evaluation, making its action selection deterministic.

For other arguments, see chainerrl.agents.DQN.

class chainerrl.agents.NSQ(q_function, optimizer, t_max, gamma, i_target, explorer, phi=<function NSQ.<lambda>>, average_q_decay=0.999, logger=<Logger chainerrl.agents.nsq (WARNING)>, batch_states=<function batch_states>)[source]¶

Asynchronous N-step Q-Learning.

See http://arxiv.org/abs/1602.01783

Parameters:

q_function (A3CModel) – Model to train
optimizer (chainer.Optimizer) – optimizer used to train the model
t_max (int) – The model is updated after every t_max local steps
gamma (float) – Discount factor [0,1]
i_target (intn) – The target model is updated after every i_target global steps
explorer (Explorer) – Explorer to use in training
phi (callable) – Feature extractor function
average_q_decay (float) – Decay rate of average Q, only used for recording statistics
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states

class chainerrl.agents.PAL(*args, **kwargs)[source]¶

Persistent Advantage Learning.

See: http://arxiv.org/abs/1512.04860.

Parameters:	alpha (float) – Weight of (persistent) advantages. Convergence is guaranteed only for alpha in [0, 1).

For other arguments, see DQN.

class chainerrl.agents.PCL(model, optimizer, replay_buffer=None, t_max=None, gamma=0.99, tau=0.01, phi=<function PCL.<lambda>>, pi_loss_coef=1.0, v_loss_coef=0.5, rollout_len=10, batchsize=1, disable_online_update=False, n_times_replay=1, replay_start_size=100, normalize_loss_by_steps=True, act_deterministically=False, average_loss_decay=0.999, average_entropy_decay=0.999, average_value_decay=0.999, explorer=None, logger=None, batch_states=<function batch_states>, backprop_future_values=True, train_async=False)[source]¶

PCL (Path Consistency Learning).

Not only the batch PCL algorithm proposed in the paper but also its asynchronous variant is implemented.

See https://arxiv.org/abs/1702.08892

Parameters:

model (chainer.Link) –
Model to train. It must be a callable that accepts a batch of observations as input and return two values:
- action distributions (Distribution)
- state values (chainer.Variable)
optimizer (chainer.Optimizer) – optimizer used to train the model
t_max (int or None) – The model is updated after every t_max local steps. If set None, the model is updated after every episode.
gamma (float) – Discount factor [0,1]
tau (float) – Weight coefficient for the entropy regularizaiton term.
phi (callable) – Feature extractor function
pi_loss_coef (float) – Weight coefficient for the loss of the policy
v_loss_coef (float) – Weight coefficient for the loss of the value function
rollout_len (int) – Number of rollout steps
batchsize (int) – Number of episodes or sub-trajectories used for an update. The total number of transitions used will be (batchsize x t_max).
disable_online_update (bool) – If set true, disable online on-policy update and rely only on experience replay.
n_times_replay (int) – Number of times experience replay is repeated per one time of online update.
replay_start_size (int) – Experience replay is disabled if the number of transitions in the replay buffer is lower than this value.
normalize_loss_by_steps (bool) – If set true, losses are normalized by the number of steps taken to accumulate the losses
act_deterministically (bool) – If set true, choose most probable actions in act method.
average_loss_decay (float) – Decay rate of average loss. Used only to record statistics.
average_entropy_decay (float) – Decay rate of average entropy. Used only to record statistics.
average_value_decay (float) – Decay rate of average value. Used only to record statistics.
explorer (Explorer or None) – If not None, this explorer is used for selecting actions.
logger (None or Logger) – Logger to be used
batch_states (callable) – Method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states
backprop_future_values (bool) – If set True, value gradients are computed not only wrt V(s_t) but also V(s_{t+d}).
train_async (bool) – If set True, use a process-local model to compute gradients and update the globally shared model.

class chainerrl.agents.PGT(model, actor_optimizer, critic_optimizer, replay_buffer, gamma, explorer, beta=0.01, act_deterministically=False, gpu=-1, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, phi=<function PGT.<lambda>>, target_update_method='hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, logger=<Logger chainerrl.agents.pgt (WARNING)>, batch_states=<function batch_states>)[source]¶

Policy Gradient Theorem with an approximate policy and a Q-function.

This agent is almost the same with DDPG except that it uses the likelihood ratio gradient estimation instead of value gradients.

Parameters:

model (chainer.Chain) – Chain that contains both a policy and a Q-function
actor_optimizer (Optimizer) – Optimizer setup with the policy
critic_optimizer (Optimizer) – Optimizer setup with the Q-function
replay_buffer (ReplayBuffer) – Replay buffer
gamma (float) – Discount factor
explorer (Explorer) – Explorer that specifies an exploration strategy.
gpu (int) – GPU device id. -1 for CPU.
replay_start_size (int) – if the replay buffer’s size is less than replay_start_size, skip update
minibatch_size (int) – Minibatch size
update_interval (int) – Model update interval in step
target_update_interval (int) – Target model update interval in step
phi (callable) – Feature extractor applied to observations
target_update_method (str) – ‘hard’ or ‘soft’.
soft_update_tau (float) – Tau of soft target update.
n_times_update (int) – Number of repetition of update
average_q_decay (float) – Decay rate of average Q, only used for recording statistics
average_loss_decay (float) – Decay rate of average loss, only used for recording statistics
batch_accumulator (str) – ‘mean’ or ‘sum’
logger (Logger) – Logger used
beta (float) – Coefficient for entropy regularization
act_deterministically (bool) – Act deterministically by selecting most probable actions in test time
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states

class chainerrl.agents.PPO(model, optimizer, obs_normalizer=None, gpu=None, gamma=0.99, lambd=0.95, phi=<function PPO.<lambda>>, value_func_coef=1.0, entropy_coef=0.01, update_interval=2048, minibatch_size=64, epochs=10, clip_eps=0.2, clip_eps_vf=None, standardize_advantages=True, batch_states=<function batch_states>, recurrent=False, max_recurrent_sequence_len=None, act_deterministically=False, value_stats_window=1000, entropy_stats_window=1000, value_loss_stats_window=100, policy_loss_stats_window=100)[source]¶

Proximal Policy Optimization

See https://arxiv.org/abs/1707.06347

Parameters:

model (A3CModel) – Model to train. Recurrent models are not supported. state s |-> (pi(s, _), v(s))
optimizer (chainer.Optimizer) – Optimizer used to train the model
gpu (int) – GPU device id if not None nor negative
gamma (float) – Discount factor [0, 1]
lambd (float) – Lambda-return factor [0, 1]
phi (callable) – Feature extractor function
value_func_coef (float) – Weight coefficient for loss of value function (0, inf)
entropy_coef (float) – Weight coefficient for entropy bonus [0, inf)
update_interval (int) – Model update interval in step
minibatch_size (int) – Minibatch size
epochs (int) – Training epochs in an update
clip_eps (float) – Epsilon for pessimistic clipping of likelihood ratio to update policy
clip_eps_vf (float) – Epsilon for pessimistic clipping of value to update value function. If it is None, value function is not clipped on updates.
standardize_advantages (bool) – Use standardized advantages on updates
recurrent (bool) – If set to True, model is assumed to implement chainerrl.links.StatelessRecurrent and update in a recurrent manner.
max_recurrent_sequence_len (int) – Maximum length of consecutive sequences of transitions in a minibatch for updatig the model. This value is used only when recurrent is True. A smaller value will encourage a minibatch to contain more and shorter sequences.
act_deterministically (bool) – If set to True, choose most probable actions in the act method instead of sampling from distributions.
value_stats_window (int) – Window size used to compute statistics of value predictions.
entropy_stats_window (int) – Window size used to compute statistics of entropy of action distributions.
value_loss_stats_window (int) – Window size used to compute statistics of loss values regarding the value function.
policy_loss_stats_window (int) – Window size used to compute statistics of loss values regarding the policy.

Statistics:

average_value: Average of value predictions on non-terminal states.: It’s updated on (batch_)act_and_train.
average_entropy: Average of entropy of action distributions on: non-terminal states. It’s updated on (batch_)act_and_train.
average_value_loss: Average of losses regarding the value function.: It’s updated after the model is updated.
average_policy_loss: Average of losses regarding the policy.: It’s updated after the model is updated.

n_updates: Number of model updates so far. explained_variance: Explained variance computed from the last batch.

class chainerrl.agents.REINFORCE(model, optimizer, beta=0, phi=<function REINFORCE.<lambda>>, batchsize=1, act_deterministically=False, average_entropy_decay=0.999, backward_separately=False, batch_states=<function batch_states>, logger=None)[source]¶

William’s episodic REINFORCE.

Parameters:

model (Policy) – Model to train. It must be a callable that accepts observations as input and return action distributions (Distribution).
optimizer (chainer.Optimizer) – optimizer used to train the model
beta (float) – Weight coefficient for the entropy regularizaiton term.
normalize_loss_by_steps (bool) – If set true, losses are normalized by the number of steps taken to accumulate the losses
act_deterministically (bool) – If set true, choose most probable actions in act method.
batchsize (int) – Number of episodes used for each update
backward_separately (bool) – If set true, call backward separately for each episode and accumulate only gradients.
average_entropy_decay (float) – Decay rate of average entropy. Used only to record statistics.
batch_states (callable) – Method which makes a batch of observations. default is chainerrl.misc.batch_states
logger (logging.Logger) – Logger to be used.

class chainerrl.agents.ResidualDQN(*args, **kwargs)[source]¶: DQN that allows maxQ also backpropagate gradients.

class chainerrl.agents.SARSA(q_function, optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, clip_delta=True, phi=<function DQN.<lambda>>, target_update_method='hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, batch_accumulator='mean', episodic_update_len=None, logger=<Logger chainerrl.agents.dqn (WARNING)>, batch_states=<function batch_states>, recurrent=False)[source]¶

Off-policy SARSA.

This agent learns the Q-function of a behavior policy defined via the given explorer, instead of learning the Q-function of the optimal policy.

class chainerrl.agents.SoftActorCritic(policy, q_func1, q_func2, policy_optimizer, q_func1_optimizer, q_func2_optimizer, replay_buffer, gamma, gpu=None, replay_start_size=10000, minibatch_size=100, update_interval=1, phi=<function SoftActorCritic.<lambda>>, soft_update_tau=0.005, logger=<Logger chainerrl.agents.soft_actor_critic (WARNING)>, batch_states=<function batch_states>, burnin_action_func=None, initial_temperature=1.0, entropy_target=None, temperature_optimizer=None, act_deterministically=True)[source]¶

Soft Actor-Critic (SAC).

See https://arxiv.org/abs/1812.05905

Parameters:

policy (Policy) – Policy.
q_func1 (Link) – First Q-function that takes state-action pairs as input and outputs predicted Q-values.
q_func2 (Link) – Second Q-function that takes state-action pairs as input and outputs predicted Q-values.
policy_optimizer (Optimizer) – Optimizer setup with the policy
q_func1_optimizer (Optimizer) – Optimizer setup with the first Q-function.
q_func2_optimizer (Optimizer) – Optimizer setup with the second Q-function.
replay_buffer (ReplayBuffer) – Replay buffer
gamma (float) – Discount factor
gpu (int) – GPU device id if not None nor negative.
replay_start_size (int) – if the replay buffer’s size is less than replay_start_size, skip update
minibatch_size (int) – Minibatch size
update_interval (int) – Model update interval in step
phi (callable) – Feature extractor applied to observations
soft_update_tau (float) – Tau of soft target update.
logger (Logger) – Logger used
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states
burnin_action_func (callable or None) – If not None, this callable object is used to select actions before the model is updated one or more times during training.
initial_temperature (float) – Initial temperature value. If entropy_target is set to None, the temperature is fixed to it.
entropy_target (float or None) – If set to a float, the temperature is adjusted during training to match the policy’s entropy to it.
temperature_optimizer (Optimizer or None) – Optimizer used to optimize the temperature. If set to None, Adam with default hyperparameters is used.
act_deterministically (bool) – If set to True, choose most probable actions in the act method instead of sampling from distributions.

class chainerrl.agents.TD3(policy, q_func1, q_func2, policy_optimizer, q_func1_optimizer, q_func2_optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=10000, minibatch_size=100, update_interval=1, phi=<function TD3.<lambda>>, soft_update_tau=0.005, n_times_update=1, logger=<Logger chainerrl.agents.td3 (WARNING)>, batch_states=<function batch_states>, burnin_action_func=None, policy_update_delay=2, target_policy_smoothing_func=<function default_target_policy_smoothing_func>)[source]¶

Twin Delayed Deep Deterministic Policy Gradients (TD3).

See http://arxiv.org/abs/1802.09477

Parameters:

policy (Policy) – Policy.
q_func1 (Link) – First Q-function that takes state-action pairs as input and outputs predicted Q-values.
q_func2 (Link) – Second Q-function that takes state-action pairs as input and outputs predicted Q-values.
policy_optimizer (Optimizer) – Optimizer setup with the policy
q_func1_optimizer (Optimizer) – Optimizer setup with the first Q-function.
q_func2_optimizer (Optimizer) – Optimizer setup with the second Q-function.
replay_buffer (ReplayBuffer) – Replay buffer
gamma (float) – Discount factor
explorer (Explorer) – Explorer that specifies an exploration strategy.
gpu (int) – GPU device id if not None nor negative.
replay_start_size (int) – if the replay buffer’s size is less than replay_start_size, skip update
minibatch_size (int) – Minibatch size
update_interval (int) – Model update interval in step
phi (callable) – Feature extractor applied to observations
soft_update_tau (float) – Tau of soft target update.
logger (Logger) – Logger used
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states
burnin_action_func (callable or None) – If not None, this callable object is used to select actions before the model is updated one or more times during training.
policy_update_delay (int) – Delay of policy updates. Policy is updated once in policy_update_delay times of Q-function updates.
target_policy_smoothing_func (callable) – Callable that takes a batch of actions as input and outputs a noisy version of it. It is used for target policy smoothing when computing target Q-values.

class chainerrl.agents.TRPO(policy, vf, vf_optimizer, obs_normalizer=None, gamma=0.99, lambd=0.95, phi=<function TRPO.<lambda>>, entropy_coef=0.01, update_interval=2048, max_kl=0.01, vf_epochs=3, vf_batch_size=64, standardize_advantages=True, batch_states=<function batch_states>, recurrent=False, max_recurrent_sequence_len=None, line_search_max_backtrack=10, conjugate_gradient_max_iter=10, conjugate_gradient_damping=0.01, act_deterministically=False, value_stats_window=1000, entropy_stats_window=1000, kl_stats_window=100, policy_step_size_stats_window=100, logger=<Logger chainerrl.agents.trpo (WARNING)>)[source]¶

Trust Region Policy Optimization.

A given stochastic policy is optimized by the TRPO algorithm. A given value function is also trained to predict by the TD(lambda) algorithm and used for Generalized Advantage Estimation (GAE).

Since the policy is optimized via the conjugate gradient method and line search while the value function is optimized via SGD, these two models should be separate.

Since TRPO requires second-order derivatives to compute Hessian-vector products, Chainer v3.0.0 or newer is required. In addition, your policy must contain only functions that support second-order derivatives.

See https://arxiv.org/abs/1502.05477 for TRPO. See https://arxiv.org/abs/1506.02438 for GAE.

Parameters:

policy (Policy) – Stochastic policy. Its forward computation must contain only functions that support second-order derivatives. Recurrent models are not supported.
vf (ValueFunction) – Value function. Recurrent models are not supported.
vf_optimizer (chainer.Optimizer) – Optimizer for the value function.
obs_normalizer (chainerrl.links.EmpiricalNormalization or None) – If set to chainerrl.links.EmpiricalNormalization, it is used to normalize observations based on the empirical mean and standard deviation of observations. These statistics are updated after computing advantages and target values and before updating the policy and the value function.
gamma (float) – Discount factor [0, 1]
lambd (float) – Lambda-return factor [0, 1]
phi (callable) – Feature extractor function
entropy_coef (float) – Weight coefficient for entropoy bonus [0, inf)
update_interval (int) – Interval steps of TRPO iterations. Every after this amount of steps, this agent updates the policy and the value function using data from these steps.
vf_epochs (int) – Number of epochs for which the value function is trained on each TRPO iteration.
vf_batch_size (int) – Batch size of SGD for the value function.
standardize_advantages (bool) – Use standardized advantages on updates
line_search_max_backtrack (int) – Maximum number of backtracking in line search to tune step sizes of policy updates.
conjugate_gradient_max_iter (int) – Maximum number of iterations in the conjugate gradient method.
conjugate_gradient_damping (float) – Damping factor used in the conjugate gradient method.
act_deterministically (bool) – If set to True, choose most probable actions in the act method instead of sampling from distributions.
value_stats_window (int) – Window size used to compute statistics of value predictions.
entropy_stats_window (int) – Window size used to compute statistics of entropy of action distributions.
kl_stats_window (int) – Window size used to compute statistics of KL divergence between old and new policies.
policy_step_size_stats_window (int) – Window size used to compute statistics of step sizes of policy updates.

Statistics:

average_value: Average of value predictions on non-terminal states.: It’s updated before the value function is updated.
average_entropy: Average of entropy of action distributions on: non-terminal states. It’s updated on act_and_train.
average_kl: Average of KL divergence between old and new policies.: It’s updated after the policy is updated.
average_policy_step_size: Average of step sizes of policy updates: It’s updated after the policy is updated.