Agents¶

Agent interfaces¶

class chainerrl.agent.Agent[source]¶

Abstract agent class.

act(obs)[source]¶

Select an action for evaluation.

Returns:	action
Return type:	~object

act_and_train(obs, reward)[source]¶

Select an action for training.

Returns:	action
Return type:	~object

get_statistics()[source]¶

Get statistics of the agent.

Returns:	List of two-item tuples. The first item in a tuple is a str that represents the name of item, while the second item is a value to be recorded. Example: [(‘average_loss’: 0), (‘average_value’: 1), ...]

load(dirname)[source]¶

Load internal states.

Returns:	None

save(dirname)[source]¶

Save internal states.

Returns:	None

stop_episode()[source]¶

Prepare for a new episode.

Returns:	None

stop_episode_and_train(state, reward, done=False)[source]¶

Observe consequences and prepare for a new episode.

Returns:	None

Agent implementations¶

class chainerrl.agents.A3C(model, optimizer, t_max, gamma, beta=0.01, process_idx=0, phi=<function <lambda>>, pi_loss_coef=1.0, v_loss_coef=0.5, keep_loss_scale_same=False, normalize_grad_by_t_max=False, use_average_reward=False, average_reward_tau=0.01, act_deterministically=False, average_entropy_decay=0.999, average_value_decay=0.999, batch_states=<function batch_states>)[source]¶

A3C: Asynchronous Advantage Actor-Critic.

See http://arxiv.org/abs/1602.01783

Parameters:

model (A3CModel) – Model to train
optimizer (chainer.Optimizer) – optimizer used to train the model
t_max (int) – The model is updated after every t_max local steps
gamma (float) – Discount factor [0,1]
beta (float) – Weight coefficient for the entropy regularizaiton term.
process_idx (int) – Index of the process.
phi (callable) – Feature extractor function
pi_loss_coef (float) – Weight coefficient for the loss of the policy
v_loss_coef (float) – Weight coefficient for the loss of the value function
act_deterministically (bool) – If set true, choose most probable actions in act method.
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states

class chainerrl.agents.ACER(model, optimizer, t_max, gamma, replay_buffer, beta=0.01, phi=<function <lambda>>, pi_loss_coef=1.0, Q_loss_coef=0.5, use_trust_region=True, trust_region_alpha=0.99, trust_region_delta=1, truncation_threshold=10, disable_online_update=False, n_times_replay=8, replay_start_size=10000, normalize_loss_by_steps=True, act_deterministically=False, use_Q_opc=False, average_entropy_decay=0.999, average_value_decay=0.999, average_kl_decay=0.999, logger=None)[source]¶

ACER (Actor-Critic with Experience Replay).

See http://arxiv.org/abs/1611.01224

Parameters:

model (ACERModel) – Model to train. It must be a callable that accepts observations as input and return three values: action distributions (Distribution), Q values (ActionValue) and state values (chainer.Variable).
optimizer (chainer.Optimizer) – optimizer used to train the model
t_max (int) – The model is updated after every t_max local steps
gamma (float) – Discount factor [0,1]
replay_buffer (EpisodicReplayBuffer) – Replay buffer to use. If set None, this agent won’t use experience replay.
beta (float) – Weight coefficient for the entropy regularizaiton term.
phi (callable) – Feature extractor function
pi_loss_coef (float) – Weight coefficient for the loss of the policy
Q_loss_coef (float) – Weight coefficient for the loss of the value function
use_trust_region (bool) – If set true, use efficient TRPO.
trust_region_alpha (float) – Decay rate of the average model used for efficient TRPO.
trust_region_delta (float) – Threshold used for efficient TRPO.
truncation_threshold (float or None) – Threshold used to truncate larger importance weights. If set None, importance weights are not truncated.
disable_online_update (bool) – If set true, disable online on-policy update and rely only on experience replay.
n_times_replay (int) – Number of times experience replay is repeated per one time of online update.
replay_start_size (int) – Experience replay is disabled if the number of transitions in the replay buffer is lower than this value.
normalize_loss_by_steps (bool) – If set true, losses are normalized by the number of steps taken to accumulate the losses
act_deterministically (bool) – If set true, choose most probable actions in act method.
use_Q_opc (bool) – If set true, use Q_opc, a Q-value estimate without importance sampling, is used to compute advantage values for policy gradients. The original paper recommend to use in case of continuous action.
average_entropy_decay (float) – Decay rate of average entropy. Used only to record statistics.
average_value_decay (float) – Decay rate of average value. Used only to record statistics.
average_kl_decay (float) – Decay rate of kl value. Used only to record statistics.

class chainerrl.agents.AL(*args, **kwargs)[source]¶

Advantage Learning.

See: http://arxiv.org/abs/1512.04860.

Parameters:	alpha (float) – Weight of (persistent) advantages. Convergence is guaranteed only for alpha in [0, 1).

For other arguments, see DQN.

class chainerrl.agents.DDPG(model, actor_optimizer, critic_optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, phi=<function <lambda>>, target_update_method=u'hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, episodic_update=False, episodic_update_len=None, logger=<logging.Logger object>, batch_states=<function batch_states>)[source]¶

Deep Deterministic Policy Gradients.

This can be used as SVG(0) by specifying a Gaussian policy instead of a deterministic policy.

Parameters:

model (DDPGModel) – DDPG model that contains both a policy and a Q-function
actor_optimizer (Optimizer) – Optimizer setup with the policy
critic_optimizer (Optimizer) – Optimizer setup with the Q-function
replay_buffer (ReplayBuffer) – Replay buffer
gamma (float) – Discount factor
explorer (Explorer) – Explorer that specifies an exploration strategy.
gpu (int) – GPU device id if not None nor negative.
replay_start_size (int) – if the replay buffer’s size is less than replay_start_size, skip update
minibatch_size (int) – Minibatch size
update_interval (int) – Model update interval in step
target_update_interval (int) – Target model update interval in step
phi (callable) – Feature extractor applied to observations
target_update_method (str) – ‘hard’ or ‘soft’.
soft_update_tau (float) – Tau of soft target update.
n_times_update (int) – Number of repetition of update
average_q_decay (float) – Decay rate of average Q, only used for recording statistics
average_loss_decay (float) – Decay rate of average loss, only used for recording statistics
batch_accumulator (str) – ‘mean’ or ‘sum’
episodic_update (bool) – Use full episodes for update if set True
episodic_update_len (int or None) – Subsequences of this length are used for update if set int and episodic_update=True
logger (Logger) – Logger used
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states

class chainerrl.agents.DoubleDQN(q_function, optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, clip_delta=True, phi=<function <lambda>>, target_update_method=u'hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, batch_accumulator=u'mean', episodic_update=False, episodic_update_len=None, logger=<logging.Logger object>, batch_states=<function batch_states>)[source]¶

Double DQN.

See: http://arxiv.org/abs/1509.06461.

class chainerrl.agents.DoublePAL(*args, **kwargs)[source]¶

class chainerrl.agents.DPP(*args, **kwargs)[source]¶

Dynamic Policy Programming with softmax operator.

Parameters:	eta (float) – Positive constant.

For other arguments, see DQN.

class chainerrl.agents.DQN(q_function, optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, clip_delta=True, phi=<function <lambda>>, target_update_method=u'hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, batch_accumulator=u'mean', episodic_update=False, episodic_update_len=None, logger=<logging.Logger object>, batch_states=<function batch_states>)[source]¶

Deep Q-Network algorithm.

Parameters:

q_function (StateQFunction) – Q-function
optimizer (Optimizer) – Optimizer that is already setup
replay_buffer (ReplayBuffer) – Replay buffer
gamma (float) – Discount factor
explorer (Explorer) – Explorer that specifies an exploration strategy.
gpu (int) – GPU device id if not None nor negative.
replay_start_size (int) – if the replay buffer’s size is less than replay_start_size, skip update
minibatch_size (int) – Minibatch size
update_interval (int) – Model update interval in step
target_update_interval (int) – Target model update interval in step
clip_delta (bool) – Clip delta if set True
phi (callable) – Feature extractor applied to observations
target_update_method (str) – ‘hard’ or ‘soft’.
soft_update_tau (float) – Tau of soft target update.
n_times_update (int) – Number of repetition of update
average_q_decay (float) – Decay rate of average Q, only used for recording statistics
average_loss_decay (float) – Decay rate of average loss, only used for recording statistics
batch_accumulator (str) – ‘mean’ or ‘sum’
episodic_update (bool) – Use full episodes for update if set True
episodic_update_len (int or None) – Subsequences of this length are used for update if set int and episodic_update=True
logger (Logger) – Logger used
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states

class chainerrl.agents.DQN(q_function, optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, clip_delta=True, phi=<function <lambda>>, target_update_method=u'hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, batch_accumulator=u'mean', episodic_update=False, episodic_update_len=None, logger=<logging.Logger object>, batch_states=<function batch_states>)[source]

Deep Q-Network algorithm.

Parameters:

q_function (StateQFunction) – Q-function
optimizer (Optimizer) – Optimizer that is already setup
replay_buffer (ReplayBuffer) – Replay buffer
gamma (float) – Discount factor
explorer (Explorer) – Explorer that specifies an exploration strategy.
gpu (int) – GPU device id if not None nor negative.
replay_start_size (int) – if the replay buffer’s size is less than replay_start_size, skip update
minibatch_size (int) – Minibatch size
update_interval (int) – Model update interval in step
target_update_interval (int) – Target model update interval in step
clip_delta (bool) – Clip delta if set True
phi (callable) – Feature extractor applied to observations
target_update_method (str) – ‘hard’ or ‘soft’.
soft_update_tau (float) – Tau of soft target update.
n_times_update (int) – Number of repetition of update
average_q_decay (float) – Decay rate of average Q, only used for recording statistics
average_loss_decay (float) – Decay rate of average loss, only used for recording statistics
batch_accumulator (str) – ‘mean’ or ‘sum’
episodic_update (bool) – Use full episodes for update if set True
episodic_update_len (int or None) – Subsequences of this length are used for update if set int and episodic_update=True
logger (Logger) – Logger used
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states

class chainerrl.agents.NSQ(q_function, optimizer, t_max, gamma, i_target, explorer, phi=<function <lambda>>, average_q_decay=0.999, logger=<logging.Logger object>, batch_states=<function batch_states>)[source]¶

Asynchronous N-step Q-Learning.

See http://arxiv.org/abs/1602.01783

Parameters:

q_function (A3CModel) – Model to train
optimizer (chainer.Optimizer) – optimizer used to train the model
t_max (int) – The model is updated after every t_max local steps
gamma (float) – Discount factor [0,1]
i_target (intn) – The target model is updated after every i_target global steps
explorer (Explorer) – Explorer to use in training
phi (callable) – Feature extractor function
average_q_decay (float) – Decay rate of average Q, only used for recording statistics
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states

class chainerrl.agents.PAL(*args, **kwargs)[source]¶

Persistent Advantage Learning.

See: http://arxiv.org/abs/1512.04860.

Parameters:	alpha (float) – Weight of (persistent) advantages. Convergence is guaranteed only for alpha in [0, 1).

For other arguments, see DQN.

class chainerrl.agents.PCL(model, optimizer, replay_buffer=None, t_max=None, gamma=0.99, tau=0.01, phi=<function <lambda>>, pi_loss_coef=1.0, v_loss_coef=0.5, rollout_len=10, batchsize=1, disable_online_update=False, n_times_replay=1, replay_start_size=100, normalize_loss_by_steps=True, act_deterministically=False, average_loss_decay=0.999, average_entropy_decay=0.999, average_value_decay=0.999, explorer=None, logger=None, batch_states=<function batch_states>, backprop_future_values=True, train_async=False)[source]¶

PCL (Path Consistency Learning).

Not only the batch PCL algorithm proposed in the paper but also its asynchronous variant is implemented.

See https://arxiv.org/abs/1702.08892

Parameters:

model (chainer.Link) –
Model to train. It must be a callable that accepts a batch of observations as input and return two values:
- action distributions (Distribution)
- state values (chainer.Variable)
optimizer (chainer.Optimizer) – optimizer used to train the model
t_max (int or None) – The model is updated after every t_max local steps. If set None, the model is updated after every episode.
gamma (float) – Discount factor [0,1]
tau (float) – Weight coefficient for the entropy regularizaiton term.
phi (callable) – Feature extractor function
pi_loss_coef (float) – Weight coefficient for the loss of the policy
v_loss_coef (float) – Weight coefficient for the loss of the value function
rollout_len (int) – Number of rollout steps
batchsize (int) – Number of episodes or sub-trajectories used for an update. The total number of transitions used will be (batchsize x t_max).
disable_online_update (bool) – If set true, disable online on-policy update and rely only on experience replay.
n_times_replay (int) – Number of times experience replay is repeated per one time of online update.
replay_start_size (int) – Experience replay is disabled if the number of transitions in the replay buffer is lower than this value.
normalize_loss_by_steps (bool) – If set true, losses are normalized by the number of steps taken to accumulate the losses
act_deterministically (bool) – If set true, choose most probable actions in act method.
average_loss_decay (float) – Decay rate of average loss. Used only to record statistics.
average_entropy_decay (float) – Decay rate of average entropy. Used only to record statistics.
average_value_decay (float) – Decay rate of average value. Used only to record statistics.
explorer (Explorer or None) – If not None, this explorer is used for selecting actions.
logger (None or Logger) – Logger to be used
batch_states (callable) – Method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states
backprop_future_values (bool) – If set True, value gradients are computed not only wrt V(s_t) but also V(s_{t+d}).
train_async (bool) – If set True, use a process-local model to compute gradients and update the globally shared model.

class chainerrl.agents.PGT(model, actor_optimizer, critic_optimizer, replay_buffer, gamma, explorer, beta=0.01, act_deterministically=False, gpu=-1, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, phi=<function <lambda>>, target_update_method=u'hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, logger=<logging.Logger object>, batch_states=<function batch_states>)[source]¶

Policy Gradient Theorem with an approximate policy and a Q-function.

This agent is almost the same with DDPG except that it uses the likelihood ratio gradient estimation instead of value gradients.

Parameters:

model (chainer.Chain) – Chain that contains both a policy and a Q-function
actor_optimizer (Optimizer) – Optimizer setup with the policy
critic_optimizer (Optimizer) – Optimizer setup with the Q-function
replay_buffer (ReplayBuffer) – Replay buffer
gamma (float) – Discount factor
explorer (Explorer) – Explorer that specifies an exploration strategy.
gpu (int) – GPU device id. -1 for CPU.
replay_start_size (int) – if the replay buffer’s size is less than replay_start_size, skip update
minibatch_size (int) – Minibatch size
update_interval (int) – Model update interval in step
target_update_interval (int) – Target model update interval in step
phi (callable) – Feature extractor applied to observations
target_update_method (str) – ‘hard’ or ‘soft’.
soft_update_tau (float) – Tau of soft target update.
n_times_update (int) – Number of repetition of update
average_q_decay (float) – Decay rate of average Q, only used for recording statistics
average_loss_decay (float) – Decay rate of average loss, only used for recording statistics
batch_accumulator (str) – ‘mean’ or ‘sum’
logger (Logger) – Logger used
beta (float) – Coefficient for entropy regularization
act_deterministically (bool) – Act deterministically by selecting most probable actions in test time
batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states

class chainerrl.agents.REINFORCE(model, optimizer, beta=0, phi=<function <lambda>>, batchsize=1, act_deterministically=False, average_entropy_decay=0.999, backward_separately=False, batch_states=<function batch_states>, logger=None)[source]¶

William’s episodic REINFORCE.

Parameters:

model (Policy) – Model to train. It must be a callable that accepts observations as input and return action distributions (Distribution).
optimizer (chainer.Optimizer) – optimizer used to train the model
beta (float) – Weight coefficient for the entropy regularizaiton term.
normalize_loss_by_steps (bool) – If set true, losses are normalized by the number of steps taken to accumulate the losses
act_deterministically (bool) – If set true, choose most probable actions in act method.
batchsize (int) – Number of episodes used for each update
backward_separately (bool) – If set true, call backward separately for each episode and accumulate only gradients.
average_entropy_decay (float) – Decay rate of average entropy. Used only to record statistics.
batch_states (callable) – Method which makes a batch of observations. default is chainerrl.misc.batch_states
logger (logging.Logger) – Logger to be used.

class chainerrl.agents.ResidualDQN(*args, **kwargs)[source]¶: DQN that allows maxQ also backpropagate gradients.

class chainerrl.agents.SARSA(q_function, optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, clip_delta=True, phi=<function <lambda>>, target_update_method=u'hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, batch_accumulator=u'mean', episodic_update=False, episodic_update_len=None, logger=<logging.Logger object>, batch_states=<function batch_states>)[source]¶

SARSA.

Unlike DQN, this agent uses actions that have been actually taken to compute tareget Q values, thus is an on-policy algorithm.