Agents¶
Agent interfaces¶
-
class
chainerrl.agent.
Agent
[source]¶ Abstract agent class.
-
act_and_train
(obs, reward)[source]¶ Select an action for training.
Returns: action Return type: ~object
-
Agent implementations¶
-
class
chainerrl.agents.
A3C
(model, optimizer, t_max, gamma, beta=0.01, process_idx=0, phi=<function <lambda>>, pi_loss_coef=1.0, v_loss_coef=0.5, keep_loss_scale_same=False, normalize_grad_by_t_max=False, use_average_reward=False, average_reward_tau=0.01, act_deterministically=False, average_entropy_decay=0.999, average_value_decay=0.999, batch_states=<function batch_states>)[source]¶ A3C: Asynchronous Advantage Actor-Critic.
See http://arxiv.org/abs/1602.01783
Parameters: - model (A3CModel) – Model to train
- optimizer (chainer.Optimizer) – optimizer used to train the model
- t_max (int) – The model is updated after every t_max local steps
- gamma (float) – Discount factor [0,1]
- beta (float) – Weight coefficient for the entropy regularizaiton term.
- process_idx (int) – Index of the process.
- phi (callable) – Feature extractor function
- pi_loss_coef (float) – Weight coefficient for the loss of the policy
- v_loss_coef (float) – Weight coefficient for the loss of the value function
- act_deterministically (bool) – If set true, choose most probable actions in act method.
- batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states
-
class
chainerrl.agents.
ACER
(model, optimizer, t_max, gamma, replay_buffer, beta=0.01, phi=<function <lambda>>, pi_loss_coef=1.0, Q_loss_coef=0.5, use_trust_region=True, trust_region_alpha=0.99, trust_region_delta=1, truncation_threshold=10, disable_online_update=False, n_times_replay=8, replay_start_size=10000, normalize_loss_by_steps=True, act_deterministically=False, use_Q_opc=False, average_entropy_decay=0.999, average_value_decay=0.999, average_kl_decay=0.999, logger=None)[source]¶ ACER (Actor-Critic with Experience Replay).
See http://arxiv.org/abs/1611.01224
Parameters: - model (ACERModel) – Model to train. It must be a callable that accepts observations as input and return three values: action distributions (Distribution), Q values (ActionValue) and state values (chainer.Variable).
- optimizer (chainer.Optimizer) – optimizer used to train the model
- t_max (int) – The model is updated after every t_max local steps
- gamma (float) – Discount factor [0,1]
- replay_buffer (EpisodicReplayBuffer) – Replay buffer to use. If set None, this agent won’t use experience replay.
- beta (float) – Weight coefficient for the entropy regularizaiton term.
- phi (callable) – Feature extractor function
- pi_loss_coef (float) – Weight coefficient for the loss of the policy
- Q_loss_coef (float) – Weight coefficient for the loss of the value function
- use_trust_region (bool) – If set true, use efficient TRPO.
- trust_region_alpha (float) – Decay rate of the average model used for efficient TRPO.
- trust_region_delta (float) – Threshold used for efficient TRPO.
- truncation_threshold (float or None) – Threshold used to truncate larger importance weights. If set None, importance weights are not truncated.
- disable_online_update (bool) – If set true, disable online on-policy update and rely only on experience replay.
- n_times_replay (int) – Number of times experience replay is repeated per one time of online update.
- replay_start_size (int) – Experience replay is disabled if the number of transitions in the replay buffer is lower than this value.
- normalize_loss_by_steps (bool) – If set true, losses are normalized by the number of steps taken to accumulate the losses
- act_deterministically (bool) – If set true, choose most probable actions in act method.
- use_Q_opc (bool) – If set true, use Q_opc, a Q-value estimate without importance sampling, is used to compute advantage values for policy gradients. The original paper recommend to use in case of continuous action.
- average_entropy_decay (float) – Decay rate of average entropy. Used only to record statistics.
- average_value_decay (float) – Decay rate of average value. Used only to record statistics.
- average_kl_decay (float) – Decay rate of kl value. Used only to record statistics.
-
class
chainerrl.agents.
AL
(*args, **kwargs)[source]¶ Advantage Learning.
See: http://arxiv.org/abs/1512.04860.
Parameters: alpha (float) – Weight of (persistent) advantages. Convergence is guaranteed only for alpha in [0, 1). For other arguments, see DQN.
-
class
chainerrl.agents.
DDPG
(model, actor_optimizer, critic_optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, phi=<function <lambda>>, target_update_method=u'hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, episodic_update=False, episodic_update_len=None, logger=<logging.Logger object>, batch_states=<function batch_states>)[source]¶ Deep Deterministic Policy Gradients.
This can be used as SVG(0) by specifying a Gaussian policy instead of a deterministic policy.
Parameters: - model (DDPGModel) – DDPG model that contains both a policy and a Q-function
- actor_optimizer (Optimizer) – Optimizer setup with the policy
- critic_optimizer (Optimizer) – Optimizer setup with the Q-function
- replay_buffer (ReplayBuffer) – Replay buffer
- gamma (float) – Discount factor
- explorer (Explorer) – Explorer that specifies an exploration strategy.
- gpu (int) – GPU device id if not None nor negative.
- replay_start_size (int) – if the replay buffer’s size is less than replay_start_size, skip update
- minibatch_size (int) – Minibatch size
- update_interval (int) – Model update interval in step
- target_update_interval (int) – Target model update interval in step
- phi (callable) – Feature extractor applied to observations
- target_update_method (str) – ‘hard’ or ‘soft’.
- soft_update_tau (float) – Tau of soft target update.
- n_times_update (int) – Number of repetition of update
- average_q_decay (float) – Decay rate of average Q, only used for recording statistics
- average_loss_decay (float) – Decay rate of average loss, only used for recording statistics
- batch_accumulator (str) – ‘mean’ or ‘sum’
- episodic_update (bool) – Use full episodes for update if set True
- episodic_update_len (int or None) – Subsequences of this length are used for update if set int and episodic_update=True
- logger (Logger) – Logger used
- batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states
-
class
chainerrl.agents.
DoubleDQN
(q_function, optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, clip_delta=True, phi=<function <lambda>>, target_update_method=u'hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, batch_accumulator=u'mean', episodic_update=False, episodic_update_len=None, logger=<logging.Logger object>, batch_states=<function batch_states>)[source]¶ Double DQN.
-
class
chainerrl.agents.
DPP
(*args, **kwargs)[source]¶ Dynamic Policy Programming with softmax operator.
Parameters: eta (float) – Positive constant. For other arguments, see DQN.
-
class
chainerrl.agents.
DQN
(q_function, optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, clip_delta=True, phi=<function <lambda>>, target_update_method=u'hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, batch_accumulator=u'mean', episodic_update=False, episodic_update_len=None, logger=<logging.Logger object>, batch_states=<function batch_states>)[source]¶ Deep Q-Network algorithm.
Parameters: - q_function (StateQFunction) – Q-function
- optimizer (Optimizer) – Optimizer that is already setup
- replay_buffer (ReplayBuffer) – Replay buffer
- gamma (float) – Discount factor
- explorer (Explorer) – Explorer that specifies an exploration strategy.
- gpu (int) – GPU device id if not None nor negative.
- replay_start_size (int) – if the replay buffer’s size is less than replay_start_size, skip update
- minibatch_size (int) – Minibatch size
- update_interval (int) – Model update interval in step
- target_update_interval (int) – Target model update interval in step
- clip_delta (bool) – Clip delta if set True
- phi (callable) – Feature extractor applied to observations
- target_update_method (str) – ‘hard’ or ‘soft’.
- soft_update_tau (float) – Tau of soft target update.
- n_times_update (int) – Number of repetition of update
- average_q_decay (float) – Decay rate of average Q, only used for recording statistics
- average_loss_decay (float) – Decay rate of average loss, only used for recording statistics
- batch_accumulator (str) – ‘mean’ or ‘sum’
- episodic_update (bool) – Use full episodes for update if set True
- episodic_update_len (int or None) – Subsequences of this length are used for update if set int and episodic_update=True
- logger (Logger) – Logger used
- batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states
-
class
chainerrl.agents.
DQN
(q_function, optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, clip_delta=True, phi=<function <lambda>>, target_update_method=u'hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, batch_accumulator=u'mean', episodic_update=False, episodic_update_len=None, logger=<logging.Logger object>, batch_states=<function batch_states>)[source] Deep Q-Network algorithm.
Parameters: - q_function (StateQFunction) – Q-function
- optimizer (Optimizer) – Optimizer that is already setup
- replay_buffer (ReplayBuffer) – Replay buffer
- gamma (float) – Discount factor
- explorer (Explorer) – Explorer that specifies an exploration strategy.
- gpu (int) – GPU device id if not None nor negative.
- replay_start_size (int) – if the replay buffer’s size is less than replay_start_size, skip update
- minibatch_size (int) – Minibatch size
- update_interval (int) – Model update interval in step
- target_update_interval (int) – Target model update interval in step
- clip_delta (bool) – Clip delta if set True
- phi (callable) – Feature extractor applied to observations
- target_update_method (str) – ‘hard’ or ‘soft’.
- soft_update_tau (float) – Tau of soft target update.
- n_times_update (int) – Number of repetition of update
- average_q_decay (float) – Decay rate of average Q, only used for recording statistics
- average_loss_decay (float) – Decay rate of average loss, only used for recording statistics
- batch_accumulator (str) – ‘mean’ or ‘sum’
- episodic_update (bool) – Use full episodes for update if set True
- episodic_update_len (int or None) – Subsequences of this length are used for update if set int and episodic_update=True
- logger (Logger) – Logger used
- batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states
-
class
chainerrl.agents.
NSQ
(q_function, optimizer, t_max, gamma, i_target, explorer, phi=<function <lambda>>, average_q_decay=0.999, logger=<logging.Logger object>, batch_states=<function batch_states>)[source]¶ Asynchronous N-step Q-Learning.
See http://arxiv.org/abs/1602.01783
Parameters: - q_function (A3CModel) – Model to train
- optimizer (chainer.Optimizer) – optimizer used to train the model
- t_max (int) – The model is updated after every t_max local steps
- gamma (float) – Discount factor [0,1]
- i_target (intn) – The target model is updated after every i_target global steps
- explorer (Explorer) – Explorer to use in training
- phi (callable) – Feature extractor function
- average_q_decay (float) – Decay rate of average Q, only used for recording statistics
- batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states
-
class
chainerrl.agents.
PAL
(*args, **kwargs)[source]¶ Persistent Advantage Learning.
See: http://arxiv.org/abs/1512.04860.
Parameters: alpha (float) – Weight of (persistent) advantages. Convergence is guaranteed only for alpha in [0, 1). For other arguments, see DQN.
-
class
chainerrl.agents.
PCL
(model, optimizer, replay_buffer=None, t_max=None, gamma=0.99, tau=0.01, phi=<function <lambda>>, pi_loss_coef=1.0, v_loss_coef=0.5, rollout_len=10, batchsize=1, disable_online_update=False, n_times_replay=1, replay_start_size=100, normalize_loss_by_steps=True, act_deterministically=False, average_loss_decay=0.999, average_entropy_decay=0.999, average_value_decay=0.999, explorer=None, logger=None, batch_states=<function batch_states>, backprop_future_values=True, train_async=False)[source]¶ PCL (Path Consistency Learning).
Not only the batch PCL algorithm proposed in the paper but also its asynchronous variant is implemented.
See https://arxiv.org/abs/1702.08892
Parameters: - model (chainer.Link) –
Model to train. It must be a callable that accepts a batch of observations as input and return two values:
- action distributions (Distribution)
- state values (chainer.Variable)
- optimizer (chainer.Optimizer) – optimizer used to train the model
- t_max (int or None) – The model is updated after every t_max local steps. If set None, the model is updated after every episode.
- gamma (float) – Discount factor [0,1]
- tau (float) – Weight coefficient for the entropy regularizaiton term.
- phi (callable) – Feature extractor function
- pi_loss_coef (float) – Weight coefficient for the loss of the policy
- v_loss_coef (float) – Weight coefficient for the loss of the value function
- rollout_len (int) – Number of rollout steps
- batchsize (int) – Number of episodes or sub-trajectories used for an update. The total number of transitions used will be (batchsize x t_max).
- disable_online_update (bool) – If set true, disable online on-policy update and rely only on experience replay.
- n_times_replay (int) – Number of times experience replay is repeated per one time of online update.
- replay_start_size (int) – Experience replay is disabled if the number of transitions in the replay buffer is lower than this value.
- normalize_loss_by_steps (bool) – If set true, losses are normalized by the number of steps taken to accumulate the losses
- act_deterministically (bool) – If set true, choose most probable actions in act method.
- average_loss_decay (float) – Decay rate of average loss. Used only to record statistics.
- average_entropy_decay (float) – Decay rate of average entropy. Used only to record statistics.
- average_value_decay (float) – Decay rate of average value. Used only to record statistics.
- explorer (Explorer or None) – If not None, this explorer is used for selecting actions.
- logger (None or Logger) – Logger to be used
- batch_states (callable) – Method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states
- backprop_future_values (bool) – If set True, value gradients are computed not only wrt V(s_t) but also V(s_{t+d}).
- train_async (bool) – If set True, use a process-local model to compute gradients and update the globally shared model.
- model (chainer.Link) –
-
class
chainerrl.agents.
PGT
(model, actor_optimizer, critic_optimizer, replay_buffer, gamma, explorer, beta=0.01, act_deterministically=False, gpu=-1, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, phi=<function <lambda>>, target_update_method=u'hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, logger=<logging.Logger object>, batch_states=<function batch_states>)[source]¶ Policy Gradient Theorem with an approximate policy and a Q-function.
This agent is almost the same with DDPG except that it uses the likelihood ratio gradient estimation instead of value gradients.
Parameters: - model (chainer.Chain) – Chain that contains both a policy and a Q-function
- actor_optimizer (Optimizer) – Optimizer setup with the policy
- critic_optimizer (Optimizer) – Optimizer setup with the Q-function
- replay_buffer (ReplayBuffer) – Replay buffer
- gamma (float) – Discount factor
- explorer (Explorer) – Explorer that specifies an exploration strategy.
- gpu (int) – GPU device id. -1 for CPU.
- replay_start_size (int) – if the replay buffer’s size is less than replay_start_size, skip update
- minibatch_size (int) – Minibatch size
- update_interval (int) – Model update interval in step
- target_update_interval (int) – Target model update interval in step
- phi (callable) – Feature extractor applied to observations
- target_update_method (str) – ‘hard’ or ‘soft’.
- soft_update_tau (float) – Tau of soft target update.
- n_times_update (int) – Number of repetition of update
- average_q_decay (float) – Decay rate of average Q, only used for recording statistics
- average_loss_decay (float) – Decay rate of average loss, only used for recording statistics
- batch_accumulator (str) – ‘mean’ or ‘sum’
- logger (Logger) – Logger used
- beta (float) – Coefficient for entropy regularization
- act_deterministically (bool) – Act deterministically by selecting most probable actions in test time
- batch_states (callable) – method which makes a batch of observations. default is chainerrl.misc.batch_states.batch_states
-
class
chainerrl.agents.
REINFORCE
(model, optimizer, beta=0, phi=<function <lambda>>, batchsize=1, act_deterministically=False, average_entropy_decay=0.999, backward_separately=False, batch_states=<function batch_states>, logger=None)[source]¶ William’s episodic REINFORCE.
Parameters: - model (Policy) – Model to train. It must be a callable that accepts observations as input and return action distributions (Distribution).
- optimizer (chainer.Optimizer) – optimizer used to train the model
- beta (float) – Weight coefficient for the entropy regularizaiton term.
- normalize_loss_by_steps (bool) – If set true, losses are normalized by the number of steps taken to accumulate the losses
- act_deterministically (bool) – If set true, choose most probable actions in act method.
- batchsize (int) – Number of episodes used for each update
- backward_separately (bool) – If set true, call backward separately for each episode and accumulate only gradients.
- average_entropy_decay (float) – Decay rate of average entropy. Used only to record statistics.
- batch_states (callable) – Method which makes a batch of observations. default is chainerrl.misc.batch_states
- logger (logging.Logger) – Logger to be used.
-
class
chainerrl.agents.
ResidualDQN
(*args, **kwargs)[source]¶ DQN that allows maxQ also backpropagate gradients.
-
class
chainerrl.agents.
SARSA
(q_function, optimizer, replay_buffer, gamma, explorer, gpu=None, replay_start_size=50000, minibatch_size=32, update_interval=1, target_update_interval=10000, clip_delta=True, phi=<function <lambda>>, target_update_method=u'hard', soft_update_tau=0.01, n_times_update=1, average_q_decay=0.999, average_loss_decay=0.99, batch_accumulator=u'mean', episodic_update=False, episodic_update_len=None, logger=<logging.Logger object>, batch_states=<function batch_states>)[source]¶ SARSA.
Unlike DQN, this agent uses actions that have been actually taken to compute tareget Q values, thus is an on-policy algorithm.