李宏毅Reinforcement Learning强化学习入门笔记

文章目录

- - Concepts in Reinforcement Learning
  - Difficulties in RL
  - A3C Method Brief Introduction
  - Policy-based Approach - Learn an Actor (Policy Gradient Method)
  - - - 1. Decide Function of Actor Model (NN? ...)
      - 2. Decide Goodness of this Function
      - 3. Choose the best function
    - On-Policy v.s. Off-Policy
    - Importance Sampling (On-Policy $→\rightarrow$ Off-Policy)
    - PPO Algorithm —— Proximal Policy Optimization
  - Value-based Approach - Learn an Critic
  - - Q-learning
    - Double DQN
    - Other Advanced Structure of Q-Learning
  - A3C Method - Asynchronous Advantage Actor-Critic
  - - Advantage Actor-Critic (A2C Method)
    - Asynchronous Advantage Actor-Critic (A3C Method)
  - Sparse Reward
  - - Reward Shaping
    - Curriculum Learning
    - Hierarchical Reinforcement Learning
    - ICM —— Intrinsic Curiosity Module

Concepts in Reinforcement Learning

The main goal of Reinforcement Learning is to maximum the Total Reward.
Total Reward is the sum of all reward in One Eposide, so the model doesn’t know which steps in this episode are good and which are bad.
Only few actions can get the positive reward (ex: fire and killing the enemy in Space War game gets positive reward but moving gets no reward), so how to let the model find these right actions is very important.

Difficulties in RL

Reward Delay
- Only “Fire” can obtain rewards, but moving before fire is also important (moving has no reward), how to let the model learn to move properly?
- In chess game, it may be better to sacrifice immediate reward to gain more long-term reward.
Agent’s actions may affect the environment
- How to explore the world (observation) as more as possible.
- How to explore the action-combination as more as possible.

A3C Method Brief Introduction

The A3C method is the most popular model which combines policy-based method and value-based method, the structure is shown as below. To learn A3C model, we need to know the concepts of policy-based and value-based. The details of A3C are shown [here](#A3C Method - Asynchronous Advantage Actor-Critic).

Policy-based Approach - Learn an Actor (Policy Gradient Method)

This approach try to learn a policy(also called actor). It accepts the observation as input, and output an action. The policy(actor) can be any model. If you $u s e$ an Neural Network to as your actor, then you are doing Deep Reinforcement Learning.

$\rightarrow Actor/Policy \rightarrow Output(Action)$

There are three steps to build DRL:

1. Decide Function of Actor Model (NN? …)

Here we use the NN as our Actor, so:

The Input of this NN is the observation of machine represented as Vector or Matrix. (Ex: Image Pixels to Matrix)
The Output of this NN is Action Probability. The most important point is that we shouldn’t always choose the action which has the highest probability, it should be a stochastic decisions according to the probability distribution.
The Advantage of NN to Q-table is: we can’t enumerate all observations (such as we can’t list all pixels’ combinations of a game) in some complex scenes, then we can use Neural Network to promise that we can always obtain an output even if this observation didn’t appear in the previous train set.

2. Decide Goodness of this Function

Since we use the Neural Network as our function model, we need to decide what is the goodness of this model (a standard to judge the performance of current model). We use $R(θ)‾\overline{R(\theta)}$ to express this standard, which $θ\theta$ is the parameters of current model.

Given an actor $πθ(t)\pi_\theta(t)$ with Network-Parameters $θ\theta$ , $t$ is the observation (input).
Use the actor $πθ(t)\pi_\theta(t)$ to play the video game until this game finished.
Sum all rewards in this episode and marked as $R(θ)→R(θ)=∑t=1TrtR(\theta) \rightarrow R(\theta) = \sum_{t=1}^Tr_t$ .
Note: $R(θ)R(\theta)$ is a variable, cause even if we use the same actor $πθ(t)\pi_\theta(t)$ to play the same game many times, we can get the different $R(θ)R(\theta)$ (random mechanism in game and action chosen). So we want to maximum the $R(θ)‾\overline{R(\theta)}$ which expresses the expect of $R(θ)R(\theta)$ .
Use the $R(θ)‾\overline{R(\theta)}$ to expresses the goodness of $πθ(t)\pi_\theta(t)$ .

How to Calculate the $R(θ)R(\theta)$ ?

An episode is considered as a trajectory $τ\tau$
- $τ\tau$ = { $s_1, a_1, r_1, s_2, a_2, r_2, ..., s_T, a_T, r_T$ } $→\rightarrow$ all the history in this episode
- $R(τ)=∑t=1TrtR(\tau) = \sum_{t=1}^Tr_t$
Different $τ\tau$ has different probability to appear, the probability of $τ\tau$ is depending on the parameter $θ\theta$ of actor $πθ(t)\pi_\theta(t)$ . So we define the probability of $τ\tau$ as $P(τ∣θ)P(\tau|\theta)$ .
$R(θ)‾=∑τP(τ∣θ)R(τ)\overline{R(\theta)} = \sum_\tau{P(\tau|\theta)R(\tau)}$
We use actor $πθ(t)\pi_\theta(t)$ to play N times game, obtain the list { $τ1,τ2,...,τN\tau^1, \tau^2, ..., \tau^N$ }. Each $τn\tau^n$ has a reward $R(τn)R(\tau^n)$ , the mean of these $R(τn)R(\tau^n)$ approximate equals to the expect $R(θ)‾\overline{R(\theta)}$ .
$R(θ)‾≈1N∑n=1NR(τn)\overline{R(\theta)} \approx \frac{1}{N}\sum_{n=1}^NR(\tau^n)$

3. Choose the best function

Now we need to know how to calculate the $θ\theta$ , here we use the Gradient Ascend method.

problem statements:
$θ∗=argmaxθR(θ)‾→R(θ)‾=∑τP(τ∣θ)R(τ)\theta^* = argmax_\theta\overline{R(\theta)} \rightarrow \overline{R(\theta)} = \sum_{\tau}P(\tau|\theta)R(\tau)$
gradient ascent:
- Start with $θ0\theta^0$ .
- $θ1=θ0+η▽R(θ0)‾\theta^1 = \theta^0 + \eta\bigtriangledown{\overline{R(\theta^0)}}$
- $θ2=θ1+η▽R(θ1)‾\theta^2 = \theta^1 + \eta\bigtriangledown{\overline{R(\theta^1)}}$
- …
The $θ\theta$ includes the parameters in the current Neural Network, $\theta = $ { $w_1, w_2, w_3, ..., b_1, b_2, b_3, ...$ }, which the $▽R(θ)=[∂R(θ)∂w1∂R(θ)∂w2...∂R(θ)∂b1∂R(θ)∂b2...]\bigtriangledown R(\theta) = \left[ \begin{matrix} \frac{\partial{R(\theta)}}{\partial{w_1}} \\ \frac{\partial{R(\theta)}}{\partial{w_2}} \\ ... \\ \frac{\partial{R(\theta)}}{\partial{b_1}} \\ \frac{\partial{R(\theta)}}{\partial{b_2}} \\ ... \end{matrix} \right]$ .

It’s time to calculate the gradient of $R(θ)=∑τP(τ∣θ)R(τ)R(\theta) = \sum_{\tau}P(\tau|\theta)R(\tau)$ , since $R(τ)R(\tau)$ has nothing to do with $θ\theta$ , the gradient can be expressed as:

$▽R(θ)=∑τR(τ)▽P(τ∣θ)=∑τR(τ)P(τ∣θ)▽P(τ∣θ)P(τ∣θ)=∑τR(τ)P(τ∣θ)▽logP(τ∣θ)\bigtriangledown{R(\theta)} = \sum_\tau{R(\tau)\bigtriangledown{P(\tau|\theta)}} = \sum_\tau{R(\tau)P(\tau|\theta)\frac{\bigtriangledown{P(\tau|\theta)}}{P(\tau|\theta)}} = \sum_\tau{R(\tau)P(\tau|\theta)\bigtriangledown{logP(\tau|\theta)}}$

Note: $dlog(f(x))dx=1f(x)df(x)dx\frac{dlog(f(x))}{dx} = \frac{1}{f(x)}\frac{df(x)}{dx}$

Use $θ\theta$ policy play the game N times, obtain { $τ1,τ2,τ3,...\tau_1, \tau_2, \tau_3, ...$ }:
$▽R(θ)≈1N∑n=1NR(τn)▽logP(τ∣θ)\bigtriangledown{R(\theta)} \approx \frac{1}{N}\sum_{n=1}^N{R(\tau^n)\bigtriangledown{logP(\tau|\theta)}}$

How to Calculate the $▽logP(τ∣θ)\bigtriangledown{logP(\tau|\theta)}$ ?

Since $τ\tau$ is the history of one episode, so:
$KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ P(\tau|\theta)…$

Ignore the terms which not related to $θ\theta$ :

$▽logP(τ∣θ)=∑t=1T▽logP(at∣st,θ)\bigtriangledown{logP(\tau|\theta)} = \sum_{t=1}^T{\bigtriangledown{logP(a_t|s_t, \theta)}}$

So the final result of $▽R(θ)‾\bigtriangledown\overline{R(\theta)}$ is :

$▽R(θ)‾=1N∑n=1N∑t=1TR(τn)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{R(\tau^n)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}}$

The meaning of this equation is very clear:

if $R(τn)R(\tau^n)$ is positive $→\rightarrow$ tune $θ\theta$ to increase the $P(a_t^n|s_t^n)$ .
if $R(τn)R(\tau^n)$ is negative $→\rightarrow$ tune $θ\theta$ to decrease the $P(a_t^n|s_t^n)$

Use this method can resolve the [Reward Delay Problem](#Difficulties in RL) in Difficulties in RL chapter, because here we use the cumulative reward of one entire episode $R(τn)R(\tau^n)$ , not just the immediate reward after taking one action.

Add a Baseline - b

To avoid all of $R(τn)R(\tau^n)$ is positive (there should be some negative reward to tell model don’t take this action at this state), we can add a baseline. So the equation changes to:
$▽R(θ)‾=1N∑n=1N∑t=1T(R(τn)−b)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(R(\tau^n) - b)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}}$

Assign Suitable Weight of each Action

Use the total reward $R(τ)R(\tau)$ to tune the all actions’ probability in this episode also has some disadvantage, show as below:

The left picture show one episode whose total reward R is 5, so the probabilities of all actions in this episode will be increased (such as x5), but the main positive reward obtained from the $a_1$ , while $a_2$ and $a_3$ didn’t give any positive reward, but the probability of $a_2$ and $a_3$ also be increased in this example. Same as right picture, $a_1$ is a bad action, but $a_2$ may not be a bad action, so probability of $a_2$ shouldn’t be decreased.

To avoid this problem, we assign different $R$ to each $a_t$ , the $R$ is the cumulation of $r_t$ which is the sum of all rewards obtained after $a_t$ , now the equation becomes:

$▽R(θ)‾=1N∑n=1N∑t=1T(∑t′=tTγt′−trt′n−b)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n} - b)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}}$

Note: $γ\gamma$ called discount factor, $γ<1\gamma < 1$ .

We can use $Aθ(st,at)A^\theta(s_t, a_t)$ to express the $(∑t′=tTγt′−trt′n−b)(\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n} - b)$ in above equation, which called Advantage Function. This function evaluate how good it is if we take $a_t$ at this state $s_t$ rather than other actions.

On-Policy v.s. Off-Policy

On-Policy and Off-Policy are two different modes of learning:

On-Policy: The agent learn the rules by interacting with environment. (learn from itself)
Off-Policy: The agent learn the rules by watching others’ interacting with environment. (learn from others)

Our Policy Gradient Method is an On-Policy learning mode, so why we need Off-Policy mode? This is because we use sampling N times and get the mean value to approximate the expect $R(θ)‾=∑τP(τ∣θ)R(τ)\overline{R(\theta)} = \sum_\tau{P(\tau|\theta)R(\tau)}$ . But when we update the $θ\theta$ , the $P(τ∣θ)P(\tau|\theta)$ changed, so we need to do N sampling again and get the mean value. This will take a lot of time to do sampling after we update $θ\theta$ . The resolution is, we build a model $πθ\pi_\theta$ , this model accept the training data from the other model $πθ′\pi_{\theta'}$ . Use $πθ′\pi_{\theta'}$ to collect data, and train the $θ\theta$ with $θ′\theta'$ , since don’t change $θ′\theta'$ , the sampling data can be reused.

Importance Sampling (On-Policy $→\rightarrow$ Off-Policy)

Importance Sampling is a method to get the expect of one function $Ex∼p(p(x))E_{x\sim{p}}(p(x))$ by sampling another function $q (x)$ . Since we have already known:

$Ex∼p[f(x)]≈1N∑i=1Nf(xi)E_{x\sim{p}}[f(x)] \approx \frac{1}{N}\sum_{i=1}^N{f(x^i)}$

But if we only have { $x^i$ } sampled from $q (x)$ , how to use this samples to calculate the $E [p (x)]$ ? We can change equation above:

$Ex∼p[f(x)]=∫p(x)f(x)dx=∫f(x)p(x)q(x)q(x)dx=Ex∼q[f(x)p(x)q(x)]E_{x\sim{p}}[f(x)] = \int{p(x)f(x)}dx = \int{f(x)\frac{p(x)}{q(x)}q(x)}dx = E_{x\sim{q}}[f(x)\frac{p(x)}{q(x)}]$

That means we can get the expect of distribution $p (x)$ by sampling the { $x^i$ } from another distribution $q (x)$ , only need to do some rectification, $p(x)q(x)\frac{p(x)}{q(x)}$ called rectification term. Now we can consider our $πθ\pi_\theta$ model as $p (x)$ , the $πθ′\pi_{\theta'}$ as $q (x)$ , use the $q (x)$ to sample data to tune $p (x)$ .

$▽R(θ)‾=Eτ∼pθ(τ)[R(τ)▽logpθ(τ)]=Eτ∼pθ′(τ)[pθ(τ)pθ′(τ)R(τ)▽logpθ(τ)]\bigtriangledown{\overline{R(\theta)}} = E_{\tau\sim{p_\theta(\tau)}}[R(\tau)\bigtriangledown{logp_\theta(\tau)}] = E_{\tau\sim{p_{\theta'(\tau)}}}[\frac{p_\theta(\tau)}{p_{\theta'}(\tau)}R(\tau)\bigtriangledown{logp_\theta(\tau)}]$

then we can use $θ′\theta'$ to sample many times and train $θ\theta$ many times. After many iterations, we update $θ′\theta'$ . Continue to transform the equation:

$E(st,at)∼πθ[Aθ(st,at)▽logpθ(atn∣stn)]=E(st,at)∼πθ′[Pθ(st,at)Pθ′(st,at)Aθ′(st,at)▽logpθ(atn∣stn)]E_{(s_t, a_t)\sim{\pi_\theta}}[A^{\theta}(s_t, a_t)\bigtriangledown{logp_\theta(a_t^n|s_t^n)}] = E_{(s_t, a_t)\sim{\pi_{\theta'}}}[\frac{P_\theta(s_t, a_t)}{P_{\theta'}(s_t, a_t)}A^{\theta'}(s_t, a_t)\bigtriangledown{logp_\theta(a_t^n|s_t^n)}]$

Let the $Pθ′(st,at)=Pθ′(at∣st)Pθ′(st)P_{\theta'}(s_t, a_t) = P_{\theta'}(a_t|s_t)P_{\theta'}(s_t)$ , and $Pθ(st,at)=Pθ(at∣st)Pθ(st)P_{\theta}(s_t, a_t) = P_{\theta}(a_t|s_t)P_{\theta}(s_t)$ . We consider the environment observation $s$ is not related to actor $θ\theta$ (ignore the environment changing by action), then $Pθ(st)=Pθ′(st)P_{\theta}(s_t) = P_{\theta'}(s_t)$ , equation becomes:

$E(st,at)∼πθ′[Pθ(at∣st)Pθ′(at∣st)Aθ′(st,at)▽logpθ(atn∣stn)]E_{(s_t, a_t)\sim{\pi_{\theta'}}}[\frac{P_{\theta}(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t, a_t)\bigtriangledown{logp_\theta(a_t^n|s_t^n)}]$

Here defines:

$Jθ′(θ)=E(st,at)∼πθ′[Pθ(at∣st)Pθ′(at∣st)Aθ′(st,at)]J^{\theta'}(\theta) = E_{(s_t, a_t)\sim{\pi_{\theta'}}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t, a_t)]$

Note: Since we use $θ′\theta'$ to sample data for $θ\theta$ , the distribution of $θ\theta$ can’t be very different from $θ′\theta'$ , how to determine the difference between two distribution and end the model training if $θ′\theta'$ is distinct from $θ\theta$ ? Now let’s start to learn PPO Algorithm.

PPO Algorithm —— Proximal Policy Optimization

PPO is the resolution of above question, it can avoid the problem which raised from the difference between $θ\theta$ and $θ′\theta'$ . The target function shows as below:
$JPPOθ′(θ)=Jθ′(θ)−βKL(θ,θ′)J_{PPO}^{\theta'}(\theta) = J^{\theta'}(\theta) - \beta KL(\theta, \theta')$
which the $KL(θ,θ′)KL(\theta, \theta')$ is the divergence of output action from policy $θ\theta$ and policy $θ′\theta'$ . The algorithm flow is:

Initial Policy parameters $θ\theta$
In each iteration:
- Using $θk\theta^k$ to interact with the environment, and collect { ${s_t, a_t}$ } to calculate the $Aθk(st,at)A^{\theta^k}(s_t, a_t)$
Update the $JPPOθ′(θ)J_{PPO}^{\theta'}(\theta)$ several times: $ J_{PPO}^{\thetak}(\theta) = J^{\thetak}(\theta) - \beta KL(\theta, \theta^k)$
If $KL(θ,θk)>KLmaxKL(\theta, \theta^k) > KL_{max}$ , that means KL part takes too big importance of this equation, increase $β\beta$
If $KL(θ,θk)<KLminKL(\theta, \theta^k) < KL_{min}$ , that means KL part takes lower importance of this equation, decrease $β\beta$

Value-based Approach - Learn an Critic

A critic doesn’t choose an action (it’s different from actor), it evaluates the performance of a given actor. So an actor can be found from a critic.

Q-learning

Q-Learning is a classical value-based method, it evaluates the score of an observation under an actor $π\pi$ , this function is called state value function $Vπ(s)V^\pi(s)$ . The score is calculated as the total reward from current observation to the end of this episode.

How to estimate $Vπ(s)V^\pi(s)$ ?

We know we need to calculate the total reward to express the performance of current actor $πθ\pi_\theta$ , but how to get this value?

Monte-Carlo based approach

In the current state $S_a$ (observation), until the end of this episode, the cumulated reward is $G_a$ ; In the current state $S_b$ (observation), until the end of this episode, the cumulated reward is $G_b$ . That means we can estimate the value of an observation $s_a$ under an actor $πθ\pi_ \theta$ , the low value could be explain as two possibilities:

a) the current observation is bad, even if a good actor can not get a high value.

b) the actor has a bad performance.

In many cases, we can’t enumerate all observations to calculate the all rewards $G_i$ . The resolution is using a Neural-Network to fit the function from observation to value $G$ .

Fit the NN with $S_a, G(a))$ , try to minimize the difference between the NN output $Vπ(Sa)V_\pi(S_a)$ and Monte-Carlo reward $G (a)$ .

Temporal-Difference approach

MC approach is worked, but the problem is you must get the total reward in the end of one episode. It may be a very long way to reach the end state in some cases, Temporal-Difference approach could address this problem.

here is a trajectory { $s_t, a_t, r_t, s_{t+1}, ...$ }, there should be:

$Vπ(st)=Vπ(st+1)+rtV^\pi(s_t) = V^\pi(s_{t+1}) + r_t$

so we can fit the NN by minimize the difference between $Vπ(st)−Vπ(st+1)V^\pi(s_t) - V^\pi(s_{t+1})$ and $r_t$ .

Here is a tip in practice: we are training the same model $VπV^\pi$ , so the two outputs $Vπ(st)V_\pi(s_t)$ and $Vπ(st+1)V_\pi(s_{t+1})$ are all generate from one parameter group $θ\theta$ . When we update the $θ\theta$ after one iteration, both $Vπ(st)V_\pi(s_t)$ and $Vπ(st+1)V_\pi(s_{t+1})$ are changed in next iteration, which makes the model unstable.

The tip is: fix the parameter group $θ′\theta'$ to generate the $Vπ(St+1)V_\pi(S_{t+1})$ , and update the $θ\theta$ for $Vπ(St)V_\pi(S_t)$ . After N iterations, let the $θ′\theta'$ equal to $θ\theta$ . Fixed parameter Network (right) is called Target Network.

MC v.s. TD
- Monte-Carlo has larger variance. This is caused by the randomness of $G (a)$ , since $G (a)$ is the sum of all reward $r_t$ , each $r_t$ is a random variable, the sum of these variable must have a larger variance. Playing N times of one game with the same policy, the reward set { $G (a), G (b), G (c), . . .$ } has a large variance.
- Temporal-Difference also has a problem, which is $Vπ(st+1)V^\pi(s_{t+1})$ may estimate incorrectly (cause it’s not like Monte-Carlo approach to cumulative the reward until the end of this episode), so even the $r_t$ is correct, the $Vπ(st)−Vπ(st+1)V^\pi(s_t) - V^\pi(s_{t+1})$ may not correct.
In the practice, people prefer to use TD method.
Q-value approach $→\rightarrow$ $Qπ(s,a)Q^\pi(s, a)$

In current state (observation), enumerate all valid actions and calculate the Q-value of each action.

note: In current state we force the actor to take the specific action to calculate the value this action, but random choose actions according to the $πθ\pi_\theta$ actor in next steps until the end of episode.

Use Q-value to learn an actor

We can learn an actor $π\pi$ with the Q-value function, here is the algorithm flow:

the question is: how to estimate the $\pi'$ is better than $\pi$?

If $π′\pi'$ is better than $π\pi$ , then:

$Vπ′(si)⩾Vπ(si),∀si∈SV^{\pi'}(s_i) \geqslant V^{\pi}(s_i), \qquad \forall s_i \in S$

We can use equation below to calculate the $π′\pi'$ from $π\pi$ :
$π′(s)=argmaxaQπ(s,a)\pi'(s) = argmax_aQ^\pi(s, a)$
Note: This approach not suitable for continuous action, only for discrete action.

But if we always choose the best action according to the $QπQ^\pi$ , some other better actions we can never detect. So we infer use Exploration method when we choose action to do.

Epsilon Greedy

Set a probability $ε\varepsilon$ , take max Q-value action or take random action show as below. Typically , $ε\varepsilon$ decreases as time goes by.
$KaTeX parse error: No such environment: align at position 20: … \left\{ \begin{̲a̲l̲i̲g̲n̲}̲ argmaxQ(s, a)&…$

Boltzmann Exploration

Since the $QπQ^\pi$ is an Neural Network, the output of this Network is the probability of each action. Use this probability to decide which action should take, show as below:
$P(ai∣s)=exp(Q(s,ai))∑aexp(Q(s,a))P(a_i|s) = \frac{exp(Q(s, a_i))}{\sum_aexp(Q(s, a))}$
Q-value may be negative, so we take exp-function to let them be positive.

Replay Buffer

Replay buffer is a buffer which stores a lot of experience data. When you train your Q-Network, random choose a batch from buffer to fit it.

An experience is a set which looks like { $s_t, a_t, r_t, s_{t+1}$ }.
The experience in buffer may comes from different policy { $πθ1,πθ2,πθ3,...\pi_{\theta_1}, \pi_{\theta_2}, \pi_{\theta_3}, ...$ }.
Drop the old experience when buffer is full.

Typical Q-Learning Algorithm

Here is the main algorithm flow of Q-learning:

Initialize Q-function Q, Initialize target Q-function $Q^=Q\hat{Q} = Q$
in each episode
- for each step t
  - Given state $s_t$ , take an action $a_t$ based on Q ( $ε\varepsilon$ -greedy exploration)
  - Obtain the reward $r_t$ and next state $s_{t+1}$
  - Store this experience { $s_t, a_t, r_t, s_{t+1}$ } into the replay buffer
  - Sample a batch of experience { $s_i, a_i, r_i, s_{i+1}), (s_j, a_j, r_j, s_{j+1}), ...$ } from buffer
  - Compute target $y=ri+maxaQ^(si+1,a)y = r_i + max_a\hat{Q}(s_{i+1}, a_)$
  - Update the parameters in $Q$ to make $Q(s_i, a_i)$ close to $y$ .
  - After N steps set $Q^=Q\hat{Q} = Q$

Double DQN

Double DQN is designed to solve the problem of DQN. Problem of DQN show as below:

Q-value are always over estimate in DQN training (Orange curve is DQN Neural Network output reward, Blue curve is Double DQN Neural Network output reward; Orange line is the real cumulative reward of DQN, Blue line is the real cumulative reward of Double DQN). Notes that Blue lines are over than Orange lines which means Double DQN has a greater true value than DQN.

Why DQN always over-estimate Q-value?

This because when we calculate the target $y$ which equals $rt+maxaQπ(st+1,a)r_t + max_aQ_\pi(s_{t+1}, a)$ , we always choose the best action and compute the highest Q-value. This may over-estimate the target value, so the real cumulative reward may lower than that target value. While Q function is try to close the target value, this results the output of Q-Network is higher than the actual cumulative reward.
$Q(st,at)⟺rt+maxaQ(st+1,a)Q(s_t, a_t) \qquad \Longleftrightarrow \qquad r_t + max_aQ(s_{t+1}, a)$
Double DQN resolution

To avoid above problem, we use two Q-Network in training, one is in charge of choose the best action and the other is to estimate Q-value.
$Q(st,at)⟺rt+Q′(st+1,argmaxaQ(st+1,a))Q(s_t, a_t) \qquad \Longleftrightarrow \qquad r_t + Q'(s_{t+1}, argmax_aQ(s_{t+1}, a))$
Here use $Q$ to select the best action in each state but use $Q^{'}$ to estimate the Q-value of this action. This method has two advantages:

If $Q$ over-estimate the Q-value of action $a$ , although this action is selected, the final Q-value of this action won’t be over estimated (because we use $Q^{'}$ to estimate the Q-value of this action).
If $Q^{'}$ over-estimate one action $a$ , it’s also safe. Because the $Q$ policy won’t select the action $a$ (because $a$ is not the best action in Policy $Q$ ).

In DQN algorithm, we already have two Network: origin Network $θ\theta$ and target Network $θ′\theta'$ (need to be fixed). So here use origin Network $\theta $ to select the action, and target Network $\theta’ $ to estimate the Q-value.

Other Advanced Structure of Q-Learning

Dueling DQN

Change the output as two parts: $Qπ(st,at)=V(st)+A(st,at)Q^\pi(s_t, a_t) = V(s_t) + A(s_t, a_t)$ , which means the final Q-value is the sum of environment value and action value.

Prioritized Replay

When we sample a batch of experience from replay buffer, we don’t use random select. Prioritized Replay marked those experience which has a high loss after one iteration, and increase the probability of selecting those experience in the next batch.

Multi-Step

Change the experience format in the Replay Buffer, not only store one step {$ s_t, a_t, r_t, s_{t+1} $}, store N steps { $s_t, a_t, r_t, s_{t+1}, ..., s_{t+N}, a_{t+N}, r_{t+N}, s_{t+N+1}$ }.

Noise Net

This method used to explore more action. Add some noise in current Network $Q$ at the beginning of one episode.

Here is the comparison of different algorithms:

A3C Method - Asynchronous Advantage Actor-Critic

Why we need A3C method? This is designed to solve the variance problem of Policy Gradient. In Policy Gradient method, even in the same state $ s_t $ and take the same action $ a_t $ N times, we may get very different result total reward $G$ . This because randomness existed when we calculate the cumulative reward in below equation:
$▽R(θ)‾=1N∑n=1N∑t=1T(∑t′=tTγt′−trt′n−b)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n} - b)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}}$
This part $ (\sum_{t’=t}^T{\gamma{t’ -t}r_{t’}^n} - b) $ could be very different for $ r_t $ is a random variable with big variance, the result may be like this:

unless we sample enough times to cover all possible rewards, the model could be stable —— but it’s hard to do this. If we can replace $∑t′=tTγt′−trt′n\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n}$ with the expect $E(r_{t'}^n)$ , then we can solve this problem.

Advantage Actor-Critic (A2C Method)

We have already introduced value-based method, the definition of $Qπθ(st,at)Q^{\pi_\theta}(s_t, a_t)$ is the expect of total reward of taking action $a_t$ at current state $s_t$ . The definition of $ V^{\pi_\theta}(s_t) $ is the expect reward of current state $s_t$ (just the state value without specific which action should take). Now we change the equation:
$▽R(θ)‾=1N∑n=1N∑t=1T(Qπθ(stn,atn)−Vπθ(stn))▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(Q^{\pi_\theta}(s_t^n, a_t^n) - V^{\pi_\theta}(s_t^n))\bigtriangledown{logP(a_t^n|s_t^n, \theta)}}$
note: Here we use state value to replace the baseline b.

We can infer the value of $Qπθ(st,at)Q^{\pi_\theta}(s_t, a_t)$ from $Vπθ(st)V^{\pi_\theta}(s_t)$ :
$Qπ(st,at)=E[rt+Vπ(st+1)]→rt+Vπ(st+1)Q^{\pi}(s_t, a_t) \quad = \quad E[r_t + V^\pi(s_{t+1})] \quad \rightarrow \quad r_t + V^\pi(s_{t+1})$
We should use the expect because $r_t$ is a random variable, but it’s hard to calculate, so we take off the expect. Now the equation becomes:
$▽R(θ)‾=1N∑n=1N∑t=1T(rt+Vπθ(st+1n)−Vπθ(stn))▽logPθ(atn∣stn)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(r_t + V^{\pi_\theta}(s_{t+1}^n) - V^{\pi_\theta}(s_t^n))\bigtriangledown{logP_{\theta}(a_t^n|s_t^n)}}$
Algorithm flow of Advantage Actor-Critic method show as below:

Tips

There are two Networks to train in this algorithm: Actor $ \pi_\theta $ and Critic $Vπ(s)V^\pi(s)$ . But two networks accept the same input $s$ , only different in output —— scaler $V (s)$ for Critic Network and Probability Distribution $P (a ∣ s)$ for Actor Network. So two networks can share some layers in the front of structure, looks like this:

Use output entropy as regularization for $π(s)\pi(s)$ , this could make the probability of each action more even so that the model can do more exploration.

Asynchronous Advantage Actor-Critic (A3C Method)

A3C is designed to speed up A2C. It maintain one global Network and create N workers, each worker interact with different environment, calculate the gradient and update the global Network.

Copy global parameters $θ1\theta_1$
Sampling some data
Compute gradients
Update global model

note: All workers are parallelized, which means when $▽θ1\bigtriangledown{\theta_1}$ finish compute and send back to global model, $θ\theta$ may changed (updated by other worker, so it may not remain $\theta_1 $). But we still use the $▽θ1\bigtriangledown{\theta_1}$ to update the current parameters $θ\theta$ $→\rightarrow$ $θ+η▽θ1\theta + \eta\bigtriangledown{\theta_1}$ .

Sparse Reward

In reinforcement learning, reward is very important for agent to know which actions are good. But only few action could obtain a positive reward (ex. only fire and destroy the enemy could obtain a positive reward in Spaceship game), most of actions have no reward (ex. move left or move right). This phenomenon is called Sparse Reward.

Reward Shaping

Typically, few state could get the positive reward in training, thus we can create some extra rewards to guide the agent do some action in current state. For example, if we want to train an plane agent to destroy the enemy plane, the actual reward should be obtained from “fire and destroy the enemy”. But in the start of the game, our plane don’t know how to find enemy, so we can create an extra positive reward if our plane is fly toward enemy plane.

note: This method needs domain knowledge to design which rules desire positive reward and how much reward should be assigned.

Curriculum Learning

Typically, a hard task could be split into many simple tasks. Curriculum Learning Algorithm is starting from simple training examples, and then becoming harder and harder.

The most common technique is Reverse Curriculum Learning, explain as below:

Given a goal state $s_g$ .
Sample some states { $s_1, s_2, ...$ } close to $s_g $.
Each state has a reward to goal state $s_g$ , compute { $R(s_1), R(s_2), ...$ }.
Delete those state whose reward is too large (it’s too easy from this state to goal state) or too small (it’s too difficult from this state to goal state).
Sample more states near the {$s_1, s_2, … }.

Hierarchical Reinforcement Learning

A entire model could be split into different hierarchies, top-level model only give the top-level order and low-level model choose the actual actions. For example, if we wanna train a plane agent, top-level model only give the way point of next target while the low-level model control the plane to fly to that target (turn left or turn right).

Here is a game example, blue point is the agent which is asked to reach the yellow point. Pink point is the temporary target given by high-level model while the low-level mode follow this instruction and control the agent to reach pink point.

ICM —— Intrinsic Curiosity Module

ICM Algorithm can let model to do more exploration. It adds an extra Reward function $rtir^i_t$ which accept three parameters $s_t, a_t, s_{t+1})$ . The Network need to maximize the sum value $∑t=1N(rti+rt)\sum_{t=1}^N(r^i_t + r_t)$ .

Now let’s see how ICM calculate reward $rtir^i_t$ in each step $t$ :

Here are two Networks in ICM module:

Network 1: This Network is used to predict the next state $s_{t+1}$ after taking action $a$ , if $s_{t+1}$ is hard to predict, then the reward $r^i$ is high.
Network 2: There may be a lot of features are not related to do actions (ex. the position of sun in the Spaceship War game). If we just maximize the reward of state which hard to predict, then the agent will stay and watch the sun moving. So we need to find the useful features in action chosen, this is the work of Network 2. It predict the action $a_t $ according to $s_t$ and $s_{t+1}$ .