文章目录
- Concepts in Reinforcement Learning
- Difficulties in RL
- A3C Method Brief Introduction
- Policy-based Approach - Learn an Actor (Policy Gradient Method)
- 1. Decide Function of Actor Model (NN? ...)
- 2. Decide Goodness of this Function
- 3. Choose the best function
- On-Policy v.s. Off-Policy
- Importance Sampling (On-Policy →\rightarrow→ Off-Policy)
- PPO Algorithm —— Proximal Policy Optimization
- Value-based Approach - Learn an Critic
- Q-learning
- Double DQN
- Other Advanced Structure of Q-Learning
- A3C Method - Asynchronous Advantage Actor-Critic
- Advantage Actor-Critic (A2C Method)
- Asynchronous Advantage Actor-Critic (A3C Method)
- Sparse Reward
- Reward Shaping
- Curriculum Learning
- Hierarchical Reinforcement Learning
- ICM —— Intrinsic Curiosity Module
Concepts in Reinforcement Learning
- The main goal of Reinforcement Learning is to maximum the
Total Reward
. Total Reward
is the sum of all reward inOne Eposide
, so the model doesn’t know which steps in this episode are good and which are bad.- Only few actions can get the positive reward (ex: fire and killing the enemy in Space War game gets positive reward but moving gets no reward), so how to let the model find these right actions is very important.
Difficulties in RL
- Reward Delay
- Only “Fire” can obtain rewards, but moving before fire is also important (moving has no reward), how to let the model learn to move properly?
- In chess game, it may be better to sacrifice immediate reward to gain more long-term reward.
- Agent’s actions may affect the environment
- How to explore the world (
observation
) as more as possible. - How to explore the
action-combination
as more as possible.
- How to explore the world (
A3C Method Brief Introduction
The A3C method is the most popular model which combines policy-based method and value-based method, the structure is shown as below. To learn A3C model, we need to know the concepts of policy-based
and value-based
. The details of A3C are shown [here](#A3C Method - Asynchronous Advantage Actor-Critic).
Policy-based Approach - Learn an Actor (Policy Gradient Method)
This approach try to learn a policy(also called actor). It accepts the observation as input, and output an action. The policy(actor) can be any model. If you useuseuse an Neural Network to as your actor, then you are doing Deep Reinforcement Learning.
Input(Observation)→Actor/Policy→Output(Action)Input(Observation) \rightarrow Actor/Policy \rightarrow Output(Action) Input(Observation)→Actor/Policy→Output(Action)
There are three steps to build DRL:
1. Decide Function of Actor Model (NN? …)
Here we use the NN as our Actor, so:
- The Input of this NN is the observation of machine represented as Vector or Matrix. (Ex: Image Pixels to Matrix)
- The Output of this NN is Action
Probability
. The most important point is that we shouldn’t always choose the action which has the highest probability, it should be a stochastic decisions according to the probability distribution. - The Advantage of NN to Q-table is: we can’t enumerate all observations (such as we can’t list all pixels’ combinations of a game) in some complex scenes, then we can use Neural Network to promise that we can always obtain an output even if this observation didn’t appear in the previous train set.
2. Decide Goodness of this Function
Since we use the Neural Network as our function model, we need to decide what is the goodness of this model (a standard to judge the performance of current model). We use R(θ)‾\overline{R(\theta)}R(θ) to express this standard, which θ\thetaθ is the parameters of current model.
- Given an actor πθ(t)\pi_\theta(t)πθ(t) with Network-Parameters θ\thetaθ, ttt is the observation (input).
- Use the actor πθ(t)\pi_\theta(t)πθ(t) to play the video game until this game finished.
- Sum all rewards in this episode and marked as R(θ)→R(θ)=∑t=1TrtR(\theta) \rightarrow R(\theta) = \sum_{t=1}^Tr_tR(θ)→R(θ)=∑t=1Trt.
Note
: R(θ)R(\theta)R(θ) is a variable, cause even if we use the same actor πθ(t)\pi_\theta(t)πθ(t) to play the same game many times, we can get the different R(θ)R(\theta)R(θ) (random mechanism in game and action chosen). So we want to maximum the R(θ)‾\overline{R(\theta)}R(θ) which expresses the expect of R(θ)R(\theta)R(θ). - Use the R(θ)‾\overline{R(\theta)}R(θ) to expresses the goodness of πθ(t)\pi_\theta(t)πθ(t).
How to Calculate the R(θ)R(\theta)R(θ)?
-
An episode is considered as a trajectory τ\tauτ
- τ\tauτ = {s1,a1,r1,s2,a2,r2,...,sT,aT,rTs_1, a_1, r_1, s_2, a_2, r_2, ..., s_T, a_T, r_Ts1,a1,r1,s2,a2,r2,...,sT,aT,rT} →\rightarrow→ all the history in this episode
- R(τ)=∑t=1TrtR(\tau) = \sum_{t=1}^Tr_tR(τ)=∑t=1Trt
-
Different τ\tauτ has different probability to appear, the probability of τ\tauτ is depending on the parameter θ\thetaθ of actor πθ(t)\pi_\theta(t)πθ(t). So we define the probability of τ\tauτ as P(τ∣θ)P(\tau|\theta)P(τ∣θ).
R(θ)‾=∑τP(τ∣θ)R(τ)\overline{R(\theta)} = \sum_\tau{P(\tau|\theta)R(\tau)} R(θ)=τ∑P(τ∣θ)R(τ) -
We use actor πθ(t)\pi_\theta(t)πθ(t) to play N times game, obtain the list {τ1,τ2,...,τN\tau^1, \tau^2, ..., \tau^Nτ1,τ2,...,τN}. Each τn\tau^nτn has a reward R(τn)R(\tau^n)R(τn), the mean of these R(τn)R(\tau^n)R(τn) approximate equals to the expect R(θ)‾\overline{R(\theta)}R(θ).
R(θ)‾≈1N∑n=1NR(τn)\overline{R(\theta)} \approx \frac{1}{N}\sum_{n=1}^NR(\tau^n) R(θ)≈N1n=1∑NR(τn)
3. Choose the best function
Now we need to know how to calculate the θ\thetaθ, here we use the Gradient Ascend
method.
- problem statements:
θ∗=argmaxθR(θ)‾→R(θ)‾=∑τP(τ∣θ)R(τ)\theta^* = argmax_\theta\overline{R(\theta)} \rightarrow \overline{R(\theta)} = \sum_{\tau}P(\tau|\theta)R(\tau) θ∗=argmaxθR(θ)→R(θ)=τ∑P(τ∣θ)R(τ) - gradient ascent:
- Start with θ0\theta^0θ0.
- θ1=θ0+η▽R(θ0)‾\theta^1 = \theta^0 + \eta\bigtriangledown{\overline{R(\theta^0)}}θ1=θ0+η▽R(θ0)
- θ2=θ1+η▽R(θ1)‾\theta^2 = \theta^1 + \eta\bigtriangledown{\overline{R(\theta^1)}}θ2=θ1+η▽R(θ1)
- …
- The θ\thetaθ includes the parameters in the current Neural Network, $\theta = $ {w1,w2,w3,...,b1,b2,b3,...w_1, w_2, w_3, ..., b_1, b_2, b_3, ...w1,w2,w3,...,b1,b2,b3,...}, which the ▽R(θ)=[∂R(θ)∂w1∂R(θ)∂w2...∂R(θ)∂b1∂R(θ)∂b2...]\bigtriangledown R(\theta) = \left[ \begin{matrix} \frac{\partial{R(\theta)}}{\partial{w_1}} \\ \frac{\partial{R(\theta)}}{\partial{w_2}} \\ ... \\ \frac{\partial{R(\theta)}}{\partial{b_1}} \\ \frac{\partial{R(\theta)}}{\partial{b_2}} \\ ... \end{matrix} \right]▽R(θ)=⎣⎢⎢⎢⎢⎢⎢⎢⎡∂w1∂R(θ)∂w2∂R(θ)...∂b1∂R(θ)∂b2∂R(θ)...⎦⎥⎥⎥⎥⎥⎥⎥⎤.
It’s time to calculate the gradient of R(θ)=∑τP(τ∣θ)R(τ)R(\theta) = \sum_{\tau}P(\tau|\theta)R(\tau)R(θ)=∑τP(τ∣θ)R(τ), since R(τ)R(\tau)R(τ) has nothing to do with θ\thetaθ, the gradient can be expressed as:
▽R(θ)=∑τR(τ)▽P(τ∣θ)=∑τR(τ)P(τ∣θ)▽P(τ∣θ)P(τ∣θ)=∑τR(τ)P(τ∣θ)▽logP(τ∣θ)\bigtriangledown{R(\theta)} = \sum_\tau{R(\tau)\bigtriangledown{P(\tau|\theta)}} = \sum_\tau{R(\tau)P(\tau|\theta)\frac{\bigtriangledown{P(\tau|\theta)}}{P(\tau|\theta)}} = \sum_\tau{R(\tau)P(\tau|\theta)\bigtriangledown{logP(\tau|\theta)}} ▽R(θ)=τ∑R(τ)▽P(τ∣θ)=τ∑R(τ)P(τ∣θ)P(τ∣θ)▽P(τ∣θ)=τ∑R(τ)P(τ∣θ)▽logP(τ∣θ)
Note
: dlog(f(x))dx=1f(x)df(x)dx\frac{dlog(f(x))}{dx} = \frac{1}{f(x)}\frac{df(x)}{dx}dxdlog(f(x))=f(x)1dxdf(x)
Use θ\thetaθ policy play the game N times, obtain {τ1,τ2,τ3,...\tau_1, \tau_2, \tau_3, ...τ1,τ2,τ3,...}:
▽R(θ)≈1N∑n=1NR(τn)▽logP(τ∣θ)\bigtriangledown{R(\theta)} \approx \frac{1}{N}\sum_{n=1}^N{R(\tau^n)\bigtriangledown{logP(\tau|\theta)}} ▽R(θ)≈N1n=1∑NR(τn)▽logP(τ∣θ)
How to Calculate the ▽logP(τ∣θ)\bigtriangledown{logP(\tau|\theta)}▽logP(τ∣θ)?
Since τ\tauτ is the history of one episode, so:
KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ P(\tau|\theta)…
Ignore the terms which not related to θ\thetaθ:
▽logP(τ∣θ)=∑t=1T▽logP(at∣st,θ)\bigtriangledown{logP(\tau|\theta)} = \sum_{t=1}^T{\bigtriangledown{logP(a_t|s_t, \theta)}} ▽logP(τ∣θ)=t=1∑T▽logP(at∣st,θ)
So the final result of ▽R(θ)‾\bigtriangledown\overline{R(\theta)}▽R(θ) is :
▽R(θ)‾=1N∑n=1N∑t=1TR(τn)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{R(\tau^n)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} ▽R(θ)=N1n=1∑Nt=1∑TR(τn)▽logP(atn∣stn,θ)
The meaning of this equation is very clear:
- if R(τn)R(\tau^n)R(τn) is positive →\rightarrow→ tune θ\thetaθ to increase the P(atn∣stn)P(a_t^n|s_t^n)P(atn∣stn).
- if R(τn)R(\tau^n)R(τn) is negative →\rightarrow→ tune θ\thetaθ to decrease the P(atn∣stn)P(a_t^n|s_t^n)P(atn∣stn)
Use this method can resolve the [Reward Delay Problem](#Difficulties in RL) in Difficulties in RL chapter, because here we use the cumulative reward
of one entire episode R(τn)R(\tau^n)R(τn), not just the immediate reward after taking one action.
Add a Baseline - b
To avoid all of R(τn)R(\tau^n)R(τn) is positive (there should be some negative reward to tell model don’t take this action at this state), we can add a baseline. So the equation changes to:
▽R(θ)‾=1N∑n=1N∑t=1T(R(τn)−b)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(R(\tau^n) - b)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} ▽R(θ)=N1n=1∑Nt=1∑T(R(τn)−b)▽logP(atn∣stn,θ)
Assign Suitable Weight of each Action
Use the total reward R(τ)R(\tau)R(τ) to tune the all actions’ probability in this episode also has some disadvantage, show as below:
The left picture show one episode whose total reward R is 5, so the probabilities of all actions in this episode will be increased (such as x5), but the main positive reward obtained from the a1a_1a1, while a2a_2a2 and a3a_3a3 didn’t give any positive reward, but the probability of a2a_2a2 and a3a_3a3 also be increased in this example. Same as right picture, a1a_1a1 is a bad action, but a2a_2a2 may not be a bad action, so probability of a2a_2a2 shouldn’t be decreased.
To avoid this problem, we assign different RRR to each ata_tat, the RRR is the cumulation of rtr_trt which is the sum of all rewards obtained after ata_tat, now the equation becomes:
▽R(θ)‾=1N∑n=1N∑t=1T(∑t′=tTγt′−trt′n−b)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n} - b)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} ▽R(θ)=N1n=1∑Nt=1∑T(t′=t∑Tγt′−trt′n−b)▽logP(atn∣stn,θ)
Note
: γ\gammaγ called discount factor, γ<1\gamma < 1γ<1.
We can use Aθ(st,at)A^\theta(s_t, a_t)Aθ(st,at) to express the (∑t′=tTγt′−trt′n−b)(\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n} - b)(∑t′=tTγt′−trt′n−b) in above equation, which called Advantage Function
. This function evaluate how good it is if we take ata_tat at this state sts_tst rather than other actions.
On-Policy v.s. Off-Policy
On-Policy and Off-Policy are two different modes of learning:
- On-Policy: The agent learn the rules by
interacting
with environment. (learn from itself) - Off-Policy: The agent learn the rules by
watching
others’ interacting with environment. (learn from others)
Our Policy Gradient Method is an On-Policy learning mode, so why we need Off-Policy mode? This is because we use sampling N times and get the mean value to approximate the expect R(θ)‾=∑τP(τ∣θ)R(τ)\overline{R(\theta)} = \sum_\tau{P(\tau|\theta)R(\tau)}R(θ)=∑τP(τ∣θ)R(τ). But when we update the θ\thetaθ, the P(τ∣θ)P(\tau|\theta)P(τ∣θ) changed, so we need to do N sampling again and get the mean value. This will take a lot of time to do sampling after we update θ\thetaθ. The resolution is, we build a model πθ\pi_\thetaπθ, this model accept the training data from the other model πθ′\pi_{\theta'}πθ′. Use πθ′\pi_{\theta'}πθ′ to collect data, and train the θ\thetaθ with θ′\theta'θ′, since don’t change θ′\theta'θ′, the sampling data can be reused.
Importance Sampling (On-Policy →\rightarrow→ Off-Policy)
Importance Sampling is a method to get the expect of one function Ex∼p(p(x))E_{x\sim{p}}(p(x))Ex∼p(p(x)) by sampling another function q(x)q(x)q(x). Since we have already known:
Ex∼p[f(x)]≈1N∑i=1Nf(xi)E_{x\sim{p}}[f(x)] \approx \frac{1}{N}\sum_{i=1}^N{f(x^i)} Ex∼p[f(x)]≈N1i=1∑Nf(xi)
But if we only have {xix^ixi} sampled from q(x)q(x)q(x), how to use this samples to calculate the E[p(x)]E[p(x)]E[p(x)]? We can change equation above:
Ex∼p[f(x)]=∫p(x)f(x)dx=∫f(x)p(x)q(x)q(x)dx=Ex∼q[f(x)p(x)q(x)]E_{x\sim{p}}[f(x)] = \int{p(x)f(x)}dx = \int{f(x)\frac{p(x)}{q(x)}q(x)}dx = E_{x\sim{q}}[f(x)\frac{p(x)}{q(x)}] Ex∼p[f(x)]=∫p(x)f(x)dx=∫f(x)q(x)p(x)q(x)dx=Ex∼q[f(x)q(x)p(x)]
That means we can get the expect of distribution p(x)p(x)p(x) by sampling the {xix^ixi} from another distribution q(x)q(x)q(x), only need to do some rectification, p(x)q(x)\frac{p(x)}{q(x)}q(x)p(x) called rectification term. Now we can consider our πθ\pi_\thetaπθ model as p(x)p(x)p(x), the πθ′\pi_{\theta'}πθ′ as q(x)q(x)q(x), use the q(x)q(x)q(x) to sample data to tune p(x)p(x)p(x).
▽R(θ)‾=Eτ∼pθ(τ)[R(τ)▽logpθ(τ)]=Eτ∼pθ′(τ)[pθ(τ)pθ′(τ)R(τ)▽logpθ(τ)]\bigtriangledown{\overline{R(\theta)}} = E_{\tau\sim{p_\theta(\tau)}}[R(\tau)\bigtriangledown{logp_\theta(\tau)}] = E_{\tau\sim{p_{\theta'(\tau)}}}[\frac{p_\theta(\tau)}{p_{\theta'}(\tau)}R(\tau)\bigtriangledown{logp_\theta(\tau)}] ▽R(θ)=Eτ∼pθ(τ)[R(τ)▽logpθ(τ)]=Eτ∼pθ′(τ)[pθ′(τ)pθ(τ)R(τ)▽logpθ(τ)]
then we can use θ′\theta'θ′ to sample many times and train θ\thetaθ many times. After many iterations, we update θ′\theta'θ′. Continue to transform the equation:
E(st,at)∼πθ[Aθ(st,at)▽logpθ(atn∣stn)]=E(st,at)∼πθ′[Pθ(st,at)Pθ′(st,at)Aθ′(st,at)▽logpθ(atn∣stn)]E_{(s_t, a_t)\sim{\pi_\theta}}[A^{\theta}(s_t, a_t)\bigtriangledown{logp_\theta(a_t^n|s_t^n)}] = E_{(s_t, a_t)\sim{\pi_{\theta'}}}[\frac{P_\theta(s_t, a_t)}{P_{\theta'}(s_t, a_t)}A^{\theta'}(s_t, a_t)\bigtriangledown{logp_\theta(a_t^n|s_t^n)}] E(st,at)∼πθ[Aθ(st,at)▽logpθ(atn∣stn)]=E(st,at)∼πθ′[Pθ′(st,at)Pθ(st,at)Aθ′(st,at)▽logpθ(atn∣stn)]
Let the Pθ′(st,at)=Pθ′(at∣st)Pθ′(st)P_{\theta'}(s_t, a_t) = P_{\theta'}(a_t|s_t)P_{\theta'}(s_t)Pθ′(st,at)=Pθ′(at∣st)Pθ′(st), and Pθ(st,at)=Pθ(at∣st)Pθ(st)P_{\theta}(s_t, a_t) = P_{\theta}(a_t|s_t)P_{\theta}(s_t)Pθ(st,at)=Pθ(at∣st)Pθ(st). We consider the environment observation sss is not related to actor θ\thetaθ (ignore the environment changing by action), then Pθ(st)=Pθ′(st)P_{\theta}(s_t) = P_{\theta'}(s_t)Pθ(st)=Pθ′(st), equation becomes:
E(st,at)∼πθ′[Pθ(at∣st)Pθ′(at∣st)Aθ′(st,at)▽logpθ(atn∣stn)]E_{(s_t, a_t)\sim{\pi_{\theta'}}}[\frac{P_{\theta}(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t, a_t)\bigtriangledown{logp_\theta(a_t^n|s_t^n)}] E(st,at)∼πθ′[Pθ′(at∣st)Pθ(at∣st)Aθ′(st,at)▽logpθ(atn∣stn)]
Here defines:
Jθ′(θ)=E(st,at)∼πθ′[Pθ(at∣st)Pθ′(at∣st)Aθ′(st,at)]J^{\theta'}(\theta) = E_{(s_t, a_t)\sim{\pi_{\theta'}}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t, a_t)] Jθ′(θ)=E(st,at)∼πθ′[Pθ′(at∣st)Pθ(at∣st)Aθ′(st,at)]
Note
: Since we use θ′\theta'θ′ to sample data for θ\thetaθ, the distribution of θ\thetaθ can’t be very different from θ′\theta'θ′, how to determine the difference between two distribution and end the model training if θ′\theta'θ′ is distinct from θ\thetaθ? Now let’s start to learn PPO Algorithm.
PPO Algorithm —— Proximal Policy Optimization
PPO is the resolution of above question, it can avoid the problem which raised from the difference between θ\thetaθ and θ′\theta'θ′ . The target function shows as below:
JPPOθ′(θ)=Jθ′(θ)−βKL(θ,θ′)J_{PPO}^{\theta'}(\theta) = J^{\theta'}(\theta) - \beta KL(\theta, \theta') JPPOθ′(θ)=Jθ′(θ)−βKL(θ,θ′)
which the KL(θ,θ′)KL(\theta, \theta')KL(θ,θ′) is the divergence of output action from policy θ\thetaθ and policy θ′\theta'θ′. The algorithm flow is:
-
Initial Policy parameters θ\thetaθ
-
In each iteration:
- Using θk\theta^kθk to interact with the environment, and collect {st,at{s_t, a_t}st,at} to calculate the Aθk(st,at)A^{\theta^k}(s_t, a_t)Aθk(st,at)
-
Update the JPPOθ′(θ)J_{PPO}^{\theta'}(\theta)JPPOθ′(θ) several times: $ J_{PPO}{\thetak}(\theta) = J{\thetak}(\theta) - \beta KL(\theta, \theta^k)$
-
If KL(θ,θk)>KLmaxKL(\theta, \theta^k) > KL_{max}KL(θ,θk)>KLmax, that means KL part takes too big importance of this equation, increase β\betaβ
-
If KL(θ,θk)<KLminKL(\theta, \theta^k) < KL_{min}KL(θ,θk)<KLmin, that means KL part takes lower importance of this equation, decrease β\betaβ
Value-based Approach - Learn an Critic
A critic doesn’t choose an action (it’s different from actor), it evaluates the performance
of a given actor. So an actor can be found from a critic.
Q-learning
Q-Learning is a classical value-based method, it evaluates the score of an observation under an actor π\piπ, this function is called state value function
Vπ(s)V^\pi(s)Vπ(s). The score is calculated as the total reward from current observation to the end of this episode.
How to estimate Vπ(s)V^\pi(s)Vπ(s)?
We know we need to calculate the total reward to express the performance of current actor πθ\pi_\thetaπθ, but how to get this value?
- Monte-Carlo based approach
In the current state SaS_aSa (observation), until the end of this episode, the cumulated reward is GaG_aGa; In the current state SbS_bSb (observation), until the end of this episode, the cumulated reward is GbG_bGb. That means we can estimate the value of an observation sas_asa under an actor πθ\pi_ \thetaπθ, the low value could be explain as two possibilities:
a) the current observation is bad, even if a good actor can not get a high value.
b) the actor has a bad performance.
In many cases, we can’t enumerate all observations to calculate the all rewards GiG_iGi. The resolution is using a Neural-Network to fit the function from observation to value GGG.
Fit the NN with (Sa,G(a))(S_a, G(a))(Sa,G(a)), try to minimize the difference between the NN output Vπ(Sa)V_\pi(S_a)Vπ(Sa) and Monte-Carlo reward G(a)G(a)G(a).
- Temporal-Difference approach
MC approach is worked, but the problem is you must get the total reward in the end of one episode. It may be a very long way to reach the end state in some cases, Temporal-Difference approach could address this problem.
here is a trajectory {...,st,at,rt,st+1,......, s_t, a_t, r_t, s_{t+1}, ......,st,at,rt,st+1,... }, there should be:
Vπ(st)=Vπ(st+1)+rtV^\pi(s_t) = V^\pi(s_{t+1}) + r_t Vπ(st)=Vπ(st+1)+rt
so we can fit the NN by minimize the difference between Vπ(st)−Vπ(st+1)V^\pi(s_t) - V^\pi(s_{t+1})Vπ(st)−Vπ(st+1) and rtr_trt.
Here is a tip in practice: we are training the same model VπV^\piVπ, so the two outputs Vπ(st)V_\pi(s_t)Vπ(st) and Vπ(st+1)V_\pi(s_{t+1})Vπ(st+1) are all generate from one parameter group θ\thetaθ. When we update the θ\thetaθ after one iteration, both Vπ(st)V_\pi(s_t)Vπ(st) and Vπ(st+1)V_\pi(s_{t+1})Vπ(st+1) are changed in next iteration, which makes the model unstable.
The tip is: fix the parameter group θ′\theta'θ′ to generate the Vπ(St+1)V_\pi(S_{t+1})Vπ(St+1), and update the θ\thetaθ for Vπ(St)V_\pi(S_t)Vπ(St). After N iterations, let the θ′\theta'θ′ equal to θ\thetaθ. Fixed parameter Network (right) is called Target Network.
-
MC v.s. TD
- Monte-Carlo has larger
variance
. This is caused by the randomness of G(a)G(a)G(a), since G(a)G(a)G(a) is the sum of all reward rtr_trt, each rtr_trt is a random variable, the sum of these variable must have a larger variance. Playing N times of one game with the same policy, the reward set {G(a),G(b),G(c),...G(a), G(b), G(c), ...G(a),G(b),G(c),...} has a large variance. - Temporal-Difference also has a problem, which is Vπ(st+1)V^\pi(s_{t+1})Vπ(st+1) may estimate
incorrectly
(cause it’s not like Monte-Carlo approach to cumulative the reward until the end of this episode), so even the rtr_trt is correct, the Vπ(st)−Vπ(st+1)V^\pi(s_t) - V^\pi(s_{t+1})Vπ(st)−Vπ(st+1) may not correct.
In the practice, people prefer to use TD method.
- Monte-Carlo has larger
-
Q-value approach →\rightarrow→ Qπ(s,a)Q^\pi(s, a)Qπ(s,a)
In current state (observation), enumerate all valid actions and calculate the Q-value of each action.
note
: In current state we force the actor to take the specific action to calculate the value this action, but random choose actions according to the πθ\pi_\thetaπθ actor in next steps until the end of episode.
Use Q-value to learn an actor
We can learn an actor π\piπ with the Q-value function, here is the algorithm flow:
If π′\pi'π′ is better than π\piπ, then:
Vπ′(si)⩾Vπ(si),∀si∈SV^{\pi'}(s_i) \geqslant V^{\pi}(s_i), \qquad \forall s_i \in S Vπ′(si)⩾Vπ(si),∀si∈S
We can use equation below to calculate the π′\pi'π′ from π\piπ:
π′(s)=argmaxaQπ(s,a)\pi'(s) = argmax_aQ^\pi(s, a) π′(s)=argmaxaQπ(s,a)
Note
: This approach not suitable for continuous action, only for discrete action.
But if we always choose the best action according to the QπQ^\piQπ, some other better actions we can never detect. So we infer use Exploration
method when we choose action to do.
Epsilon Greedy
Set a probability ε\varepsilonε, take max Q-value action or take random action show as below. Typically , ε\varepsilonε decreases as time goes by.
KaTeX parse error: No such environment: align at position 20: … \left\{ \begin{̲a̲l̲i̲g̲n̲}̲ argmaxQ(s, a)&…
Boltzmann Exploration
Since the QπQ^\piQπis an Neural Network, the output of this Network is the probability of each action. Use this probability to decide which action should take, show as below:
P(ai∣s)=exp(Q(s,ai))∑aexp(Q(s,a))P(a_i|s) = \frac{exp(Q(s, a_i))}{\sum_aexp(Q(s, a))} P(ai∣s)=∑aexp(Q(s,a))exp(Q(s,ai))
Q-value may be negative, so we take exp-function to let them be positive.
Replay Buffer
Replay buffer is a buffer which stores a lot of experience data. When you train your Q-Network, random choose a batch from buffer to fit it.
-
An experience is a set which looks like {st,at,rt,st+1s_t, a_t, r_t, s_{t+1}st,at,rt,st+1}.
-
The experience in buffer may comes from different policy {πθ1,πθ2,πθ3,...\pi_{\theta_1}, \pi_{\theta_2}, \pi_{\theta_3}, ...πθ1,πθ2,πθ3,...}.
-
Drop the old experience when buffer is full.
Typical Q-Learning Algorithm
Here is the main algorithm flow of Q-learning:
- Initialize Q-function Q, Initialize target Q-function Q^=Q\hat{Q} = QQ^=Q
- in each episode
- for each step t
- Given state sts_tst, take an action ata_tat based on Q (ε\varepsilonε-greedy exploration)
- Obtain the reward rtr_trt and next state st+1s_{t+1}st+1
- Store this experience {st,at,rt,st+1s_t, a_t, r_t, s_{t+1}st,at,rt,st+1} into the replay buffer
- Sample a batch of experience {(si,ai,ri,si+1),(sj,aj,rj,sj+1),...(s_i, a_i, r_i, s_{i+1}), (s_j, a_j, r_j, s_{j+1}), ...(si,ai,ri,si+1),(sj,aj,rj,sj+1),...} from buffer
- Compute target y=ri+maxaQ^(si+1,a)y = r_i + max_a\hat{Q}(s_{i+1}, a_)y=ri+maxaQ^(si+1,a)
- Update the parameters in QQQ to make Q(si,ai)Q(s_i, a_i)Q(si,ai) close to yyy.
- After N steps set Q^=Q\hat{Q} = QQ^=Q
- for each step t
Double DQN
Double DQN is designed to solve the problem of DQN. Problem of DQN show as below:
Q-value are always over estimate in DQN training (Orange curve is DQN Neural Network output reward, Blue curve is Double DQN Neural Network output reward; Orange line is the real cumulative reward of DQN, Blue line is the real cumulative reward of Double DQN). Notes that Blue lines are over than Orange lines which means Double DQN has a greater true value than DQN.
Why DQN always over-estimate Q-value?
This because when we calculate the target yyy which equals rt+maxaQπ(st+1,a)r_t + max_aQ_\pi(s_{t+1}, a)rt+maxaQπ(st+1,a), we always choose the best action and compute the highest Q-value. This may over-estimate the target value, so the real cumulative reward may lower than that target value. While Q function is try to close the target value, this results the output of Q-Network is higher than the actual cumulative reward.
Q(st,at)⟺rt+maxaQ(st+1,a)Q(s_t, a_t) \qquad \Longleftrightarrow \qquad r_t + max_aQ(s_{t+1}, a) Q(st,at)⟺rt+maxaQ(st+1,a)
Double DQN resolution
To avoid above problem, we use two Q-Network in training, one is in charge of choose the best action and the other is to estimate Q-value.
Q(st,at)⟺rt+Q′(st+1,argmaxaQ(st+1,a))Q(s_t, a_t) \qquad \Longleftrightarrow \qquad r_t + Q'(s_{t+1}, argmax_aQ(s_{t+1}, a)) Q(st,at)⟺rt+Q′(st+1,argmaxaQ(st+1,a))
Here use QQQ to select the best action in each state but use Q′Q'Q′ to estimate the Q-value of this action. This method has two advantages:
- If QQQ over-estimate the Q-value of action aaa, although this action is selected, the final Q-value of this action won’t be over estimated (because we use Q′Q'Q′ to estimate the Q-value of this action).
- If Q′Q'Q′ over-estimate one action aaa, it’s also safe. Because the QQQ policy won’t select the action aaa (because aaa is not the best action in Policy QQQ).
In DQN algorithm, we already have two Network: origin Network
θ\thetaθ and target Network
θ′\theta'θ′ (need to be fixed). So here use origin Network
$\theta $ to select the action, and target Network
$\theta’ $ to estimate the Q-value.
Other Advanced Structure of Q-Learning
- Dueling DQN
Change the output as two parts: Qπ(st,at)=V(st)+A(st,at)Q^\pi(s_t, a_t) = V(s_t) + A(s_t, a_t)Qπ(st,at)=V(st)+A(st,at), which means the final Q-value is the sum of environment value and action value.
- Prioritized Replay
When we sample a batch of experience from replay buffer, we don’t use random select. Prioritized Replay marked those experience which has a high loss after one iteration, and increase the probability of selecting those experience in the next batch.
- Multi-Step
Change the experience format in the Replay Buffer, not only store one step {$ s_t, a_t, r_t, s_{t+1} $}, store N steps { st,at,rt,st+1,...,st+N,at+N,rt+N,st+N+1s_t, a_t, r_t, s_{t+1}, ..., s_{t+N}, a_{t+N}, r_{t+N}, s_{t+N+1}st,at,rt,st+1,...,st+N,at+N,rt+N,st+N+1 }.
- Noise Net
This method used to explore more action. Add some noise in current Network QQQ at the beginning of one episode.
Here is the comparison of different algorithms:
A3C Method - Asynchronous Advantage Actor-Critic
Why we need A3C method? This is designed to solve the variance problem of Policy Gradient. In Policy Gradient method, even in the same state $ s_t $ and take the same action $ a_t $ N times, we may get very different result total reward GGG. This because randomness existed when we calculate the cumulative reward in below equation:
▽R(θ)‾=1N∑n=1N∑t=1T(∑t′=tTγt′−trt′n−b)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n} - b)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} ▽R(θ)=N1n=1∑Nt=1∑T(t′=t∑Tγt′−trt′n−b)▽logP(atn∣stn,θ)
This part $ (\sum_{t’=t}T{\gamma{t’ -t}r_{t’}^n} - b) $ could be very different for $ r_t $ is a random variable with big variance, the result may be like this:
unless we sample enough times to cover all possible rewards, the model could be stable —— but it’s hard to do this. If we can replace ∑t′=tTγt′−trt′n\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n}∑t′=tTγt′−trt′n with the expect E(rt′n)E(r_{t'}^n)E(rt′n), then we can solve this problem.
Advantage Actor-Critic (A2C Method)
We have already introduced value-based method, the definition of Qπθ(st,at)Q^{\pi_\theta}(s_t, a_t)Qπθ(st,at) is the expect of total reward of taking action ata_tat at current state sts_tst. The definition of $ V^{\pi_\theta}(s_t) $ is the expect reward of current state sts_tst (just the state value without specific which action should take). Now we change the equation:
▽R(θ)‾=1N∑n=1N∑t=1T(Qπθ(stn,atn)−Vπθ(stn))▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(Q^{\pi_\theta}(s_t^n, a_t^n) - V^{\pi_\theta}(s_t^n))\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} ▽R(θ)=N1n=1∑Nt=1∑T(Qπθ(stn,atn)−Vπθ(stn))▽logP(atn∣stn,θ)
note
: Here we use state value to replace the baseline b.
We can infer the value of Qπθ(st,at)Q^{\pi_\theta}(s_t, a_t)Qπθ(st,at) from Vπθ(st)V^{\pi_\theta}(s_t)Vπθ(st) :
Qπ(st,at)=E[rt+Vπ(st+1)]→rt+Vπ(st+1)Q^{\pi}(s_t, a_t) \quad = \quad E[r_t + V^\pi(s_{t+1})] \quad \rightarrow \quad r_t + V^\pi(s_{t+1}) Qπ(st,at)=E[rt+Vπ(st+1)]→rt+Vπ(st+1)
We should use the expect because rtr_trt is a random variable, but it’s hard to calculate, so we take off the expect. Now the equation becomes:
▽R(θ)‾=1N∑n=1N∑t=1T(rt+Vπθ(st+1n)−Vπθ(stn))▽logPθ(atn∣stn)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(r_t + V^{\pi_\theta}(s_{t+1}^n) - V^{\pi_\theta}(s_t^n))\bigtriangledown{logP_{\theta}(a_t^n|s_t^n)}} ▽R(θ)=N1n=1∑Nt=1∑T(rt+Vπθ(st+1n)−Vπθ(stn))▽logPθ(atn∣stn)
Algorithm flow of Advantage Actor-Critic method show as below:
Tips
- There are two Networks to train in this algorithm: Actor $ \pi_\theta $ and Critic Vπ(s)V^\pi(s)Vπ(s). But two networks accept the same input sss, only different in output —— scaler V(s)V(s)V(s) for Critic Network and Probability Distribution P(a∣s)P(a|s)P(a∣s) for Actor Network. So two networks can share some layers in the front of structure, looks like this:
- Use output entropy as regularization for π(s)\pi(s)π(s), this could make the probability of each action more even so that the model can do more exploration.
Asynchronous Advantage Actor-Critic (A3C Method)
A3C is designed to speed up A2C. It maintain one global Network and create N workers, each worker interact with different environment, calculate the gradient and update the global Network.
- Copy global parameters θ1\theta_1θ1
- Sampling some data
- Compute gradients
- Update global model
note
: All workers are parallelized, which means when ▽θ1\bigtriangledown{\theta_1}▽θ1 finish compute and send back to global model, θ\thetaθ may changed (updated by other worker, so it may not remain $\theta_1 $). But we still use the ▽θ1\bigtriangledown{\theta_1}▽θ1 to update the current parameters θ\thetaθ →\rightarrow→ θ+η▽θ1\theta + \eta\bigtriangledown{\theta_1}θ+η▽θ1.
Sparse Reward
In reinforcement learning, reward is very important for agent to know which actions are good. But only few action could obtain a positive reward (ex. only fire and destroy the enemy could obtain a positive reward in Spaceship game), most of actions have no reward (ex. move left or move right). This phenomenon is called Sparse Reward.
Reward Shaping
Typically, few state could get the positive reward in training, thus we can create some extra rewards to guide the agent do some action in current state. For example, if we want to train an plane agent to destroy the enemy plane, the actual reward should be obtained from “fire and destroy the enemy”. But in the start of the game, our plane don’t know how to find enemy, so we can create an extra positive reward if our plane is fly toward enemy plane.
note
: This method needs domain knowledge to design which rules desire positive reward and how much reward should be assigned.
Curriculum Learning
Typically, a hard task could be split into many simple tasks. Curriculum Learning Algorithm is starting from simple training examples, and then becoming harder and harder.
The most common technique is Reverse Curriculum Learning, explain as below:
- Given a goal state sgs_gsg.
- Sample some states {s1,s2,...s_1, s_2, ...s1,s2,...} close to $s_g $.
- Each state has a reward to goal state sgs_gsg, compute {R(s1),R(s2),...R(s_1), R(s_2), ...R(s1),R(s2),...}.
- Delete those state whose reward is too large (it’s too easy from this state to goal state) or too small (it’s too difficult from this state to goal state).
- Sample more states near the {$s_1, s_2, … }.
Hierarchical Reinforcement Learning
A entire model could be split into different hierarchies, top-level model only give the top-level order and low-level model choose the actual actions. For example, if we wanna train a plane agent, top-level model only give the way point of next target while the low-level model control the plane to fly to that target (turn left or turn right).
Here is a game example, blue point is the agent which is asked to reach the yellow point. Pink point is the temporary target given by high-level model while the low-level mode follow this instruction and control the agent to reach pink point.
ICM —— Intrinsic Curiosity Module
ICM Algorithm can let model to do more exploration. It adds an extra Reward function rtir^i_trti which accept three parameters (st,at,st+1)(s_t, a_t, s_{t+1})(st,at,st+1). The Network need to maximize the sum value ∑t=1N(rti+rt)\sum_{t=1}^N(r^i_t + r_t)∑t=1N(rti+rt).
Now let’s see how ICM calculate reward rtir^i_trti in each step ttt :
Here are two Networks in ICM module:
- Network 1: This Network is used to predict the next state st+1s_{t+1}st+1 after taking action aaa, if st+1s_{t+1}st+1 is hard to predict, then the reward rir^iri is high.
- Network 2: There may be a lot of features are not related to do actions (ex. the position of sun in the Spaceship War game). If we just maximize the reward of state which hard to predict, then the agent will stay and watch the sun moving. So we need to find the useful features in action chosen, this is the work of Network 2. It predict the action $a_t $ according to sts_tst and st+1s_{t+1}st+1.