李宏毅Reinforcement Learning强化学习入门笔记

文章目录

      • Concepts in Reinforcement Learning
      • Difficulties in RL
      • A3C Method Brief Introduction
      • Policy-based Approach - Learn an Actor (Policy Gradient Method)
      • Value-based Approach - Learn an Critic
        • Q-learning
        • Double DQN
        • Other Advanced Structure of Q-Learning
      • A3C Method - Asynchronous Advantage Actor-Critic
        • Advantage Actor-Critic (A2C Method)
        • Asynchronous Advantage Actor-Critic (A3C Method)
      • Sparse Reward
        • Reward Shaping
        • Curriculum Learning
        • Hierarchical Reinforcement Learning
        • ICM —— Intrinsic Curiosity Module

Concepts in Reinforcement Learning

  1. The main goal of Reinforcement Learning is to maximum the Total Reward.
  2. Total Reward is the sum of all reward in One Eposide, so the model doesn’t know which steps in this episode are good and which are bad.
  3. Only few actions can get the positive reward (ex: fire and killing the enemy in Space War game gets positive reward but moving gets no reward), so how to let the model find these right actions is very important.

Difficulties in RL

  1. Reward Delay
    • Only “Fire” can obtain rewards, but moving before fire is also important (moving has no reward), how to let the model learn to move properly?
    • In chess game, it may be better to sacrifice immediate reward to gain more long-term reward.
  2. Agent’s actions may affect the environment
    • How to explore the world (observation) as more as possible.
    • How to explore the action-combination as more as possible.

A3C Method Brief Introduction

The A3C method is the most popular model which combines policy-based method and value-based method, the structure is shown as below. To learn A3C model, we need to know the concepts of policy-based and value-based. The details of A3C are shown [here](#A3C Method - Asynchronous Advantage Actor-Critic).

Policy-based Approach - Learn an Actor (Policy Gradient Method)

This approach try to learn a policy(also called actor). It accepts the observation as input, and output an action. The policy(actor) can be any model. If you useuseuse an Neural Network to as your actor, then you are doing Deep Reinforcement Learning.

Input(Observation)→Actor/Policy→Output(Action)Input(Observation) \rightarrow Actor/Policy \rightarrow Output(Action) Input(Observation)Actor/PolicyOutput(Action)

There are three steps to build DRL:

1. Decide Function of Actor Model (NN? …)

Here we use the NN as our Actor, so:

  • The Input of this NN is the observation of machine represented as Vector or Matrix. (Ex: Image Pixels to Matrix)
  • The Output of this NN is Action Probability. The most important point is that we shouldn’t always choose the action which has the highest probability, it should be a stochastic decisions according to the probability distribution.
  • The Advantage of NN to Q-table is: we can’t enumerate all observations (such as we can’t list all pixels’ combinations of a game) in some complex scenes, then we can use Neural Network to promise that we can always obtain an output even if this observation didn’t appear in the previous train set.
2. Decide Goodness of this Function

Since we use the Neural Network as our function model, we need to decide what is the goodness of this model (a standard to judge the performance of current model). We use R(θ)‾\overline{R(\theta)}R(θ) to express this standard, which θ\thetaθ is the parameters of current model.

  • Given an actor πθ(t)\pi_\theta(t)πθ(t) with Network-Parameters θ\thetaθ, ttt is the observation (input).
  • Use the actor πθ(t)\pi_\theta(t)πθ(t) to play the video game until this game finished.
  • Sum all rewards in this episode and marked as R(θ)→R(θ)=∑t=1TrtR(\theta) \rightarrow R(\theta) = \sum_{t=1}^Tr_tR(θ)R(θ)=t=1Trt.
    Note: R(θ)R(\theta)R(θ) is a variable, cause even if we use the same actor πθ(t)\pi_\theta(t)πθ(t) to play the same game many times, we can get the different R(θ)R(\theta)R(θ) (random mechanism in game and action chosen). So we want to maximum the R(θ)‾\overline{R(\theta)}R(θ) which expresses the expect of R(θ)R(\theta)R(θ).
  • Use the R(θ)‾\overline{R(\theta)}R(θ) to expresses the goodness of πθ(t)\pi_\theta(t)πθ(t).

How to Calculate the R(θ)R(\theta)R(θ)?

  • An episode is considered as a trajectory τ\tauτ

    • τ\tauτ = {s1,a1,r1,s2,a2,r2,...,sT,aT,rTs_1, a_1, r_1, s_2, a_2, r_2, ..., s_T, a_T, r_Ts1,a1,r1,s2,a2,r2,...,sT,aT,rT} →\rightarrow all the history in this episode
    • R(τ)=∑t=1TrtR(\tau) = \sum_{t=1}^Tr_tR(τ)=t=1Trt
  • Different τ\tauτ has different probability to appear, the probability of τ\tauτ is depending on the parameter θ\thetaθ of actor πθ(t)\pi_\theta(t)πθ(t). So we define the probability of τ\tauτ as P(τ∣θ)P(\tau|\theta)P(τθ).
    R(θ)‾=∑τP(τ∣θ)R(τ)\overline{R(\theta)} = \sum_\tau{P(\tau|\theta)R(\tau)} R(θ)=τP(τθ)R(τ)

  • We use actor πθ(t)\pi_\theta(t)πθ(t) to play N times game, obtain the list {τ1,τ2,...,τN\tau^1, \tau^2, ..., \tau^Nτ1,τ2,...,τN}. Each τn\tau^nτn has a reward R(τn)R(\tau^n)R(τn), the mean of these R(τn)R(\tau^n)R(τn) approximate equals to the expect R(θ)‾\overline{R(\theta)}R(θ).
    R(θ)‾≈1N∑n=1NR(τn)\overline{R(\theta)} \approx \frac{1}{N}\sum_{n=1}^NR(\tau^n) R(θ)N1n=1NR(τn)

3. Choose the best function

Now we need to know how to calculate the θ\thetaθ, here we use the Gradient Ascend method.

  • problem statements:
    θ∗=argmaxθR(θ)‾→R(θ)‾=∑τP(τ∣θ)R(τ)\theta^* = argmax_\theta\overline{R(\theta)} \rightarrow \overline{R(\theta)} = \sum_{\tau}P(\tau|\theta)R(\tau) θ=argmaxθR(θ)R(θ)=τP(τθ)R(τ)
  • gradient ascent:
    • Start with θ0\theta^0θ0.
    • θ1=θ0+η▽R(θ0)‾\theta^1 = \theta^0 + \eta\bigtriangledown{\overline{R(\theta^0)}}θ1=θ0+ηR(θ0)
    • θ2=θ1+η▽R(θ1)‾\theta^2 = \theta^1 + \eta\bigtriangledown{\overline{R(\theta^1)}}θ2=θ1+ηR(θ1)
  • The θ\thetaθ includes the parameters in the current Neural Network, $\theta = $ {w1,w2,w3,...,b1,b2,b3,...w_1, w_2, w_3, ..., b_1, b_2, b_3, ...w1,w2,w3,...,b1,b2,b3,...}, which the ▽R(θ)=[∂R(θ)∂w1∂R(θ)∂w2...∂R(θ)∂b1∂R(θ)∂b2...]\bigtriangledown R(\theta) = \left[ \begin{matrix} \frac{\partial{R(\theta)}}{\partial{w_1}} \\ \frac{\partial{R(\theta)}}{\partial{w_2}} \\ ... \\ \frac{\partial{R(\theta)}}{\partial{b_1}} \\ \frac{\partial{R(\theta)}}{\partial{b_2}} \\ ... \end{matrix} \right]R(θ)=w1R(θ)w2R(θ)...b1R(θ)b2R(θ)....

It’s time to calculate the gradient of R(θ)=∑τP(τ∣θ)R(τ)R(\theta) = \sum_{\tau}P(\tau|\theta)R(\tau)R(θ)=τP(τθ)R(τ), since R(τ)R(\tau)R(τ) has nothing to do with θ\thetaθ, the gradient can be expressed as:

▽R(θ)=∑τR(τ)▽P(τ∣θ)=∑τR(τ)P(τ∣θ)▽P(τ∣θ)P(τ∣θ)=∑τR(τ)P(τ∣θ)▽logP(τ∣θ)\bigtriangledown{R(\theta)} = \sum_\tau{R(\tau)\bigtriangledown{P(\tau|\theta)}} = \sum_\tau{R(\tau)P(\tau|\theta)\frac{\bigtriangledown{P(\tau|\theta)}}{P(\tau|\theta)}} = \sum_\tau{R(\tau)P(\tau|\theta)\bigtriangledown{logP(\tau|\theta)}} R(θ)=τR(τ)P(τθ)=τR(τ)P(τθ)P(τθ)P(τθ)=τR(τ)P(τθ)logP(τθ)

Note: dlog(f(x))dx=1f(x)df(x)dx\frac{dlog(f(x))}{dx} = \frac{1}{f(x)}\frac{df(x)}{dx}dxdlog(f(x))=f(x)1dxdf(x)

Use θ\thetaθ policy play the game N times, obtain {τ1,τ2,τ3,...\tau_1, \tau_2, \tau_3, ...τ1,τ2,τ3,...}:
▽R(θ)≈1N∑n=1NR(τn)▽logP(τ∣θ)\bigtriangledown{R(\theta)} \approx \frac{1}{N}\sum_{n=1}^N{R(\tau^n)\bigtriangledown{logP(\tau|\theta)}} R(θ)N1n=1NR(τn)logP(τθ)

How to Calculate the ▽logP(τ∣θ)\bigtriangledown{logP(\tau|\theta)}logP(τθ)?

Since τ\tauτ is the history of one episode, so:
KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ P(\tau|\theta)…

Ignore the terms which not related to θ\thetaθ:

▽logP(τ∣θ)=∑t=1T▽logP(at∣st,θ)\bigtriangledown{logP(\tau|\theta)} = \sum_{t=1}^T{\bigtriangledown{logP(a_t|s_t, \theta)}} logP(τθ)=t=1TlogP(atst,θ)

So the final result of ▽R(θ)‾\bigtriangledown\overline{R(\theta)}R(θ) is :

▽R(θ)‾=1N∑n=1N∑t=1TR(τn)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{R(\tau^n)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} R(θ)=N1n=1Nt=1TR(τn)logP(atnstn,θ)

The meaning of this equation is very clear:

  • if R(τn)R(\tau^n)R(τn) is positive →\rightarrow tune θ\thetaθ to increase the P(atn∣stn)P(a_t^n|s_t^n)P(atnstn).
  • if R(τn)R(\tau^n)R(τn) is negative →\rightarrow tune θ\thetaθ to decrease the P(atn∣stn)P(a_t^n|s_t^n)P(atnstn)

Use this method can resolve the [Reward Delay Problem](#Difficulties in RL) in Difficulties in RL chapter, because here we use the cumulative reward of one entire episode R(τn)R(\tau^n)R(τn), not just the immediate reward after taking one action.

Add a Baseline - b

To avoid all of R(τn)R(\tau^n)R(τn) is positive (there should be some negative reward to tell model don’t take this action at this state), we can add a baseline. So the equation changes to:
▽R(θ)‾=1N∑n=1N∑t=1T(R(τn)−b)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(R(\tau^n) - b)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} R(θ)=N1n=1Nt=1T(R(τn)b)logP(atnstn,θ)

Assign Suitable Weight of each Action

Use the total reward R(τ)R(\tau)R(τ) to tune the all actions’ probability in this episode also has some disadvantage, show as below:

The left picture show one episode whose total reward R is 5, so the probabilities of all actions in this episode will be increased (such as x5), but the main positive reward obtained from the a1a_1a1, while a2a_2a2 and a3a_3a3 didn’t give any positive reward, but the probability of a2a_2a2 and a3a_3a3 also be increased in this example. Same as right picture, a1a_1a1 is a bad action, but a2a_2a2 may not be a bad action, so probability of a2a_2a2 shouldn’t be decreased.

To avoid this problem, we assign different RRR to each ata_tat, the RRR is the cumulation of rtr_trt which is the sum of all rewards obtained after ata_tat, now the equation becomes:

▽R(θ)‾=1N∑n=1N∑t=1T(∑t′=tTγt′−trt′n−b)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n} - b)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} R(θ)=N1n=1Nt=1T(t=tTγttrtnb)logP(atnstn,θ)

Note: γ\gammaγ called discount factor, γ<1\gamma < 1γ<1.

We can use Aθ(st,at)A^\theta(s_t, a_t)Aθ(st,at) to express the (∑t′=tTγt′−trt′n−b)(\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n} - b)(t=tTγttrtnb) in above equation, which called Advantage Function. This function evaluate how good it is if we take ata_tat at this state sts_tst rather than other actions.

On-Policy v.s. Off-Policy

On-Policy and Off-Policy are two different modes of learning:

  • On-Policy: The agent learn the rules by interacting with environment. (learn from itself)
  • Off-Policy: The agent learn the rules by watching others’ interacting with environment. (learn from others)

Our Policy Gradient Method is an On-Policy learning mode, so why we need Off-Policy mode? This is because we use sampling N times and get the mean value to approximate the expect R(θ)‾=∑τP(τ∣θ)R(τ)\overline{R(\theta)} = \sum_\tau{P(\tau|\theta)R(\tau)}R(θ)=τP(τθ)R(τ). But when we update the θ\thetaθ, the P(τ∣θ)P(\tau|\theta)P(τθ) changed, so we need to do N sampling again and get the mean value. This will take a lot of time to do sampling after we update θ\thetaθ. The resolution is, we build a model πθ\pi_\thetaπθ, this model accept the training data from the other model πθ′\pi_{\theta'}πθ. Use πθ′\pi_{\theta'}πθ to collect data, and train the θ\thetaθ with θ′\theta'θ, since don’t change θ′\theta'θ, the sampling data can be reused.

Importance Sampling (On-Policy →\rightarrow Off-Policy)

Importance Sampling is a method to get the expect of one function Ex∼p(p(x))E_{x\sim{p}}(p(x))Exp(p(x)) by sampling another function q(x)q(x)q(x). Since we have already known:

Ex∼p[f(x)]≈1N∑i=1Nf(xi)E_{x\sim{p}}[f(x)] \approx \frac{1}{N}\sum_{i=1}^N{f(x^i)} Exp[f(x)]N1i=1Nf(xi)

But if we only have {xix^ixi} sampled from q(x)q(x)q(x), how to use this samples to calculate the E[p(x)]E[p(x)]E[p(x)]? We can change equation above:

Ex∼p[f(x)]=∫p(x)f(x)dx=∫f(x)p(x)q(x)q(x)dx=Ex∼q[f(x)p(x)q(x)]E_{x\sim{p}}[f(x)] = \int{p(x)f(x)}dx = \int{f(x)\frac{p(x)}{q(x)}q(x)}dx = E_{x\sim{q}}[f(x)\frac{p(x)}{q(x)}] Exp[f(x)]=p(x)f(x)dx=f(x)q(x)p(x)q(x)dx=Exq[f(x)q(x)p(x)]

That means we can get the expect of distribution p(x)p(x)p(x) by sampling the {xix^ixi} from another distribution q(x)q(x)q(x), only need to do some rectification, p(x)q(x)\frac{p(x)}{q(x)}q(x)p(x) called rectification term. Now we can consider our πθ\pi_\thetaπθ model as p(x)p(x)p(x), the πθ′\pi_{\theta'}πθ as q(x)q(x)q(x), use the q(x)q(x)q(x) to sample data to tune p(x)p(x)p(x).

▽R(θ)‾=Eτ∼pθ(τ)[R(τ)▽logpθ(τ)]=Eτ∼pθ′(τ)[pθ(τ)pθ′(τ)R(τ)▽logpθ(τ)]\bigtriangledown{\overline{R(\theta)}} = E_{\tau\sim{p_\theta(\tau)}}[R(\tau)\bigtriangledown{logp_\theta(\tau)}] = E_{\tau\sim{p_{\theta'(\tau)}}}[\frac{p_\theta(\tau)}{p_{\theta'}(\tau)}R(\tau)\bigtriangledown{logp_\theta(\tau)}] R(θ)=Eτpθ(τ)[R(τ)logpθ(τ)]=Eτpθ(τ)[pθ(τ)pθ(τ)R(τ)logpθ(τ)]

then we can use θ′\theta'θ to sample many times and train θ\thetaθ many times. After many iterations, we update θ′\theta'θ. Continue to transform the equation:

E(st,at)∼πθ[Aθ(st,at)▽logpθ(atn∣stn)]=E(st,at)∼πθ′[Pθ(st,at)Pθ′(st,at)Aθ′(st,at)▽logpθ(atn∣stn)]E_{(s_t, a_t)\sim{\pi_\theta}}[A^{\theta}(s_t, a_t)\bigtriangledown{logp_\theta(a_t^n|s_t^n)}] = E_{(s_t, a_t)\sim{\pi_{\theta'}}}[\frac{P_\theta(s_t, a_t)}{P_{\theta'}(s_t, a_t)}A^{\theta'}(s_t, a_t)\bigtriangledown{logp_\theta(a_t^n|s_t^n)}] E(st,at)πθ[Aθ(st,at)logpθ(atnstn)]=E(st,at)πθ[Pθ(st,at)Pθ(st,at)Aθ(st,at)logpθ(atnstn)]

Let the Pθ′(st,at)=Pθ′(at∣st)Pθ′(st)P_{\theta'}(s_t, a_t) = P_{\theta'}(a_t|s_t)P_{\theta'}(s_t)Pθ(st,at)=Pθ(atst)Pθ(st), and Pθ(st,at)=Pθ(at∣st)Pθ(st)P_{\theta}(s_t, a_t) = P_{\theta}(a_t|s_t)P_{\theta}(s_t)Pθ(st,at)=Pθ(atst)Pθ(st). We consider the environment observation sss is not related to actor θ\thetaθ (ignore the environment changing by action), then Pθ(st)=Pθ′(st)P_{\theta}(s_t) = P_{\theta'}(s_t)Pθ(st)=Pθ(st), equation becomes:

E(st,at)∼πθ′[Pθ(at∣st)Pθ′(at∣st)Aθ′(st,at)▽logpθ(atn∣stn)]E_{(s_t, a_t)\sim{\pi_{\theta'}}}[\frac{P_{\theta}(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t, a_t)\bigtriangledown{logp_\theta(a_t^n|s_t^n)}] E(st,at)πθ[Pθ(atst)Pθ(atst)Aθ(st,at)logpθ(atnstn)]

Here defines:

Jθ′(θ)=E(st,at)∼πθ′[Pθ(at∣st)Pθ′(at∣st)Aθ′(st,at)]J^{\theta'}(\theta) = E_{(s_t, a_t)\sim{\pi_{\theta'}}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t, a_t)] Jθ(θ)=E(st,at)πθ[Pθ(atst)Pθ(atst)Aθ(st,at)]

Note: Since we use θ′\theta'θ to sample data for θ\thetaθ, the distribution of θ\thetaθ can’t be very different from θ′\theta'θ, how to determine the difference between two distribution and end the model training if θ′\theta'θ is distinct from θ\thetaθ? Now let’s start to learn PPO Algorithm.

PPO Algorithm —— Proximal Policy Optimization

PPO is the resolution of above question, it can avoid the problem which raised from the difference between θ\thetaθ and θ′\theta'θ . The target function shows as below:
JPPOθ′(θ)=Jθ′(θ)−βKL(θ,θ′)J_{PPO}^{\theta'}(\theta) = J^{\theta'}(\theta) - \beta KL(\theta, \theta') JPPOθ(θ)=Jθ(θ)βKL(θ,θ)
which the KL(θ,θ′)KL(\theta, \theta')KL(θ,θ) is the divergence of output action from policy θ\thetaθ and policy θ′\theta'θ. The algorithm flow is:

  • Initial Policy parameters θ\thetaθ

  • In each iteration:

    • Using θk\theta^kθk to interact with the environment, and collect {st,at{s_t, a_t}st,at} to calculate the Aθk(st,at)A^{\theta^k}(s_t, a_t)Aθk(st,at)
  • Update the JPPOθ′(θ)J_{PPO}^{\theta'}(\theta)JPPOθ(θ) several times: $ J_{PPO}{\thetak}(\theta) = J{\thetak}(\theta) - \beta KL(\theta, \theta^k)$

  • If KL(θ,θk)>KLmaxKL(\theta, \theta^k) > KL_{max}KL(θ,θk)>KLmax, that means KL part takes too big importance of this equation, increase β\betaβ

  • If KL(θ,θk)<KLminKL(\theta, \theta^k) < KL_{min}KL(θ,θk)<KLmin, that means KL part takes lower importance of this equation, decrease β\betaβ

Value-based Approach - Learn an Critic

A critic doesn’t choose an action (it’s different from actor), it evaluates the performance of a given actor. So an actor can be found from a critic.

Q-learning

Q-Learning is a classical value-based method, it evaluates the score of an observation under an actor π\piπ, this function is called state value function Vπ(s)V^\pi(s)Vπ(s). The score is calculated as the total reward from current observation to the end of this episode.

How to estimate Vπ(s)V^\pi(s)Vπ(s)?

We know we need to calculate the total reward to express the performance of current actor πθ\pi_\thetaπθ, but how to get this value?

  • Monte-Carlo based approach

In the current state SaS_aSa (observation), until the end of this episode, the cumulated reward is GaG_aGa; In the current state SbS_bSb (observation), until the end of this episode, the cumulated reward is GbG_bGb. That means we can estimate the value of an observation sas_asa under an actor πθ\pi_ \thetaπθ, the low value could be explain as two possibilities:

a) the current observation is bad, even if a good actor can not get a high value.

b) the actor has a bad performance.

In many cases, we can’t enumerate all observations to calculate the all rewards GiG_iGi. The resolution is using a Neural-Network to fit the function from observation to value GGG.

Fit the NN with (Sa,G(a))(S_a, G(a))(Sa,G(a)), try to minimize the difference between the NN output Vπ(Sa)V_\pi(S_a)Vπ(Sa) and Monte-Carlo reward G(a)G(a)G(a).

  • Temporal-Difference approach

MC approach is worked, but the problem is you must get the total reward in the end of one episode. It may be a very long way to reach the end state in some cases, Temporal-Difference approach could address this problem.

here is a trajectory {...,st,at,rt,st+1,......, s_t, a_t, r_t, s_{t+1}, ......,st,at,rt,st+1,... }, there should be:

Vπ(st)=Vπ(st+1)+rtV^\pi(s_t) = V^\pi(s_{t+1}) + r_t Vπ(st)=Vπ(st+1)+rt

so we can fit the NN by minimize the difference between Vπ(st)−Vπ(st+1)V^\pi(s_t) - V^\pi(s_{t+1})Vπ(st)Vπ(st+1) and rtr_trt.

Here is a tip in practice: we are training the same model VπV^\piVπ, so the two outputs Vπ(st)V_\pi(s_t)Vπ(st) and Vπ(st+1)V_\pi(s_{t+1})Vπ(st+1) are all generate from one parameter group θ\thetaθ. When we update the θ\thetaθ after one iteration, both Vπ(st)V_\pi(s_t)Vπ(st) and Vπ(st+1)V_\pi(s_{t+1})Vπ(st+1) are changed in next iteration, which makes the model unstable.

The tip is: fix the parameter group θ′\theta'θ to generate the Vπ(St+1)V_\pi(S_{t+1})Vπ(St+1), and update the θ\thetaθ for Vπ(St)V_\pi(S_t)Vπ(St). After N iterations, let the θ′\theta'θ equal to θ\thetaθ. Fixed parameter Network (right) is called Target Network.

  • MC v.s. TD

    • Monte-Carlo has larger variance. This is caused by the randomness of G(a)G(a)G(a), since G(a)G(a)G(a) is the sum of all reward rtr_trt, each rtr_trt is a random variable, the sum of these variable must have a larger variance. Playing N times of one game with the same policy, the reward set {G(a),G(b),G(c),...G(a), G(b), G(c), ...G(a),G(b),G(c),...} has a large variance.
    • Temporal-Difference also has a problem, which is Vπ(st+1)V^\pi(s_{t+1})Vπ(st+1) may estimate incorrectly (cause it’s not like Monte-Carlo approach to cumulative the reward until the end of this episode), so even the rtr_trt is correct, the Vπ(st)−Vπ(st+1)V^\pi(s_t) - V^\pi(s_{t+1})Vπ(st)Vπ(st+1) may not correct.

    In the practice, people prefer to use TD method.

  • Q-value approach →\rightarrow Qπ(s,a)Q^\pi(s, a)Qπ(s,a)

In current state (observation), enumerate all valid actions and calculate the Q-value of each action.

note: In current state we force the actor to take the specific action to calculate the value this action, but random choose actions according to the πθ\pi_\thetaπθ actor in next steps until the end of episode.

Use Q-value to learn an actor

We can learn an actor π\piπ with the Q-value function, here is the algorithm flow:

the question is: how to estimate the $\pi'$ is better than $\pi$?

If π′\pi'π is better than π\piπ, then:

Vπ′(si)⩾Vπ(si),∀si∈SV^{\pi'}(s_i) \geqslant V^{\pi}(s_i), \qquad \forall s_i \in S Vπ(si)Vπ(si),siS

We can use equation below to calculate the π′\pi'π from π\piπ:
π′(s)=argmaxaQπ(s,a)\pi'(s) = argmax_aQ^\pi(s, a) π(s)=argmaxaQπ(s,a)
Note: This approach not suitable for continuous action, only for discrete action.

But if we always choose the best action according to the QπQ^\piQπ, some other better actions we can never detect. So we infer use Exploration method when we choose action to do.

Epsilon Greedy

Set a probability ε\varepsilonε, take max Q-value action or take random action show as below. Typically , ε\varepsilonε decreases as time goes by.
KaTeX parse error: No such environment: align at position 20: … \left\{ \begin{̲a̲l̲i̲g̲n̲}̲ argmaxQ(s, a)&…

Boltzmann Exploration

Since the QπQ^\piQπis an Neural Network, the output of this Network is the probability of each action. Use this probability to decide which action should take, show as below:
P(ai∣s)=exp(Q(s,ai))∑aexp(Q(s,a))P(a_i|s) = \frac{exp(Q(s, a_i))}{\sum_aexp(Q(s, a))} P(ais)=aexp(Q(s,a))exp(Q(s,ai))
Q-value may be negative, so we take exp-function to let them be positive.

Replay Buffer

Replay buffer is a buffer which stores a lot of experience data. When you train your Q-Network, random choose a batch from buffer to fit it.

  • An experience is a set which looks like {st,at,rt,st+1s_t, a_t, r_t, s_{t+1}st,at,rt,st+1}.

  • The experience in buffer may comes from different policy {πθ1,πθ2,πθ3,...\pi_{\theta_1}, \pi_{\theta_2}, \pi_{\theta_3}, ...πθ1,πθ2,πθ3,...}.

  • Drop the old experience when buffer is full.

Typical Q-Learning Algorithm

Here is the main algorithm flow of Q-learning:

  • Initialize Q-function Q, Initialize target Q-function Q^=Q\hat{Q} = QQ^=Q
  • in each episode
    • for each step t
      • Given state sts_tst, take an action ata_tat based on Q (ε\varepsilonε-greedy exploration)
      • Obtain the reward rtr_trt and next state st+1s_{t+1}st+1
      • Store this experience {st,at,rt,st+1s_t, a_t, r_t, s_{t+1}st,at,rt,st+1} into the replay buffer
      • Sample a batch of experience {(si,ai,ri,si+1),(sj,aj,rj,sj+1),...(s_i, a_i, r_i, s_{i+1}), (s_j, a_j, r_j, s_{j+1}), ...(si,ai,ri,si+1),(sj,aj,rj,sj+1),...} from buffer
      • Compute target y=ri+maxaQ^(si+1,a)y = r_i + max_a\hat{Q}(s_{i+1}, a_)y=ri+maxaQ^(si+1,a)
      • Update the parameters in QQQ to make Q(si,ai)Q(s_i, a_i)Q(si,ai) close to yyy.
      • After N steps set Q^=Q\hat{Q} = QQ^=Q

Double DQN

Double DQN is designed to solve the problem of DQN. Problem of DQN show as below:

Q-value are always over estimate in DQN training (Orange curve is DQN Neural Network output reward, Blue curve is Double DQN Neural Network output reward; Orange line is the real cumulative reward of DQN, Blue line is the real cumulative reward of Double DQN). Notes that Blue lines are over than Orange lines which means Double DQN has a greater true value than DQN.

Why DQN always over-estimate Q-value?

This because when we calculate the target yyy which equals rt+maxaQπ(st+1,a)r_t + max_aQ_\pi(s_{t+1}, a)rt+maxaQπ(st+1,a), we always choose the best action and compute the highest Q-value. This may over-estimate the target value, so the real cumulative reward may lower than that target value. While Q function is try to close the target value, this results the output of Q-Network is higher than the actual cumulative reward.
Q(st,at)⟺rt+maxaQ(st+1,a)Q(s_t, a_t) \qquad \Longleftrightarrow \qquad r_t + max_aQ(s_{t+1}, a) Q(st,at)rt+maxaQ(st+1,a)
Double DQN resolution

To avoid above problem, we use two Q-Network in training, one is in charge of choose the best action and the other is to estimate Q-value.
Q(st,at)⟺rt+Q′(st+1,argmaxaQ(st+1,a))Q(s_t, a_t) \qquad \Longleftrightarrow \qquad r_t + Q'(s_{t+1}, argmax_aQ(s_{t+1}, a)) Q(st,at)rt+Q(st+1,argmaxaQ(st+1,a))
Here use QQQ to select the best action in each state but use Q′Q'Q to estimate the Q-value of this action. This method has two advantages:

  • If QQQ over-estimate the Q-value of action aaa, although this action is selected, the final Q-value of this action won’t be over estimated (because we use Q′Q'Q to estimate the Q-value of this action).
  • If Q′Q'Q over-estimate one action aaa, it’s also safe. Because the QQQ policy won’t select the action aaa (because aaa is not the best action in Policy QQQ).

In DQN algorithm, we already have two Network: origin Network θ\thetaθ and target Network θ′\theta'θ (need to be fixed). So here use origin Network $\theta $ to select the action, and target Network $\theta’ $ to estimate the Q-value.


Other Advanced Structure of Q-Learning

  • Dueling DQN

Change the output as two parts: Qπ(st,at)=V(st)+A(st,at)Q^\pi(s_t, a_t) = V(s_t) + A(s_t, a_t)Qπ(st,at)=V(st)+A(st,at), which means the final Q-value is the sum of environment value and action value.

  • Prioritized Replay

When we sample a batch of experience from replay buffer, we don’t use random select. Prioritized Replay marked those experience which has a high loss after one iteration, and increase the probability of selecting those experience in the next batch.

  • Multi-Step

Change the experience format in the Replay Buffer, not only store one step {$ s_t, a_t, r_t, s_{t+1} $}, store N steps { st,at,rt,st+1,...,st+N,at+N,rt+N,st+N+1s_t, a_t, r_t, s_{t+1}, ..., s_{t+N}, a_{t+N}, r_{t+N}, s_{t+N+1}st,at,rt,st+1,...,st+N,at+N,rt+N,st+N+1 }.

  • Noise Net

This method used to explore more action. Add some noise in current Network QQQ at the beginning of one episode.

Here is the comparison of different algorithms:


A3C Method - Asynchronous Advantage Actor-Critic

Why we need A3C method? This is designed to solve the variance problem of Policy Gradient. In Policy Gradient method, even in the same state $ s_t $ and take the same action $ a_t $ N times, we may get very different result total reward GGG. This because randomness existed when we calculate the cumulative reward in below equation:
▽R(θ)‾=1N∑n=1N∑t=1T(∑t′=tTγt′−trt′n−b)▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n} - b)\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} R(θ)=N1n=1Nt=1T(t=tTγttrtnb)logP(atnstn,θ)
This part $ (\sum_{t’=t}T{\gamma{t’ -t}r_{t’}^n} - b) $ could be very different for $ r_t $ is a random variable with big variance, the result may be like this:

unless we sample enough times to cover all possible rewards, the model could be stable —— but it’s hard to do this. If we can replace ∑t′=tTγt′−trt′n\sum_{t'=t}^T{\gamma^{t' -t}r_{t'}^n}t=tTγttrtn with the expect E(rt′n)E(r_{t'}^n)E(rtn), then we can solve this problem.

Advantage Actor-Critic (A2C Method)

We have already introduced value-based method, the definition of Qπθ(st,at)Q^{\pi_\theta}(s_t, a_t)Qπθ(st,at) is the expect of total reward of taking action ata_tat at current state sts_tst. The definition of $ V^{\pi_\theta}(s_t) $ is the expect reward of current state sts_tst (just the state value without specific which action should take). Now we change the equation:
▽R(θ)‾=1N∑n=1N∑t=1T(Qπθ(stn,atn)−Vπθ(stn))▽logP(atn∣stn,θ)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(Q^{\pi_\theta}(s_t^n, a_t^n) - V^{\pi_\theta}(s_t^n))\bigtriangledown{logP(a_t^n|s_t^n, \theta)}} R(θ)=N1n=1Nt=1T(Qπθ(stn,atn)Vπθ(stn))logP(atnstn,θ)
note: Here we use state value to replace the baseline b.

We can infer the value of Qπθ(st,at)Q^{\pi_\theta}(s_t, a_t)Qπθ(st,at) from Vπθ(st)V^{\pi_\theta}(s_t)Vπθ(st) :
Qπ(st,at)=E[rt+Vπ(st+1)]→rt+Vπ(st+1)Q^{\pi}(s_t, a_t) \quad = \quad E[r_t + V^\pi(s_{t+1})] \quad \rightarrow \quad r_t + V^\pi(s_{t+1}) Qπ(st,at)=E[rt+Vπ(st+1)]rt+Vπ(st+1)
We should use the expect because rtr_trt is a random variable, but it’s hard to calculate, so we take off the expect. Now the equation becomes:
▽R(θ)‾=1N∑n=1N∑t=1T(rt+Vπθ(st+1n)−Vπθ(stn))▽logPθ(atn∣stn)\bigtriangledown\overline{R(\theta)} = \frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T{(r_t + V^{\pi_\theta}(s_{t+1}^n) - V^{\pi_\theta}(s_t^n))\bigtriangledown{logP_{\theta}(a_t^n|s_t^n)}} R(θ)=N1n=1Nt=1T(rt+Vπθ(st+1n)Vπθ(stn))logPθ(atnstn)
Algorithm flow of Advantage Actor-Critic method show as below:

Tips

  1. There are two Networks to train in this algorithm: Actor $ \pi_\theta $ and Critic Vπ(s)V^\pi(s)Vπ(s). But two networks accept the same input sss, only different in output —— scaler V(s)V(s)V(s) for Critic Network and Probability Distribution P(a∣s)P(a|s)P(as) for Actor Network. So two networks can share some layers in the front of structure, looks like this:
  1. Use output entropy as regularization for π(s)\pi(s)π(s), this could make the probability of each action more even so that the model can do more exploration.

Asynchronous Advantage Actor-Critic (A3C Method)

A3C is designed to speed up A2C. It maintain one global Network and create N workers, each worker interact with different environment, calculate the gradient and update the global Network.

  • Copy global parameters θ1\theta_1θ1
  • Sampling some data
  • Compute gradients
  • Update global model

note: All workers are parallelized, which means when ▽θ1\bigtriangledown{\theta_1}θ1 finish compute and send back to global model, θ\thetaθ may changed (updated by other worker, so it may not remain $\theta_1 $). But we still use the ▽θ1\bigtriangledown{\theta_1}θ1 to update the current parameters θ\thetaθ →\rightarrow θ+η▽θ1\theta + \eta\bigtriangledown{\theta_1}θ+ηθ1.

Sparse Reward

In reinforcement learning, reward is very important for agent to know which actions are good. But only few action could obtain a positive reward (ex. only fire and destroy the enemy could obtain a positive reward in Spaceship game), most of actions have no reward (ex. move left or move right). This phenomenon is called Sparse Reward.

Reward Shaping

Typically, few state could get the positive reward in training, thus we can create some extra rewards to guide the agent do some action in current state. For example, if we want to train an plane agent to destroy the enemy plane, the actual reward should be obtained from “fire and destroy the enemy”. But in the start of the game, our plane don’t know how to find enemy, so we can create an extra positive reward if our plane is fly toward enemy plane.

note: This method needs domain knowledge to design which rules desire positive reward and how much reward should be assigned.

Curriculum Learning

Typically, a hard task could be split into many simple tasks. Curriculum Learning Algorithm is starting from simple training examples, and then becoming harder and harder.

The most common technique is Reverse Curriculum Learning, explain as below:

  • Given a goal state sgs_gsg.
  • Sample some states {s1,s2,...s_1, s_2, ...s1,s2,...} close to $s_g $.
  • Each state has a reward to goal state sgs_gsg, compute {R(s1),R(s2),...R(s_1), R(s_2), ...R(s1),R(s2),...}.
  • Delete those state whose reward is too large (it’s too easy from this state to goal state) or too small (it’s too difficult from this state to goal state).
  • Sample more states near the {$s_1, s_2, … }.

Hierarchical Reinforcement Learning

A entire model could be split into different hierarchies, top-level model only give the top-level order and low-level model choose the actual actions. For example, if we wanna train a plane agent, top-level model only give the way point of next target while the low-level model control the plane to fly to that target (turn left or turn right).

Here is a game example, blue point is the agent which is asked to reach the yellow point. Pink point is the temporary target given by high-level model while the low-level mode follow this instruction and control the agent to reach pink point.

ICM —— Intrinsic Curiosity Module

ICM Algorithm can let model to do more exploration. It adds an extra Reward function rtir^i_trti which accept three parameters (st,at,st+1)(s_t, a_t, s_{t+1})(st,at,st+1). The Network need to maximize the sum value ∑t=1N(rti+rt)\sum_{t=1}^N(r^i_t + r_t)t=1N(rti+rt).

Now let’s see how ICM calculate reward rtir^i_trti in each step ttt :

Here are two Networks in ICM module:

  • Network 1: This Network is used to predict the next state st+1s_{t+1}st+1 after taking action aaa, if st+1s_{t+1}st+1 is hard to predict, then the reward rir^iri is high.
  • Network 2: There may be a lot of features are not related to do actions (ex. the position of sun in the Spaceship War game). If we just maximize the reward of state which hard to predict, then the agent will stay and watch the sun moving. So we need to find the useful features in action chosen, this is the work of Network 2. It predict the action $a_t $ according to sts_tst and st+1s_{t+1}st+1.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/290289.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

《BI项目笔记》数据源视图设置

原文:《BI项目笔记》数据源视图设置目的数据源视图是物理源数据库和分析维度与多维数据集之间的逻辑数据模型。在创建数据源视图时&#xff0c;需要在源数据库中指定包含创建维度和多维数据集所需要的数据表格和视图。BIDS与数据库连接&#xff0c;读取表格和视图定义&#xff…

201521123070 《JAVA程序设计》第13周学习总结

1. 本章学习总结 以你喜欢的方式&#xff08;思维导图、OneNote或其他&#xff09;归纳总结多网络相关内容。 2. 书面作业 Q1. 网络基础 1.1 比较ping www.baidu.com与ping cec.jmu.edu.cn&#xff0c;分析返回结果有何不同&#xff1f;为什么会有这样的不同&#xff1f; 1.2 t…

.NET 7 预览版2 的亮点之 NativeAOT 回顾

.NET 中备受追捧和期待已久的功能NativeAOT终于出现在本周的.NET 7 预览版2中&#xff0c;该项目的工作仍在继续&#xff0c;该版本将 NativeAOT 从实验性的 dotnet/runtimelab repo 中移出合并进入稳定的运行时库 dotnet/runtime repo&#xff0c;但尚未在 dotnet SDK 中添加足…

c语言十佳运动员有奖评选系统_2019年沃德十佳内饰解读

​2019年沃德十佳内饰解读​mp.weixin.qq.com在这个世界上&#xff0c;要判定一件事物的成功与否并不容易&#xff0c;仅靠主观判断远远不够&#xff0c;而是需要能够进行量化判断的标准和成果。正如运动员需要金牌和冠军积淀&#xff0c;导演和演员需要奖项傍身一样&#xff0…

Mybatis——返回类型为 集合嵌套集合 应该如何处理

2019独角兽企业重金招聘Python工程师标准>>> 最近在练习时 遇到了类似于 企鹅里的好友分组功能&#xff0c;使用的持久层框架是mybatis 第一次处理这种关系 记录一下 备忘。。 首先是表结构&#xff1a; <user_group > 好友分组 、 <t_group> 用户与好友…

为什么用 windbg 看 !address 显示出的Free是128T 大小?

总是被朋友问&#xff0c;用 !address -summary 显示出上面的 Free ≈ 128T 到底是什么意思&#xff1f;我的空闲内存不可能有这么大,不管怎么说&#xff0c;先上命令。0:009> !address -summary--- Usage Summary ---------------- RgnCount ----------- Total Size ------…

DeepMind 的马尔可夫决策过程(MDP)课堂笔记

DeepMind Teaching by David Silver 视频课程链接&#xff08;Youtube资源&#xff0c;需梯子&#xff09;&#xff1a;https://youtu.be/lfHX2hHRMVQ 文章目录DeepMind Teaching by David Silver1. 马尔可夫过程&#xff08;Markov Processes&#xff09;2. 马尔可夫回报过程…

深入Java集合学习系列:ConcurrentHashMap之实现细节

http://www.iteye.com/topic/344876 ConcurrentHashMap是Java 5中支持高并发、高吞吐量的线程安全HashMap实现。在这之前我对ConcurrentHashMap只有一些肤浅的理解&#xff0c;仅知道它采用了多个锁&#xff0c;大概也足够了。但是在经过一次惨痛的面试经历之后&#xff0c;我觉…

基于小波变换的信号降噪处理及仿真研究_信号处理方法推荐--1(转载自用,侵删)...

综述作者&#xff1a;aresmiki链接&#xff1a;https://www.zhihu.com/question/23701194/answer/167005497来源&#xff1a;知乎著作权归作者所有。商业转载请联系作者获得授权&#xff0c;非商业转载请注明出处。非平稳信号处理应该是现在信号处理技术最新的也是最热的研究方…

js温故而知新11(AJAX)——学习廖雪峰的js教程

Asynchronous JavaScript and XML&#xff0c;意思就是用JavaScript执行异步网络请求。 如果仔细观察一个Form的提交&#xff0c;你就会发现&#xff0c;一旦用户点击“Submit”按钮&#xff0c;表单开始提交&#xff0c;浏览器就会刷新页面&#xff0c;然后在新页面里告诉你操…

最流行的 .NET 开源项目合集

Github 上有很多优秀的 .NET 开源项目&#xff0c;它们很适合用来学习 .NET 、设计模式和架构。nopCommerce https://github.com/nopSolutions/nopCommercestar: 7k ⭐nopCommerce 是最受欢迎最好的开源电子商务购物车解决方案之一&#xff0c;它基于 ASP.NET Core&#xff…

GFS(Genetic Fuzzy Systems)—— 基于专家先验规则库和遗传算法相结合的智能体决策算法

文章目录1. FIS 系统&#xff08;Fuzzy Inference Systems&#xff09;1.1 什么是 FIS 系统&#xff1f;1.2 使用 FIS 算法的几个步骤2. GFS 系统&#xff08;GA FIS&#xff09;2.1 什么是基因遗传算法&#xff08;GA&#xff09;?2.2 使用GA算法进化FIS规则库在大规模的多智…

3-unit1 IPv6网络的管理

##########IPv6网络的管理#############学习目标了解IPv6管理IPv6##)IPv6简介Internet Protocol Version 6IPv6是IETF&#xff08;和互联网工程任务组&#xff09;设计的用与替代现行版本IP协议的下一代IP协议。IPv6采用128位2进制数码表示##IPv6示意图:##IPv6表示方式为方便操…

Xamarin效果第一篇之时间轴

一直都想找个时间玩玩移动端,中间也去各种的调研快速的方式去开发;过程中还是太浮躁木有沉下心去学习;好早尝试过Flutter,二点让我不爽:1、配置环境好费劲(VS把我惯坏了)&#xff1b;2、套娃的写法真是恶心;来看看酸爽不:因为一直都是C#开发,所以这次再次摸索Xamarin去开发;前面…

Lync 小技巧-42-动态-IP-统一沟通-环境-IP-变更后-操作

1. 查看-你的-公网IPhttp://www.ip138.com/2. 修改-你的-公网A记录https://www.godaddy.com/3. 修改-你的-拓朴-For-边缘服务器3.1.远程-前端服务器3.2.下载-拓朴3.3.选择-边缘服务器3.4.选择-边缘服务器3.5.修改-公网IP116.230.255.783.5.发布-拓朴3.6.导出-拓朴3.7.复制-拓朴…

Blazor University (1)介绍 - 什么是 Blazor?

原文链接&#xff1a;https://blazor-university.com/overview/what-is-blazor/什么是 Blazor&#xff1f;Blazor 是一个单页应用程序开发框架。Blazor 这个名称是单词 Browser 和 Razor&#xff08;.NET HTML 视图生成引擎&#xff09;的组合/变异。这意味着 Blazor 不必在服务…

jquery特效(1)—点击展示与隐藏全文

下班了~~~我把今天整理的一个jquery小特效发一下&#xff0c;个人觉得比较简单&#xff0c;嗖嗖的就写出来了~~~ 下面先来看最终的动态效果&#xff1a; 一、来看一下主体框架程序&#xff1a; <!DOCTYPE html> <html><head><meta charset"utf-8&quo…

.NET Core中使用结果过滤器ResultFilter统一结果返回封装

介绍实现需要继承IResultFilter或者 IAsyncResultFilter。为了方便开发,简化代码,也为了与前端方便对接,需要对接口服务返回结果进行统一处理定义统一返回的结果类我们需要定义一个统一返回结果泛型类ApiResultpublic class ApiResult<T>{public int Code { get; set; }p…

UML 绘图关系

1 继承 子类继承父类2 实现实现类实现接口3 依赖 &#xff08;偶然、临时、比较弱关联&#xff09;类 A 使用了类 B&#xff0c;如果类 B 产生变化将会影响类A4 关联&#xff08;长期的、平等的、双向的、强依赖关系&#xff09;强依赖关系。5 聚合关系&#xff08;关联关系特例…

linux下网口监控软件_超赞的!Aibaba技术官分享高性能Linux服务器解读笔记

一个运行缓慢的应用程序有时会让人抓狂&#xff0c;此时需要在问题诊断的基础上进行性能调整。随着虚拟化、云计算时代的来临&#xff0c;Linux得到迅猛发展&#xff0c;在服务器领域已经占据半壁江山&#xff0c;而基于Linux的运维也面临新的挑战:面对越来越复杂的业务&#x…