Reinforcement Learning with Code 【Chapter 7. Temporal-Difference Learning】

Reinforcement Learning with Code

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning, .

文章目录

  • Reinforcement Learning with Code
    • Chapter 7. Temporal-Difference Learning
    • Reference

Chapter 7. Temporal-Difference Learning

Temporal-difference (TD) algorithms can be seen as special Robbins-Monro (RM) algorithm solving expectation form of Bellman or Bellman optimality equation.

7.1 TD leaning of state value

​ Recall the Bellman equation in section 2.2

v π ( s ) = ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ) (elementwise form) v π = r π + γ P π v π (matrix-vector form) v_\pi(s) = \sum_a \pi(a|s) \Big(\sum_rp(r|s,a)r + \gamma \sum_{s^\prime}p(s^\prime|s,a)v_\pi(s^\prime) \Big) \quad \text{(elementwise form)} \\ v_\pi = r_\pi + \gamma P_\pi v_\pi \quad \text{(matrix-vector form)} vπ(s)=aπ(as)(rp(rs,a)r+γsp(ss,a)vπ(s))(elementwise form)vπ=rπ+γPπvπ(matrix-vector form)

where

[ r π ] s ≜ ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r [ P π ] s , s ′ = ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) [r_\pi]_s \triangleq \sum_a \pi(a|s) \sum_r p(r|s,a)r \qquad [P_\pi]_{s,s^\prime} = \sum_a \pi(a|s) \sum_{s^\prime} p(s^\prime|s,a) [rπ]saπ(as)rp(rs,a)r[Pπ]s,s=aπ(as)sp(ss,a)

​ Recall the definition of state value: state value is defined as the mean of all possible returns starting from a state, which is actually the expectation of return from a specific state. We can rewrite the above equation into

v π = E [ R + γ v π ] (matrix-vector form) v π ( s ) = E [ R + γ v π ( S ′ ) ∣ S = s ] , s ∈ S (elementwise form) \textcolor{red}{v_\pi = \mathbb{E}[R+\gamma v_\pi]} \quad \text{(matrix-vector form)} \\ \textcolor{red}{v_\pi(s) = \mathbb{E}[R+\gamma v_\pi(S^\prime)|S=s]}, \quad s\in\mathcal{S} \quad \text{(elementwise form)} vπ=E[R+γvπ](matrix-vector form)vπ(s)=E[R+γvπ(S)S=s],sS(elementwise form)

where S , S ′ S,S^\prime S,S and R R R are the random variables representing the current state, next state and immediate reward. This equation also called the Bellman expectation equation.

​ We can use Robbins-Monro algorithm introduced in chapter 6 to solve the Bellman expectation equation. Reformulate the problem that is to find the root v π ( s ) v_\pi(s) vπ(s) of equation g ( v π ( s ) ) = v π ( s ) − E [ R + γ v π ( S ′ ) ∣ S = s ] = 0 g(v_\pi(s))=v_\pi(s) - \mathbb{E}[R+\gamma v_\pi(S^\prime)|S=s]=0 g(vπ(s))=vπ(s)E[R+γvπ(S)S=s]=0, where S ′ S^\prime S is iid with s s s. We can only get the measurement with noise ​

g ~ ( v π ( s ) , η ) = v π ( s ) − ( r + γ v π ( s ′ ) ) = v π ( s ) − E [ R + γ v π ( S ′ ) ∣ S = s ] ⏟ g ( v π ( s ) ) + ( E [ R + γ v π ( S ′ ) ∣ S = s ] − [ r + γ v π ( s ′ ) ] ) ⏟ η \begin{aligned} \tilde{g}(v_\pi(s),\eta) & = v_\pi(s) - (r+\gamma v_\pi(s^\prime)) \\ & = \underbrace{v_\pi(s) - \mathbb{E}[R+\gamma v_\pi(S^\prime)|S=s]}_{g(v_\pi(s))} + \underbrace{\Big( \mathbb{E}[R+\gamma v_\pi(S^\prime)|S=s] - [r+\gamma v_\pi(s^\prime)] \Big)}_{\eta} \end{aligned} g~(vπ(s),η)=vπ(s)(r+γvπ(s))=g(vπ(s)) vπ(s)E[R+γvπ(S)S=s]+η (E[R+γvπ(S)S=s][r+γvπ(s)])

Hence, according to the Robbins-Monro algorithm, we can get the TD learning algorithm as

v k + 1 ( s ) = v k ( s ) − α k ( v k ( s ) − [ r k + γ v π ( s k ′ ) ] ) v_{k+1}(s) = v_k(s) - \alpha_k \Big(v_k(s) - [r_k+\gamma v_\pi(s^\prime_k)] \Big) vk+1(s)=vk(s)αk(vk(s)[rk+γvπ(sk)])

We do some modification in order to remove some assumptions of TD learning. One modification is the sample data { ( s , r k , s k ′ ) } \{(s,r_k,s_k^\prime) \} {(s,rk,sk)} is changed to { ( s t , r t + 1 , s t + 1 ) } \{(s_t, r_{t+1}, s_{t+1}) \} {(st,rt+1,st+1)}. Due to the modification the algorithm is called temporal-difference learning. Rewrite it in a more concise way:

TD learning : { v t + 1 ( s t ) ⏟ new estimation = v t ( s t ) ⏟ current estimation − α t ( s t ) [ v t ( s t ) − ( r t + 1 + γ v t ( s t + 1 ) ) ⏟ TD target  v ˉ t ] ⏞ TD error or Innovation  δ t v t + 1 ( s ) = v t ( s ) , for all  s ≠ s t \text{TD learning} : \left \{ \begin{aligned} \textcolor{red}{\underbrace{v_{t+1}(s_t)}_{\text{new estimation}}} & \textcolor{red}{= \underbrace{v_t(s_t)}_{\text{current estimation}} - \alpha_t(s_t) \overbrace{\Big[v_t(s_t) - \underbrace{(r_{t+1} +\gamma v_t(s_{t+1}))}_{\text{TD target } \bar{v}_t} \Big]}^{\text{TD error or Innovation } \delta_t}} \\ \textcolor{red}{v_{t+1}(s)} & \textcolor{red}{= v_t(s)}, \quad \text{for all } s\ne s_t \end{aligned} \right. TD learning: new estimation vt+1(st)vt+1(s)=current estimation vt(st)αt(st)[vt(st)TD target vˉt (rt+1+γvt(st+1))] TD error or Innovation δt=vt(s),for all s=st

where t = 0 , 1 , 2 , … t=0,1,2,\dots t=0,1,2,. Here, v t ( s t ) v_t(s_t) vt(st) is the estimated state value of v π ( s t ) v_\pi(s_t) vπ(st) and a t ( s t ) a_t(s_t) at(st) is the learning rate of s t s_t st at time t t t. And

v ˉ t ≜ r t + 1 + γ v ( s t + 1 ) \bar{v}_t \triangleq r_{t+1}+\gamma v(s_{t+1}) vˉtrt+1+γv(st+1)

is called the TD target and

δ t ≜ v ( s t ) − [ r t + 1 + γ v ( s t + 1 ) ] = v ( s t ) − v ˉ t \delta_t \triangleq v(s_t) - [r_{t+1}+\gamma v(s_{t+1})] = v(s_t) - \bar{v}_t δtv(st)[rt+1+γv(st+1)]=v(st)vˉt

is called the TD error. TD error reflects the deficiency between the current estimate v t v_t vt and the true state value v π v_\pi vπ.

7.2 TD learning of action value: Sarsa

​ Sarsa is an algorithm directly estimate action values. Estimating action value is important because the policy can be improved based on action values.

​ Recall the Bellman equation of action value in section 2.5

q π ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) (elementwise form) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) ∑ a ′ ∈ A ( s ′ ) π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) (elementwise form) \begin{aligned} q_\pi(s,a) & =\sum_r p(r|s,a)r + \gamma\sum_{s^\prime} p(s^\prime|s,a) v_\pi(s^\prime) \quad \text{(elementwise form)} \\ & = \sum_r p(r|s,a)r + \gamma\sum_{s^\prime} p(s^\prime|s,a) \sum_{a^\prime \in \mathcal{A}(s^\prime)}\pi(a^\prime|s^\prime) q_\pi(s^\prime,a^\prime) \quad \text{(elementwise form)} \end{aligned} qπ(s,a)=rp(rs,a)r+γsp(ss,a)vπ(s)(elementwise form)=rp(rs,a)r+γsp(ss,a)aA(s)π(as)qπ(s,a)(elementwise form)

Due to the conditional probability p ( a , b ) = p ( b ) p ( a ∣ b ) p(a,b)=p(b)p(a|b) p(a,b)=p(b)p(ab), we have

p ( s ′ , a ′ ∣ s , a ) = p ( s ′ ∣ s , a ) p ( a ′ ∣ s ′ , s , a ) (conditional probility) = p ( s ′ ∣ s , a ) p ( a ′ ∣ s ′ ) (due to conditional independence) = p ( s ′ ∣ s , a ) π ( a ′ ∣ s ′ ) \begin{aligned} p(s^\prime, a^\prime |s,a) & = p(s^\prime|s,a)p(a^\prime|s^\prime, s, a) \quad \text{(conditional probility)} \\ & = p(s^\prime|s,a) p(a^\prime|s^\prime) \quad \text{(due to conditional independence)} \\ & = p(s^\prime|s,a) \pi(a^\prime|s^\prime) \end{aligned} p(s,as,a)=p(ss,a)p(as,s,a)(conditional probility)=p(ss,a)p(as)(due to conditional independence)=p(ss,a)π(as)

Due to the above equation, we have

q π ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ ∑ a ′ p ( s ′ , a ′ ∣ s , a ) q π ( s ′ , a ′ ) q_\pi(s,a) = \sum_r p(r|s,a)r + \gamma \sum_{s^\prime} \sum_{a^\prime} p(s^\prime,a^\prime|s,a) q_\pi(s^\prime,a^\prime) qπ(s,a)=rp(rs,a)r+γsap(s,as,a)qπ(s,a)

Regard the probability p ( r ∣ s , a ) p(r|s,a) p(rs,a) and p ( s ′ , a ′ ∣ s , a ) p(s^\prime, a^\prime |s,a) p(s,as,a) as the distribution of random variable R R R and S ′ S^\prime S respectively. Then rewrite above equation into expectation form

q π ( s , a ) = E [ R + γ q π ( S ′ , A ′ ) ∣ S = s , A = a ] , for all  s , a (expectation form) \textcolor{red}{ q_\pi(s,a) = \mathbb{E}\Big[ R + \gamma q_\pi(S^\prime,A^\prime) \Big| S=s, A=a\Big] }, \text{ for all }s,a \quad \text{(expectation form)} qπ(s,a)=E[R+γqπ(S,A) S=s,A=a], for all s,a(expectation form)

where R , S , S ′ R,S,S^\prime R,S,S are random variables, denote immediate reward, currnet state and next state respectively.

Hence, we can use the Robbins-Monro algorithm to sovle the Bellamn eqaution of action value. We can define

g ( q π ( s , a ) ) ≜ q π ( s , a ) − E [ R + γ q π ( S ′ , A ′ ) ∣ S = s , A = a ] g(q_\pi(s,a)) \triangleq q_\pi(s,a) - \mathbb{E}{\Big[ R + \gamma q_\pi(S^\prime,A^\prime) \Big| S=s, A=a\Big]} g(qπ(s,a))qπ(s,a)E[R+γqπ(S,A) S=s,A=a]

We only can get the observation with noise that

g ~ ( q π ( s , a ) , η ) = q π ( s , a ) − [ r + γ q π ( s ′ , a ′ ) ] = q π ( s , a ) − E [ R + γ q π ( S ′ , A ′ ) ∣ S = s , A = a ] ⏟ g ( q π ( s , a ) ) + [ E [ R + γ q π ( S ′ , A ′ ) ∣ S = s , A = a ] − [ r + γ q π ( s ′ , a ′ ) ] ] ⏟ η \begin{aligned} \tilde{g}\Big(q_\pi(s,a),\eta \Big) & = q_\pi(s,a) - \Big[r+\gamma q_\pi(s^\prime,a^\prime) \Big] \\ & = \underbrace{q_\pi(s,a) - \mathbb{E}{\Big[ R + \gamma q_\pi(S^\prime,A^\prime)\Big| S=s, A=a\Big]}}_{g(q_\pi(s,a))} + \underbrace{\Bigg[\mathbb{E}{\Big[ R + \gamma q_\pi(S^\prime,A^\prime) \Big| S=s, A=a\Big]} - \Big[r+\gamma q_\pi(s^\prime,a^\prime)\Big] \Bigg]}_{\eta} \end{aligned} g~(qπ(s,a),η)=qπ(s,a)[r+γqπ(s,a)]=g(qπ(s,a)) qπ(s,a)E[R+γqπ(S,A) S=s,A=a]+η [E[R+γqπ(S,A) S=s,A=a][r+γqπ(s,a)]]

Hence, according to the Robbins-Monro algorithm, we can get Sarsa as

q k + 1 ( s , a ) = q k ( s , a ) − α k [ q k ( s , a ) − ( r k + γ q k ( s k ′ , a k ′ ) ) ] q_{k+1}(s,a) = q_k(s,a) - \alpha_k \Big[ q_k(s,a) - \big(r_k+\gamma q_k(s^\prime_k,a^\prime_k) \big) \Big] qk+1(s,a)=qk(s,a)αk[qk(s,a)(rk+γqk(sk,ak))]

Similar to the TD learning estimates state value in last section, we do some modification in above equation. The sampled data ( s , a , r k , s k ′ , a k ′ ) (s,a,r_k,s^\prime_k,a^\prime_k) (s,a,rk,sk,ak) is changed to ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) (s_t,a_t,r_{t+1},s_{t+1},a_{t+1}) (st,at,rt+1,st+1,at+1). Hence, the Sarsa becomes

Sarsa : { q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ q t ( s t + 1 , a t + 1 ) ) ] q t + 1 ( s , a ) = q t ( s , a ) , for all  ( s , a ) ≠ ( s t , a t ) \text{Sarsa} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - (r_{t+1} +\gamma q_t(s_{t+1},a_{t+1})) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right. Sarsa: qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γqt(st+1,at+1))]=qt(s,a),for all (s,a)=(st,at)

where t = 0 , 1 , 2 , … t=0,1,2,\dots t=0,1,2,. Here, q t ( s , a t ) q_t(s,a_t) qt(s,at) is the estimated action value of ( s t , a t ) (s_t,a_t) (st,at); α t ( s t , a t ) \alpha_t(s_t,a_t) αt(st,at) is the learning rate depending on s t , a t s_t,a_t st,at.

​ Sarsa is nothing but an action-value version of the TD algorithm. Sarsa is also implemented with policy improvement steps such as ϵ \epsilon ϵ- greedy algorithm. There is a point that should be noticed. In the q q q-value update step, unlike the model-based policy iteration or value iteration algorithm where the values of all states are updated in each iteration, Sarsa only updates a single state-action pair that is visited at time step t t t.

Pesudocode:

Image

7.3 TD learning of action value: Expected Sarsa

​ Recall the Bellman equation of action value

q π ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) (elementwise form) q_\pi(s,a) = \sum_r p(r|s,a)r + \gamma\sum_{s^\prime} p(s^\prime|s,a) v_\pi(s^\prime) \quad \text{(elementwise form)} qπ(s,a)=rp(rs,a)r+γsp(ss,a)vπ(s)(elementwise form)

Regard the probability p ( r ∣ s , a ) p(r|s,a) p(rs,a) and p ( s ′ ∣ s , a ) p(s^\prime|s,a) p(ss,a) as the distribution of random variable R R R and v π ( S ′ ) v_\pi(S^\prime) vπ(S). Then, we have the expectation form of Bellman equation of action value.

q π ( s , a ) = E [ R + γ v π ( S ′ ) ∣ S = s , A = a ] (expectation form) ( 1 ) q_\pi(s,a) = \mathbb{E}[R + \gamma v_\pi(S^\prime)|S=s,A=a] \quad \text{(expectation form)} \quad (1) qπ(s,a)=E[R+γvπ(S)S=s,A=a](expectation form)(1)

According to the definition of state value we have

E [ q π ( s , A ) ∣ s ] = ∑ a ∈ A ( s ) π ( a ∣ s ) q π ( s , a ) = v π ( s ) → E [ q π ( S ′ , A ) ∣ S ′ ] = v π ( S ′ ) ( 2 ) \begin{aligned} \mathbb{E}[q_\pi(s, A) | s] & = \sum_{a\in\mathcal{A}(s)} \pi(a|s) q_\pi(s,a) = v_\pi(s) \\ \to \mathbb{E}[q_\pi(S^\prime, A) | S^\prime] & = v_\pi(S^\prime) \quad (2) \end{aligned} E[qπ(s,A)s]E[qπ(S,A)S]=aA(s)π(as)qπ(s,a)=vπ(s)=vπ(S)(2)

Subtitute ( 2 ) (2) (2) into ( 1 ) (1) (1) we have

q π ( s , a ) = E [ R + γ E [ q π ( S ′ , A ) ∣ S ′ ] ∣ S = s , A = a ] , for all  s , a (expectation form) \textcolor{red}{q_\pi(s,a) = \mathbb{E} \Big[ R+\gamma \mathbb{E} \big[ q_\pi(S^\prime, A)|S^\prime \big] \Big| S=s, A=a \Big]}, \text{ for all }s,a \quad \text{(expectation form)} qπ(s,a)=E[R+γE[qπ(S,A)S] S=s,A=a], for all s,a(expectation form)

Rewirte it into root finding parttern

g ( q π ( s , a ) ) ≜ q π ( s , a ) − E [ R + γ E [ q π ( S ′ , A ) ∣ S ′ ] ∣ S = s , A = a ] g(q_\pi(s,a)) \triangleq q_\pi(s,a) - \mathbb{E} \Big[ R+\gamma \mathbb{E} \big[ q_\pi(S^\prime, A)|S^\prime \big] \Big| S=s, A=a \Big] g(qπ(s,a))qπ(s,a)E[R+γE[qπ(S,A)S] S=s,A=a]

We can only get the observation with noise η \eta η

g ~ ( q π ( s , a ) , η ) = q π ( s , a ) − ( r + γ E [ q π ( s ′ , A ) ∣ s ′ ] ) = q π ( s , a ) − E [ R + γ E [ q π ( S ′ , A ) ∣ S ′ ] ∣ S = s , A = a ] ⏟ g ( q π ( s , a ) ) + E [ R + γ E [ q π ( S ′ , A ) ∣ S ′ ] ∣ S = s , A = a ] − ( r + γ E [ q π ( s ′ , A ) ∣ s ′ ] ) ⏟ η \begin{aligned} \tilde{g}(q_\pi(s,a), \eta) & = q_\pi(s,a) - \Big(r + \gamma \mathbb{E} \big[ q_\pi(s^\prime, A)|s^\prime \big] \Big) \\ & = \underbrace{q_\pi(s,a) - \mathbb{E} \Big[ R+\gamma \mathbb{E} \big[ q_\pi(S^\prime, A)|S^\prime \big] \Big| S=s, A=a \Big]}_{g(q_\pi(s,a))} + \underbrace{\mathbb{E} \Big[ R+\gamma \mathbb{E} \big[ q_\pi(S^\prime, A)|S^\prime \big] \Big| S=s, A=a \Big] - \Big(r + \gamma \mathbb{E} \big[ q_\pi(s^\prime, A)|s^\prime \big] \Big)}_{\eta} \end{aligned} g~(qπ(s,a),η)=qπ(s,a)(r+γE[qπ(s,A)s])=g(qπ(s,a)) qπ(s,a)E[R+γE[qπ(S,A)S] S=s,A=a]+η E[R+γE[qπ(S,A)S] S=s,A=a](r+γE[qπ(s,A)s])

Hence, we can implement Robbins-Monro algorithm to find the root of g ( q π ( s , a ) ) g(q_\pi(s,a)) g(qπ(s,a)) that

q k + 1 ( s , a ) = q k ( s , a ) − α k ( s , a ) [ q k ( s , a ) − ( r k + γ E [ q k ( s k ′ , A ) ∣ s k ′ ] ) ] q_{k+1}(s,a) = q_k(s,a) - \alpha_k(s,a) \Bigg[ q_k(s,a) - \Big(r_k + \gamma \mathbb{E} \big[ q_k(s^\prime_k, A)|s^\prime_k \big] \Big) \Bigg] qk+1(s,a)=qk(s,a)αk(s,a)[qk(s,a)(rk+γE[qk(sk,A)sk])]

Similar to the TD learning estimates state value, we do some modification in above equation. The sampled data ( s , a , r k , s k ′ ) (s,a,r_k,s^\prime_k) (s,a,rk,sk) is changed to ( s t , a t , r t + 1 , s t + 1 ) (s_t,a_t,r_{t+1},s_{t+1}) (st,at,rt+1,st+1). Hence, the Expected-Sarsa becomes

Expected-Sarsa : { q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ E [ q t ( s t + 1 , A ∣ s t + 1 ) ] ) ] q t + 1 ( s , a ) = q t ( s , a ) , for all  ( s , a ) ≠ ( s t , a t ) \text{Expected-Sarsa} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - \big( r_{t+1} +\gamma \mathbb{E}[q_t(s_{t+1},A|s_{t+1})] \big) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right. Expected-Sarsa: qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γE[qt(st+1,Ast+1)])]=qt(s,a),for all (s,a)=(st,at)

7.4 TD learning of action values: n n n-step Sarsa

​ Recall the definition of action value is

q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] q_\pi(s,a) = \mathbb{E}[G_t|S_t=s, A_t=a] qπ(s,a)=E[GtSt=s,At=a]

The discounted return G t G_t Gt can be written in different forms as

Sarsa ⟵ G t ( 1 ) = R t + 1 + γ q π ( S t + 1 , A t + 1 ) G t ( 2 ) = R t + 1 + γ R t + 2 + γ 2 q π ( S t + 2 , A t + 2 ) ⋮ n -step Sarsa ⟵ G t ( n ) = R t + 1 + γ R t + 2 + ⋯ + γ n q π ( S t + n , A t + n ) ⋮ Monte Carlo ⟵ G t ( ∞ ) = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ \begin{aligned} \text{Sarsa} \longleftarrow G_t^{(1)} & = R_{t+1} + \gamma q_\pi(S_{t+1},A_{t+1}) \\ G_t^{(2)} & = R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2},A_{t+2}) \\ & \vdots \\ n\text{-step Sarsa} \longleftarrow G_t^{(n)} & = R_{t+1} + \gamma R_{t+2} + \cdots +\gamma^n q_\pi(S_{t+n},A_{t+n}) \\ & \vdots \\ \text{Monte Carlo} \longleftarrow G_t^{(\infty)} & = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3}+ \cdots \\ \end{aligned} SarsaGt(1)Gt(2)n-step SarsaGt(n)Monte CarloGt()=Rt+1+γqπ(St+1,At+1)=Rt+1+γRt+2+γ2qπ(St+2,At+2)=Rt+1+γRt+2++γnqπ(St+n,At+n)=Rt+1+γRt+2+γ2Rt+3+

It should be noted that G t = G t ( 1 ) = G t ( 2 ) = G t ( n ) = G t ( ∞ ) G_t=G_t^{(1)}=G_t^{(2)}=G_t^{(n)}=G_t^{(\infty)} Gt=Gt(1)=Gt(2)=Gt(n)=Gt(), where the superscripts merely indicate the different decomposition structures of G t G_t Gt.

​ Sarsa aims to solve

q π ( s , a ) = E [ G t ( 1 ) ∣ s , a ] = E [ R t + 1 + γ q π ( S t + 1 , A t + 1 ) ∣ s , a ] q_\pi(s,a) = \mathbb{E}[G_t^{(1)}|s,a] = \mathbb{E} [R_{t+1}+\gamma q_\pi(S_{t+1},A_{t+1})|s,a ] qπ(s,a)=E[Gt(1)s,a]=E[Rt+1+γqπ(St+1,At+1)s,a]

MC learning aims to solve

q π ( s , a ) = E [ G t ( ∞ ) ∣ s , a ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ ∣ s , a ] q_\pi(s,a) = \mathbb{E} [G_t^{(\infty)}|s,a] = \mathbb{E} [R_{t+1}+\gamma R_{t+2} + \gamma^2 R_{t+3}+\cdots |s,a] qπ(s,a)=E[Gt()s,a]=E[Rt+1+γRt+2+γ2Rt+3+s,a]

n n n-step Sarsa aims to solve

q π ( s , a ) = E [ G t ( n ) ∣ s , a ] = E [ R t + 1 + γ R t + 2 + ⋯ + γ n q π ( S t + n , A t + n ) ∣ s , a ] q_\pi(s,a) = \mathbb{E} [G_t^{(n)}|s,a] = \mathbb{E} [R_{t+1}+\gamma R_{t+2} +\cdots + \gamma^n q_\pi(S_{t+n},A_{t+n}) |s,a] qπ(s,a)=E[Gt(n)s,a]=E[Rt+1+γRt+2++γnqπ(St+n,At+n)s,a]

​ The n n n-step Sarsa algorithm is

n -step Sarsa : { q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ r t + 2 + ⋯ + γ n q t ( s t + n , a t + n ) ) ] q t + 1 ( s , a ) = q t ( s , a ) , for all  ( s , a ) ≠ ( s t , a t ) n\text{-step Sarsa} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - (r_{t+1}+ \gamma r_{t+2} + \cdots + \gamma^n q_t(s_{t+n},a_{t+n})) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right. n-step Sarsa: qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γrt+2++γnqt(st+n,at+n))]=qt(s,a),for all (s,a)=(st,at)

7.5 TD learning of optimal action values: Q-learning

​ It should be noted that Sarsa can only estimate the action values of a given policy. It must be combined with a policy improvement step to find optimal policies and hence their optimal action values. By contrast, Q-learning can directly estimate optimal action values.

​ Recall the Bellman optimal equation of state value in section 3.2

v ( s ) = max ⁡ π ∑ a ∈ A ( s ) π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v ( s ′ ) ] v ( s ) = max ⁡ a ∈ A ( s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v ( s ′ ) ] \begin{aligned} v(s) & = \max_\pi \sum_{a\in\mathcal{A}(s)} \pi(a|s) \Big[ \sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^{\prime}|s,a) v(s^{\prime}) \Big] \\ v(s) & = \max_{a\in\mathcal{A}(s)} \Big[\sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^\prime|s,a) v(s^\prime) \Big] \end{aligned} v(s)v(s)=πmaxaA(s)π(as)[rp(rs,a)r+γsp(ss,a)v(s)]=aA(s)max[rp(rs,a)r+γsp(ss,a)v(s)]

where v ( s ) ≜ max ⁡ a ∈ A ( s ) q ( s , a ) v(s)\triangleq \max_{a\in\mathcal{A}(s)} q(s,a) v(s)maxaA(s)q(s,a). Hence we have

max ⁡ a ∈ A ( s ) q ( s , a ) = max ⁡ a ∈ A ( s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v ( s ′ ) ] max ⁡ a ∈ A ( s ) q ( s , a ) = max ⁡ a ∈ A ( s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ a ∈ A ( s ) q ( s ′ , a ) ] → q ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ a ∈ A ( s ) q ( s ′ , a ) (elementwise form) \begin{aligned} \max_{a\in\mathcal{A}(s)} q(s,a) & = \max_{a\in\mathcal{A}(s)} \Big[\sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^\prime|s,a) v(s^\prime) \Big] \\ \max_{a\in\mathcal{A}(s)} q(s,a) & = \max_{a\in\mathcal{A}(s)} \Big[\sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^\prime|s,a) \max_{a\in\mathcal{A}(s)} q(s^\prime,a) \Big] \\ \to q(s,a) & = \sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^\prime|s,a) \max_{a\in\mathcal{A}(s)} q(s^\prime,a) \quad \text{(elementwise form)} \end{aligned} aA(s)maxq(s,a)aA(s)maxq(s,a)q(s,a)=aA(s)max[rp(rs,a)r+γsp(ss,a)v(s)]=aA(s)max[rp(rs,a)r+γsp(ss,a)aA(s)maxq(s,a)]=rp(rs,a)r+γsp(ss,a)aA(s)maxq(s,a)(elementwise form)

Rewrite it into expectation form

q ( s , a ) = E [ R + γ max ⁡ a ∈ A ( s ) q ( S ′ , a ) ∣ S = s , A = a ] , for all  s , a (expectation form) \textcolor{red}{ q(s,a) = \mathbb{E}[R+\gamma \max_{a\in\mathcal{A}(s)} q(S^\prime,a) |S=s,A=a ] }, \text{ for all }s,a \quad \text{(expectation form)} q(s,a)=E[R+γaA(s)maxq(S,a)S=s,A=a], for all s,a(expectation form)

This equation is the Bellman optimal equation expressed in terms of action values.

Rewrite it into

g ( q ( s , a ) ) ≜ q ( s , a ) − E [ R + γ max ⁡ a ∈ A ( S ′ ) q ( S ′ , a ) ∣ S = s , A = a ] g(q(s,a)) \triangleq q(s,a) - \mathbb{E} [R+\gamma \max_{a\in\mathcal{A}(S^\prime)} q(S^\prime,a) |S=s,A=a ] g(q(s,a))q(s,a)E[R+γaA(S)maxq(S,a)S=s,A=a]

We can get the observation with noise

g ~ ( q ( s , a ) ) = q ( s , a ) − [ r + γ max ⁡ a ∈ A ( s ′ ) q ( s ′ , a ) ] = q ( s , a ) − E [ R + γ max ⁡ a ∈ A ( S ′ ) q ( S ′ , a ) ∣ S = s , A = a ] ⏟ g ( q ( s , a ) ) + E [ R + γ max ⁡ a ∈ A ( S ′ ) q ( S ′ , a ) ∣ S = s , A = a ] − [ r + γ max ⁡ a ∈ A ( s ′ ) q ( s ′ , a ) ] ⏟ η \begin{aligned} \tilde{g}(q(s,a)) & = q(s,a) - \Big[r + \gamma \max_{a\in\mathcal{A}(s^\prime)} q(s^\prime,a) \Big] \\ & = \underbrace{q(s,a) - \mathbb{E} [R+\gamma \max_{a\in\mathcal{A}(S^\prime)} q(S^\prime,a) |S=s,A=a ]}_{g(q(s,a))} + \underbrace{\mathbb{E} [R+\gamma \max_{a\in\mathcal{A}(S^\prime)} q(S^\prime,a) |S=s,A=a ] - \Big[r + \gamma \max_{a\in\mathcal{A}(s^\prime)} q(s^\prime,a) \Big]}_{\eta} \end{aligned} g~(q(s,a))=q(s,a)[r+γaA(s)maxq(s,a)]=g(q(s,a)) q(s,a)E[R+γaA(S)maxq(S,a)S=s,A=a]+η E[R+γaA(S)maxq(S,a)S=s,A=a][r+γaA(s)maxq(s,a)]

Hence, we can implement Robbins-Monro algorithm to find the root

q k + 1 ( s , a ) = q k ( s , a ) − α k ( s , a ) [ q k ( s , a ) − ( r k + γ max ⁡ a ∈ A ( s ′ ) q k ( s ′ , a ) ) ] q_{k+1}(s,a) = q_k(s,a) - \alpha_k(s,a) \Big[q_k(s,a) - \big(r_k + \gamma \max_{a\in\mathcal{A}(s^\prime)} q_k(s^\prime,a) \big) \Big] qk+1(s,a)=qk(s,a)αk(s,a)[qk(s,a)(rk+γaA(s)maxqk(s,a))]

Similar to the TD learning estimates state value, we do some modification in above equation. The sampled data ( s , a , r k , s k ′ ) (s,a,r_k,s^\prime_k) (s,a,rk,sk) is changed to ( s t , a t , r t + 1 , s t + 1 ) (s_t,a_t,r_{t+1},s_{t+1}) (st,at,rt+1,st+1). Hence, the Q-learning becomes

Q-learning : { q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q t ( s t + 1 , a ) ) ] q t + 1 ( s , a ) = q t ( s , a ) , for all  ( s , a ) ≠ ( s t , a t ) \text{Q-learning} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - (r_{t+1}+ \gamma \max_{a\in\mathcal{A}(s_{t+1})} q_t(s_{t+1},a)) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right. Q-learning: qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γaA(st+1)maxqt(st+1,a))]=qt(s,a),for all (s,a)=(st,at)

Off-policy vs on-policy:

​ There exist two policies in a TD learning task: behavior policy and target policy. The behavoir policy is used to generate experience samples. The target policy is constantly updated toward an optimal policy. When the behavior policy is the same as the target policy, such a kind of learning is called on-policy. Otherwise, when they are different, the learning is called off-policy.

​ The advantage of off-policy learning compared to on-policy learning is that it can search for optimal policies based on the experiences generated by any other policies.

How to determine a algorithm is on-policy or off-policy? The basic reason is that if the algorithm is sovling Bellman equation, then it’s on-policy. This is because Bellman equation is finding the state value or action value under a given policy. Else if the algorithm is sovling Bellman optimal equation, then it’s off-policy. This because Bellman equation does not include any policy, hence, the behavior policy and target policy can be different.

Online learning vs offline learning:

Online learning refers to the case where the value and policy can be updated once an experience sample is obtained. Offline learning refers to the case that the update can only be done after all experience samples have been collected. For example, TD learning is online whereas Monte Carlo learning is offline.

Persudocode:

(On-policy version)

Image

(Off-policy version)

Image

Reference

赵世钰老师的课程

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/9731.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

高层金属做power mesh如何避免via stack

随着工艺精进,pr要处理的层次也越来越多,如何选择power plan的层次尤为关键,一方面决定ir drop的大小,影响着芯片的功能,一方面决定绕线资源,影响面积。 选择高层metal做power mesh的关键在于厚金属&#…

局域网内主机ping不通,但是可以调用对方http接口(防火墙阻止了icmp协议)(关闭防火墙或者启用ICMP回显请求(ICMPv4-In))

文章目录 背景可能的原因问题排查及解决 背景 局域网内有一台主机,ping它ping不通,但是可以调用它的http接口,很诡异。。。 可能的原因 可能的原因有以下几种: 防火墙设置:局域网内的主机可能设置了防火墙&#xff…

勘探开发人工智能应用:地震层位解释

1 地震层位解释 层位解释是地震构造解释的重要内容,是根据目标层位的地震反射特征如振幅、相位、形态、连续性、特征组合等信息在地震数据体上进行追踪解释获得地震层位数据的方法。 1.1 地震信号、层位与断层 图1.1 所示为地震信号采集的过程,地面炮…

opencv-21 alpha 通道详解(应用于 图像增强,合成,蒙版,特效 等)

什么是alpha 通道? Alpha通道是计算机图形学中用于表示图像透明度的一种通道。在一个图像中,通常会有三个颜色通道:红色(R)、绿色(G)、蓝色(B),它们合在一起…

macOS 源码编译 Percona XtraBackup

percona-xtrabackup-2.4.28.tar.gz安装依赖 ╰─➤ brew install cmake ╰─➤ cmake --version cmake version 3.27.0brew 安装 ╰─➤ brew update╰─➤ brew search xtrabackup > Formulae percona-xtrabackup╰─➤ brew install percona-xtrabackup╰─➤ xtr…

scrcpy2.0+实时将手机画面显示在屏幕上并用鼠标模拟点击2023.7.26

想要用AI代打手游,除了模拟器登录,也可以直接使用第三方工具Scrcpy,来自github,它是一个开源的屏幕镜像工具,可以在电脑上显示Android设备的画面,并支持使用鼠标进行交互。 目录 1. 下载安装2. scrcpy的高级…

最常见的设计模式(代码示例)

文章目录 为什么要学习设计模式单例模式哪些地方使用单例模式懒汉模式和饿汉模式的区别单例的特性饿汉模式与懒汉模式的区别 工厂模式Spring 工厂模式创建Bean为什么Spring IOC要使用工厂设计模式创建Bean呢各个工厂模式的区别简单工厂(一个工厂生产不同的具体产品&…

Go语言开发小技巧易错点100例(八)

往期回顾: Go语言开发小技巧&易错点100例(一)Go语言开发小技巧&易错点100例(二)Go语言开发小技巧&易错点100例(三)Go语言开发小技巧&易错点100例(四)Go…

100多个常用快捷键

Ctrl A: 全选Ctrl C: 复制Ctrl V: 粘贴Ctrl X: 剪切Ctrl Z: 撤销Ctrl Y: 重做Ctrl S: 保存Ctrl P: 打印Ctrl F: 查找Ctrl H: 替换Ctrl G: 转到Ctrl N: 新建Ctrl O: 打开Ctrl W: 关闭窗口Ctrl Q: 退出程序Ctrl F1: 折叠/展开功能区Ctrl F5: 刷新Ctrl F7: 拼写…

【论文笔记】RCM-Fusion: Radar-Camera Multi-Level Fusion for 3D Object Detection

原文链接:https://arxiv.org/abs/2307.10249 1. 引言 目前的一些雷达-相机融合3D目标检测方法进行实例级的融合,从相机图像生成3D提案,并与雷达点云相关联以修正提案。但这种方法没有在最初阶段使用雷达,依赖于相机3D检测器&…

Spring中如何用注解方式存取JavaBean?有几种注入方式?

博主简介:想进大厂的打工人博主主页:xyk:所属专栏: JavaEE进阶 本篇文章将讲解如何在spring中使用注解的方式来存取Bean对象,spring提供了多种注入对象的方式,常见的注入方式包括 构造函数注入,Setter 方法注入和属性…

如何在局域网外SSH远程访问连接到家里的树莓派?

文章目录 如何在局域网外SSH远程访问连接到家里的树莓派?如何通过 SSH 连接到树莓派步骤1. 在 Raspberry Pi 上启用 SSH步骤2. 查找树莓派的 IP 地址步骤3. SSH 到你的树莓派步骤 4. 在任何地点访问家中的树莓派4.1 安装 Cpolar4.2 cpolar进行token认证4.3 配置cpol…

Linux-free

free命令可以显示Linux系统中空闲的、已用的物理内存及swap内存,及被内核使用的buffer。在Linux系统监控的工具中,free命令是最经常使用的命令之一。 1.命令格式: free [参数] 2.命令功能: free 命令显示系统使用和…

word图自动编号引用

一.引用,插入题注,新建标签,图1-,这样生成的就是图1-1这种,确定 再添加图片就点击添加题注就行,自动生成图1-2这种 二.图例保存为书签 插入,书签,书签命名,如图1 三…

Tensorflow(二)

一、过拟合 过拟合现象:机器对于数据的学习过于自负(想要将误差减到最小)。 解决方法:利用正规化方法 二、卷积神经网络(CNN) 卷积神经网络是近些年来逐渐兴起的人工神经网络,主要用于图像分类、计算机视觉等。 卷积:例如对图片每一小块像素区域的处理&#xff…

Centos7.9安全部署_防火墙配置_端口配置_协议配置_IP配置_全部亲测---记录022_大数据工作笔记0182

在我们平时搭建大数据平台的时候,由于防火墙的限制,会让搭建集群的时候,报各种错误,但是,有些网络环境要求比较严格的地方,防火墙又要求必须要放开,尤其是.. 有些网络环境会安全组进行定时扫描,说实话,我们用的很多开源软件,一般都是低版本的话都有漏洞,但是升级的话又会很容易…

vue2开发前的准备和注意事项

目录 注意事项 1、创建vue脚手架 2、项目启动 3、安装路由VueRouter 4、安装axios【需要自行安装】 5、安装vuex 6、安装ElementUI【自行安装】 7、打包【提交项目】 注意事项 components文件夹:主要写会重复用到的模块 views:写页面 文件命名格…

微信小程序分享页面代码

在微信小程序中实现分享功能需要以下几个步骤: 1. 在app.json文件中配置分享参数,例如标题、路径等。示例如下: json { "pages": [ "pages/index/index" ], "window": { "navigationBarTit…

sqlite触发器1

SQLite 的触发器(Trigger)可以指定在特定的数据库表发生 DELETE、INSERT 或 UPDATE 时触发,或在一个或多个指定表的列发生更新时触发。 SQLite 只支持 FOR EACH ROW 触发器(Trigger),没有 FOR EACH STATEM…

SpringBoot+Jpa+Thymeleaf实现增删改查

SpringBootJpaThymeleaf实现增删改查 这篇文章介绍如何使用 Jpa 和 Thymeleaf 做一个增删改查的示例。 1、pom依赖 pom 包里面添加Jpa 和 Thymeleaf 的相关包引用 <?xml version"1.0" encoding"UTF-8"?> <project xmlns"http://maven.…