Reinforcement Learning with Code 【Chapter 7. Temporal-Difference Learning】

Reinforcement Learning with Code

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning, .

文章目录

  • Reinforcement Learning with Code
    • Chapter 7. Temporal-Difference Learning
    • Reference

Chapter 7. Temporal-Difference Learning

Temporal-difference (TD) algorithms can be seen as special Robbins-Monro (RM) algorithm solving expectation form of Bellman or Bellman optimality equation.

7.1 TD leaning of state value

​ Recall the Bellman equation in section 2.2

v π ( s ) = ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ) (elementwise form) v π = r π + γ P π v π (matrix-vector form) v_\pi(s) = \sum_a \pi(a|s) \Big(\sum_rp(r|s,a)r + \gamma \sum_{s^\prime}p(s^\prime|s,a)v_\pi(s^\prime) \Big) \quad \text{(elementwise form)} \\ v_\pi = r_\pi + \gamma P_\pi v_\pi \quad \text{(matrix-vector form)} vπ(s)=aπ(as)(rp(rs,a)r+γsp(ss,a)vπ(s))(elementwise form)vπ=rπ+γPπvπ(matrix-vector form)

where

[ r π ] s ≜ ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r [ P π ] s , s ′ = ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) [r_\pi]_s \triangleq \sum_a \pi(a|s) \sum_r p(r|s,a)r \qquad [P_\pi]_{s,s^\prime} = \sum_a \pi(a|s) \sum_{s^\prime} p(s^\prime|s,a) [rπ]saπ(as)rp(rs,a)r[Pπ]s,s=aπ(as)sp(ss,a)

​ Recall the definition of state value: state value is defined as the mean of all possible returns starting from a state, which is actually the expectation of return from a specific state. We can rewrite the above equation into

v π = E [ R + γ v π ] (matrix-vector form) v π ( s ) = E [ R + γ v π ( S ′ ) ∣ S = s ] , s ∈ S (elementwise form) \textcolor{red}{v_\pi = \mathbb{E}[R+\gamma v_\pi]} \quad \text{(matrix-vector form)} \\ \textcolor{red}{v_\pi(s) = \mathbb{E}[R+\gamma v_\pi(S^\prime)|S=s]}, \quad s\in\mathcal{S} \quad \text{(elementwise form)} vπ=E[R+γvπ](matrix-vector form)vπ(s)=E[R+γvπ(S)S=s],sS(elementwise form)

where S , S ′ S,S^\prime S,S and R R R are the random variables representing the current state, next state and immediate reward. This equation also called the Bellman expectation equation.

​ We can use Robbins-Monro algorithm introduced in chapter 6 to solve the Bellman expectation equation. Reformulate the problem that is to find the root v π ( s ) v_\pi(s) vπ(s) of equation g ( v π ( s ) ) = v π ( s ) − E [ R + γ v π ( S ′ ) ∣ S = s ] = 0 g(v_\pi(s))=v_\pi(s) - \mathbb{E}[R+\gamma v_\pi(S^\prime)|S=s]=0 g(vπ(s))=vπ(s)E[R+γvπ(S)S=s]=0, where S ′ S^\prime S is iid with s s s. We can only get the measurement with noise ​

g ~ ( v π ( s ) , η ) = v π ( s ) − ( r + γ v π ( s ′ ) ) = v π ( s ) − E [ R + γ v π ( S ′ ) ∣ S = s ] ⏟ g ( v π ( s ) ) + ( E [ R + γ v π ( S ′ ) ∣ S = s ] − [ r + γ v π ( s ′ ) ] ) ⏟ η \begin{aligned} \tilde{g}(v_\pi(s),\eta) & = v_\pi(s) - (r+\gamma v_\pi(s^\prime)) \\ & = \underbrace{v_\pi(s) - \mathbb{E}[R+\gamma v_\pi(S^\prime)|S=s]}_{g(v_\pi(s))} + \underbrace{\Big( \mathbb{E}[R+\gamma v_\pi(S^\prime)|S=s] - [r+\gamma v_\pi(s^\prime)] \Big)}_{\eta} \end{aligned} g~(vπ(s),η)=vπ(s)(r+γvπ(s))=g(vπ(s)) vπ(s)E[R+γvπ(S)S=s]+η (E[R+γvπ(S)S=s][r+γvπ(s)])

Hence, according to the Robbins-Monro algorithm, we can get the TD learning algorithm as

v k + 1 ( s ) = v k ( s ) − α k ( v k ( s ) − [ r k + γ v π ( s k ′ ) ] ) v_{k+1}(s) = v_k(s) - \alpha_k \Big(v_k(s) - [r_k+\gamma v_\pi(s^\prime_k)] \Big) vk+1(s)=vk(s)αk(vk(s)[rk+γvπ(sk)])

We do some modification in order to remove some assumptions of TD learning. One modification is the sample data { ( s , r k , s k ′ ) } \{(s,r_k,s_k^\prime) \} {(s,rk,sk)} is changed to { ( s t , r t + 1 , s t + 1 ) } \{(s_t, r_{t+1}, s_{t+1}) \} {(st,rt+1,st+1)}. Due to the modification the algorithm is called temporal-difference learning. Rewrite it in a more concise way:

TD learning : { v t + 1 ( s t ) ⏟ new estimation = v t ( s t ) ⏟ current estimation − α t ( s t ) [ v t ( s t ) − ( r t + 1 + γ v t ( s t + 1 ) ) ⏟ TD target  v ˉ t ] ⏞ TD error or Innovation  δ t v t + 1 ( s ) = v t ( s ) , for all  s ≠ s t \text{TD learning} : \left \{ \begin{aligned} \textcolor{red}{\underbrace{v_{t+1}(s_t)}_{\text{new estimation}}} & \textcolor{red}{= \underbrace{v_t(s_t)}_{\text{current estimation}} - \alpha_t(s_t) \overbrace{\Big[v_t(s_t) - \underbrace{(r_{t+1} +\gamma v_t(s_{t+1}))}_{\text{TD target } \bar{v}_t} \Big]}^{\text{TD error or Innovation } \delta_t}} \\ \textcolor{red}{v_{t+1}(s)} & \textcolor{red}{= v_t(s)}, \quad \text{for all } s\ne s_t \end{aligned} \right. TD learning: new estimation vt+1(st)vt+1(s)=current estimation vt(st)αt(st)[vt(st)TD target vˉt (rt+1+γvt(st+1))] TD error or Innovation δt=vt(s),for all s=st

where t = 0 , 1 , 2 , … t=0,1,2,\dots t=0,1,2,. Here, v t ( s t ) v_t(s_t) vt(st) is the estimated state value of v π ( s t ) v_\pi(s_t) vπ(st) and a t ( s t ) a_t(s_t) at(st) is the learning rate of s t s_t st at time t t t. And

v ˉ t ≜ r t + 1 + γ v ( s t + 1 ) \bar{v}_t \triangleq r_{t+1}+\gamma v(s_{t+1}) vˉtrt+1+γv(st+1)

is called the TD target and

δ t ≜ v ( s t ) − [ r t + 1 + γ v ( s t + 1 ) ] = v ( s t ) − v ˉ t \delta_t \triangleq v(s_t) - [r_{t+1}+\gamma v(s_{t+1})] = v(s_t) - \bar{v}_t δtv(st)[rt+1+γv(st+1)]=v(st)vˉt

is called the TD error. TD error reflects the deficiency between the current estimate v t v_t vt and the true state value v π v_\pi vπ.

7.2 TD learning of action value: Sarsa

​ Sarsa is an algorithm directly estimate action values. Estimating action value is important because the policy can be improved based on action values.

​ Recall the Bellman equation of action value in section 2.5

q π ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) (elementwise form) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) ∑ a ′ ∈ A ( s ′ ) π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) (elementwise form) \begin{aligned} q_\pi(s,a) & =\sum_r p(r|s,a)r + \gamma\sum_{s^\prime} p(s^\prime|s,a) v_\pi(s^\prime) \quad \text{(elementwise form)} \\ & = \sum_r p(r|s,a)r + \gamma\sum_{s^\prime} p(s^\prime|s,a) \sum_{a^\prime \in \mathcal{A}(s^\prime)}\pi(a^\prime|s^\prime) q_\pi(s^\prime,a^\prime) \quad \text{(elementwise form)} \end{aligned} qπ(s,a)=rp(rs,a)r+γsp(ss,a)vπ(s)(elementwise form)=rp(rs,a)r+γsp(ss,a)aA(s)π(as)qπ(s,a)(elementwise form)

Due to the conditional probability p ( a , b ) = p ( b ) p ( a ∣ b ) p(a,b)=p(b)p(a|b) p(a,b)=p(b)p(ab), we have

p ( s ′ , a ′ ∣ s , a ) = p ( s ′ ∣ s , a ) p ( a ′ ∣ s ′ , s , a ) (conditional probility) = p ( s ′ ∣ s , a ) p ( a ′ ∣ s ′ ) (due to conditional independence) = p ( s ′ ∣ s , a ) π ( a ′ ∣ s ′ ) \begin{aligned} p(s^\prime, a^\prime |s,a) & = p(s^\prime|s,a)p(a^\prime|s^\prime, s, a) \quad \text{(conditional probility)} \\ & = p(s^\prime|s,a) p(a^\prime|s^\prime) \quad \text{(due to conditional independence)} \\ & = p(s^\prime|s,a) \pi(a^\prime|s^\prime) \end{aligned} p(s,as,a)=p(ss,a)p(as,s,a)(conditional probility)=p(ss,a)p(as)(due to conditional independence)=p(ss,a)π(as)

Due to the above equation, we have

q π ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ ∑ a ′ p ( s ′ , a ′ ∣ s , a ) q π ( s ′ , a ′ ) q_\pi(s,a) = \sum_r p(r|s,a)r + \gamma \sum_{s^\prime} \sum_{a^\prime} p(s^\prime,a^\prime|s,a) q_\pi(s^\prime,a^\prime) qπ(s,a)=rp(rs,a)r+γsap(s,as,a)qπ(s,a)

Regard the probability p ( r ∣ s , a ) p(r|s,a) p(rs,a) and p ( s ′ , a ′ ∣ s , a ) p(s^\prime, a^\prime |s,a) p(s,as,a) as the distribution of random variable R R R and S ′ S^\prime S respectively. Then rewrite above equation into expectation form

q π ( s , a ) = E [ R + γ q π ( S ′ , A ′ ) ∣ S = s , A = a ] , for all  s , a (expectation form) \textcolor{red}{ q_\pi(s,a) = \mathbb{E}\Big[ R + \gamma q_\pi(S^\prime,A^\prime) \Big| S=s, A=a\Big] }, \text{ for all }s,a \quad \text{(expectation form)} qπ(s,a)=E[R+γqπ(S,A) S=s,A=a], for all s,a(expectation form)

where R , S , S ′ R,S,S^\prime R,S,S are random variables, denote immediate reward, currnet state and next state respectively.

Hence, we can use the Robbins-Monro algorithm to sovle the Bellamn eqaution of action value. We can define

g ( q π ( s , a ) ) ≜ q π ( s , a ) − E [ R + γ q π ( S ′ , A ′ ) ∣ S = s , A = a ] g(q_\pi(s,a)) \triangleq q_\pi(s,a) - \mathbb{E}{\Big[ R + \gamma q_\pi(S^\prime,A^\prime) \Big| S=s, A=a\Big]} g(qπ(s,a))qπ(s,a)E[R+γqπ(S,A) S=s,A=a]

We only can get the observation with noise that

g ~ ( q π ( s , a ) , η ) = q π ( s , a ) − [ r + γ q π ( s ′ , a ′ ) ] = q π ( s , a ) − E [ R + γ q π ( S ′ , A ′ ) ∣ S = s , A = a ] ⏟ g ( q π ( s , a ) ) + [ E [ R + γ q π ( S ′ , A ′ ) ∣ S = s , A = a ] − [ r + γ q π ( s ′ , a ′ ) ] ] ⏟ η \begin{aligned} \tilde{g}\Big(q_\pi(s,a),\eta \Big) & = q_\pi(s,a) - \Big[r+\gamma q_\pi(s^\prime,a^\prime) \Big] \\ & = \underbrace{q_\pi(s,a) - \mathbb{E}{\Big[ R + \gamma q_\pi(S^\prime,A^\prime)\Big| S=s, A=a\Big]}}_{g(q_\pi(s,a))} + \underbrace{\Bigg[\mathbb{E}{\Big[ R + \gamma q_\pi(S^\prime,A^\prime) \Big| S=s, A=a\Big]} - \Big[r+\gamma q_\pi(s^\prime,a^\prime)\Big] \Bigg]}_{\eta} \end{aligned} g~(qπ(s,a),η)=qπ(s,a)[r+γqπ(s,a)]=g(qπ(s,a)) qπ(s,a)E[R+γqπ(S,A) S=s,A=a]+η [E[R+γqπ(S,A) S=s,A=a][r+γqπ(s,a)]]

Hence, according to the Robbins-Monro algorithm, we can get Sarsa as

q k + 1 ( s , a ) = q k ( s , a ) − α k [ q k ( s , a ) − ( r k + γ q k ( s k ′ , a k ′ ) ) ] q_{k+1}(s,a) = q_k(s,a) - \alpha_k \Big[ q_k(s,a) - \big(r_k+\gamma q_k(s^\prime_k,a^\prime_k) \big) \Big] qk+1(s,a)=qk(s,a)αk[qk(s,a)(rk+γqk(sk,ak))]

Similar to the TD learning estimates state value in last section, we do some modification in above equation. The sampled data ( s , a , r k , s k ′ , a k ′ ) (s,a,r_k,s^\prime_k,a^\prime_k) (s,a,rk,sk,ak) is changed to ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) (s_t,a_t,r_{t+1},s_{t+1},a_{t+1}) (st,at,rt+1,st+1,at+1). Hence, the Sarsa becomes

Sarsa : { q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ q t ( s t + 1 , a t + 1 ) ) ] q t + 1 ( s , a ) = q t ( s , a ) , for all  ( s , a ) ≠ ( s t , a t ) \text{Sarsa} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - (r_{t+1} +\gamma q_t(s_{t+1},a_{t+1})) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right. Sarsa: qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γqt(st+1,at+1))]=qt(s,a),for all (s,a)=(st,at)

where t = 0 , 1 , 2 , … t=0,1,2,\dots t=0,1,2,. Here, q t ( s , a t ) q_t(s,a_t) qt(s,at) is the estimated action value of ( s t , a t ) (s_t,a_t) (st,at); α t ( s t , a t ) \alpha_t(s_t,a_t) αt(st,at) is the learning rate depending on s t , a t s_t,a_t st,at.

​ Sarsa is nothing but an action-value version of the TD algorithm. Sarsa is also implemented with policy improvement steps such as ϵ \epsilon ϵ- greedy algorithm. There is a point that should be noticed. In the q q q-value update step, unlike the model-based policy iteration or value iteration algorithm where the values of all states are updated in each iteration, Sarsa only updates a single state-action pair that is visited at time step t t t.

Pesudocode:

Image

7.3 TD learning of action value: Expected Sarsa

​ Recall the Bellman equation of action value

q π ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) (elementwise form) q_\pi(s,a) = \sum_r p(r|s,a)r + \gamma\sum_{s^\prime} p(s^\prime|s,a) v_\pi(s^\prime) \quad \text{(elementwise form)} qπ(s,a)=rp(rs,a)r+γsp(ss,a)vπ(s)(elementwise form)

Regard the probability p ( r ∣ s , a ) p(r|s,a) p(rs,a) and p ( s ′ ∣ s , a ) p(s^\prime|s,a) p(ss,a) as the distribution of random variable R R R and v π ( S ′ ) v_\pi(S^\prime) vπ(S). Then, we have the expectation form of Bellman equation of action value.

q π ( s , a ) = E [ R + γ v π ( S ′ ) ∣ S = s , A = a ] (expectation form) ( 1 ) q_\pi(s,a) = \mathbb{E}[R + \gamma v_\pi(S^\prime)|S=s,A=a] \quad \text{(expectation form)} \quad (1) qπ(s,a)=E[R+γvπ(S)S=s,A=a](expectation form)(1)

According to the definition of state value we have

E [ q π ( s , A ) ∣ s ] = ∑ a ∈ A ( s ) π ( a ∣ s ) q π ( s , a ) = v π ( s ) → E [ q π ( S ′ , A ) ∣ S ′ ] = v π ( S ′ ) ( 2 ) \begin{aligned} \mathbb{E}[q_\pi(s, A) | s] & = \sum_{a\in\mathcal{A}(s)} \pi(a|s) q_\pi(s,a) = v_\pi(s) \\ \to \mathbb{E}[q_\pi(S^\prime, A) | S^\prime] & = v_\pi(S^\prime) \quad (2) \end{aligned} E[qπ(s,A)s]E[qπ(S,A)S]=aA(s)π(as)qπ(s,a)=vπ(s)=vπ(S)(2)

Subtitute ( 2 ) (2) (2) into ( 1 ) (1) (1) we have

q π ( s , a ) = E [ R + γ E [ q π ( S ′ , A ) ∣ S ′ ] ∣ S = s , A = a ] , for all  s , a (expectation form) \textcolor{red}{q_\pi(s,a) = \mathbb{E} \Big[ R+\gamma \mathbb{E} \big[ q_\pi(S^\prime, A)|S^\prime \big] \Big| S=s, A=a \Big]}, \text{ for all }s,a \quad \text{(expectation form)} qπ(s,a)=E[R+γE[qπ(S,A)S] S=s,A=a], for all s,a(expectation form)

Rewirte it into root finding parttern

g ( q π ( s , a ) ) ≜ q π ( s , a ) − E [ R + γ E [ q π ( S ′ , A ) ∣ S ′ ] ∣ S = s , A = a ] g(q_\pi(s,a)) \triangleq q_\pi(s,a) - \mathbb{E} \Big[ R+\gamma \mathbb{E} \big[ q_\pi(S^\prime, A)|S^\prime \big] \Big| S=s, A=a \Big] g(qπ(s,a))qπ(s,a)E[R+γE[qπ(S,A)S] S=s,A=a]

We can only get the observation with noise η \eta η

g ~ ( q π ( s , a ) , η ) = q π ( s , a ) − ( r + γ E [ q π ( s ′ , A ) ∣ s ′ ] ) = q π ( s , a ) − E [ R + γ E [ q π ( S ′ , A ) ∣ S ′ ] ∣ S = s , A = a ] ⏟ g ( q π ( s , a ) ) + E [ R + γ E [ q π ( S ′ , A ) ∣ S ′ ] ∣ S = s , A = a ] − ( r + γ E [ q π ( s ′ , A ) ∣ s ′ ] ) ⏟ η \begin{aligned} \tilde{g}(q_\pi(s,a), \eta) & = q_\pi(s,a) - \Big(r + \gamma \mathbb{E} \big[ q_\pi(s^\prime, A)|s^\prime \big] \Big) \\ & = \underbrace{q_\pi(s,a) - \mathbb{E} \Big[ R+\gamma \mathbb{E} \big[ q_\pi(S^\prime, A)|S^\prime \big] \Big| S=s, A=a \Big]}_{g(q_\pi(s,a))} + \underbrace{\mathbb{E} \Big[ R+\gamma \mathbb{E} \big[ q_\pi(S^\prime, A)|S^\prime \big] \Big| S=s, A=a \Big] - \Big(r + \gamma \mathbb{E} \big[ q_\pi(s^\prime, A)|s^\prime \big] \Big)}_{\eta} \end{aligned} g~(qπ(s,a),η)=qπ(s,a)(r+γE[qπ(s,A)s])=g(qπ(s,a)) qπ(s,a)E[R+γE[qπ(S,A)S] S=s,A=a]+η E[R+γE[qπ(S,A)S] S=s,A=a](r+γE[qπ(s,A)s])

Hence, we can implement Robbins-Monro algorithm to find the root of g ( q π ( s , a ) ) g(q_\pi(s,a)) g(qπ(s,a)) that

q k + 1 ( s , a ) = q k ( s , a ) − α k ( s , a ) [ q k ( s , a ) − ( r k + γ E [ q k ( s k ′ , A ) ∣ s k ′ ] ) ] q_{k+1}(s,a) = q_k(s,a) - \alpha_k(s,a) \Bigg[ q_k(s,a) - \Big(r_k + \gamma \mathbb{E} \big[ q_k(s^\prime_k, A)|s^\prime_k \big] \Big) \Bigg] qk+1(s,a)=qk(s,a)αk(s,a)[qk(s,a)(rk+γE[qk(sk,A)sk])]

Similar to the TD learning estimates state value, we do some modification in above equation. The sampled data ( s , a , r k , s k ′ ) (s,a,r_k,s^\prime_k) (s,a,rk,sk) is changed to ( s t , a t , r t + 1 , s t + 1 ) (s_t,a_t,r_{t+1},s_{t+1}) (st,at,rt+1,st+1). Hence, the Expected-Sarsa becomes

Expected-Sarsa : { q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ E [ q t ( s t + 1 , A ∣ s t + 1 ) ] ) ] q t + 1 ( s , a ) = q t ( s , a ) , for all  ( s , a ) ≠ ( s t , a t ) \text{Expected-Sarsa} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - \big( r_{t+1} +\gamma \mathbb{E}[q_t(s_{t+1},A|s_{t+1})] \big) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right. Expected-Sarsa: qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γE[qt(st+1,Ast+1)])]=qt(s,a),for all (s,a)=(st,at)

7.4 TD learning of action values: n n n-step Sarsa

​ Recall the definition of action value is

q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] q_\pi(s,a) = \mathbb{E}[G_t|S_t=s, A_t=a] qπ(s,a)=E[GtSt=s,At=a]

The discounted return G t G_t Gt can be written in different forms as

Sarsa ⟵ G t ( 1 ) = R t + 1 + γ q π ( S t + 1 , A t + 1 ) G t ( 2 ) = R t + 1 + γ R t + 2 + γ 2 q π ( S t + 2 , A t + 2 ) ⋮ n -step Sarsa ⟵ G t ( n ) = R t + 1 + γ R t + 2 + ⋯ + γ n q π ( S t + n , A t + n ) ⋮ Monte Carlo ⟵ G t ( ∞ ) = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ \begin{aligned} \text{Sarsa} \longleftarrow G_t^{(1)} & = R_{t+1} + \gamma q_\pi(S_{t+1},A_{t+1}) \\ G_t^{(2)} & = R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2},A_{t+2}) \\ & \vdots \\ n\text{-step Sarsa} \longleftarrow G_t^{(n)} & = R_{t+1} + \gamma R_{t+2} + \cdots +\gamma^n q_\pi(S_{t+n},A_{t+n}) \\ & \vdots \\ \text{Monte Carlo} \longleftarrow G_t^{(\infty)} & = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3}+ \cdots \\ \end{aligned} SarsaGt(1)Gt(2)n-step SarsaGt(n)Monte CarloGt()=Rt+1+γqπ(St+1,At+1)=Rt+1+γRt+2+γ2qπ(St+2,At+2)=Rt+1+γRt+2++γnqπ(St+n,At+n)=Rt+1+γRt+2+γ2Rt+3+

It should be noted that G t = G t ( 1 ) = G t ( 2 ) = G t ( n ) = G t ( ∞ ) G_t=G_t^{(1)}=G_t^{(2)}=G_t^{(n)}=G_t^{(\infty)} Gt=Gt(1)=Gt(2)=Gt(n)=Gt(), where the superscripts merely indicate the different decomposition structures of G t G_t Gt.

​ Sarsa aims to solve

q π ( s , a ) = E [ G t ( 1 ) ∣ s , a ] = E [ R t + 1 + γ q π ( S t + 1 , A t + 1 ) ∣ s , a ] q_\pi(s,a) = \mathbb{E}[G_t^{(1)}|s,a] = \mathbb{E} [R_{t+1}+\gamma q_\pi(S_{t+1},A_{t+1})|s,a ] qπ(s,a)=E[Gt(1)s,a]=E[Rt+1+γqπ(St+1,At+1)s,a]

MC learning aims to solve

q π ( s , a ) = E [ G t ( ∞ ) ∣ s , a ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ ∣ s , a ] q_\pi(s,a) = \mathbb{E} [G_t^{(\infty)}|s,a] = \mathbb{E} [R_{t+1}+\gamma R_{t+2} + \gamma^2 R_{t+3}+\cdots |s,a] qπ(s,a)=E[Gt()s,a]=E[Rt+1+γRt+2+γ2Rt+3+s,a]

n n n-step Sarsa aims to solve

q π ( s , a ) = E [ G t ( n ) ∣ s , a ] = E [ R t + 1 + γ R t + 2 + ⋯ + γ n q π ( S t + n , A t + n ) ∣ s , a ] q_\pi(s,a) = \mathbb{E} [G_t^{(n)}|s,a] = \mathbb{E} [R_{t+1}+\gamma R_{t+2} +\cdots + \gamma^n q_\pi(S_{t+n},A_{t+n}) |s,a] qπ(s,a)=E[Gt(n)s,a]=E[Rt+1+γRt+2++γnqπ(St+n,At+n)s,a]

​ The n n n-step Sarsa algorithm is

n -step Sarsa : { q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ r t + 2 + ⋯ + γ n q t ( s t + n , a t + n ) ) ] q t + 1 ( s , a ) = q t ( s , a ) , for all  ( s , a ) ≠ ( s t , a t ) n\text{-step Sarsa} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - (r_{t+1}+ \gamma r_{t+2} + \cdots + \gamma^n q_t(s_{t+n},a_{t+n})) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right. n-step Sarsa: qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γrt+2++γnqt(st+n,at+n))]=qt(s,a),for all (s,a)=(st,at)

7.5 TD learning of optimal action values: Q-learning

​ It should be noted that Sarsa can only estimate the action values of a given policy. It must be combined with a policy improvement step to find optimal policies and hence their optimal action values. By contrast, Q-learning can directly estimate optimal action values.

​ Recall the Bellman optimal equation of state value in section 3.2

v ( s ) = max ⁡ π ∑ a ∈ A ( s ) π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v ( s ′ ) ] v ( s ) = max ⁡ a ∈ A ( s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v ( s ′ ) ] \begin{aligned} v(s) & = \max_\pi \sum_{a\in\mathcal{A}(s)} \pi(a|s) \Big[ \sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^{\prime}|s,a) v(s^{\prime}) \Big] \\ v(s) & = \max_{a\in\mathcal{A}(s)} \Big[\sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^\prime|s,a) v(s^\prime) \Big] \end{aligned} v(s)v(s)=πmaxaA(s)π(as)[rp(rs,a)r+γsp(ss,a)v(s)]=aA(s)max[rp(rs,a)r+γsp(ss,a)v(s)]

where v ( s ) ≜ max ⁡ a ∈ A ( s ) q ( s , a ) v(s)\triangleq \max_{a\in\mathcal{A}(s)} q(s,a) v(s)maxaA(s)q(s,a). Hence we have

max ⁡ a ∈ A ( s ) q ( s , a ) = max ⁡ a ∈ A ( s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v ( s ′ ) ] max ⁡ a ∈ A ( s ) q ( s , a ) = max ⁡ a ∈ A ( s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ a ∈ A ( s ) q ( s ′ , a ) ] → q ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ a ∈ A ( s ) q ( s ′ , a ) (elementwise form) \begin{aligned} \max_{a\in\mathcal{A}(s)} q(s,a) & = \max_{a\in\mathcal{A}(s)} \Big[\sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^\prime|s,a) v(s^\prime) \Big] \\ \max_{a\in\mathcal{A}(s)} q(s,a) & = \max_{a\in\mathcal{A}(s)} \Big[\sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^\prime|s,a) \max_{a\in\mathcal{A}(s)} q(s^\prime,a) \Big] \\ \to q(s,a) & = \sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^\prime|s,a) \max_{a\in\mathcal{A}(s)} q(s^\prime,a) \quad \text{(elementwise form)} \end{aligned} aA(s)maxq(s,a)aA(s)maxq(s,a)q(s,a)=aA(s)max[rp(rs,a)r+γsp(ss,a)v(s)]=aA(s)max[rp(rs,a)r+γsp(ss,a)aA(s)maxq(s,a)]=rp(rs,a)r+γsp(ss,a)aA(s)maxq(s,a)(elementwise form)

Rewrite it into expectation form

q ( s , a ) = E [ R + γ max ⁡ a ∈ A ( s ) q ( S ′ , a ) ∣ S = s , A = a ] , for all  s , a (expectation form) \textcolor{red}{ q(s,a) = \mathbb{E}[R+\gamma \max_{a\in\mathcal{A}(s)} q(S^\prime,a) |S=s,A=a ] }, \text{ for all }s,a \quad \text{(expectation form)} q(s,a)=E[R+γaA(s)maxq(S,a)S=s,A=a], for all s,a(expectation form)

This equation is the Bellman optimal equation expressed in terms of action values.

Rewrite it into

g ( q ( s , a ) ) ≜ q ( s , a ) − E [ R + γ max ⁡ a ∈ A ( S ′ ) q ( S ′ , a ) ∣ S = s , A = a ] g(q(s,a)) \triangleq q(s,a) - \mathbb{E} [R+\gamma \max_{a\in\mathcal{A}(S^\prime)} q(S^\prime,a) |S=s,A=a ] g(q(s,a))q(s,a)E[R+γaA(S)maxq(S,a)S=s,A=a]

We can get the observation with noise

g ~ ( q ( s , a ) ) = q ( s , a ) − [ r + γ max ⁡ a ∈ A ( s ′ ) q ( s ′ , a ) ] = q ( s , a ) − E [ R + γ max ⁡ a ∈ A ( S ′ ) q ( S ′ , a ) ∣ S = s , A = a ] ⏟ g ( q ( s , a ) ) + E [ R + γ max ⁡ a ∈ A ( S ′ ) q ( S ′ , a ) ∣ S = s , A = a ] − [ r + γ max ⁡ a ∈ A ( s ′ ) q ( s ′ , a ) ] ⏟ η \begin{aligned} \tilde{g}(q(s,a)) & = q(s,a) - \Big[r + \gamma \max_{a\in\mathcal{A}(s^\prime)} q(s^\prime,a) \Big] \\ & = \underbrace{q(s,a) - \mathbb{E} [R+\gamma \max_{a\in\mathcal{A}(S^\prime)} q(S^\prime,a) |S=s,A=a ]}_{g(q(s,a))} + \underbrace{\mathbb{E} [R+\gamma \max_{a\in\mathcal{A}(S^\prime)} q(S^\prime,a) |S=s,A=a ] - \Big[r + \gamma \max_{a\in\mathcal{A}(s^\prime)} q(s^\prime,a) \Big]}_{\eta} \end{aligned} g~(q(s,a))=q(s,a)[r+γaA(s)maxq(s,a)]=g(q(s,a)) q(s,a)E[R+γaA(S)maxq(S,a)S=s,A=a]+η E[R+γaA(S)maxq(S,a)S=s,A=a][r+γaA(s)maxq(s,a)]

Hence, we can implement Robbins-Monro algorithm to find the root

q k + 1 ( s , a ) = q k ( s , a ) − α k ( s , a ) [ q k ( s , a ) − ( r k + γ max ⁡ a ∈ A ( s ′ ) q k ( s ′ , a ) ) ] q_{k+1}(s,a) = q_k(s,a) - \alpha_k(s,a) \Big[q_k(s,a) - \big(r_k + \gamma \max_{a\in\mathcal{A}(s^\prime)} q_k(s^\prime,a) \big) \Big] qk+1(s,a)=qk(s,a)αk(s,a)[qk(s,a)(rk+γaA(s)maxqk(s,a))]

Similar to the TD learning estimates state value, we do some modification in above equation. The sampled data ( s , a , r k , s k ′ ) (s,a,r_k,s^\prime_k) (s,a,rk,sk) is changed to ( s t , a t , r t + 1 , s t + 1 ) (s_t,a_t,r_{t+1},s_{t+1}) (st,at,rt+1,st+1). Hence, the Q-learning becomes

Q-learning : { q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q t ( s t + 1 , a ) ) ] q t + 1 ( s , a ) = q t ( s , a ) , for all  ( s , a ) ≠ ( s t , a t ) \text{Q-learning} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - (r_{t+1}+ \gamma \max_{a\in\mathcal{A}(s_{t+1})} q_t(s_{t+1},a)) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right. Q-learning: qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γaA(st+1)maxqt(st+1,a))]=qt(s,a),for all (s,a)=(st,at)

Off-policy vs on-policy:

​ There exist two policies in a TD learning task: behavior policy and target policy. The behavoir policy is used to generate experience samples. The target policy is constantly updated toward an optimal policy. When the behavior policy is the same as the target policy, such a kind of learning is called on-policy. Otherwise, when they are different, the learning is called off-policy.

​ The advantage of off-policy learning compared to on-policy learning is that it can search for optimal policies based on the experiences generated by any other policies.

How to determine a algorithm is on-policy or off-policy? The basic reason is that if the algorithm is sovling Bellman equation, then it’s on-policy. This is because Bellman equation is finding the state value or action value under a given policy. Else if the algorithm is sovling Bellman optimal equation, then it’s off-policy. This because Bellman equation does not include any policy, hence, the behavior policy and target policy can be different.

Online learning vs offline learning:

Online learning refers to the case where the value and policy can be updated once an experience sample is obtained. Offline learning refers to the case that the update can only be done after all experience samples have been collected. For example, TD learning is online whereas Monte Carlo learning is offline.

Persudocode:

(On-policy version)

Image

(Off-policy version)

Image

Reference

赵世钰老师的课程

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/9731.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

高层金属做power mesh如何避免via stack

随着工艺精进,pr要处理的层次也越来越多,如何选择power plan的层次尤为关键,一方面决定ir drop的大小,影响着芯片的功能,一方面决定绕线资源,影响面积。 选择高层metal做power mesh的关键在于厚金属&#…

局域网内主机ping不通,但是可以调用对方http接口(防火墙阻止了icmp协议)(关闭防火墙或者启用ICMP回显请求(ICMPv4-In))

文章目录 背景可能的原因问题排查及解决 背景 局域网内有一台主机,ping它ping不通,但是可以调用它的http接口,很诡异。。。 可能的原因 可能的原因有以下几种: 防火墙设置:局域网内的主机可能设置了防火墙&#xff…

勘探开发人工智能应用:地震层位解释

1 地震层位解释 层位解释是地震构造解释的重要内容,是根据目标层位的地震反射特征如振幅、相位、形态、连续性、特征组合等信息在地震数据体上进行追踪解释获得地震层位数据的方法。 1.1 地震信号、层位与断层 图1.1 所示为地震信号采集的过程,地面炮…

opencv-21 alpha 通道详解(应用于 图像增强,合成,蒙版,特效 等)

什么是alpha 通道? Alpha通道是计算机图形学中用于表示图像透明度的一种通道。在一个图像中,通常会有三个颜色通道:红色(R)、绿色(G)、蓝色(B),它们合在一起…

macOS 源码编译 Percona XtraBackup

percona-xtrabackup-2.4.28.tar.gz安装依赖 ╰─➤ brew install cmake ╰─➤ cmake --version cmake version 3.27.0brew 安装 ╰─➤ brew update╰─➤ brew search xtrabackup > Formulae percona-xtrabackup╰─➤ brew install percona-xtrabackup╰─➤ xtr…

scrcpy2.0+实时将手机画面显示在屏幕上并用鼠标模拟点击2023.7.26

想要用AI代打手游,除了模拟器登录,也可以直接使用第三方工具Scrcpy,来自github,它是一个开源的屏幕镜像工具,可以在电脑上显示Android设备的画面,并支持使用鼠标进行交互。 目录 1. 下载安装2. scrcpy的高级…

Go语言开发小技巧易错点100例(八)

往期回顾: Go语言开发小技巧&易错点100例(一)Go语言开发小技巧&易错点100例(二)Go语言开发小技巧&易错点100例(三)Go语言开发小技巧&易错点100例(四)Go…

【论文笔记】RCM-Fusion: Radar-Camera Multi-Level Fusion for 3D Object Detection

原文链接:https://arxiv.org/abs/2307.10249 1. 引言 目前的一些雷达-相机融合3D目标检测方法进行实例级的融合,从相机图像生成3D提案,并与雷达点云相关联以修正提案。但这种方法没有在最初阶段使用雷达,依赖于相机3D检测器&…

Spring中如何用注解方式存取JavaBean?有几种注入方式?

博主简介:想进大厂的打工人博主主页:xyk:所属专栏: JavaEE进阶 本篇文章将讲解如何在spring中使用注解的方式来存取Bean对象,spring提供了多种注入对象的方式,常见的注入方式包括 构造函数注入,Setter 方法注入和属性…

如何在局域网外SSH远程访问连接到家里的树莓派?

文章目录 如何在局域网外SSH远程访问连接到家里的树莓派?如何通过 SSH 连接到树莓派步骤1. 在 Raspberry Pi 上启用 SSH步骤2. 查找树莓派的 IP 地址步骤3. SSH 到你的树莓派步骤 4. 在任何地点访问家中的树莓派4.1 安装 Cpolar4.2 cpolar进行token认证4.3 配置cpol…

word图自动编号引用

一.引用,插入题注,新建标签,图1-,这样生成的就是图1-1这种,确定 再添加图片就点击添加题注就行,自动生成图1-2这种 二.图例保存为书签 插入,书签,书签命名,如图1 三…

Tensorflow(二)

一、过拟合 过拟合现象:机器对于数据的学习过于自负(想要将误差减到最小)。 解决方法:利用正规化方法 二、卷积神经网络(CNN) 卷积神经网络是近些年来逐渐兴起的人工神经网络,主要用于图像分类、计算机视觉等。 卷积:例如对图片每一小块像素区域的处理&#xff…

sqlite触发器1

SQLite 的触发器(Trigger)可以指定在特定的数据库表发生 DELETE、INSERT 或 UPDATE 时触发,或在一个或多个指定表的列发生更新时触发。 SQLite 只支持 FOR EACH ROW 触发器(Trigger),没有 FOR EACH STATEM…

SpringBoot+Jpa+Thymeleaf实现增删改查

SpringBootJpaThymeleaf实现增删改查 这篇文章介绍如何使用 Jpa 和 Thymeleaf 做一个增删改查的示例。 1、pom依赖 pom 包里面添加Jpa 和 Thymeleaf 的相关包引用 <?xml version"1.0" encoding"UTF-8"?> <project xmlns"http://maven.…

Ubuntu 20.04下的录屏与视频剪辑软件

ubuntu20.04下的录屏与视频剪辑 一、录屏软件SimpleScreenRecorder安装与使用 1、安装 2、设置录制窗口参数 3、开始录制 二、视频剪辑软件kdenlive的安装 1、安装 2、启动 一、录屏软件SimpleScreenRecorder安装与使用 1、安装 &#xff08;1&#xff09;直接在终端输入以下命…

DAY2,Qt(继续完善登录框,信号与槽的使用 )

1.继续完善登录框&#xff0c;当登录成功时&#xff0c;关闭登录界面&#xff0c;跳转到新的界面中&#xff0c;来回切换页面&#xff1b; ---mychat.h chatroom.h---两个页面头文件 #ifndef MYCHAT_H #define MYCHAT_H#include <QWidget> #include <QDebug> /…

STC12C5A60S2 单片机串口2的通信功能测试

根据手册说明&#xff0c;STC12C5A60S2 系列单片机可以直接使用 reg51.h 的头文件&#xff0c;只是在用到相应的特殊功能寄存器时&#xff0c;要做相应的定义即可。 笔记来自视频教程链接: https://www.bilibili.com/video/BV1Qq4y1Z7iS/?spm_id_from333.880.my_history.page…

EtherNet/IP转Modbus网关以连接AB PLC

本案例为西门子S7-1200 PLC通过捷米特Modbus转EtherNet/IP网关捷米特JM-EIP-RTU连接AB PLC的配置案例。 网关分别从ETHERNET/IP一侧和MODBUS一侧读写数据&#xff0c;存入各自的缓冲区&#xff0c;网关内部将缓冲区的数据进行交换&#xff0c;从而实现两边数据的传输。 网关做为…

Vlan端口隔离(第二十四课)

一、端口隔离 1、端口隔离技术概述 1)端口隔离技术出现背景:为了实现报文之间的二层隔离,可以将不同的端口加入不同的VLAN,但这样会浪费有限的VLAN ID资源。 2)端口隔离的作用:采用端口隔离功能,可以实现同一VLAN内端口之间的隔离。 3)如何实现端口隔离功能:只需要…

tcl学习之路(一)(Vivado与Tcl)

学习第一步&#xff1a;安装tcl编译软件 点击这里进入activestate的官网&#xff0c;下载你喜欢的操作系统所需的安装包。这里我下载的是windows下的安装包。一步一步安装即可。   那么&#xff0c;安装后&#xff0c;我们可以在开始的菜单栏处看到三个应用程序。      …