#! https://zhuanlan.zhihu.com/p/686235508
深度强化学习(二)(贝尔曼方程)
一.贝尔曼方程(将 Q π Q_\pi Qπ 表示成 Q π Q_\pi Qπ )
Theorem :假设 R t R_t Rt 是 S t 、 A t 、 S t + 1 S_t 、 A_t 、 S_{t+1} St、At、St+1 的函数。那么
Q π ( s t , a t ) = E S t + 1 , A t + 1 [ R t + γ ⋅ Q π ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] . (1.1) Q_\pi\left(s_t, a_t\right)=\mathbb{E}_{S_{t+1}, A_{t+1}}\left[R_t+\gamma \cdot Q_\pi\left(S_{t+1}, A_{t+1}\right) \mid S_t=s_t, A_t=a_t\right] .\tag{1.1} Qπ(st,at)=ESt+1,At+1[Rt+γ⋅Qπ(St+1,At+1)∣St=st,At=at].(1.1)
proof:令 S t + 1 : = { S t + 1 , S t + 2 , ⋯ } \mathcal{S}_{t+1:}=\left\{S_{t+1}, S_{t+2}, \cdots\right\} St+1:={St+1,St+2,⋯}, A t + 1 : = { A t + 1 , A t + 2 , ⋯ } \mathcal{A}_{t+1:}=\left\{A_{t+1}, A_{t+2}, \cdots\right\} At+1:={At+1,At+2,⋯},由 U t U_t Ut的定义知 U t = R t + γ ⋅ U t + 1 U_t=R_t+\gamma \cdot U_{t+1} Ut=Rt+γ⋅Ut+1
Q π ( s t , a t ) = E S t + 1 : , A t + 1 : [ U t ∣ S t = s t , A t = a t ] = E S t + 1 : , A t + 1 : [ R t + γ ⋅ U t + 1 ∣ S t = s t , A t = a t ] = E S t + 1 , A t + 1 [ R t ∣ S t = s t , A t = a t ] ⏟ ( 1 ) + γ ⋅ E S t + 1 : , A t + 1 : [ U t + 1 ∣ S t = s t , A t = a t ] ⏟ ( 2 ) \begin{aligned} Q_\pi\left(s_t, a_t\right)&=\mathbb{E}_{\mathcal{S}_{t+1:}, \mathcal{A}_{t+1:}}\left[U_t \mid S_t=s_t, A_t=a_t\right]\\ &=\mathbb{E}_{\mathcal{S}_{t+1:}, \mathcal{A}_{t+1:}}\left[R_t+\gamma \cdot U_{t+1} \mid S_t=s_t, A_t=a_t\right]\\ &= \underbrace{\Bbb E_{\cal S_{t+1},\cal A_{t+1}}\left[R_t|S_t=s_t,A_t=a_t \right]}_{(1)}+\gamma\cdot\underbrace{ \mathbb{E}_{\mathcal{S}_{t+1:}, \mathcal{A}_{t+1:}}\left[U_{t+1} \mid S_t=s_t, A_t=a_t\right]}_{(2)}\\ \end{aligned} Qπ(st,at)=ESt+1:,At+1:[Ut∣St=st,At=at]=ESt+1:,At+1:[Rt+γ⋅Ut+1∣St=st,At=at]=(1) ESt+1,At+1[Rt∣St=st,At=at]+γ⋅(2) ESt+1:,At+1:[Ut+1∣St=st,At=at]
其中, t t t时刻的回报 R t R_{t} Rt只与 t + 1 t+1 t+1时刻的状态 S t + 1 S_{t+1} St+1有关,而 S t + 1 S_{t+1} St+1只与 S t , A t S_t,A_t St,At有关,则
( 1 ) = E S t + 1 , A t + 1 [ R t ∣ S t = s t , A t = a t ] = E S t + 1 [ R t ∣ S t = s t , A t = a t ] = E S t + 1 , A t + 1 [ R t ∣ S t = s t , A t = a t ] \begin{aligned} (1)&=\Bbb E_{\cal S_{t+1},\cal A_{t+1}}\left[R_t|S_t=s_t,A_t=a_t \right]\\ &= \Bbb E_{S_{t+1}}\left [R_t|S_t=s_t,A_t=a_t\right]\\ &= \Bbb E_{S_{t+1},A_{t+1}}\left [R_t|S_t=s_t,A_t=a_t\right] \end{aligned} (1)=ESt+1,At+1[Rt∣St=st,At=at]=ESt+1[Rt∣St=st,At=at]=ESt+1,At+1[Rt∣St=st,At=at]
对 ( 2 ) (2) (2)中的式子变形可得
( 2 ) = E S t + 1 : , A t + 1 : [ U t + 1 ∣ S t = s t , A t = a t ] = E S t + 1 , A t + 1 , S t + 2 , A t + 2 [ U t + 1 ∣ S t = s t , A t = a t ] = E S t + 1 , A t + 1 [ E S t + 2 , A t + 2 [ U t + 1 ∣ S t + 1 , A t + 1 , S t = s t , A t = a t ] ∣ S t = s t , A t = a t ] 利用马尔可夫性 = E S t + 1 , A t + 1 [ E S t + 2 , A t + 2 [ U t + 1 ∣ S t + 1 , A t + 1 ] ∣ S t = s t , A t = a t ] = E S t + 1 , A t + 1 [ Q π ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] \begin{aligned} (2)&= \mathbb{E}_{\mathcal{S}_{t+1:}, \mathcal{A}_{t+1:}}\left[U_{t+1} \mid S_t=s_t, A_t=a_t\right]\\ &= \Bbb E_{S_{t+1},A_{t+1},\cal S _{t+2},\cal A _{t+2}}\left[U_{t+1}|S_t=s_t,A_t=a_t \right]\\ &= \Bbb E_{S_{t+1},A_{t+1}}\left[\Bbb E_{\cal S_{t+2},\cal A_{t+2}}\left[U_{t+1}|S_{t+1},A_{t+1},S_t=s_t,A_t=a_t\right]|S_t=s_t,A_t=a_t \right]利用马尔可夫性\\ &=\Bbb E_{S_{t+1},A_{t+1}}\left[\Bbb E_{\cal S_{t+2},\cal A_{t+2}}\left[U_{t+1}|S_{t+1},A_{t+1}\right]|S_t=s_t,A_t=a_t \right] \\ &=\mathbb{E}_{S_{t+1}, A_{t+1}}\left[Q_\pi\left(S_{t+1}, A_{t+1}\right) \mid S_t=s_t, A_t=a_t\right] \end{aligned} (2)=ESt+1:,At+1:[Ut+1∣St=st,At=at]=ESt+1,At+1,St+2,At+2[Ut+1∣St=st,At=at]=ESt+1,At+1[ESt+2,At+2[Ut+1∣St+1,At+1,St=st,At=at]∣St=st,At=at]利用马尔可夫性=ESt+1,At+1[ESt+2,At+2[Ut+1∣St+1,At+1]∣St=st,At=at]=ESt+1,At+1[Qπ(St+1,At+1)∣St=st,At=at]
由此证毕。
二.贝尔曼方程 (将 Q π 表示成 V π ) \text { (将 } Q_\pi \text { 表示成 } V_\pi \text { ) } (将 Qπ 表示成 Vπ )
Theorem :假设 R t R_t Rt 是 S t 、 A t 、 S t + 1 S_t 、 A_t 、 S_{t+1} St、At、St+1 的函数。那么
Q π ( s t , a t ) = E S t + 1 [ R t + γ ⋅ V π ( S t + 1 ) ∣ S t = s t , A t = a t ] (1.2) Q_\pi\left(s_t, a_t\right)=\mathbb{E}_{S_{t+1}}\left[R_t+\gamma \cdot V_\pi\left(S_{t+1}\right) \mid S_t=s_t, A_t=a_t\right]\tag{1.2} Qπ(st,at)=ESt+1[Rt+γ⋅Vπ(St+1)∣St=st,At=at](1.2)
proof: 由于 V π ( S t + 1 ) = E A t + 1 ∼ π ( ⋅ ∣ S t + 1 ) [ Q ( S t + 1 , A t + 1 ) ] = E A t + 1 [ Q π ( S t + 1 , A t + 1 ) ∣ S t + 1 ] \text { 由于 } V_\pi\left(S_{t+1}\right)=\mathbb{E}_{A_{t+1}\sim \pi\left(\cdot \mid S_{t+1}\right)}\left[Q\left(S_{t+1}, A_{t+1}\right)\right]=\Bbb E_{A_{t+1}}\left[ Q_{\pi}(S_{t+1},A_{t+1})|S_{t+1}\right] 由于 Vπ(St+1)=EAt+1∼π(⋅∣St+1)[Q(St+1,At+1)]=EAt+1[Qπ(St+1,At+1)∣St+1]
( 2 ) = E S t + 1 , A t + 1 [ Q π ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] = E S t + 1 [ E A t + 1 [ Q π ( S t + 1 , A t + 1 ) ∣ S t + 1 ] ∣ S t = s t , A t = a t ] = E S t + 1 [ V π ( S t + 1 ) ∣ S t = s t , A t = a t ] \begin{aligned} (2)= &\mathbb{E}_{S_{t+1}, A_{t+1}}\left[Q_\pi\left(S_{t+1}, A_{t+1}\right) \mid S_t=s_t, A_t=a_t\right]\\ =&\Bbb E_{S_{t+1}}\left[\Bbb E_{A_{t+1}}\left[ Q_{\pi}(S_{t+1},A_{t+1})|S_{t+1}\right]|S_t=s_t,A_t=a_t\right]\\ =&\Bbb E_{S_{t+1}}\left[V_\pi\left(S_{t+1}\right)|S_t=s_t,A_t=a_t\right] \end{aligned} (2)===ESt+1,At+1[Qπ(St+1,At+1)∣St=st,At=at]ESt+1[EAt+1[Qπ(St+1,At+1)∣St+1]∣St=st,At=at]ESt+1[Vπ(St+1)∣St=st,At=at]
证毕
三.贝尔曼方程(将 V π V_\pi Vπ 表示成 V π V_\pi Vπ )
Theorem :假设 R t R_t Rt 是 S t 、 A t 、 S t + 1 S_t 、 A_t 、 S_{t+1} St、At、St+1 的函数。那么
V π ( s t ) = E A t , S t + 1 [ R t + γ ⋅ V π ( S t + 1 ) ∣ S t = s t ] (1.3) V_\pi\left(s_t\right)=\mathbb{E}_{A_t, S_{t+1}}\left[R_t+\gamma \cdot V_\pi\left(S_{t+1}\right) \mid S_t=s_t\right]\tag{1.3} Vπ(st)=EAt,St+1[Rt+γ⋅Vπ(St+1)∣St=st](1.3)
proof:
V π ( s t ) = E A t , S t + 1 , A t + 1 [ U t ∣ S t = s t ] = E A t , S t + 1 , A t + 1 , [ R t + γ U t + 1 ∣ S t = s t ] = E A t , S t + 1 , A t + 1 [ R t ∣ S t = s t ] + γ E A t , S t + 1 , A t + 1 [ U t + 1 ∣ S t = s t ] = E A t , S t + 1 [ R t ∣ S t = s t ] + γ E S t + 1 [ E A t A t + 1 , S t + 2 [ U t + 1 ∣ S t + 1 , S t = s t ] ∣ S t = s t ] = E A t , S t + 1 [ R t ∣ S t = s t ] + γ E S t + 1 [ E A t + 1 , S t + 2 [ U t + 1 ∣ S t + 1 ] ∣ S t = s t ] 马尔可夫性 = E A t , S t + 1 [ R t ∣ S t = s t ] + γ E S t + 1 [ V π ( S t + 1 ) ∣ S t = s t ] = E A t , S t + 1 [ R t ∣ S t = s t ] + γ E A t , S t + 1 [ V π ( S t + 1 ) ∣ S t = s t ] 马尔可夫性 证毕 \begin{aligned} V_\pi\left(s_t\right)&=\Bbb E_{A_t,\cal S_{t+1}, \cal A_{t+1}}\left[U_t \mid S_t=s_t\right] \\ & =\Bbb E_{A_t,\cal S_{t+1}, \cal A_{t+1}},\left[R_t+\gamma U_{t+1}|S_t=s_t\right] \\ & =\Bbb E_{A_t,\cal S_{t+1}, \cal A_{t+1}}\left[R_t \mid S_t=s_t\right] +\gamma \Bbb E_{A_t,\cal S_{t+1}, \cal A_{t+1}}\left[U_{t+1} \mid S_t=s_t\right] \\ & =\Bbb E_{A_t, S_{t+1}}\left[R_t \mid S_t=s_t\right] +\gamma \Bbb E_{S_{t+1}}\left[\Bbb E_{A_t \cal A_{t+1}, \cal S_{t+2}}\left[U_{t+1} \mid S_{t+1},S_t=s_t\right]\mid S_{t}=s_t\right]\qquad \\ & =\Bbb E_{A_t, S_{t+1}}\left[R_t \mid S_t=s_t\right]+ \gamma \Bbb E_{S_{t+1}}\left[ E_{ \cal A_{t+1}, \cal S_{t+2}}\left[U_{t+1} \mid S_{t+1}\right]\mid S_{t}=s_t\right]马尔可夫性\\ & = \Bbb E_{A_t, S_{t+1}}\left[R_t \mid S_t=s_t\right]+ \gamma \Bbb E_{S_{t+1}}\left[V_{\pi}(S_{t+1})\mid S_{t}=s_t\right]\\ &=\Bbb E_{A_t, S_{t+1}}\left[R_t \mid S_t=s_t\right]+ \gamma \Bbb E_{A_t, S_{t+1}}\left[V_{\pi}(S_{t+1})\mid S_{t}=s_t\right]马尔可夫性\\ \textbf{证毕} \end{aligned} Vπ(st)证毕=EAt,St+1,At+1[Ut∣St=st]=EAt,St+1,At+1,[Rt+γUt+1∣St=st]=EAt,St+1,At+1[Rt∣St=st]+γEAt,St+1,At+1[Ut+1∣St=st]=EAt,St+1[Rt∣St=st]+γESt+1[EAtAt+1,St+2[Ut+1∣St+1,St=st]∣St=st]=EAt,St+1[Rt∣St=st]+γESt+1[EAt+1,St+2[Ut+1∣St+1]∣St=st]马尔可夫性=EAt,St+1[Rt∣St=st]+γESt+1[Vπ(St+1)∣St=st]=EAt,St+1[Rt∣St=st]+γEAt,St+1[Vπ(St+1)∣St=st]马尔可夫性
或者直接利用式 1.2 1.2 1.2,两边同时对 A t ∼ π ( ⋅ ∣ s t ) A_t\sim \pi(\cdot|s_t) At∼π(⋅∣st)求期望得
E A t ∼ π ( ⋅ ∣ s t ) [ Q π ( s t , A t ) ] = E A t ∼ π ( ⋅ ∣ s t ) [ E S t + 1 [ R t + γ ⋅ V π ( S t + 1 ) ∣ S t = s t , A t ] ] ⇕ E A t [ Q π ( S t , A t ) ∣ S t = s t ] = E A t [ E S t + 1 [ R t + γ ⋅ V π ( S t + 1 ) ∣ S t = s t , A t ] ∣ S t = s t ] = E S t + 1 , A t [ R t + γ ⋅ V π ( S t + 1 ) ∣ S t = s t ] \begin{aligned} \Bbb E_{A_t\sim \pi(\cdot|s_t)}[Q_\pi\left(s_t, A_t\right)]&=\Bbb E_{A_t\sim \pi(\cdot|s_t)}[\mathbb{E}_{S_{t+1}}\left[R_t+\gamma \cdot V_\pi\left(S_{t+1}\right) \mid S_t=s_t,A_t\right]]\\ \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \Updownarrow \\ \Bbb E_{A_t}[Q_\pi\left(S_t, A_t\right)\mid S_t=s_t]&=\Bbb E_{A_t}[\mathbb{E}_{S_{t+1}}\left[R_t+\gamma \cdot V_\pi\left(S_{t+1}\right) \mid S_t=s_t,A_t\right]\mid S_t=s_t]\\ &=\mathbb{E}_{S_{t+1},A_{t}}[R_t+\gamma \cdot V_\pi\left(S_{t+1}\right) \mid S_t=s_t] \end{aligned} EAt∼π(⋅∣st)[Qπ(st,At)]⇕EAt[Qπ(St,At)∣St=st]=EAt∼π(⋅∣st)[ESt+1[Rt+γ⋅Vπ(St+1)∣St=st,At]]=EAt[ESt+1[Rt+γ⋅Vπ(St+1)∣St=st,At]∣St=st]=ESt+1,At[Rt+γ⋅Vπ(St+1)∣St=st]
利用式 1.3 1.3 1.3,进一步写出显示表达式可得
V π ( s t ) = E A t , S t + 1 [ R t ∣ S t = s t ] + γ E A t , S t + 1 [ V π ( S t + 1 ) ∣ S t = s t ] = E A t [ E S t + 1 [ R t ∣ A t , S t = s t ] ∣ S t = s t ] + γ E A t [ E S t + 1 [ V π ( S t + 1 ) ∣ A t , S t = s t ] ∣ S t = s t ] = ∑ A t π ( a t ∣ s t ) E S t + 1 [ R t ∣ A t , S t = s t ] + γ ∑ A t π ( a t ∣ s t ) E S t + 1 [ V π ( S t + 1 ) ∣ A t , S t = s t ] = ∑ A t π ( a t ∣ s t ) ∑ S t + 1 r ⋅ p ( s t + 1 ∣ s t , a t ) + γ ∑ A t π ( a t ∣ s t ) ∑ S t + 1 V π ( s t + 1 ) ⋅ p ( s t + 1 ∣ s t , a t ) \begin{aligned} V_{\pi}(s_t)&=\Bbb E_{A_t, S_{t+1}}\left[R_t \mid S_t=s_t\right]+ \gamma \Bbb E_{A_t, S_{t+1}}\left[V_{\pi}(S_{t+1})\mid S_{t}=s_t\right]\\ &= \Bbb E_{A_t}[\Bbb E_{S_{t+1}}[R_t\mid A_t,S_t=s_t ]\mid S_t=s_t] +\gamma \Bbb E_{A_t}\left[\Bbb E_{S_{t+1}}\left[V_{\pi(S_{t+1})}\mid A_t,S_t=s_t\right]\mid S_t=s_t \right]\\ & =\sum_{A_t}\pi(a_t\mid s_{t})\Bbb E_{S_{t+1}}[R_t\mid A_t ,S_t=s_t]+\gamma \sum_{A_t}\pi(a_t\mid s_t)\Bbb E_{S_{t+1}}\left[V_{\pi(S_{t+1})}\mid A_t,S_t=s_t\right] \\ &=\sum_{A_t}\pi(a_t\mid s_{t})\sum_{S_{t+1}}r\cdot p(s_{t+1}\mid s_t,a_t)+\gamma \sum_{A_t}\pi(a_t\mid s_t)\sum_{S_{t+1}}V_{\pi}(s_{t+1})\cdot p(s_{t+1}\mid s_t,a_t) \end{aligned} Vπ(st)=EAt,St+1[Rt∣St=st]+γEAt,St+1[Vπ(St+1)∣St=st]=EAt[ESt+1[Rt∣At,St=st]∣St=st]+γEAt[ESt+1[Vπ(St+1)∣At,St=st]∣St=st]=At∑π(at∣st)ESt+1[Rt∣At,St=st]+γAt∑π(at∣st)ESt+1[Vπ(St+1)∣At,St=st]=At∑π(at∣st)St+1∑r⋅p(st+1∣st,at)+γAt∑π(at∣st)St+1∑Vπ(st+1)⋅p(st+1∣st,at)
其中 r = r ( s t , s t + 1 , a t ) r=r(s_t,s_{t+1},a_t) r=r(st,st+1,at)
四.最优贝尔曼方程
Theorem :假设 R t R_t Rt 是 S t 、 A t 、 S t + 1 S_t 、 A_t 、 S_{t+1} St、At、St+1 的函数。那么
Q ⋆ ( s t , a t ) = E S t + 1 ∼ p ( ⋅ ∣ s t , a t ) [ R t + γ ⋅ max A ∈ A Q ⋆ ( S t + 1 , A ) ∣ S t = s t , A t = a t ] (1.4) Q_{\star}\left(s_t, a_t\right)=\mathbb{E}_{S_{t+1} \sim p\left(\cdot \mid s_t, a_t\right)}\left[R_t+\gamma \cdot \max _{A \in \mathcal{A}} Q_{\star}\left(S_{t+1}, A\right) \mid S_t=s_t, A_t=a_t\right] \tag{1.4} Q⋆(st,at)=ESt+1∼p(⋅∣st,at)[Rt+γ⋅A∈AmaxQ⋆(St+1,A)∣St=st,At=at](1.4)
由贝尔曼方程可知
Q ⋆ ( s t , a t ) = E S t + 1 , A t + 1 [ R t + γ ⋅ Q ⋆ ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] Q_{\star}\left(s_t, a_t\right)=\mathbb{E}_{S_{t+1}, A_{t+1}}\left[R_t+\gamma \cdot Q_{\star}\left(S_{t+1}, A_{t+1}\right) \mid S_t=s_t, A_t=a_t\right] Q⋆(st,at)=ESt+1,At+1[Rt+γ⋅Q⋆(St+1,At+1)∣St=st,At=at]
因为动作 A t + 1 = argmax A Q ⋆ ( S t + 1 , A ) A_{t+1}=\operatorname{argmax}_A Q_{\star}\left(S_{t+1}, A\right) At+1=argmaxAQ⋆(St+1,A) 是状态 S t + 1 S_{t+1} St+1 的确定性函数, 所以
Q ⋆ ( s t , a t ) = E S t + 1 [ R t + γ ⋅ max A ∈ A Q ⋆ ( S t + 1 , A ) ∣ S t = s t , A t = a t ] Q_{\star}\left(s_t, a_t\right)=\mathbb{E}_{S_{t+1}}\left[R_t+\gamma \cdot \max _{A \in \mathcal{A}} Q_{\star}\left(S_{t+1}, A\right) \mid S_t=s_t, A_t=a_t\right] Q⋆(st,at)=ESt+1[Rt+γ⋅A∈AmaxQ⋆(St+1,A)∣St=st,At=at]