强化学习学习笔记-李宏毅

Policy Gradient

actor+env+reward function，env和reward是不能控制的，唯一可以变的是actor，Policy $\pi$ 是一个网络，参数为 $\theta$ ，输入是当前的观察，输出是采取的行为，例如游戏中输入的是游戏画面 $s_1$ ，输出的是采取的操作 $a_1$ ，有了决定的action $a_1$ 之后会获取对应的reward $r_1$ ，并且画面也会有对应的改变得到 $s_2$ ，这个过程不断进行得到一个trajectory $\tau = \{s_1,a_1,s_2,a_2,\cdots,s_T,a_T\}$ ，假设网络参数固定，那么某条trajectory的几率是 $p_\theta(\tau) = p(s_1)p_\theta(a_1|s_1)p(s_2|s_1,a_1)p_\theta(a_2|s_2)\cdots = p(s_1)\prod_{t = 1}^Tp_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)$ ，某一条trajectory得到的reward $R(\tau)=\sum_{t = 1}^Tr_t$ ，目标就是调整网络参数，使得reward的期望值大 $\overline{R}_\theta = \sum_\tau R(\tau)p_\theta(\tau)$ ，如何优化 $\theta$ 呢，梯度下降 $\nabla \overline(R)_\theta=\sum_\tau R(\tau)\nabla p_\theta(\tau) = \sum_\tau R(\tau)p_\theta(\tau)\frac{\nabla p_\theta(\tau)}{p_\theta(\tau)}=\sum_\tau R(\tau)p_\theta(\tau)\nabla \log p_\theta(\tau) = E_{\tau\sim p_\theta(\tau)}[R(\tau)\nabla\log p_\theta(\tau)] = \frac{1}{N}\sum_{n = 1}^NR(\tau^n)\nabla\log p_\theta(\tau^n) = \frac{1}{N}\sum_{n = 1}^N\sum_{t = 1}^{T_n}R(\tau^n)\nabla\log p_\theta(a^n_t|s_t^n)$ ，更新参数 $\theta\leftarrow \theta + \eta\nabla \overline{R}_\theta$ ，训练数据的获得，根据当前的网络，去玩游戏获取不同的trajectory，记录 $s^t_i,a^t_i,r_i$ 的数据对，计算梯度，更新参数，之后再次sample trajectory；
本质上可以看做一个分类问题，网络希望输入 $s_i^t$ 输出 $a_i^t$ 使得reward $r_i$ 最大，其中 $r_i$ 是针对整场游戏而言的，所以可以看做以reward为权重的log likelihood，希望加权的likelihood越大越好，以此提升输入 $s_i^t$ 得到reward大的时候对应的 $a_i^t$ 的几率，对应的就是分类的时候提升正确类别对应的几率；
由于训练的时候是sample，所以假设所有的reward都为正的时候可能会存在问题，所以reward整体都减去一个常量，可以取作reward的期望；
现在reward是trajectory粒度的，但是一条trajectory里面可能并不是所有的action都是好的，所以需要为不同的步骤分配不同的credit，此时变为 $R(\tau^n)\rightarrow \sum_{t'=t}^{T_n}r_{t'}^n\rightarrow\sum_{t'=t}^{T_n}\gamma^{t'-t}r_{t'}^n（随时间指数）$ ，减去bias之后记作 $A^\theta(s_t,a_t)$ ；

Proximal Policy Optimization

On-policy：参与学习的agent和与环境互动的agent是同一个，上面的就是on-policy的做法，存在的问题就是更新了参数之后，之前sample出来的数据就不能再次使用了；
Off-policy：参与学习的agent和与环境互动的agent不是同一个，希望使用sample出来的数据多次，使用从 $\pi_{\theta'}$ 中sample出来的数据来训练 $\pi_\theta$ ，其中 $\theta'$ 是固定的；
importance sampling： $E_{x\sim p}[f(x)] = \frac{1}{N}\sum_{i = 1}^Nf(x^i)$ ，但是我们现在不能从 $p$ sample数据，只能从 $q (x)$ sample数据，所以换成 $E_{x\sim p}[f(x)] = \int f(x)p(x)dx = \int f(x)\frac{p(x)}{q(x)}q(x)dx = E_{x\sim q}[f(x)\frac{p(x)}{q(x)}]$ ，也就是做了一个修正，乘上了 $\frac{p(x)}{q(x)}$ ，也就是importance weight，但是importance sampling有一个问题就是 $p$ 和 $q$ 不能差太多；
对应的梯度 $\nabla \overline R_\theta = E_{\tau\sim p_{\theta'}(\tau)}[\frac{p_\theta(\tau)}{p_{\theta'}(\tau)}R(\tau)\nabla\log p_\theta(\tau)]=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{P_\theta(s_t,a_t)}{P_{\theta'}(s_t,a_t)}A^{\theta'}(s_t,a_t)\nabla\log p_\theta(a_t^n|s_t^n)]=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}\frac{p_\theta(s_t)}{p_{\theta'}(s_t)}A^{\theta'}(s_t,a_t)\nabla\log p_\theta(a_t^n|s_t^n)]=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)\nabla\log p_\theta(a_t^n|s_t^n)]$ ，根据 $\nabla f(x) = f(x)\nabla\log f(x)$ 反推出原优化目标为 $J^{\theta'}(\theta) =E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)]$ ；
PPO就是加了一项使得 $p_\theta$ 和 $p_{\theta'}$ 之间不能差太多， $J^{\theta'}_{PPO}(\theta) = J^{\theta'}(\theta)-\beta KL(\theta,\theta')$ ，其中 $\beta$ 动态调整，如果 $KL(\theta,\theta')>KL_{max}$ 增大 $b e t a$ ，如果 $KL(\theta,\theta')<KL_{min}$ 减小 $b e t a$ ；