梯度 cv2.sobel_TensorFlow 2.0中连续策略梯度的最小工作示例

梯度 cv2.sobel

At the root of all the sophisticated actor-critic algorithms that are designed and applied these days is the vanilla policy gradient algorithm, which essentially is an actor-only algorithm. Nowadays, the actor that learns the decision-making policy is often represented by a neural network. In continuous control problems, this network outputs the relevant distribution parameters to sample appropriate actions.

如今，已设计和应用的所有复杂的行为者批评算法的根本是香草策略梯度算法，该算法本质上是仅行为者算法。如今，学习决策策略的演员通常以神经网络为代表。在连续控制问题中，该网络输出相关的分配参数以对适当的动作进行采样。

With so many deep reinforcement learning algorithms in circulation, you’d expect it to be easy to find abundant plug-and-play TensorFlow implementations for a basic actor network in continuous control, but this is hardly the case. Various reasons may exist for this. First, TensorFlow 2.0 was released only in September 2019, differing quite substantially from its predecessor. Second, most implementations focus on discrete action spaces rather than continuous ones. Third, there are many different implementations in circulation, yet some are tailored such that they only work in specific problem settings. It can be a tad frustrating to plow through several hundred lines of code riddled with placeholders and class members, only to find out the approach is not suitable to your problem after all. This article — based on our ResearchGate note [1] — provides a minimal working example that functions in TensorFlow 2.0. We will show that the real magic happens in only three lines of code!

鉴于有如此之多的深度强化学习算法正在流通中，您希望可以轻松地为连续控制中的基本参与者网络找到丰富的即插即用TensorFlow实现，但这并不是事实。为此可能存在多种原因。首先，TensorFlow 2.0仅在2019年9月发布，与之前的版本有很大不同。其次，大多数实现都集中在离散的动作空间而不是连续的动作空间上。第三，流通中有许多不同的实现，但有些实现是经过量身定制的，因此它们仅在特定的问题环境中起作用。翻遍数百行充满占位符和类成员的代码可能有点令人沮丧，只是发现该方法毕竟不适合您的问题。本文基于我们的ResearchGate注释[1]，提供了一个在TensorFlow 2.0中起作用的最小工作示例。我们将展示真正的魔力仅发生在三行代码中！

一些数学背景 (Some mathematical background)

In this article, we present a simple and generic implementation for an actor network in the context of the vanilla policy gradient algorithm REINFORCE [2]. In the continuous variant, we usually draw actions from a Gaussian distribution; the goal is to learn an appropriate mean μ and a standard deviation σ. The actor network learns and outputs these parameters.

在本文中，我们在香草策略梯度算法REINFORCE [2]的背景下，为参与者网络提供了一个简单而通用的实现。在连续变体中，我们通常从高斯分布中得出动作。目的是学习适当的平均值μ和标准偏差σ 。参与者网络学习并输出这些参数。

Let’s formalize this actor network a bit more. Here, the input is the state s or a feature array ϕ(s), followed by one or more hidden layers that transform the input, with the output being μ and σ. Once obtaining this output, an action a is randomly drawn from the corresponding Gaussian distribution. Thus, we have a=μ(s)+σ(s)ξ , where ξ ∼ 𝒩(0,1).

让我们进一步规范这个actor网络。在这里，输入是状态s或特征数组ϕ(s) ，后跟一个或多个隐藏层来转换输入，输出为μ和σ 。一旦获得此输出，便从相应的高斯分布中随机抽取一个动作a 。因此，我们有A =μ(s)+σ(S)ξ，其中ξ〜𝒩(0,1)。

After taking our action a, we observe a corresponding reward signal v. Together with some learning rate α, we may update the weights into a direction that improves the expected reward of our policy. The corresponding update rule [2] — based on gradient ascent — is given by:

在采取行动a之后 ，我们观察到相应的奖励信号v 。连同一些学习率α，我们可以将权重更新为一个方向，以改善我们的政策的预期回报。基于梯度上升的相应更新规则[2]由下式给出：

If we use a linear approximation scheme μ_θ(s)=θ^⊤ ϕ(s), we may directly apply these update rules on each feature weight. For neural networks, it may not be as straightforward how we should perform this update though.

如果我们使用线性近似方案μ_θ(s)=θ^⊤(s) ，则可以将这些更新规则直接应用于每个特征权重。对于神经网络，我们应该如何执行此更新可能并不那么简单。

Neural networks are trained by minimizing a loss function. We often compute the loss by computing the mean-squared error (squaring the difference between the predicted- and observed value). For instance, in a critic network the loss could be defined as (rₜ + Qₜ₊₁ - Qₜ)², with Qₜ being the predicted value and rₜ + Qₜ₊₁ the observed value. After computing the loss, we backpropagate it through the network, computing the partial losses and gradients required to update the network weights.

通过最小化损失函数来训练神经网络。我们通常通过计算均方误差(对预测值和观察值之间的差进行平方)来计算损失。例如，在一个网络评论家的损失可以被定义为(Rₜ+ Qₜ₊₁ - Qₜ)²，具有Qₜ作为预测值和rₜ+ Qₜ₊₁所观察到的值。计算完损耗后，我们通过网络反向传播，计算更新网络权重所需的部分损耗和梯度。

At first glance, the update equations have little in common with such a loss function. We simply try to improve our policy by moving into a certain direction, but do not have an explicit ‘target’ or ‘true value’ in mind. Indeed, we will need to define a ‘pseudo loss function’ that helps us update the network [3]. The link between the traditional update rules and this loss function become more clear when expressing the update rule into its generic form:

乍一看，更新方程与这种损失函数几乎没有共同点。我们只是试图朝某个方向发展以改善政策，但没有明确的“目标”或“真实价值”。确实，我们将需要定义一个“伪损失函数”来帮助我们更新网络[3]。将更新规则表达为通用形式时，传统更新规则与此损失函数之间的联系变得更加清晰：

Transformation into a loss function is fairly straightforward. As the loss is only the input for the backpropagation procedure, we first drop the learning rate α and gradient ∇_θ. Furthermore, neural networks are updated using gradient descent instead of gradient ascent, so we must add a minus sign. These steps yield the following loss function:

转换为损失函数非常简单。由于损耗只是反向传播过程的输入，因此我们首先降低学习率α和梯度∇_θ 。此外，神经网络是使用梯度下降而不是梯度上升来更新的，因此我们必须添加减号。这些步骤产生以下损失函数：

Quite similar to the update rule, right? To provide some intuition: remind that the log transformation yields a negative number for all values smaller than 1. If we have an action with a low probability and a high reward, we’d want to observe a large loss, i.e., a strong signal to update our policy into the direction of that high reward. The loss function does precisely that.

很像更新规则，对不对？为了提供一些直觉：请注意，对数转换会为所有小于1的值产生一个负数。如果我们进行的动作的概率较低且回报较高，则希望观察到较大的损失，即强烈的信号将我们的政策更新为高回报的方向。损失函数正是这样做的。

To apply the update for a Gaussian policy, we can simply substitute π_θ with the Gaussian probability density function (pdf) — note that in the continuous domain we work with pdf values rather than actual probabilities — to obtain the so-called weighted Gaussian log likelihood loss function:

要将更新应用于高斯策略，我们可以简单地用高斯概率密度函数(pdf)替换π_θ-注意，在连续域中，我们使用pdf值而不是实际概率-获得所谓的加权高斯对数似然损失函数 ：

TensorFlow 2.0实施 (TensorFlow 2.0 implementation)

Enough mathematics for now, it’s time for the implementation.

到目前为止，数学已经足够多了，是时候实施了。

We just defined the loss function, but unfortunately we cannot directly apply it in Tensorflow 2.0. When training a neural network, you may be used to something like model.compile(loss='mse',optimizer=opt), followed by model.fitormodel.train_on_batch, but this doesn’t work. First of all, the Gaussian log likelihood loss function is not a default one in TensorFlow 2.0 — it is in the Theano library for example[4] — meaning we have to create a custom loss function. More restrictive though: TensorFlow 2.0 requires a loss function to have exactly two arguments, y_true and y_predicted. As we just saw, we have three arguments due to multiplying with the reward. Let’s worry about that later though and first present our custom Guassian loss function:

我们只是定义了损失函数，但不幸的是我们无法在Tensorflow 2.0中直接应用它。在训练神经网络时，您可能习惯了诸如model.compile(loss='mse',optimizer=opt) ，随后是model.fit或model.train_on_batch ，但这是行不通的。首先，高斯对数似然损失函数不是TensorFlow 2.0中的默认函数-例如在Theano库中[4]-这意味着我们必须创建一个自定义损失函数。但是， y_true y_predicted ：TensorFlow 2.0需要一个损失函数来具有正好两个参数y_true和y_predicted 。正如我们刚刚看到的，由于乘以奖励，我们有三个论点。不过，让我们稍后再担心，首先介绍我们的自定义Guassian损失函数：

"""Weighted Gaussian log likelihood loss function for RL"""
def custom_loss_gaussian(state, action, reward):# Predict mu and sigma with actor networkmu, sigma = actor_network(state)# Compute Gaussian pdf valuepdf_value = tf.exp(-0.5 *((action - mu) / (sigma))**2)* 1 / (sigma * tf.sqrt(2 * np.pi))# Convert pdf value to log probabilitylog_probability = tf.math.log(pdf_value + 1e-5)# Compute weighted lossloss_actor = - reward * log_probabilityreturn loss_actor

So we have the correct loss function now, but we cannot apply it!? Of course we can — otherwise all of this would have been fairly pointless — it’s just slightly different than you might be used to.

因此，我们现在具有正确的损失函数，但是我们无法应用它！我们当然可以-否则所有这些都将毫无意义-与您过去的习惯略有不同。

This is where the GradientTapefunctionality comes in, which is a novel addition to TensorFlow 2.0 [5]. It essentially records your forward steps on a ‘tape’ such that it can apply automatic differentiation. The updating approach consists of three steps [6]. First, in our custom loss function we make a forward pass through the actor network — which is memorized — and calculate the loss. Second, with the function .trainable_variables, we recall the weights found during our forward pass. Subsequently, tape.gradient calculates all the gradients for you by simply plugging in the loss value and the trainable variables. Third, with optimizer.apply_gradients we update the network weights, where the optimizer is one of your choosing (e.g., SGD, Adam, RMSprop). In Python, the update steps look as follows:

这是GradientTape功能出现的地方，它是TensorFlow 2.0 [5]的新增功能。它实质上将您的前进步骤记录在“ 磁带”上 ，以便可以应用自动区分。更新方法包括三个步骤[6]。首先，在我们的自定义损失函数中，我们通过参与者网络(已存储)进行前向传递，然后计算损失。其次，使用函数.trainable_variables ，我们可以回忆起向前通过过程中发现的权重。随后， tape.gradient只需插入损失值和可训练变量，即可为您计算所有梯度。第三，使用optimizer.apply_gradients我们更新网络权重，其中优化器是您的选择之一(例如，SGD，Adam，RMSprop)。在Python中，更新步骤如下所示：

"""Compute and apply gradients to update network weights"""
with tf.GradientTape() as tape:# Compute Gaussian loss with custom loss functionloss_value = custom_loss_gaussian(state, action, reward)# Compute gradients for actor networkgrads = tape.gradient(loss_value, actor_network.trainable_variables)# Apply gradients to update network weightsoptimizer.apply_gradients(zip(grads, actor_network.trainable_variables))

So in the end, we only need a few lines of codes to perform the update!

因此，最后，我们只需要执行几行代码即可执行更新！

数值例子 (Numerical example)

We present a minimal working example for a continuous control problem, the full code can be found on my GitHub. We consider an extremely simple problem, namely a one-shot game with only one state and a trivial optimal policy. The closer we are to the (fixed but unknown) target, the higher our reward. The reward function is formally denoted as R =ζ β / max(ζ,|τ - a|), with β as the maximum reward, τ as the target and ζ as the target range.

我们为连续控制问题提供了一个最小的工作示例，完整的代码可以在我的GitHub上找到。我们考虑一个非常简单的问题，即只有一个状态和一个琐碎的最优策略的单机游戏。我们越接近(固定但未知)的目标，我们的奖励就越高。奖励函数正式表示为R =ζβ/ max (ζ，|τ-a |) ，其中β为最大奖励， τ为目标， ζ为目标范围。

To represent the actor we define a dense neural network (using Keras) that takes the fixed state (a tensor with value 1) as input, performs transformations in two hidden layers with ReLUs as activation functions (five per layer) and returns μ and σ as output. We initialize bias weights such that we start with μ=0 and σ=1. For our optimizer, we use Adam with its default learning rate of 0.001.

为了表示参与者，我们定义了一个密集的神经网络(使用Keras)，该网络以固定状态(值为1的张量)作为输入，在两个隐藏层中执行变换，其中ReLU作为激活函数(每层五个)，并返回μ和σ作为输出。我们初始化偏差权重，使得我们从μ = 0和σ = 1开始。对于我们的优化器，我们使用默认学习率为0.001的Adam。

Some sample runs are shown in the figure below. Note that the convergence pattern is in line with our expectations. At first the losses are relatively high, causing μ to move into the direction of higher rewards and σ to increase and allow for more exploration. Once hitting the target the observed losses decrease, resulting in μ to stabilize and σ to drop to nearly 0.

下图显示了一些示例运行。请注意，收敛模式符合我们的预期。首先，损失相对较高，导致μ向更高奖励的方向移动，而σ增加并允许更多的探索。一旦达到目标，观察到的损耗就会减少，从而使μ稳定下来，而σ下降到接近0。

关键点 (Key points)

The policy gradient method does not work with traditional loss functions; we must define a pseudo-loss to update actor networks. For continuous control, the pseudo-loss function is simply the negative log of the pdf value multiplied with the reward signal.
策略梯度法不适用于传统的损失函数；我们必须定义一个伪损失来更新参与者网络。对于连续控制，伪损失函数只是pdf值的负对数乘以奖励信号。
Several TensorFlow 2.0 update functions only accept custom loss functions with exactly two arguments. The GradientTape functionality does not have this restriction.
一些TensorFlow 2.0更新函数仅接受带有两个自变量的自定义损失函数。 GradientTape功能没有此限制。
Actor networks are updated using three steps: (i) define a custom loss function, (ii) compute the gradients for the trainable variables and (iii) apply the gradients to update the weights of the actor network.
使用三个步骤来更新Actor网络：(i)定义自定义损失函数，(ii)计算可训练变量的梯度，以及(iii)应用梯度来更新Actor网络的权重。

This article is partially based on my ResearchGate paper: ‘Implementing Gaussian Actor Networks for Continuous Control in TensorFlow 2.0’ , available at https://www.researchgate.net/publication/343714359_Implementing_Gaussian_Actor_Networks_for_Continuous_Control_in_TensorFlow_20

本文部分基于我的ResearchGate论文：``在TensorFlow 2.0中实现高斯Actor网络以实现连续控制''， 网址为 https://www.researchgate.net/publication/343714359_Implementing_Gaussian_Actor_Networks_for_Continuous_Control_in_TensorFlow_20

The GitHub code (implemented using Python 3.8 and TensorFlow 2.3) can be found at: www.github.com/woutervanheeswijk/example_continuous_control

GitHub代码(使用Python 3.8和TensorFlow 2.3实现)可以在以下位置找到： www.github.com/woutervanheeswijk/example_continuous_control

[1] Van Heeswijk, W.J.A. (2020) Implementing Gaussian Actor Networks for Continuous Control in TensorFlow 2.0. https://www.researchgate.net/publication/343714359_Implementing_Gaussian_Actor_Networks_for_Continuous_Control_in_TensorFlow_20

[1] Van Heeswijk，WJA(2020)在TensorFlow 2.0中实现高斯演员网络进行连续控制。 https://www.researchgate.net/publication/343714359_Implementing_Gaussian_Actor_Networks_for_Continuous_Control_in_TensorFlow_20

[2] Williams, R. J. (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4):229-256.

[2] Williams，RJ(1992)用于连接主义强化学习的简单统计梯度跟踪算法。机器学习，8(3-4)：229-256。

[3] Levine, S. (2019) CS 285 at UC Berkeley Deep Reinforcement Learning: Policy Gradients. http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf

[3] Levine，S.(2019)CS 285，加州大学伯克利分校深度强化学习：政策梯度。 http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf

[4] Theanets 0.7.3 documentation. Gaussian Log Likelihood Function. https://theanets.readthedocs.io/en/stable/api/generated/theanets.losses.GaussianLogLikelihood.html#theanets.losses.GaussianLogLikelihood

[4] Theanets 0.7.3文档。高斯对数似然函数。 https://theanets.readthedocs.io/zh_CN/stable/api/generated/theanets.losses.GaussianLogLikelihood.html#theanets.losses.GaussianLogLikelihood

[5] Rosebrock, A. (2020) Using TensorFlow and GradientTape to train a Keras model. https://www.tensorflow.org/api_docs/python/tf/GradientTape

[5] Rosebrock，A.(2020)使用TensorFlow和GradientTape训练Keras模型。 https://www.tensorflow.org/api_docs/python/tf/GradientTape

[6] Nandan, A. (2020) Actor Critic Method. https://keras.io/examples/rl/actor_critic_cartpole/

[6] Nandan，A.(2020)演员评论方法。 https://keras.io/examples/rl/actor_critic_cartpole/