求一个张量的梯度_张量流中离散策略梯度的最小工作示例2 0

求一个张量的梯度

Training discrete actor networks with TensorFlow 2.0 is easy once you know how to do it, but also rather different from implementations in TensorFlow 1.0. As the 2.0 version was only released in September 2019, most examples that circulate on the web are still designed for TensorFlow 1.0. In a related article — in which we also discuss the mathematics in more detail — we already treated the continuous case. Here, we use a simple multi-armed bandit problem to show how we can implement and update an actor network the discrete setting [1].

一旦您知道如何使用TensorFlow 2.0训练离散的actor网络就很容易了，而且与TensorFlow 1.0的实现也有很大不同。由于2.0版本仅在2019年9月发布，因此大多数在网络上传播的示例仍是针对TensorFlow 1.0设计的。在相关文章中(我们还将更详细地讨论数学)，我们已经处理了连续的情况。在这里，我们使用一个简单的多臂匪问题来说明如何实现离散设置[1]并更新演员网络。

一点数学 (A bit of mathematics)

We use the classical policy gradient algorithm REINFORCE in which the actor is represented by a neural network known as the actor network. In the discrete case, the network output is simply the probability of selecting each of the actions. So, if the set of actions is defined by A and the action by a ∈ A, then the network output are the probabilities p(a), ∀a ∈ A. The input layer contains the state s or a feature array ϕ(s), followed by one or more hidden layers that transform the input, with the output being the probabilities for each action that might be selected.

我们使用经典的策略梯度算法REINFORCE，其中角色由称为角色网络的神经网络表示。在离散情况下，网络输出仅仅是选择每个动作的概率。因此，如果一组动作由A定义，而动作由a∈A定义 ，则网络输出为概率p(a) ， ∀a∈A 。输入层包含状态s或要素数组ϕ(s) ，后跟一个或多个隐藏层来转换输入，输出是每个可能选择的动作的概率。

The policy π is parameterized by θ, which in deep reinforcement learning represents the neural network weights. After each action we take, we observe a reward v. Computing the gradients for θ and using learning rate α, the update rule typically encountered in textbooks looks as follows [2,3]:

策略π由θ参数化，在深度强化学习中，它表示神经网络权重。在我们执行每个动作之后，我们都会观察到奖励v 。计算θ的梯度并使用学习率α ，教科书中通常会遇到的更新规则如下[2,3]：

When applying backpropagation updates to neural networks we must slightly modify this update rule, but the procedure follows the same lines. Although we might update the network weights manually, we typically prefer to let TensorFlow (or whatever library you use) handle the update. We only need to provide a loss function; the computer handles the calculation of gradients and other fancy tricks such as customized learning rates. In fact, the sole thing we have to do is add a minus sign, as we perform gradient descent rather than ascent. Thus, the loss function — which is known as the log loss function or cross-entropy loss function[4] — looks like this:

将反向传播更新应用于神经网络时，我们必须稍微修改此更新规则，但是该过程遵循相同的原则。尽管我们可能会手动更新网络权重，但我们通常更喜欢让TensorFlow(或您使用的任何库)来处理更新。我们只需要提供一个损失函数；计算机可以处理梯度和其他花式技巧(例如自定义学习率)的计算。实际上，我们唯一要做的就是添加一个减号，因为我们执行梯度下降而不是上升。因此，损失函数(称为对数损失函数或交叉熵损失函数 [4])如下所示：

TensorFlow 2.0实施 (TensorFlow 2.0 implementation)

Now let’s move on to the actual implementation. If you have some experience with TensorFlow, you likely first compile your network withmodel.compileand then perform model.fitormodel.train_on_batchto fit the network to your data. As TensorFlow 2.0 requires a loss function to have exactly two arguments, (y_true and y_predicted) we cannot use these methods though, since we need the action, state and reward as input arguments. The GradientTapefunctionality — which did not exist in TensorFlow 1.0 [5] — conveniently solves this problem. After storing a forward pass through the actor network on a `tape' , it is able to perform automatic differentiation in a backward pass later on.

现在让我们继续实际的实现。如果您有使用TensorFlow的经验，则可能首先使用model.compile编译网络，然后执行model.fit或model.train_on_batch使网络适合您的数据。由于TensorFlow 2.0需要一个损失函数来具有正好两个参数( y_true和y_predicted )，因此我们无法使用这些方法，因为我们需要将操作，状态和奖励作为输入参数。 TensorFlow 1.0 [5]中不存在的GradientTape功能可以方便地解决此问题。在通过actor网络将前向通行证存储在“ 磁带 ”上之后，稍后可以在后向通行证中执行自动区分。

We start by defining our cross entropy loss function:

我们首先定义交叉熵损失函数：

In the next step, we use the function .trainable_variables to retrieve the network weights. Subsequently, tape.gradient calculates all the gradients for you by simply plugging in the loss value and the trainable variables. With optimizer.apply_gradients we update the network weights using a selected optimizer. As mentioned earlier, it is crucial that the forward pass (in which we obtain the action probabilities from the network) is included in the GradientTape. The code to update the weights is as follows:

在下一步中，我们使用函数.trainable_variables检索网络权重。随后， tape.gradient只需插入损失值和可训练变量，即可为您计算所有梯度。通过optimizer.apply_gradients我们使用选定的优化器来更新网络权重。如前所述，至关重要的是将正向传递(我们从网络中获得动作概率)包括在GradientTape中。更新权重的代码如下：

多臂匪 (Multi-armed bandit)

In the multi-armed bandit problem, we are able to play several slot machines with unique pay-off properties [6]. Each machine i has a mean payoff μ_i and a standard deviation σ_i, which are unknown to the player. At every decision moment you play one of the machines and observe the reward. After sufficient iterations and exploration, you should be able to fairly accurately estimate the mean reward of each machine. Naturally, the optimal policy is to always play the slot machine with the highest expected payoff.

在多武装匪徒问题中，我们能够玩几台具有独特回报特性的老虎机[6]。每台机器i均具有玩家不知道的平均收益μ_i和标准偏差σ_i 。在每个决策时刻，您都玩一台机器并观察奖励。经过足够的迭代和探索，您应该能够相当准确地估计每台机器的平均回报。自然，最佳策略是始终使用预期收益最高的老虎机。

Using Keras, we define a dense actor network. It takes a fixed state (a tensor with value 1) as input. We have two hidden layers that use five ReLUs per layer as activation functions. The network outputs the probabilities of playing each slot machine. The bias weights are initialized in such a way that each machine has equal probability at the beginning. Finally, the chosen optimizer is Adam with its default learning rate of 0.001.

使用Keras，我们定义了一个密集的actor网络。它采用固定状态(值为1的张量)作为输入。我们有两个隐藏层，每个层使用五个ReLU作为激活函数。网络输出玩每个老虎机的概率。偏置权重的初始化方式是，每台机器在开始时都有相同的概率。最后，选择的优化器是Adam，默认学习率为0.001。

We test four settings with differing mean payoffs. For simplicity we set all standard deviations equal. The figures below show the learned probabilities for each slot machine, testing with four machines. As expected, the policy learns to play the machine(s) with the highest expected payoff. Some exploration naturally persists, especially when payoffs are close together. A bit of fine-tuning and you surely will do a lot better during your next Vegas trip.

我们测试了四种具有不同平均收益的设置。为简单起见，我们将所有标准偏差设置为相等。下图显示了在四台老虎机上进行测试后，每台老虎机的学习概率。正如预期的那样，该策略将学习播放具有最高预期收益的机器。自然会持续进行一些探索，尤其是当收益接近时。进行一些微调，在您下一次维加斯之旅中，您肯定会做得更好。

关键点 (Key points)

We define a pseudo-loss to update actor networks. For discrete control, the pseudo-loss function is simply the negative log probability multiplied with the reward signal, also known as the log loss- or cross-entropy loss function.
我们定义了伪损失来更新参与者网络。对于离散控制，伪损失函数仅是负对数概率乘以奖励信号，也称为对数损失或交叉熵损失函数。
Common TensorFlow 2.0 functions only accept loss functions with exactly two arguments. The GradientTape does not have this restriction.
常见的TensorFlow 2.0函数仅接受具有两个参数的损失函数。 GradientTape没有此限制。
Actor networks are updated using three steps: (i) define a custom loss function, (ii) compute the gradients for the trainable variables and (iii) apply the gradients to update the weights of the actor network.
使用三个步骤来更新角色网络：(i)定义自定义损失函数；(ii)计算可训练变量的梯度；(iii)应用梯度来更新角色网络的权重。

This article is partially based on my method paper: ‘Implementing Actor Networks for Discrete Control in TensorFlow 2.0’ [1]

本文部分基于我的方法论文：“在Actors Networks中实现TensorFlow 2.0中的离散控制” [1]

The GitHub code (implemented using Python 3.8 and TensorFlow 2.3) can be found at: www.github.com/woutervanheeswijk/example_discrete_control

GitHub代码(使用Python 3.8和TensorFlow 2.3实现)可以在以下位置找到： www.github.com/woutervanheeswijk/example_discrete_control