swap最大值和平均值
Blake Elias is a Researcher at the New England Complex Systems Institute.Shawn Jain is an AI Resident at Microsoft Research.
布莱克·埃里亚斯 ( Blake Elias) 是 新英格兰复杂系统研究所的研究员。 Shawn Jain 是 Microsoft Research 的 AI驻地 。
Our method, softmax-weighted average pooling (SWAP), applies average-pooling, but re-weights the inputs by the softmax of each window.
我们的方法softmax加权平均池(SWAP)应用平均池,但是通过每个窗口的softmax对输入进行加权。
We present a pooling method for convolutional neural networks as an alternative to max-pooling or average pooling. Our method, softmax-weighted average pooling (SWAP), applies average-pooling, but re-weights the inputs by the softmax of each window. While the forward-pass values are nearly identical to those of max-pooling, SWAP’s backward pass has the property that all elements in the window receive a gradient update, rather than just the maximum one. We hypothesize that these richer, more accurate gradients can improve the learning dynamics. Here, we instantiate this idea and investigate learning behavior on the CIFAR-10 dataset. We find that SWAP neither allows us to increase learning rate nor yields improved model performance.
我们提出了卷积神经网络的池化方法,以替代最大池化或平均池化。 我们的方法softmax加权平均池(SWAP)应用平均池,但是通过每个窗口的softmax对输入进行加权。 尽管前向传递值与最大池化值几乎相同,但是SWAP的向后传递具有以下属性:窗口中的所有元素均接收渐变更新,而不仅仅是最大更新。 我们假设这些更丰富,更准确的渐变可以改善学习动力。 在这里,我们实例化此想法并研究CIFAR-10数据集上的学习行为。 我们发现SWAP既不能提高学习率,也不能提高模型性能。
起源 (Origins)
While watching James Martens’ lecture on optimization, from DeepMind / UCL’s Deep Learning course, we noted his point that as learning progresses, you must either lower the learning rate or increase batch size to ensure convergence. Either of these techniques results in a more accurate estimate of the gradient. This got us thinking about the need for accurate gradients. Separately, we had been doing an in-depth review of how backpropagation computes gradients for all types of layers. In doing this exercise for convolution and pooling, we noted that max-pooling only computes a gradient with respect to the maximum value in a window. This discards information — how can we make this better? Could we get a more accurate estimate of the gradient by using all the information?
在观看James Martens在 DeepMind / UCL的“深度学习”课程上的优化讲座时 ,我们注意到他的观点,即随着学习的进行,您必须降低学习率或增加批处理量以确保收敛。 这些技术中的任何一种都会导致对梯度的更准确的估计。 这使我们开始思考是否需要精确的渐变。 另外,我们一直在深入研究反向传播如何计算所有类型图层的梯度。 在进行卷积和池化练习时,我们注意到最大池化仅计算相对于窗口最大值的梯度。 这会丢弃信息-我们如何才能使其变得更好? 通过使用所有信息,我们能否获得更准确的梯度估计?
Max-pooling discards gradient information — how can we make this better?
最大池丢弃了梯度信息-我们如何使它变得更好?
进一步的背景 (Further Background)
Max-Pooling is typically used in CNNs for vision tasks as a downsampling method. For example, AlexNet used 3x3 Max-Pooling. [cite]
在CNN中,Max-Pooling通常作为下采样方法用于视觉任务。 例如,AlexNet使用3x3 Max-Pooling。 [ 引用 ]
In vision applications, max-pooling takes a feature map as input, and outputs a smaller feature map. If the input image is 4x4, a 2x2 max-pooling operator with a stride of 2 (no overlap) will output a 2x2 feature map. The 2x2 kernel of the max-pooling operator has 2x2 non-overlapping ‘positions’ on the input feature map. For each position, the maximum value in the 2x2 window is selected as the value in the output feature map. The other values are discarded.
在视觉应用中,最大池化将要素图作为输入,并输出较小的要素图。 如果输入图像为4x4,则跨度为2(无重叠)的2x2最大合并运算符将输出2x2特征图。 max-pooling运算符的2x2内核在输入要素图上具有2x2不重叠的“位置”。 对于每个位置,选择2x2窗口中的最大值作为输出要素图中的值。 其他值将被丢弃。
The implicit assumption is “bigger values are better,” — i.e. larger values are more important to the final output. This modelling decision is motivated by our intuition, although may not be absolutely correct. [Ed.: Maybe the other values matter as well! In a near-tie situation, maybe propagating gradients to the second-largest value could make it the largest value. This may change the trajectory the model takes as its learning. Updating the second-largest value as well, could be the better learning trajectory to follow.]
隐含的假设是“值越大越好”,即值越大对最终输出越重要。 尽管并非完全正确,但此建模决策是出于我们的直觉。 [编辑:也许其他价值观也很重要! 在接近平局的情况下,也许将梯度传播到第二大值可能会使它成为最大值。 这可能会改变模型学习的轨迹。 同样,更新第二大的值可能也是更好的学习轨迹。]
You might be wondering, is this differentiable? After all, deep learning requires that all operations in the model be differentiable, in order to compute gradients. In the purely mathematical sense, this is not a differentiable operation. In practice, in the backward pass, all positions corresponding to the maximum simply copy the inbound gradients; all the non-maximum positions simply set their gradients to zero. PyTorch implements this as a custom CUDA kernel (this function invokes this function).
您可能想知道,这与众不同吗? 毕竟,深度学习要求模型中的所有运算都是可微的,以便计算梯度。 从纯粹的数学意义上讲,这不是微分运算。 实际上,在向后遍历中,所有与最大值对应的位置都只是复制入站渐变; 所有非最大位置只需将其梯度设置为零即可。 PyTorch将其实现为自定义CUDA内核( 此函数调用此函数 )。
In other words, Max-Pooling generates sparse gradients. And it works! From AlexNet [cite] to ResNet [cite] to Reinforcement Learning [cite cite], it’s widely used.
换句话说,Max-Pooling生成稀疏渐变。 而且有效! 从AlexNet [ 引用 ]到RESNET [ 引用 ]以强化学习[ 举 举 ],它的广泛应用。
Many variants have been developed; Average-Pooling outputs the average, instead of the max, over the window. Dilated Max-Pooling makes the window non-contiguous; instead, it uses a checkerboard like pattern.
已经开发了许多变体。 平均池输出窗口上的平均值而不是最大值。 扩展的最大池使窗口不连续; 相反,它使用棋盘状图案。
Controversially, Geoff Hinton doesn’t like Max-Pooling:
有争议的是,Geoff Hinton不喜欢Max-Pooling:
The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.
卷积神经网络中使用的池化操作是一个很大的错误,它运行良好的事实是一场灾难。
If the pools do not overlap, pooling loses valuable information about where things are. We need this information to detect precise relationships between the parts of an object. Its [sic] true that if the pools overlap enough, the positions of features will be accurately preserved by “coarse coding” (see my paper on “distributed representations” in 1986 for an explanation of this effect). But I no longer believe that coarse coding is the best way to represent the poses of objects relative to the viewer (by pose I mean position, orientation, and scale).
如果池不重叠,则池将丢失有关事物位置的有价值的信息。 我们需要此信息来检测对象各部分之间的精确关系。 它的[ 原文 ]诚然,如果池重叠足够的特征的位置将被准确地“粗编码”保存(见我于1986年“分布式表示”纸的这种效应的解释)。 但是我不再相信粗略编码是代表对象相对于观察者的姿态的最佳方法(所谓姿态,是指位置,方向和比例)。
[Source: Geoff Hinton on Reddit.]
[来源:杰夫欣顿上书签交易 。]
动机 (Motivation)
Max-Pooling generates sparse gradients. With better gradient estimates, could we take larger steps by increasing learning rate, and therefore converge faster?
Max-Pooling生成稀疏渐变。 有了更好的梯度估计,我们是否可以通过提高学习率来采取更大的步骤,从而收敛得更快?
Sparse gradients discard too much information. With better gradient estimates, could we take larger steps by increasing learning rate, and therefore converge faster?
稀疏的梯度会丢弃过多的信息。 有了更好的梯度估计,我们是否可以通过提高学习率来采取更大的步骤,从而收敛得更快?
Although the outbound gradients generated by Max-Pool are sparse, this operation is typically used in a Conv → Max-Pool chain of operations. Notice that the trainable parameters (i.e., the filter values, F) are all in the Conv operator. Note also, that:
尽管Max-Pool生成的出站渐变稀疏,但此操作通常在Conv→Max-Pool操作链中使用。 注意,可训练参数(即过滤器值F )都在Conv运算符中。 另请注意:
dL/dF = Conv(X, dL/dO), where:
dL / dF = Conv(X,dL / dO) ,其中:
dL/dF are the gradients with respect to the convolutional filter
dL / dF是相对于卷积滤波器的梯度
dL/dO is the outbound gradient from Max-Pool, and
dL / dO是Max-Pool的出站梯度,并且
X is the input to Conv (forward).
X是Conv(正向)的输入。
As a result, all positions in the convolutional filter F get gradients. However, those gradients are computed from a sparse matrix dL/dO instead of a dense matrix. (The degree of sparsity depends on the Max-Pool window size.)
结果,卷积滤波器F中的所有位置都得到梯度。 但是,这些梯度是根据稀疏矩阵dL / dO而不是密集矩阵计算的。 (稀疏程度取决于最大池窗口的大小。)
Forward:
前锋:
Backward:
落后:
Note also that dL/dF is not sparse, as each sparse entry of dL/dO sends a gradient value back to all entries dL/dF.
还要注意, dL / dF 不是稀疏的,因为dL / dO的每个稀疏条目都会将梯度值发送回所有条目dL / dF 。
But this raises a question. While dL/dF is not sparse itself, its entries are calculated based on an averaging of sparse inputs. If its inputs (dL/dO — the outbound gradient of Max-Pool) — were dense, could dL/dF be a better estimate of the true gradient? How can we make dL/dO dense while still retaining the “bigger values are better” assumption of Max-Pool?
但这提出了一个问题。 尽管dL / dF 本身不是稀疏的,但其条目是根据稀疏输入的平均值计算得出的。 如果其输入( dL / dO -Max-Pool的出站梯度)很密集,那么dL / dF是否可以更好地估算真实梯度? 我们如何才能使dL / dO致密,同时仍保留Max-Pool的“越大越好”的假设?
One solution is Average-Pooling. There, all activations pass a gradient backwards, rather than just the max in each window. However, it violates MaxPool’s assumption that “bigger values are better.”
一种解决方案是平均池。 在那里, 所有激活都向后传递渐变,而不仅仅是每个窗口中的最大值。 但是,它违反了MaxPool的假设,即“值越大越好”。
Enter Softmax-Weighted Average-Pooling (SWAP). The forward pass is best explained as pseudo-code:
输入Softmax加权平均池(SWAP)。 最好将前向传递解释为伪代码:
average_pool(O, weights=softmax_per_window(O))
average_pool(O,权重= softmax_per_window(O))
The softmax operator normalizes the values into a probability distribution, however, it heavily favors large values. This gives it a max-pool like effect.
softmax运算符将这些值归一化为概率分布,但是,它非常喜欢较大的值。 这给了它一个类似最大池的效果。
On the backward pass, dL/dO is dense, because each outbound activation in A depends on all activations in its window — not just the max value. Non-max values in O now receive relatively small, but non-zero, gradients. Bingo!
在向后传递时, dL / dO很密集,因为A中的每个出站激活都取决于其窗口中的所有激活,而不仅仅是最大值。 现在, O中的非最大值会收到相对较小但非零的渐变。 答对了!
实验装置 (Experimental Setup)
We conducted our experiments on CIFAR10. Our code is available here. We fixed the architecture of the network to:
我们在CIFAR10上进行了实验。 我们的代码在这里 。 我们将网络架构固定为:
We tested three different variants of the “Pool” layer: two baselines (Max-Pool and Average-Pool), in addition to SWAP. Models were trained for 100 epochs using SGD, LR=1e-3 (unless otherwise mentioned).
我们测试了“ Pool”层的三种不同变体:除SWAP之外,还提供了两个基准(Max-Pool和Average-Pool)。 使用SGD,LR = 1e-3(除非另有说明)为模型训练了100个时期。
We also trained SWAP with a {25, 50, 400}% increase in LR. This was to test the idea that, with more accurate gradients we could take larger steps, and with larger steps the model would converge faster.
我们还对SWAP进行了培训,使LR增加了{25、50、400}%。 这是为了检验这样的想法:使用更准确的渐变,我们可以采用更大的步长,而采用更大的步长,模型可以收敛得更快。
结果 (Results)
讨论区 (Discussion)
SWAP shows worse performance compared to both baselines. We do not understand why this is the case. An increase in LR provided no benefit; generally, worse performance vs baseline was observed as LR increased. We attribute the 400% increase in LR performing better than the 50% increase to randomness; we tested with only a single random seed and reported only a single trial. Another possible explanation for the 400% increase performing better, is simply the ability to “cover more ground” with a higher LR.
与两个基准相比,SWAP的性能均较差。 我们不明白为什么会这样。 LR的增加无济于事; 通常,随着LR的增加,与基线相比,性能较差。 我们将LR表现的400%增长优于50%的增长归因于随机性。 我们仅使用一个随机种子进行了测试,并且仅报告了一项试验。 对于400%的更高性能表现的另一个可能的解释是,具有更高LR的“覆盖更多地面”的能力。
An increase in LR provided no benefit; generally, worse performance vs baseline was observed as LR increased.
LR的增加无济于事; 通常,随着LR的增加,与基线相比,性能较差。
未来的工作和结论 (Future Work and Conclusion)
While SWAP did not show improvement, we still want to try several experiments:
尽管SWAP并未显示出改善,但我们仍想尝试几个实验:
Overlapping pool windows. One possibility is to use overlapping pool windows (i.e. stride = 1), rather than the disjoint windows we used here (with stride = 2). Modern convolutional architectures, like AlexNet and ResNet both use overlapping pool windows. So, for a fair comparison, it would be sensible to compare with something closer to state-of-the-art, rather than the architecture we used here for simplicity. Indeed, Hinton’s critique of max-pooling is most stringent in the case of non-overlapping pool windows, with the reasoning that this throws out spatial information.
重叠的游泳池窗户。 一种可能是使用重叠的池窗口(即,步幅= 1),而不是我们在此使用的不相交的窗口(步幅= 2)。 诸如AlexNet和ResNet之类的现代卷积体系结构都使用重叠的池窗口。 因此,为了进行公平的比较,比较接近最新技术的东西而不是我们为简单起见而使用的体系结构比较是明智的。 确实,对于非重叠的泳池窗户,欣顿对最大泳池的批评最为严格,理由是这会抛出空间信息。
Histogram of activations. We would like to try Max-Pool & SWAP with the exact same initialization, train both, and compare the distributions of gradients. Investigating the difference in gradients may offer a better understanding of the stark contrast in training behavior.
激活的直方图。 我们想尝试使用完全相同的初始化程序来进行Max-Pool和SWAP训练,同时训练它们和比较梯度分布。 研究梯度差异可以更好地理解训练行为中的鲜明对比。
Improving gradient accuracy is still an exciting area. How else can we modify the model or the gradient computation to improve gradient accuracy?
改善梯度精度仍然是令人兴奋的领域。 我们还能如何修改模型或梯度计算以提高梯度精度?
翻译自: https://towardsdatascience.com/swap-softmax-weighted-average-pooling-70977a69791b
swap最大值和平均值
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/242148.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!