swap最大值和平均值_SWAP:Softmax加权平均池

swap最大值和平均值

Blake Elias is a Researcher at the New England Complex Systems Institute.Shawn Jain is an AI Resident at Microsoft Research.

布莱克·埃里亚斯 ( Blake Elias) 新英格兰复杂系统研究所的研究员。 Shawn Jain Microsoft Research AI驻地

Our method, softmax-weighted average pooling (SWAP), applies average-pooling, but re-weights the inputs by the softmax of each window.

我们的方法softmax加权平均池(SWAP)应用平均池,但是通过每个窗口的softmax对输入进行加权。

We present a pooling method for convolutional neural networks as an alternative to max-pooling or average pooling. Our method, softmax-weighted average pooling (SWAP), applies average-pooling, but re-weights the inputs by the softmax of each window. While the forward-pass values are nearly identical to those of max-pooling, SWAP’s backward pass has the property that all elements in the window receive a gradient update, rather than just the maximum one. We hypothesize that these richer, more accurate gradients can improve the learning dynamics. Here, we instantiate this idea and investigate learning behavior on the CIFAR-10 dataset. We find that SWAP neither allows us to increase learning rate nor yields improved model performance.

我们提出了卷积神经网络的池化方法,以替代最大池化或平均池化。 我们的方法softmax加权平均池(SWAP)应用平均池,但是通过每个窗口的softmax对输入进行加权。 尽管前向传递值与最大池化值几乎相同,但是SWAP的向后传递具有以下属性:窗口中的所有元素均接收渐变更新,而不仅仅是最大更新。 我们假设这些更丰富,更准确的渐变可以改善学习动力。 在这里,我们实例化此想法并研究CIFAR-10数据集上的学习行为。 我们发现SWAP既不能提高学习率,也不能提高模型性能。

起源 (Origins)

While watching James Martens’ lecture on optimization, from DeepMind / UCL’s Deep Learning course, we noted his point that as learning progresses, you must either lower the learning rate or increase batch size to ensure convergence. Either of these techniques results in a more accurate estimate of the gradient. This got us thinking about the need for accurate gradients. Separately, we had been doing an in-depth review of how backpropagation computes gradients for all types of layers. In doing this exercise for convolution and pooling, we noted that max-pooling only computes a gradient with respect to the maximum value in a window. This discards information — how can we make this better? Could we get a more accurate estimate of the gradient by using all the information?

在观看James Martens在 DeepMind / UCL的“深度学习”课程上的优化讲座时 ,我们注意到他的观点,即随着学习的进行,您必须降低学习率或增加批处理量以确保收敛。 这些技术中的任何一种都会导致对梯度的更准确的估计。 这使我们开始思考是否需要精确的渐变。 另外,我们一直在深入研究反向传播如何计算所有类型图层的梯度。 在进行卷积和池化练习时,我们注意到最大池化仅计算相对于窗口最大值的梯度。 这会丢弃信息-我们如何才能使其变得更好? 通过使用所有信息,我们能否获得更准确的梯度估计?

Max-pooling discards gradient information — how can we make this better?

最大池丢弃了梯度信息-我们如何使它变得更好?

进一步的背景 (Further Background)

Max-Pooling is typically used in CNNs for vision tasks as a downsampling method. For example, AlexNet used 3x3 Max-Pooling. [cite]

在CNN中,Max-Pooling通常作为下采样方法用于视觉任务。 例如,AlexNet使用3x3 Max-Pooling。 [ 引用 ]

In vision applications, max-pooling takes a feature map as input, and outputs a smaller feature map. If the input image is 4x4, a 2x2 max-pooling operator with a stride of 2 (no overlap) will output a 2x2 feature map. The 2x2 kernel of the max-pooling operator has 2x2 non-overlapping ‘positions’ on the input feature map. For each position, the maximum value in the 2x2 window is selected as the value in the output feature map. The other values are discarded.

在视觉应用中,最大池化将要素图作为输入,并输出较小的要素图。 如果输入图像为4x4,则跨度为2(无重叠)的2x2最大合并运算符将输出2x2特征图。 max-pooling运算符的2x2内核在输入要素图上具有2x2不重叠的“位置”。 对于每个位置,选择2x2窗口中的最大值作为输出要素图中的值。 其他值将被丢弃。

The implicit assumption is “bigger values are better,” — i.e. larger values are more important to the final output. This modelling decision is motivated by our intuition, although may not be absolutely correct. [Ed.: Maybe the other values matter as well! In a near-tie situation, maybe propagating gradients to the second-largest value could make it the largest value. This may change the trajectory the model takes as its learning. Updating the second-largest value as well, could be the better learning trajectory to follow.]

隐含的假设是“值越大越好”,即值越大对最终输出越重要。 尽管并非完全正确,但此建模决策是出于我们的直觉。 [编辑:也许其他价值观也很重要! 在接近平局的情况下,也许将梯度传播到第二大值可能会使它成为最大值。 这可能会改变模型学习的轨迹。 同样,更新第二大的值可能也是更好的学习轨迹。]

You might be wondering, is this differentiable? After all, deep learning requires that all operations in the model be differentiable, in order to compute gradients. In the purely mathematical sense, this is not a differentiable operation. In practice, in the backward pass, all positions corresponding to the maximum simply copy the inbound gradients; all the non-maximum positions simply set their gradients to zero. PyTorch implements this as a custom CUDA kernel (this function invokes this function).

您可能想知道,这与众不同吗? 毕竟,深度学习要求模型中的所有运算都是可微的,以便计算梯度。 从纯粹的数学意义上讲,这不是微分运算。 实际上,在向后遍历中,所有与最大值对应的位置都只是复制入站渐变; 所有非最大位置只需将其梯度设置为零即可。 PyTorch将其实现为自定义CUDA内核( 此函数调用此函数 )。

In other words, Max-Pooling generates sparse gradients. And it works! From AlexNet [cite] to ResNet [cite] to Reinforcement Learning [cite cite], it’s widely used.

换句话说,Max-Pooling生成稀疏渐变。 而且有效! 从AlexNet [ 引用 ]到RESNET [ 引用 ]以强化学习[ 举 举 ],它的广泛应用。

Many variants have been developed; Average-Pooling outputs the average, instead of the max, over the window. Dilated Max-Pooling makes the window non-contiguous; instead, it uses a checkerboard like pattern.

已经开发了许多变体。 平均池输出窗口上的平均值而不是最大值。 扩展的最大池使窗口不连续; 相反,它使用棋盘状图案。

Image for post
arXiv (via arXiv (通过StackOverflow).StackOverflow )。

Controversially, Geoff Hinton doesn’t like Max-Pooling:

有争议的是,Geoff Hinton不喜欢Max-Pooling:

The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.

卷积神经网络中使用的池化操作是一个很大的错误,它运行良好的事实是一场灾难。

If the pools do not overlap, pooling loses valuable information about where things are. We need this information to detect precise relationships between the parts of an object. Its [sic] true that if the pools overlap enough, the positions of features will be accurately preserved by “coarse coding” (see my paper on “distributed representations” in 1986 for an explanation of this effect). But I no longer believe that coarse coding is the best way to represent the poses of objects relative to the viewer (by pose I mean position, orientation, and scale).

如果池不重叠,则池将丢失有关事物位置的有价值的信息。 我们需要此信息来检测对象各部分之间的精确关系。 它的[ 原文 ]诚然,如果池重叠足够的特征的位置将被准确地“粗编码”保存(见我于1986年“分布式表示”纸的这种效应的解释)。 但是我不再相信粗略编码是代表对象相对于观察者的姿态的最佳方法(所谓姿态,是指位置,方向和比例)。

[Source: Geoff Hinton on Reddit.]

[来源:杰夫欣顿上书签交易 。]

动机 (Motivation)

Max-Pooling generates sparse gradients. With better gradient estimates, could we take larger steps by increasing learning rate, and therefore converge faster?

Max-Pooling生成稀疏渐变。 有了更好的梯度估计,我们是否可以通过提高学习率来采取更大的步骤,从而收敛得更快?

Sparse gradients discard too much information. With better gradient estimates, could we take larger steps by increasing learning rate, and therefore converge faster?

稀疏的梯度会丢弃过多的信息。 有了更好的梯度估计,我们是否可以通过提高学习率来采取更大的步骤,从而收敛得更快?

Although the outbound gradients generated by Max-Pool are sparse, this operation is typically used in a Conv → Max-Pool chain of operations. Notice that the trainable parameters (i.e., the filter values, F) are all in the Conv operator. Note also, that:

尽管Max-Pool生成的出站渐变稀疏,但此操作通常在Conv→Max-Pool操作链中使用。 注意,可训练参数(即过滤器值F )都在Conv运算符中。 另请注意:

dL/dF = Conv(X, dL/dO), where:

dL / dF = Conv(X,dL / dO) ,其中:

  • dL/dF are the gradients with respect to the convolutional filter

    dL / dF是相对于卷积滤波器的梯度

  • dL/dO is the outbound gradient from Max-Pool, and

    dL / dO是Max-Pool的出站梯度,并且

  • X is the input to Conv (forward).

    X是Conv(正向)的输入。

As a result, all positions in the convolutional filter F get gradients. However, those gradients are computed from a sparse matrix dL/dO instead of a dense matrix. (The degree of sparsity depends on the Max-Pool window size.)

结果,卷积滤波器F中的所有位置都得到梯度。 但是,这些梯度是根据稀疏矩阵dL / dO而不是密集矩阵计算的。 (稀疏程度取决于最大池窗口的大小。)

Forward:

前锋:

Image for post

Backward:

落后:

Image for post
Figure 3: Max pooling generates sparse gradients. (Authors’ image)
图3:最大池生成稀疏梯度。 (作者的图片)

Note also that dL/dF is not sparse, as each sparse entry of dL/dO sends a gradient value back to all entries dL/dF.

还要注意, dL / dF 不是稀疏的,因为dL / dO的每个稀疏条目都会将梯度值发送回所有条目dL / dF

But this raises a question. While dL/dF is not sparse itself, its entries are calculated based on an averaging of sparse inputs. If its inputs (dL/dO — the outbound gradient of Max-Pool) — were dense, could dL/dF be a better estimate of the true gradient? How can we make dL/dO dense while still retaining the “bigger values are better” assumption of Max-Pool?

但这提出了一个问题。 尽管dL / dF 本身不是稀疏的,但其条目是根据稀疏输入的平均值计算得出的。 如果其输入( dL / dO -Max-Pool的出站梯度)很密集,那么dL / dF是否可以更好地估算真实梯度? 我们如何才能使dL / dO致密,同时仍保留Max-Pool的“越大越好”的假设?

One solution is Average-Pooling. There, all activations pass a gradient backwards, rather than just the max in each window. However, it violates MaxPool’s assumption that “bigger values are better.”

一种解决方案是平均池。 在那里, 所有激活向后传递渐变,而不仅仅是每个窗口中的最大值。 但是,它违反了MaxPool的假设,即“值越大越好”。

Enter Softmax-Weighted Average-Pooling (SWAP). The forward pass is best explained as pseudo-code:

输入Softmax加权平均池(SWAP)。 最好将前向传递解释为伪代码:

average_pool(O, weights=softmax_per_window(O))

average_pool(O,权重= softmax_per_window(O))

Image for post
Figure 4: SWAP produces a value almost the same as max-pooling — but passes gradients back to all entries in the window. (Authors’ image)
图4:SWAP产生的值几乎与最大池化相同,但是将梯度传递回窗口中的所有条目。 (作者的图片)

The softmax operator normalizes the values into a probability distribution, however, it heavily favors large values. This gives it a max-pool like effect.

softmax运算符将这些值归一化为概率分布,但是,它非常喜欢较大的值。 这给了它一个类似最大池的效果。

On the backward pass, dL/dO is dense, because each outbound activation in A depends on all activations in its window — not just the max value. Non-max values in O now receive relatively small, but non-zero, gradients. Bingo!

在向后传递时, dL / dO很密集,因为A中的每个出站激活都取决于其窗口中的所有激活,而不仅仅是最大值。 现在, O中的非最大值会收到相对较小但非零的渐变。 答对了!

实验装置 (Experimental Setup)

We conducted our experiments on CIFAR10. Our code is available here. We fixed the architecture of the network to:

我们在CIFAR10上进行了实验。 我们的代码在这里 。 我们将网络架构固定为:

Image for post

We tested three different variants of the “Pool” layer: two baselines (Max-Pool and Average-Pool), in addition to SWAP. Models were trained for 100 epochs using SGD, LR=1e-3 (unless otherwise mentioned).

我们测试了“ Pool”层的三种不同变体:除SWAP之外,还提供了两个基准(Max-Pool和Average-Pool)。 使用SGD,LR = 1e-3(除非另有说明)为模型训练了100个时期。

We also trained SWAP with a {25, 50, 400}% increase in LR. This was to test the idea that, with more accurate gradients we could take larger steps, and with larger steps the model would converge faster.

我们还对SWAP进行了培训,使LR增加了{25、50、400}%。 这是为了检验这样的想法:使用更准确的渐变,我们可以采用更大的步长,而采用更大的步长,模型可以收敛得更快。

结果 (Results)

Image for post

讨论区 (Discussion)

SWAP shows worse performance compared to both baselines. We do not understand why this is the case. An increase in LR provided no benefit; generally, worse performance vs baseline was observed as LR increased. We attribute the 400% increase in LR performing better than the 50% increase to randomness; we tested with only a single random seed and reported only a single trial. Another possible explanation for the 400% increase performing better, is simply the ability to “cover more ground” with a higher LR.

与两个基准相比,SWAP的性能均较差。 我们不明白为什么会这样。 LR的增加无济于事; 通常,随着LR的增加,与基线相比,性能较差。 我们将LR表现的400%增长优于50%的增长归因于随机性。 我们仅使用一个随机种子进行了测试,并且仅报告了一项试验。 对于400%的更高性能表现的另一个可能的解释是,具有更高LR的“覆盖更多地面”的能力。

An increase in LR provided no benefit; generally, worse performance vs baseline was observed as LR increased.

LR的增加无济于事; 通常,随着LR的增加,与基线相比,性能较差。

未来的工作和结论 (Future Work and Conclusion)

While SWAP did not show improvement, we still want to try several experiments:

尽管SWAP并未显示出改善,但我们仍想尝试几个实验:

  • Overlapping pool windows. One possibility is to use overlapping pool windows (i.e. stride = 1), rather than the disjoint windows we used here (with stride = 2). Modern convolutional architectures, like AlexNet and ResNet both use overlapping pool windows. So, for a fair comparison, it would be sensible to compare with something closer to state-of-the-art, rather than the architecture we used here for simplicity. Indeed, Hinton’s critique of max-pooling is most stringent in the case of non-overlapping pool windows, with the reasoning that this throws out spatial information.

    重叠的游泳池窗户。 一种可能是使用重叠的池窗口(即,步幅= 1),而不是我们在此使用的不相交的窗口(步幅= 2)。 诸如AlexNet和ResNet之类的现代卷积体系结构都使用重叠的池窗口。 因此,为了进行公平的比较,比较接近最新技术的东西而不是我们为简单起见而使用的体系结构比较是明智的。 确实,对于非重叠的泳池窗户,欣顿对最大泳池的批评最为严格,理由是这会抛出空间信息。

  • Histogram of activations. We would like to try Max-Pool & SWAP with the exact same initialization, train both, and compare the distributions of gradients. Investigating the difference in gradients may offer a better understanding of the stark contrast in training behavior.

    激活的直方图。 我们想尝试使用完全相同的初始化程序来进行Max-Pool和SWAP训练,同时训练它们和比较梯度分布。 研究梯度差异可以更好地理解训练行为中的鲜明对比。

Improving gradient accuracy is still an exciting area. How else can we modify the model or the gradient computation to improve gradient accuracy?

改善梯度精度仍然是令人兴奋的领域。 我们还能如何修改模型或梯度计算以提高梯度精度?

翻译自: https://towardsdatascience.com/swap-softmax-weighted-average-pooling-70977a69791b

swap最大值和平均值

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/242148.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

该酷的酷该飒的飒,穿出自己的潮流前线

精选匈牙利白鸭绒填充,柔软蓬松 舒适感很强,回弹性好 没有什么异味很干净安全 宝贝穿上去保暖又舒适 树脂拉链+金属按扣,松紧帽檐+袖口 下摆还做了可调节抽绳,细节满满防风保暖很nice 短款设计相较于…

pytorch卷积可视化_使用Pytorch可视化卷积神经网络

pytorch卷积可视化Filter and Feature map Image by the author筛选和特征图作者提供的图像 When dealing with image’s and image data, CNN are the go-to architectures. Convolutional neural networks have proved to provide many state-of-the-art solutions in deep l…

Golang之轻松化解defer的温柔陷阱

defer是Go语言提供的一种用于注册延迟调用的机制:让函数或语句可以在当前函数执行完毕后(包括通过return正常结束或者panic导致的异常结束)执行。深受Go开发者的欢迎,但一不小心就会掉进它的温柔陷阱,只有深入理解它的…

u-net语义分割_使用U-Net的语义分割

u-net语义分割Picture By Martei Macru On Unsplash图片由Martei Macru On Unsplash拍摄 Semantic segmentation is a computer vision problem where we try to assign a class to each pixel . Unlike the classic image classification task where only one class value is …

我国身家超过亿元的有多少人?

目前我国身家达到亿元以上的人数,从公开数据来看大概有13万人,但如果把那些统计不到的隐形亿万富翁计算在内,我认为至少有20万以上。公开资料显示目前我国亿万富翁人数达到133000人根据胡润2018财富报告显示,目前我国(…

地理空间数据

摘要 (Summary) In this article, using Data Science and Python, I will show how different Clustering algorithms can be applied to Geospatial data in order to solve a Retail Rationalization business case.在本文中,我将使用数据科学和Python演示如何将…

嵌入式系统分类及其应用场景_词嵌入及其应用简介

嵌入式系统分类及其应用场景Before I give you an introduction on Word Embeddings, take a look at the following examples and ask yourself what is common between them:在向您介绍Word Embeddings之前,请看一下以下示例并问问自己它们之间的共同点是什么&…

山东男子5个月刷信用卡1800次,被银行处理后他选择29次取款100元

虽然我国实行的是存款自愿,取款自由的储蓄政策,客户想怎么取款,在什么时候取,取多少钱,完全是客户的权利,只要客户的账户上有钱,哪怕他每次取一毛钱取个100次都是客户的权利。但是明明可以一次性…

深发银行为什么要更名为平安银行?

深圳发展银行之所以更名为平安银行,最直接的原因是平安银行收购了深圳发展银行,然后又以平安集团作为主体,以深圳发展银行的名义收购了平安银行,最后两个人合并之后统一命名为平安银行。深圳发展银行更名为平安银行,大…

高斯过程分类和高斯过程回归_高斯过程回归建模入门

高斯过程分类和高斯过程回归Gaussian processing (GP) is quite a useful technique that enables a non-parametric Bayesian approach to modeling. It has wide applicability in areas such as regression, classification, optimization, etc. The goal of this article i…

假如购买的期房不小心烂尾了,那银行贷款是否可以不还了?

如今房价一路高升,再加上开发商融资难度越来越大,现在很多人都开始打期房的主意。期房不论是对开发商还是对购房者来说都是双赢的,开发商可以以较低的融资成本维持楼盘的开发,提高财务杠杆,而购房者可以较低的价格买房…

在银行存款5000万,能办理一张50万额度的信用卡吗?

拥有一张大额信用卡是很多人梦寐以求的事情,大额信用卡不仅实用,在关键时刻可以把钱拿出来刷卡或者取现,这是一种非常方便的融资方式。然而大额信用卡并不是说谁想申请就可以申请下来,正常情况下,10万以上额度以上的信…

hotelling变换_基于Hotelling-T²的偏最小二乘(PLS)中的变量选择

hotelling变换背景 (Background) One of the most common challenges encountered in the modeling of spectroscopic data is to select a subset of variables (i.e. wavelengths) out of a large number of variables associated with the response variable. It is common …

商业银行为什么大量组织高净值小规模活动?

在管理界有一个非常著名的定律叫做二八定律,所谓28定律就是20%的客户贡献了企业80%的利润。虽然这个定律在银行不一定适用,但同样的道理用于银行营销也是合适的。银行之所以经常组织一些高净值小规模的活动,因为这些客户的资产和价值比较高&a…

在县城投资买一辆出租车,一个月能收入多少钱?

在县城投资出租车能赚多少钱具体要看你是什么县城,比如西部的县城勉强能养活自己,中部的县城一个月能赚个5、6千,东部的小县城月赚个万元以上也有可能。具体回报率怎么样可以先算下投资一个出租车的成本投资一个出租车的构成成本比较多&#…

通过ISO镜像文件安装Ubuntu(可实现默认启动Windows的双系统)

解压文件 使用WinRAR等软件,Ubuntu ISO镜像文件中的casper文件夹解压到硬盘中的任意分区根目录,把ISO镜像也放在那个分区根目录。 使用Grub4dos启动Ubuntu 使用grub4dos启动Ubuntu,menu.lst写法如下。其中root命令指定了硬盘分区编号&#xf…

命名实体识别 实体抽取_您的公司为什么要关心命名实体的识别

命名实体识别 实体抽取Named entity recognition is the task of categorizing text into entities, such as people, locations, and dates. For example, for the sentence, On April 30, 1789, George Washington was inaugurated as the first president of the United Sta…

表达式测试

1111 (parameters) -> { statements; }//求平方 (int a) -> {return a * a;}//打印,无返回值 (int a) -> {System.out.println("a " a);}

有关西电的课程学分相关问题:必修课、选修课、补考、重修、学分

注:最近一年多以来学校的政策改动比较大,听说有选修一旦选了就必须通过,否则视为挂科需要重修的;还有的说是选修课学分够了再多选可能要收费(未经确认,可能只是误传);等各种说法。本…

银行现在都很缺钱吗,为什么给的利息比以前高了?

目前无论是大银行还是小银行,也不论是国有银行还是民营银行,基本上每个银行都上浮利率,如果不上浮利率,那就只能吃土了,当然加息一般主要针对定期存款以及贷款来说,活期存款利率一般是不会上浮,…