批梯度下降 随机梯度下降_梯度下降及其变体快速指南

批梯度下降 随机梯度下降

In this article, I am going to discuss the Gradient Descent algorithm. The next article will be in continuation of this article where I will discuss optimizers in neural networks. For understanding those optimizers it’s important to get a deep understanding of Gradient Descent.

在本文中,我将讨论梯度下降算法。 下一篇文章将是本文的继续,我将在其中讨论神经网络中的优化器。 为了了解这些优化器,对Gradient Descent的深入了解很重要。

内容- (Content-)

  1. Gradient Descent

    梯度下降
  2. Choice of Learning Rate

    学习率选择
  3. Batch Gradient Descent

    批次梯度下降
  4. Stochastic Gradient Descent

    随机梯度下降
  5. Mini Batch Gradient Descent

    迷你批次梯度下降
  6. Conclusion

    结论
  7. Credits

    学分

梯度下降- (Gradient Descent-)

Gradient Descent is a first-order iterative optimization algorithm for finding the local minimum of a differentiable function. To get the value of the parameter that will minimize our objective function we iteratively move in the opposite direction of the gradient of that function or in simple terms in each iteration we move a step in direction of steepest descent. The size of each step is determined by a parameter which is called Learning Rate. Gradient Descent is the first-order algorithm because it uses the first-order derivative of the loss function to find minima. Gradient Descent works in space of any number of dimensions.

梯度下降是用于找到可微函数的局部最小值的一阶迭代优化算法。 为了获得将目标函数最小化的参数值,我们迭代该函数的梯度的相反方向,或者在每次迭代中简单地说,我们都沿最陡的下降方向移动一步。 每个步骤的大小由称为学习率的参数确定。 梯度下降法是一阶算法,因为它使用损失函数的一阶导数来找到最小值。 渐变下降可在任意尺寸的空间中工作。

Image for post
Source来源

Steps in Gradient Descent-

梯度下降步骤-

  1. Initialize the parameters(weights and bias) randomly.

    随机初始化参数(权重和偏差)。
  2. Choose the learning rate (‘η’).

    选择学习率('η')。
  3. Until convergence repeat this-

    在收敛之前,请重复此操作-
Image for post
Image By Author
图片作者

Where ‘wₜ’ is the parameter whose value we have to find, ‘η’ is the learning rate and L represents cost function.

其中“wₜ”是我们必须找到其值的参数,“η”是学习率,L表示成本函数。

By repeat until convergence we mean, repeat until the old value of weight is not approximately equal to its new value ie repeat until the difference between the old value and the new value is very small.

重复直到收敛,我们的意思是,重复直到权重的旧值不等于其新值为止,即重复直到权重的旧值与新值之间的差很小。

Another important thing to be kept in mind is that that all weights need to be updated simultaneously as updating a specific parameter before calculating another one will yield a wrong implementation.

要记住的另一件重要事情是,所有权重都需要同时更新,因为在计算另一个参数之前更新特定参数会产生错误的实现。

Image for post
Illustration of Gradient Descent on a series of level stats (Source: Wikipedia)
一系列等级统计数据的梯度下降插图(来源:Wikipedia)

学习率选择(η)- (Choice of Learning Rate(η)-)

Choosing an appropriate value of learning is very important because it helps in determining how much we have to descent in each iteration. If the learning rate is too small, the descent will be small and hence there will be a delayed or no convergence on the other hand if the learning rate is too large, then gradient descent will overshoot the minimum point and will ultimately fail to converge.

选择适当的学习价值非常重要,因为它有助于确定我们在每次迭代中必须下降多少。 如果学习速率太小,则下降将很小,因此,如果学习速率太大,则将延迟或没有收敛,则梯度下降将超过最小点,最终将无法收敛。

Image for post
Source来源

To check this, the best thing is to calculate cost function at each iteration and then plot it with respect to the number of iterations. If cost ever increases we need to decrease the value of the learning rate and if the cost is decreasing at a very slow rate then we need to increase the value of the learning rate.

为了检查这一点,最好的办法是在每次迭代中计算成本函数,然后相对于迭代次数对其进行绘制。 如果成本增加,我们需要降低学习率的值,如果成本以非常缓慢的速度下降,那么我们需要提高学习率的值。

Image for post
Source来源

Apart from choosing the right value of learning rate another thing that can be done to optimize gradient descent is to normalize the data to a specific range. For this, we can use any kind of standardization technique like min-max standardization, mean-variance standardization, etc. If we don’t normalize our data then features with a large scale will dominate and gradient descent will take many unnecessary steps.

除了选择正确的学习率值之外,可以做的另一种优化梯度下降的方法是将数据归一化到特定范围。 为此,我们可以使用任何类型的标准化技术,例如最小-最大标准化,均值-方差标准化等。如果不对数据进行标准化,则大规模特征将占主导地位,梯度下降将采取许多不必要的步骤。

Probably in your school mathematics, you must have come across a method of solving optimization problems by computing derivative and then equating it to zero then using the double derivative to check whether that point is the point of minima, maxima, or a saddle point. A question comes in mind why don’t we use that method in machine learning for optimization. The problem with that method is that its time complexity is very high and it will be very slow to implement if our dataset is large. Hence gradient descent is preferred.

可能是在学校数学中,您必须遇到一种解决优化问题的方法,方法是计算导数,然后将其等价为零,然后使用双导数来检查该点是最小值,最大值还是鞍点。 想到一个问题,我们为什么不在机器学习中使用该方法进行优化。 该方法的问题在于它的时间复杂度很高,如果我们的数据集很大,则实现起来将很慢。 因此,梯度下降是优选的。

Gradient descent finds minima of a function. If that function is convex then its local minima will be its global minima. However, if the function is not convex then, in that case, we might reach a saddle point. To prevent this from happening, there are some optimizations that we can apply to Gradient Descent.

梯度下降找到函数的最小值。 如果该函数是凸函数,则其局部最小值将是其全局最小值。 但是,如果函数不是凸函数,那么在这种情况下,我们可能会达到鞍点。 为了防止这种情况的发生,我们可以将某些优化应用于渐变下降。

Limitations of Gradient Descent-

梯度下降的局限性

  1. The convergence rate is slow in gradient descent. If we try to speed it by increasing the learning rate then we may overshoot local minima.

    梯度下降时收敛速度慢。 如果我们尝试通过提高学习率来加快速度,那么我们可能会超出局部最小值。
  2. If we apply Gradient Descent on a non-convex function we may end up at local minima or a saddle point.

    如果在非凸函数上应用“梯度下降”,则可能会出现在局部最小值或鞍点处。
  3. For large datasets, memory consumption will be very high.

    对于大型数据集,内存消耗将非常高。

Gradient Descent is the most common optimization technique used throughout machine learning. Let’s discuss some variations of Gradient Descent.

梯度下降是整个机器学习中最常用的优化技术。 让我们讨论梯度下降的一些变化。

批次梯度下降 (Batch Gradient Descent-)

Batch Gradient Descent is one of the most common versions of Gradient Descent. When we say Gradient Descent in general we are talking about batch gradient descent only. It works by taking all data points available in the dataset to perform computation and update gradients. It works fairly well for a convex function and gives a straight trajectory to the minimum point. However, it is slow and hard to compute for large datasets.

批梯度下降是梯度下降的最常见版本之一。 一般而言,当我们说“梯度下降”时,我们仅在谈论批梯度下降。 它通过获取数据集中所有可用的数据点来执行计算和更新渐变。 它对于凸函数非常有效,并为最小点给出了直线轨迹。 但是,对于大型数据集,它速度慢且难以计算。

Advantages-

优点-

  1. Gives a stable trajectory to the minimum point.

    给出最小点的稳定轨迹。
  2. Computationally efficient for small datasets.

    对于小型数据集,计算效率高。

Limitations-

局限性

  1. Slow for large datasets.

    对于大型数据集,速度较慢。

随机梯度下降 (Stochastic Gradient Descent-)

Stochastic Gradient Descent is a variation of gradient descent which considers only one point at a time to update weights. We will not calculate the total error for whole data in one step instead we will calculate the error of each point and use it to update weights. So basically it increases the number of updates but for each update, less computation will be required. It is based on the assumption that error at each point is additive. Since we are considering just one example at a time the cost will fluctuate and may not necessarily decrease at each step but in the long run, it will decrease. Steps in Stochastic Gradient Descent are-

随机梯度下降是梯度下降的一种变化形式,一次仅考虑一个点来更新权重。 我们不会一步一步地计算整个数据的总误差,而是会计算每个点的误差并使用它来更新权重。 因此,基本上,它增加了更新的数量,但是对于每个更新,将需要较少的计算。 它基于以下假设:每个点的误差都是累加的。 由于我们一次仅考虑一个示例,因此成本会有所波动,并且不一定会在每一步都降低,但从长远来看,它将降低。 随机梯度下降的步骤是-

  1. Initialize weights randomly and choose a learning rate.

    随机初始化权重并选择学习率。
  2. Repeat until an approximate minimum is obtained-

    重复直到获得近似最小值-
  • Randomly shuffle the dataset.

    随机随机播放数据集。
  • For each point in the dataset ie if there are m points then-

    对于数据集中的每个点,即如果有m个点,则-
Image for post
Image By Author
图片作者

Shuffling the whole dataset is done to reduce variance and to make sure the model remains general and overfit less. By shuffling the data, we ensure that each data point creates an “independent” change on the model, without being biased by the same points before them.

对整个数据集进行混洗以减少方差,并确保模型保持通用性并减少过拟合。 通过对数据进行混排,我们确保每个数据点都在模型上创建“独立”更改,而不会受到之前相同点的影响。

Image for post
Source来源

It’s clear from the above image that SGD will go to minima with a lot of fluctuations whereas GD will follow a straight trajectory.

从上图可以清楚地看到,SGD会在波动很大的情况下达到最小值,而GD会遵循一条直线轨迹。

Advantages-

优点-

  1. It is easy to fit in memory as only one data point needs to be processed at a time.

    它很容易装入内存,因为一次只需要处理一个数据点。
  2. It updates weights more regularly as compared to batch gradient descent and hence it converges faster.

    与批次梯度下降相比,它更定期地更新权重,因此收敛速度更快。
  3. It is computationally less expensive than batch gradient descent.

    它在计算上比批量梯度下降便宜。
  4. It avoids local minima in case of non-convex function as randomness or noise introduced by stochastic gradient descent allows us to escape local minima and to reach a better minimum.

    它避免了非凸函数的局部最小值,因为随机梯度下降带来的随机性或噪声使我们能够逃脱局部最小值并达到更好的最小值。

Disadvantages-

缺点

  1. It is possible that SGD never reaches local minima and may oscillate around it due to a lot of fluctuations in each step.

    由于每个步骤的波动很大,SGD可能永远不会达到局部最小值,并可能在其附近振荡。
  2. Each step of SGD is very noisy and gradient descent fluctuates in different directions.

    SGD的每个步骤都非常嘈杂,并且梯度下降沿不同方向波动。

So as discussed above SGD is a better idea than batch GD in case of large datasets but in SGD we have to compromise with accuracy. However, there are various variations of SGD which I will discuss in the next blog using which we can improve SGD to a great extent.

因此,如上所述,在数据集较大的情况下,SGD比批处理GD是更好的主意,但在SGD中,我们必须牺牲准确性。 但是,我将在下一个博客中讨论SGD的各种变化,我们可以在很大程度上利用它们来改进SGD。

迷你批次梯度下降- (Mini Batch Gradient Descent-)

In Mini Batch Gradient Descent instead of using the complete dataset for calculating gradient, we choose only a mini-batch of it. The size of a batch is a hyperparameter and is generally chosen as a multiple of 32 eg 32,64,128,256 etc. Let’s see its equation-

在“小批量梯度下降”中,我们不使用完整的数据集来计算梯度,而是仅选择一个小批量。 批的大小是一个超参数,通常选择为32的倍数,例如32,64,128,256等。让我们看一下它的方程式-

  1. Initialize weights randomly and choose a learning rate.

    随机初始化权重并选择学习率。
  2. Repeat until convergence-

    重复直到收敛-
Image for post
Image by Author
图片作者

Here ‘b’ is batch size.

这里的“ b”是批量大小。

Advantages-

优点-

  1. Faster than batch version as it considers only a small batch of data at a time for calculating gradients.

    比批处理版本快,因为它一次只考虑一小批数据来计算梯度。
  2. Computationally efficient and easily fits in memory.

    计算效率高,可轻松放入内存。
  3. Less prone to overfitting due to noise.

    不太容易因噪音而过拟合。
  4. Like SGD, it avoids local minima in case of non-convex function as randomness or noise introduced by mini-batch gradient descent allows us to escape local minima and to reach a better minimum.

    像SGD一样,它避免了非凸函数的局部最小值,因为由小批量梯度下降引起的随机性或噪声使我们能够逃避局部最小值并达到更好的最小值。
  5. It can take advantage of vectorization.

    它可以利用矢量化的优势。

Disadvantages-

缺点

  1. Like SGD, Due to noise, mini-batch Gradient Descent also may not converge exactly at minima and may oscillate around it.

    像SGD一样,由于噪声的原因,小批量梯度下降可能也无法完全收敛于最小值,并可能在其附近振荡。
  2. Although computing each step in mini-batch gradient descent is faster than batch gradient descent due to a small set of points taken into consideration but in the long run, due to noise, it takes more steps to reach minima.

    尽管由于考虑了少量点,所以在小批量梯度下降中计算每个步骤都比批量梯度下降要快,但从长远来看,由于噪声的原因,要达到最小值需要花费更多的步骤。

We can say SGD is also a mini-batch gradient algorithm with a batch size of 1.

我们可以说SGD也是一个小批量梯度算法,批大小为1。

If we particularly compare mini-batch gradient descent and SGD then its clear that SGD is noisier as compared to mini-batch gradient descent and hence it will fluctuate more to reach convergence. However, it is computationally less expensive and with some variations, it can perform much better.

如果我们特别比较小批量梯度下降和SGD,那么很明显SGD与小批量梯度下降相比噪声更大,因此它将波动更大以达到收敛。 但是,它在计算上更便宜,并且有一些变化,它可以执行得更好。

结论- (Conclusion-)

In this article, we discussed Gradient Descent along with its variations and some related terminologies. In the next article, we will discuss optimizers in neural networks.

在本文中,我们讨论了“梯度下降”及其变体和一些相关术语。 在下一篇文章中,我们将讨论神经网络中的优化器。

Image for post
Image by Author
图片作者

学分 (Credits-)

  1. https://towardsdatascience.com/batch-mini-batch-stochastic-gradient-descent-7a62ecba642a

    https://towardsdatascience.com/batch-mini-batch-stochastic-gradient-descent-7a62ecba642a

  2. https://towardsdatascience.com/difference-between-batch-gradient-descent-and-stochastic-gradient-descent-1187f1291aa1

    https://towardsdatascience.com/difference-between-batch-gradient-descent-and-stochastic-gradient-descent-1187f1291aa1

  3. https://medium.com/@divakar_239/stochastic-vs-batch-gradient-descent-8820568eada1

    https://medium.com/@divakar_239/stochastic-vs-batch-gradient-descent-8820568eada1

  4. https://en.wikipedia.org/wiki/Stochastic_gradient_descent

    https://zh.wikipedia.org/wiki/Stochastic_gradient_descent

  5. https://en.wikipedia.org/wiki/Gradient_descent

    https://zh.wikipedia.org/wiki/渐变_下降

Image for post
Image by Author
图片作者

That’s all from my side. Thanks for reading this article. Sources for few images used are mentioned rest of them are my creation. Feel free to post comments, suggest corrections and improvements. Connect with me on Linkedin or you can mail me at sahdevkansal02@gmail.com. I look forward to hearing your feedback. Check out my medium profile for more such articles.

这就是我的全部。 感谢您阅读本文。 提到的一些图片来源都是我的创作。 随时发表评论,提出更正和改进建议。 在Linkedin上与我联系,或者您可以通过sahdevkansal02@gmail.com向我发送邮件。 期待收到您的反馈。 查看我的中档,以获取更多此类文章。

翻译自: https://towardsdatascience.com/quick-guide-to-gradient-descent-and-its-variants-97a7afb33add

批梯度下降 随机梯度下降

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391363.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

java作业 2.6

//程序猿:孔宏旭 2017.X.XX /**功能:在键盘输入一个三位数,求它们的各数位之和。 *1、使用Scanner关键字来实现从键盘输入的方法。 *2、使用取余的方法将各个数位提取出来。 *3、最后将得到的各个数位相加。 */ import java.util.Scanner; p…

Linux 命令 之查看程序占用内存

2019独角兽企业重金招聘Python工程师标准>>> 查看PID ps aux | grep nginx root 3531 0.0 0.0 18404 832 ? Ss 15:29 0:00 nginx: master process ./nginx 查看占用资源情况 pmap -d 3531 top -p 3531 转载于:https://my.oschina.net/mengzha…

逻辑回归 自由度_回归自由度的官方定义

逻辑回归 自由度Back in middle and high school you likely learned to calculate the mean and standard deviation of a dataset. And your teacher probably told you that there are two kinds of standard deviation: population and sample. The formulas for the two a…

网络对抗技术作业一 201421410031

姓名:李冠华 学号:201421410031 指导教师:高见 1、虚拟机安装与调试 安装windows和linux(kali)两个虚拟机,均采用NAT网络模式,查看主机与两个虚拟机器的IP地址,并确保其连通性。同时…

生存分析简介:Kaplan-Meier估计器

In my previous article, I described the potential use-cases of survival analysis and introduced all the building blocks required to understand the techniques used for analyzing the time-to-event data.在我的上一篇文章中 ,我描述了生存分析的潜在用例…

使用r语言做garch模型_使用GARCH估计货币波动率

使用r语言做garch模型Asset prices have a high degree of stochastic trends inherent in the time series. In other words, price fluctuations are subject to a large degree of randomness, and therefore it is very difficult to forecast asset prices using traditio…

方差偏差权衡_偏差偏差权衡:快速介绍

方差偏差权衡The bias-variance tradeoff is one of the most important but overlooked and misunderstood topics in ML. So, here we want to cover the topic in a simple and short way as possible.偏差-方差折衷是机器学习中最重要但被忽视和误解的主题之一。 因此&…

win10 uwp 让焦点在点击在页面空白处时回到textbox中

原文:win10 uwp 让焦点在点击在页面空白处时回到textbox中在网上 有一个大神问我这样的问题:在做UWP的项目,怎么能让焦点在点击在页面空白处时回到textbox中? 虽然我的小伙伴认为他这是一个 xy 问题,但是我还是回答他这个问题。 首…

重学TCP协议(1) TCP/IP 网络分层以及TCP协议概述

1. TCP/IP 网络分层 TCP/IP协议模型(Transmission Control Protocol/Internet Protocol),包含了一系列构成互联网基础的网络协议,是Internet的核心协议,通过20多年的发展已日渐成熟,并被广泛应用于局域网和…

分节符缩写p_p值的缩写是什么?

分节符缩写pp是概率吗? (Is p for probability?) Technically, p-value stands for probability value, but since all of statistics is all about dealing with probabilistic decision-making, that’s probably the least useful name we could give it.从技术…

[测试题]打地鼠

Description 小明听说打地鼠是一件很好玩的游戏,于是他也开始打地鼠。地鼠只有一只,而且一共有N个洞,编号为1到N排成一排,两边是墙壁,小明当然不可能百分百打到,因为他不知道地鼠在哪个洞。小明只能在白天打…

重学TCP协议(2) TCP 报文首部

1. TCP 报文首部 1.1 源端口和目标端口 每个TCP段都包含源端和目的端的端口号,用于寻找发端和收端应用进程。这两个值加上IP首部中的源端IP地址和目的端IP地址唯一确定一个TCP连接 端口号分类 熟知端口号(well-known port)已登记的端口&am…

机器学习 预测模型_使用机器学习模型预测心力衰竭的生存时间-第一部分

机器学习 预测模型数据科学 , 机器学习 (Data Science, Machine Learning) 前言 (Preface) Cardiovascular diseases are diseases of the heart and blood vessels and they typically include heart attacks, strokes, and heart failures [1]. According to the …

重学TCP协议(3) 端口号及MTU、MSS

1. 端口相关的命令 1.1 查看端口是否打开 使用 nc 和 telnet 这两个命令可以非常方便的查看到对方端口是否打开或者网络是否可达。如果对端端口没有打开,使用 telnet 和 nc 命令会出现 “Connection refused” 错误 1.2 查看监听端口的进程 使用 netstat sudo …

Diffie Hellman密钥交换

In short, the Diffie Hellman is a widely used technique for securely sending a symmetric encryption key to another party. Before proceeding, let’s discuss why we’d want to use something like the Diffie Hellman in the first place. When transmitting data o…

如何通过建造餐厅来了解Scala差异

I understand that type variance is not fundamental to writing Scala code. Its been more or less a year since Ive been using Scala for my day-to-day job, and honestly, Ive never had to worry much about it. 我了解类型差异并不是编写Scala代码的基础。 自从我在日…

组织在召唤:如何免费获取一个js.org的二级域名

之前我是使用wangduanduan.github.io作为我的博客地址,后来觉得麻烦,有把博客关了。最近有想去折腾折腾。先看效果:wdd.js.org 如果你不了解js.org可以看看我的这篇文章:一个值得所有前端开发者关注的网站js.org 前提 已经有了github pages的…

linkedin爬虫_您应该在LinkedIn上关注的8个人

linkedin爬虫Finding great mentors are hard to come by these days. With so much information and so many opinions flooding the internet, finding an authority in a specific field can be quite tough.这些天很难找到优秀的导师。 互联网上充斥着如此众多的信息和众多…

重学TCP协议(4) 三次握手

1. 三次握手 请求端(通常称为客户)发送一个 S Y N段指明客户打算连接的服务器的端口,以及初始序号。这个S Y N段为报文段1。服务器发回包含服务器的初始序号的 S Y N报文段(报文段2)作为应答。同时,将确认序…

java温故笔记(二)java的数组HashMap、ConcurrentHashMap、ArrayList、LinkedList

为什么80%的码农都做不了架构师?>>> HashMap 摘要 HashMap是Java程序员使用频率最高的用于映射(键值对)处理的数据类型。随着JDK(Java Developmet Kit)版本的更新,JDK1.8对HashMap底层的实现进行了优化,例…