深度学习算法和机器学习算法_啊哈！ 4种流行的机器学习算法的片刻

深度学习算法和机器学习算法

Most people are either in two camps:

大多数人都在两个营地中：

I don’t understand these machine learning algorithms.
我不了解这些机器学习算法。
I understand how the algorithms work, but not why they work.
我理解的算法是如何工作的，但不是为什么他们的工作。

This article seeks to explain not only how algorithms work, but give an intuitive understanding of why they work, to deliver that lightbulb aha! moment.

本文试图解释算法不仅是如何工作的，但给的为什么他们的工作，以交付灯泡AHA一个直观的了解！时刻。

决策树 (Decision Trees)

Decision Trees divide the feature space using horizontal and vertical lines. For example, consider a very simplistic Decision Tree below, which has one conditional node and two class nodes, indicating a condition and under which category a training point that satisfies it will fall into.

决策树使用水平线和垂直线划分要素空间。例如，考虑下面一个非常简单的决策树，该决策树具有一个条件节点和两个类节点，指示一个条件以及满足该条件的训练点将属于哪个类别。

Note that there is a lot of overlap between the fields marked as each color and the data points within that area that actually are that color, or (roughly) entropy. The decision tree is constructed to minimize the entropy. In this scenario, we can add an additional layer of complexity. If we were to add another condition; if x is less than 6 and y is larger than 6, we can designate points in that area as red. The entropy has been lowered with this move.

请注意，标记为每种颜色的字段与该区域内实际上是该颜色或(大致) 熵的数据点之间存在很多重叠。构造决策树以最小化熵。在这种情况下，我们可以增加一层复杂性。如果要添加另一个条件；如果x小于6 ， y大于6，我们可以将该区域中的点指定为红色。此举降低了熵。

Each step, the Decision Tree algorithm attempts to find a method to build the tree such that the entropy is minimized. Think of entropy more formally as the amount of ‘disorder’ or ‘confusion’ a certain divider (the conditions) has, and its opposite as ‘information gain’ — how much a divider adds information and insight to the model. Feature splits that have the highest information gain (as well as a lowest entropy) are placed at the top.

在每个步骤中，决策树算法都会尝试找到一种构建树的方法，以使熵最小化。将熵更正式地看作是某个分隔线(条件)所具有的“混乱”或“混乱”的数量，而与之相反的是“信息增益”，即分隔线为模型增加了多少信息和洞察力。具有最高信息增益(以及最低熵)的要素拆分位于顶部。

The conditions may split their one-dimensional features somewhat like this:

条件可能会将其一维特征分解为如下形式：

Note that condition 1 has clean separation, and therefore low entropy and high information gain. The same cannot be said for condition 3, which is why it is placed near the bottom of the Decision Tree. This construction of the tree ensures that it can remain as lightweight as possible.

请注意，条件1具有清晰的分隔，因此熵低且信息增益高。条件3不能说相同，这就是为什么它位于决策树底部附近的原因。树的这种构造确保其可以保持尽可能轻巧。

You can read more about entropy and its use in Decision Trees as well as neural networks (cross-entropy as a loss function) here.

您可以在此处阅读有关熵及其在决策树以及神经网络(交叉熵作为损失函数)中的用法的更多信息。

随机森林 (Random Forest)

Random Forest is a bagged (bootstrap aggregated) version of the Decision Tree. The primary idea is that several Decision Trees are each trained on a subset of data. Then, an input is passed through each model, and their outputs are aggregated through a function like a mean to produce a final output. Bagging is a form of ensemble learning.

随机森林是决策树的袋装(引导聚合)版本。主要思想是对数个决策树分别训练一个数据子集。然后，输入通过每个模型，并且它们的输出通过类似平均值的函数进行汇总以产生最终输出。套袋是合奏学习的一种形式。

There are many analogies for why Random Forest works well. Here is a common version of one:

有许多类比说明为什么随机森林运作良好。这是其中一个的通用版本：

You need to decide which restaurant to go to next. To ask someone for their recommendation, you must answer a variety of yes/no questions, which will lead them to make their decision for which restaurant you should go to.
您需要确定下一家餐厅。要向某人提出建议，您必须回答各种是/否问题，这将使他们做出您应该去哪家餐厅的决定。

Would you rather only ask one friend or ask several friends, then find the mode or general consensus?
您愿意只问一个朋友还是问几个朋友，然后找到方式或普遍共识？

Unless you only have one friend, most people would answer the second. The insight this analogy provides is that each tree has some sort of ‘diversity of thought’ because they were trained on different data, and hence have different ‘experiences’.

除非您只有一个朋友，否则大多数人都会回答第二个。该类比提供的见解是，每棵树都具有某种“思想多样性”，因为它们是在不同的数据上进行训练的，因此具有不同的“体验”。

This analogy, clean and simple as it is, never really stood out to me. In the real world, the single-friend option has less experience than all the friends in total, but in machine learning, the decision tree and random forest models are trained on the same data, and hence, same experiences. The ensemble model is not actually receiving any new information. If I could ask one all-knowing friend for a recommendation, I see no objection to that.

这种类比，干净和简单，从来没有真正让我脱颖而出。在现实世界中，单朋友选项的经验少于所有朋友，但是在机器学习中，决策树和随机森林模型是在相同的数据上训练的，因此也具有相同的体验。集成模型实际上没有接收任何新信息。如果我可以向一个全知的朋友提出建议，我不会反对。

How can a model trained on the same data that randomly pulls subsets of the data to simulate artificial ‘diversity’ perform better than one trained on the data as a whole?

在相同数据上训练的，随机抽取数据子集以模拟人为“多样性”的模型如何比在整个数据上训练的模型更好？

Take a sine wave with heavy normally distributed noise. This is your single Decision Tree classifier, which is naturally a very high-variance model.

拍摄正弦波，并带有大量正态分布的噪声。这是您的单个决策树分类器，它自然是一个高方差模型。

100 ‘approximators’ will be chosen. These approximators randomly select points along the sine wave and generate a sinusoidal fit, much like decision trees being trained on subsets of the data. These fits are then averaged to form a bagged curve. The result? — a much smoother curve.

将选择100个“近似器”。这些逼近器沿正弦波随机选择点并生成正弦曲线拟合，就像在数据子集上训练决策树一样。然后将这些拟合平均，以形成袋装曲线。结果？ -更平滑的曲线。

The reason why bagging works is because it reduces the variance of models, and helps improve capability to generalize, by artificially making the model more ‘confident’. This is also why bagging does not work as well on already low-variance models like logistic regression.

套袋工作的原因在于，它通过人为地使模型更具“信心”，从而减少了模型的差异并有助于提高泛化能力。这也就是为什么装袋在诸如Logistic回归之类的低方差模型中效果不佳的原因。

You can read more about the intuition and more rigorous proof of the success of bagging here.

您可以在这里关于直觉和成功套袋更严格的证据。

支持向量机 (Support Vector Machines)

Support Vector Machines attempt to find a hyperplane that can divide the data best, relying on the concept of ‘support vectors’ to maximize the divide between the two classes.

支持向量机依靠“支持向量”的概念来最大化两个类别之间的距离，从而试图找到一种可以最好地划分数据的超平面。

Unfortunately, most datasets are not so easily separable, and if they were, SVM would likely not be the best algorithm to handle it. Consider this one-dimensional separation task; there is no good divider, since any one separation will cause two separate classes to be lumped into the same one.

不幸的是，大多数数据集并不是那么容易分离，如果是这样，SVM可能不是处理它的最佳算法。考虑此一维分离任务；没有很好的分隔符，因为任何一种分隔都会导致将两个单独的类归为同一类。

SVM is powerful at solving these kinds of problems by using a so-called ‘kernel trick’, which projects data into new dimensions to make the separation task easier. For instance, let’s create a new dimension, which is simply defined as x² (x is the original dimension):

SVM通过使用所谓的“内核技巧”来强大地解决此类问题，该技巧将数据投影到新的维度上，从而使分离任务更加容易。例如，让我们创建一个新的层面，它被简单地定义为x²(x为原始尺寸)：

Now, the data is cleanly separable after the data was projected onto a new dimension (each data point represented in two dimensions as (x, x²)).

现在，数据被投影到一个新的层面后的数据是干净可分离(每个数据点在两个维度为代表( x , x ²)

Using a variety of kernels — most popularly, polynomial, sigmoid, and RBF kernels — the kernel trick does the heavy lifting to create a transformed space such that the separation task is simple.

使用各种内核(最常见的是多项式，Sigmoid和RBF内核)，内核技巧使繁重的工作创造了一个转换后的空间，从而使分离任务变得简单。

神经网络 (Neural Networks)

Neural Networks are the pinnacle of machine learning. Their discovery, and that unlimited variations and improvements that can be made upon it have warranted it the subject of its own field, deep learning. Admittedly, the success of neural networks is still incomplete (“Neural networks are matrix multiplications that no one understands”), but the easiest way to explain them is through the Universal Approximation Theorem (UAT).

神经网络是机器学习的顶峰。他们的发现以及对它的无穷变化和改进使它成为了自己领域的主题，即深度学习。诚然，神经网络的成功仍然不完整(“神经网络是没人能理解的矩阵乘法”)，但是最简单的解释方法是通过通用近似定理(UAT)。

At their core, every supervised algorithm seeks to model some underlying function of the data; usually this is either a regression plane or the feature boundary. Consider this function y = x², which can be modelled to an arbitrary accuracy with several horizontal steps.

每种监督算法的核心都是试图对数据的某些基础功能进行建模。通常这是一个回归平面或特征边界。考虑这个函数y = x² ，可以用几个水平步长将其建模为任意精度。

This is essentially what a neural network can do. Perhaps it can be a little more complex and model relationships beyond horizontal steps (like quadratic and linear lines below), but at its core, the neural network is a piecewise function approximator.

这本质上就是神经网络可以做的。也许除了水平步长(如下面的二次和线性线)之外，模型关系可能会更复杂一些，但是神经网络的核心是分段函数逼近器。

Each node is in delegated to one part of the piecewise function, and the purpose of the network is to activate certain neurons responsible for parts of the feature space. For instance, if one were to classify images of men with beards or no beards, several nodes should be delegated specifically to pixel locations where beards often appear. Somewhere in multi-dimensional space, these nodes represent a numerical range.

每个节点都委派给分段功能的一部分，而网络的目的是激活负责部分特征空间的某些神经元。例如，如果要对有胡须或没有胡须的男性图像进行分类，则应将几个节点专门委派给经常出现胡须的像素位置。在多维空间中的某个位置，这些节点表示一个数值范围。

Note, again, that the question “why do neural networks work” is still unanswered. The UAT doesn’t answer this question, but states that neural networks, under certain human interpretations, can model any function. The field of Explainable/Interpretable AI is emerging to answer these questions with methods like activation maximization and sensitivity analysis.

再次注意，“神经网络为什么起作用”的问题仍然没有得到回答。 UAT并未回答这个问题，但指出在某些人类的解释下，神经网络可以为任何功能建模。可解释/可解释AI的领域正在涌现，以通过激活最大化和敏感性分析之类的方法来回答这些问题。

You can read a more in-depth explanation and view visualizations of the Universal Approximation Theorem here.

您可以在此处阅读更深入的解释，并查看通用近似定理的可视化。

In all four algorithms, and many others, these look very simplistic at a low dimensionality. A key realization in machine learning is that a lot of the ‘magic’ and ‘intelligence’ we purport to see in AI is really a simple algorithm hidden under the guise of high dimensionality.

在所有四种算法以及许多其他算法中，这些算法在低维情况下看起来都非常简单。机器学习的一个关键实现是，我们声称在AI中看到的许多“魔术”和“智能”实际上是一个隐藏在高维伪装下的简单算法。

Decision trees splitting regions into squares is simple, but decision trees splitting high-dimensional space into hypercubes is less so. SVM performing a kernel trick to improve separability from one to two dimensions is understandable, but SVM doing the same thing on a dataset of hundreds of dimensions large is almost magic.

将区域划分为正方形的决策树很简单，但是将高维空间划分为超立方体的决策树却不那么容易。 SVM执行内核技巧以提高一维到二维的可分离性是可以理解的，但是SVM在数百个大维数据集上执行相同的操作几乎是神奇的。

Our admiration and confusion of machine learning is predicated on our lack of understanding for high dimensional spaces. Learning how to get around high dimensionality and understanding algorithms in a native space is instrumental to an intuitive understanding.

我们对机器学习的钦佩和困惑是基于我们对高维空间缺乏了解。学习如何解决高维问题并了解本机空间中的算法，有助于直观理解。

All images created by author.

作者创作的所有图像。