t-sne原理解释

The method of t-distributed Stochastic Neighbor Embedding (t-SNE) is a method for dimensionality reduction, used mainly for visualization of data in 2D and 3D maps. This method can find non-linear connections in the data and therefore it is highly popular. In this post, I’ll give an intuitive explanation for how t-SNE works and then describe the math behind it.

t分布随机邻居嵌入(t-SNE)方法是一种降维方法，主要用于2D和3D地图中的数据可视化。这种方法可以在数据中找到非线性连接，因此非常受欢迎。在这篇文章中，我将对t-SNE的工作原理给出直观的解释，然后描述其背后的数学原理。

See your data in a lower dimension
以较低的维度查看数据

So when and why would you want to visualize your data in a low dimension? When working on data with more than 2–3 features you might want to check if your data has clusters in it. This information can help you understand your data and, if needed, choose the number of clusters for clustering models such as k-means.

那么什么时候以及为什么要以低维度可视化数据？处理具有2–3个以上功能的数据时，您可能需要检查数据中是否包含群集。此信息可以帮助您理解数据，并在需要时选择用于聚类模型(例如k均值)的聚类数。

Now let’s look at a short example that will help understand what we want to get. Let’s say we have data in a 2D space and we want to reduce its dimension into 1D. Here’s an example of data in 2D:

现在，让我们看一个简短的示例，该示例将有助于理解我们想要得到的东西。假设我们在2D空间中有数据，并且想将其维数减小为1D。这是2D数据的示例：

In this example, each color represents a cluster. We can see that each cluster has a different density. We will see how the model deals with that in the dimensional reduction process.

在此示例中，每种颜色代表一个群集。我们可以看到每个簇具有不同的密度。我们将看到模型在降维过程中如何处理该问题。

Now, if we try to simply project the data onto just one of its dimensions, we see an overlap of at least two of the clusters:

现在，如果我们尝试将数据仅投影到其一个维度上，就会看到至少两个集群的重叠：

Image for post — Figure 2: Data projections to one dimension

So we understand that we need to find a better way to do this dimension reduction.

因此，我们知道我们需要找到一种更好的方法来减少尺寸。

T-SNE algorithm deals with this problem, and I’ll explain its performance in three stages:

T-SNE算法解决了这个问题，我将分三个阶段来说明其性能：

Calculating a joint probability distribution that represents the similarities between the data points (don’t worry, I’ll explain that soon!).
计算表示数据点之间相似度的联合概率分布(不用担心，我会尽快解释！)。
Creating a dataset of points in the target dimension and then calculating the joint probability distribution for them as well.
在目标维度中创建点的数据集，然后也为它们计算联合概率分布。
Using gradient descent to change the dataset in the low-dimensional space so that the joint probability distribution representing it would be as similar as possible to the one in the high dimension.
使用梯度下降来更改低维空间中的数据集，以便表示它的联合概率分布将与高维空间中的数据尽可能相似。

算法 (The Algorithm)

第一阶段-亲爱的朋友，您成为我邻居的可能性有多大？ (First Stage — Dear points, how likely are you to be my neighbors?)

The first stage of the algorithm is calculating the Euclidian distances of each point from all of the other points. Then, taking these distances and transforming them into conditional probabilities that represent the similarity between every two points. What does that mean? It means that we want to evaluate how similar every two points in the data are, or in other words, how likely they are to be neighbors.

该算法的第一阶段是计算每个点与所有其他点的欧几里得距离。然后，采用这些距离并将其转换为表示每两个点之间相似度的条件概率。那是什么意思？这意味着我们要评估数据中每两个点的相似度，换句话说， 它们成为邻居的可能性。

The conditional probability of point xⱼ to be next to point xᵢ is represented by a Gaussian centered at xᵢ with a standard deviation of σᵢ (I’ll mention later on what influences σᵢ). It is written mathematically in the following way:

点xⱼ紧接点xᵢ的条件概率由以xᵢ为中心，标准差为σdeviation的高斯表示(我将在后面介绍影响σᵢ的因素)。它是通过以下方式以数学方式编写的：

The reason for dividing by the sum of all the other points placed at the Gaussian centered at xᵢ is that we may need to deal with clusters with different densities. To explain that, let’s go back to the example of Figure 1. As you can see the density of the orange cluster is lower than the density of the blue cluster. Therefore, if we calculate the similarities of each two points by a Gaussian only, we will see lower similarities between the orange points compared to the blue ones. In our final output we won’t mind that some clusters had different densities, we will just want to see them as clusters, and therefore we do this normalization.

用位于以xᵢ为中心的高斯分布的所有其他点的总和除的原因是，我们可能需要处理具有不同密度的聚类。为了解释这一点，让我们回到图1的示例。您可以看到橙色簇的密度低于蓝色簇的密度。因此，如果仅通过高斯计算每两个点的相似度，则橙色点与蓝色点之间的相似度会更低。在最终输出中，我们不会介意某些簇具有不同的密度，我们只想将它们视为簇，因此我们进行了归一化。

From the conditional distributions created we calculate the joint probability distribution, using the following equation:

根据创建的条件分布，我们使用以下公式计算联合概率分布：

Using the joint probability distribution rather than the conditional probability is one of the improvements in the method of t-SNE relative to the former SNE. The symmetric property of the pairwise similarities (pᵢⱼ = pⱼᵢ) helps simplify the calculation at the third stage of the algorithm.

相对于以前的SNE，使用联合概率分布而不是条件概率是t-SNE方法的改进之一。成对相似性的对称性(pᵢⱼ=pⱼᵢ)有助于简化算法第三阶段的计算。

第二阶段-低维度创建数据 (Second Stage — Creating data in a low dimension)

In this stage, we create a dataset of points in a low-dimensional space and calculate a joint probability distribution for them as well.

在此阶段，我们在低维空间中创建点的数据集，并为其计算联合概率分布。

To do that, we build a random dataset of points with the same number of points as we had in the original dataset, and K features, where K is our target dimension. Usually, K will be 2 or 3 if we want to use the dimension reduction for visualization. If we go back to our example, at this stage the algorithm builds a random dataset of points in 1D:

为此，我们构建了一个点的随机数据集，其点数与原始数据集中的点数相同，并且建立了K个要素，其中K是我们的目标维度。通常，如果我们要使用降维进行可视化，则K将为2或3。如果回到示例，在此阶段，该算法将建立一维点的随机数据集：

For this set of points, we will create their joint probability distribution but this time we will be using the t-distribution and not the Gaussian distribution, as we did for the original dataset. This is another advantage of t-SNE compared to the former SNE (t in t-SNE stands for t-distribution) that I will soon explain. We will mark the probabilities here by q, and the points by y.

对于这组点，我们将创建它们的联合概率分布，但是这次我们将使用t分布而不是高斯分布，就像对原始数据集所做的那样。与之前将要解释的以前的SNE(t-SNE中的t代表t-分布)相比，t-SNE的另一个优势是。我们在这里用q标记概率，在y上标记点。

The reason for choosing t-distribution rather than the Gaussian distribution is the heavy tails property of the t-distribution. This quality causes moderate distances between points in the high-dimensional space to become extreme in the low-dimensional space, and that helps prevent “crowding” of the points in the lower dimension. Another advantage of using t-distribution is an improvement in the optimization process in the third part of the algorithm.

选择t分布而不是高斯分布的原因是t分布的重尾特性。这种质量使高维空间中的点之间的适度距离在低维空间中变得极端，并且有助于防止较低维中的点“拥挤”。使用t分布的另一个优点是改进了算法第三部分的优化过程。

第三阶段-让魔术发生！ (Third Stage — Let the magic happen!)

Or in other words, change your dataset in the low-dimensional space so it will best visualize your data

或者换句话说，在低维空间中更改数据集，以便最佳地可视化数据

Now we use the Kullback-Leiber divergence to make the joint probability distribution of the data points in the low dimension as similar as possible to the one from the original dataset. If this transformation succeeds we will get a good dimension reduction.

现在，我们使用Kullback-Leiber散度使低维数据点的联合概率分布尽可能类似于原始数据集中的数据。如果此转换成功，我们将获得很好的尺寸缩减。

I’ll briefly explain what Kullback-Leiber divergence (KL divergence) is. KL divergence is a measure of how much two distributions are different from one another. For distributions P and Q in the probability space of χ, the KL divergence is defined by:

我将简要说明什么是Kullback-Leiber散度(KL散度)。 KL散度是两个分布之间有多少不同的度量。对于χ概率空间中的P和Q分布，KL散度定义为：

As much as the distributions are similar to each other, the value of the KL divergence is smaller, reaching zero when the distributions are identical.

尽管分布彼此相似，但KL散度的值较小，当分布相同时达到零。

Back to our algorithm — we try to change the lower dimension dataset such that its joint probability distribution will be as similar as possible to the one from the original data. This is done by using gradient descent. The cost function that the gradient descent tries to minimize is the KL divergence of the joint probability distribution P from the high-dimensional space and Q from the low-dimensional space.

回到我们的算法-我们尝试更改较低维度的数据集，以使其联合概率分布与原始数据中的概率分布尽可能相似。这是通过使用梯度下降来完成的。梯度下降试图最小化的代价函数是来自高维空间的联合概率分布P和来自低维空间的Q的KL散度。

From this optimization, we get the values of the points in the low dimension dataset and use it for our visualization. In our example, we see the clusters in the low-dimensional space as follows:

通过此优化，我们获得了低维数据集中的点的值，并将其用于可视化。在我们的示例中，我们看到了低维空间中的聚类，如下所示：

模型中的参数 (Parameters in the model)

There are several parameters in this model that you can adjust to your needs. Some of them relate to the process of gradient descent, where the most important ones are the learning rate and the number of iterations. If you are not familiar with gradient descent I recommend going through its explanation for better understanding.

您可以根据需要调整此模型中的几个参数。其中一些与梯度下降过程有关，其中最重要的是学习率和迭代次数。如果您不熟悉梯度下降，我建议您仔细阅读它的解释以更好地理解。

Another parameter in t-SNE is perplexity. It is used for choosing the standard deviation σᵢ of the Gaussian representing the conditional distribution in the high-dimensional space. I will not elaborate on the math behind it, but it can be interpreted as the number of effective neighbors for each point. The model is rather robust for perplexities between 5 to 50, but you can see some examples of how changes in perplexity affect t-SNE results in the following article.

t-SNE中的另一个参数是困惑。它用于选择代表高维空间中条件分布的高斯标准偏差σᵢ。我不会详细说明其背后的数学原理，但是可以将其解释为每个点的有效邻居数。该模型对于5到50之间的困惑度相当鲁棒，但是在下一篇文章中，您可以看到一些困惑度变化如何影响t-SNE结果的示例。

结论 (Conclusion)

That’s it! I hope this post helped you better understand the operating algorithm behind t-SNE and will help you use it effectively. For more details on the math of the method, I recommend looking at the original paper of TSNE. Thank you for reading :)

而已！我希望这篇文章可以帮助您更好地了解t-SNE背后的操作算法，并可以帮助您有效地使用它。有关该方法的数学运算的更多详细信息，建议查看TSNE的原始论文。谢谢您的阅读:)