余弦相似度和欧氏距离
This is a quick and straight to the point introduction to Euclidean distance and cosine similarity with a focus on NLP.
这是对欧氏距离和余弦相似度的快速而直接的介绍,重点是NLP。
欧氏距离 (Euclidean Distance)
The Euclidean distance metric allows you to identify how far two points or two vectors are apart from each other.
欧几里德距离度量标准可让您确定两个点或两个向量彼此相距多远。
Now suppose you are a high school student and you have three classes. A math class, a philosophy class, and a psychology class. You want to check the similarity between these classes based on the words your professors use in class. For the sake of simplicity, let’s consider these two words: “theory” and “harmony”. You could then create a table like this to record the occurrence of these words in each class:
现在假设您是一名高中生,您有3个班级。 数学课,哲学课和心理学课。 您想根据您的教授在课堂上使用的单词来检查这些课程之间的相似性。 为了简单起见,让我们考虑以下两个词:“理论”和“和谐”。 然后,您可以创建一个像这样的表来记录每个类中这些单词的出现情况:
In this table, the word “theory” is repeated 60 times in math class, 20 times in philosophy class, and 25 times in psychology class whereas the word harmony is repeated 10, 40, and 70 times in math, philosophy, and psychology classes respectively. Let’s translate this data into a 2D plane.
在此表中,“理论”一词在数学课中重复了60次,在哲学课中重复了20次,在心理学课中重复了25次,而在数学,哲学和心理学课中,“和谐”一词重复了10、40和70次分别。 让我们将此数据转换为2D平面。
The Euclidean distance is simply the distance between the points. In the graph below.
欧几里得距离就是点之间的距离。 在下图中。
You can see clearly that d1 which is the distance between psychology and philosophy is smaller than d2 which is the distance between philosophy and math. But how do you calculate d1 and d2?
您可以清楚地看到,心理学与哲学之间的距离d1小于哲学与数学之间的距离d2。 但是,如何计算d1和d2?
The generic formula is the following.
通用公式如下。
In our case, for d1, d(v, w) = d(philosophy, psychology)
`, which is:
在我们的情况下,对于d1, d(v, w) = d(philosophy, psychology)
`,即:
And d2
和d2
As expected d2 > d1.
如预期的那样,d2> d1。
How to do this in python?
如何在python中做到这一点?
import numpy as np# define the vectorsmath = np.array([60, 10])philosophy = np.array([20, 40])psychology = np.array([25, 70])# calculate d1d1 = np.linalg.norm(philosophy - psychology)# calculate d2d2 = np.linalg.norm(philosophy - math)
余弦相似度 (Cosine Similarity)
Suppose you only have 2 hours of psychology class per week and 5 hours of both math class and philosophy class. Because you attend more of these two classes, the occurrence of the words “theory” and “harmony” will be greater than for the psychology class. Thus the updated table:
假设您每周只有2个小时的心理学课,而数学课和哲学课则只有5个小时。 由于您参加这两个课程中的更多课程,因此“理论”和“和谐”一词的出现将比心理学课程中的要大。 因此,更新后的表:
And the updated 2D graph:
以及更新后的2D图形:
Using the formula we’ve given earlier for Euclidean distance, we will find that, in this case, d1 is greater than d2. But we know psychology is closer to philosophy than it is to math. The frequency of the courses, trick the Euclidean distance metric. Cosine similarity is here to solve this problem.
使用我们先前给出的欧几里得距离公式,我们会发现,在这种情况下,d1大于d2。 但是我们知道心理学比数学更接近于哲学。 课程的频率欺骗欧几里德距离度量标准。 余弦相似度在这里解决了这个问题。
Instead of calculating the straight line distance between the points, cosine similarity cares about the angle between the vectors.
余弦相似度关心的是矢量之间的角度,而不是计算点之间的直线距离。
Zooming in on the graph, we can see that the angle α, is smaller than the angle β. That’s all cosine similarity wants to know. In other words, the smaller the angle, the closer the vectors are to each other.
放大该图,我们可以看到角度α小于角度β。 这就是所有余弦相似度想要知道的。 换句话说,角度越小,向量彼此越接近。
The generic formula goes as follows
通用公式如下
β is the angle between the vectors philosophy (represented by v) and math (represented by w).
β是向量原理(用v表示)和数学(用w表示)之间的夹角。
Whereas cos(alpha) = 0.99
which is higher than cos(beta)
meaning philosophy is closer to psychology than it is to math.
而cos(alpha) = 0.99
(高于cos(beta)
意味着哲学比数学更接近心理学。
Recall that
回想起那个
and
和
This implies that the smaller the angle, the greater your cosine similarity will be and the greater your cosine similarity, the more similar your vectors are.
这意味着角度越小,您的余弦相似度就越大,并且您的余弦相似度越大,向量就越相似。
Python implementation
Python实现
import numpy as npmath = np.array([80, 45])philosophy = np.array([50, 60])psychology = np.array([15, 20])cos_beta = np.dot(philosophy, math) / (np.linalg.norm(philosophy) * np.linalg.norm(math))print(cos_beta)
带走 (Takeaway)
I bet you should know by now how Euclidean distance and cosine similarity works. The former considers the straight line distance between two points whereas the latter cares about the angle between the two vectors in question.
我敢打赌,您现在应该知道欧几里得距离和余弦相似度是如何工作的。 前者考虑了两个点之间的直线距离,而后者则考虑了所讨论的两个向量之间的角度。
Euclidean distance is more straightforward and is guaranteed to work whenever your features distribution is balanced. But most of the time, we deal with unbalanced data. In such cases, it’s better to use cosine similarity.
欧几里得距离更简单明了,并且可以保证只要要素分布平衡就可以使用。 但是大多数时候,我们处理不平衡的数据。 在这种情况下,最好使用余弦相似度。
翻译自: https://medium.com/@josmyfaure/euclidean-distance-and-cosine-similarity-which-one-to-use-and-when-28c97a18fe68
余弦相似度和欧氏距离
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389936.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!