python中knn_如何在python中从头开始构建knn

python中knn

k最近邻居 (k-Nearest Neighbors)

k-Nearest Neighbors (KNN) is a supervised machine learning algorithm that can be used for either regression or classification tasks. KNN is non-parametric, which means that the algorithm does not make assumptions about the underlying distributions of the data. This is in contrast to a technique like linear regression, which is parametric, and requires us to find a function that describes the relationship between dependent and independent variables.

k最近邻(KNN)是一种受监督的机器学习算法,可用于回归或分类任务。 KNN是非参数的,这意味着该算法不对数据的基础分布进行假设。 这与参数化的线性回归等技术形成对比,后者是参数化的,要求我们找到一个描述因变量和自变量之间关系的函数。

KNN has the advantage of being quite intuitive to understand. When used for classification, a query point (or test point) is classified based on the k labeled training points that are closest to that query point.

KNN具有非常直观易懂的优点。 当用于分类时,根据最接近该查询点的k个标记训练点对查询点(或测试点)进行分类。

For a simplified example, see the figure below. The left panel shows a 2-d plot of sixteen data points — eight are labeled as green, and eight are labeled as purple. Now, the right panel shows how we would classify a new point (the black cross), using KNN when k=3. We find the three closest points, and count up how many ‘votes’ each color has within those three points. In this case, two of the three points are purple — so, the black cross will be labeled as purple.

有关简化示例,请参见下图。 左面板显示了16个数据点的二维图-八个标记为绿色,八个标记为紫色。 现在,右面板显示了当k = 3时,如何使用KNN对新点(黑色十字)进行分类。 我们找到三个最接近的点,并计算出每种颜色在这三个点内有多少个“票数”。 在这种情况下,三个点中的两个是紫色的-因此,黑色十字将被标记为紫色。

Image for post
2-d Classification using KNN when k=3
当k = 3时使用KNN进行二维分类

Calculating Distance

计算距离

The distance between points is determined by using one of several versions of the Minkowski distance equation. The generalized formula for Minkowski distance can be represented as follows:

点之间的距离是通过使用Minkowski距离方程的几个版本之一确定的。 Minkowski距离的广义公式可以表示为:

Image for post

where X and Y are data points, n is the number of dimensions, and p is the Minkowski power parameter. When p =1, the distance is known at the Manhattan (or Taxicab) distance, and when p=2 the distance is known as the Euclidean distance. In two dimensions, the Manhattan and Euclidean distances between two points are easy to visualize (see the graph below), however at higher orders of p, the Minkowski distance becomes more abstract.

其中XY是数据点, n是维数, p是Minkowski幂参数。 当p = 1时,该距离已知为曼哈顿(或出租车)距离,而当p = 2时,该距离称为欧几里得距离。 在两个维度上,两点之间的曼哈顿距离和欧几里得距离很容易可视化(请参见下图),但是在p的高阶处,明可夫斯基距离变得更加抽象。

Image for post
Manhattan and Euclidean distances in 2-d
二维中的曼哈顿距离和欧几里得距离

Python中的KNN (KNN in Python)

To implement my own version of the KNN classifier in Python, I’ll first want to import a few common libraries to help out.

为了用Python实现我自己的KNN分类器版本,我首先要导入一些常见的库来提供帮助。

# Initial importsimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt

加载数据中 (Loading Data)

To test the KNN classifier, I’m going to use the iris data set from sklearn.datasets. The data set has measurements (Sepal Length, Sepal Width, Petal Length, Petal Width) for 150 iris plants, split evenly among three species (0 = setosa, 1 = versicolor, and 2 = virginica). Below, I load the data and store it in a dataframe.

为了测试KNN分类器,我将使用sklearn.datasets中的虹膜数据集。 数据集具有150种鸢尾植物的测量值(头长,萼片宽度,花瓣长度,花瓣宽度),均匀地分为三种(0 =刚毛,1 =杂色和2 =弗吉尼亚)。 在下面,我加载数据并将其存储在数据框中。

# Load iris data and store in dataframefrom sklearn import datasetsiris = datasets.load_iris()df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target
df.head()
Image for post

I’ll also separate the data into features (X) and the target variable (y), which is the species label for each plant.

我还将数据分为特征(X)和目标变量(y),目标变量是每种植物的种类标签。

# Separate X and y dataX = df.drop('target', axis=1)
y = df.target

建立KNN框架 (Building out the KNN Framework)

Creating a functioning KNN classifier can be broken down into several steps. While KNN includes a bit more nuance than this, here’s my bare-bones to-do list:

创建功能良好的KNN分类器可以分为几个步骤。 尽管KNN包含的细微之处要多于此,但以下是我的基本工作清单:

  1. Define a function to calculate the distance between two points

    定义一个函数来计算两点之间的距离
  2. Use the distance function to get the distance between a test point and all known data points

    使用距离函数获取测试点与所有已知数据点之间的距离
  3. Sort distance measurements to find the points closest to the test point (i.e., find the nearest neighbors)

    对距离测量值进行排序,以找到最接近测试点的点(即,找到最近的邻居)
  4. Use majority class labels of those closest points to predict the label of the test point

    使用那些最接近的点的多数类标签来预测测试点的标签
  5. Repeat steps 1 through 4 until all test data points are classified

    重复步骤1至4,直到对所有测试数据点进行分类

1.定义一个函数来计算两点之间的距离 (1. Define a function to calculate distance between two points)

First, I define a function called minkowski_distance, that takes an input of two data points (a & b) and a Minkowski power parameter p, and returns the distance between the two points. Note that this function calculates distance exactly like the Minkowski formula I mentioned earlier. By making p an adjustable parameter, I can decide whether I want to calculate Manhattan distance (p=1), Euclidean distance (p=2), or some higher order of the Minkowski distance.

首先,我定义一个名为minkowski_distance的函数,该函数接受两个数据点( ab )和一个Minkowski幂参数p的输入,并返回两个点之间的距离。 请注意,此函数计算距离的方式与我之前提到的Minkowski公式完全相同。 通过将p设置为可调参数,我可以决定是否要计算曼哈顿距离(p = 1),欧几里得距离(p = 2)或Minkowski距离的更高阶。

# Calculate distance between two pointsdef minkowski_distance(a, b, p=1):# Store the number of dimensionsdim = len(a)# Set initial distance to 0distance = 0# Calculate minkowski distance using parameter pfor d in range(dim):distance += abs(a[d] - b[d])**pdistance = distance**(1/p)return distance# Test the functionminkowski_distance(a=X.iloc[0], b=X.iloc[1], p=1)
0.6999999999999993

2.使用距离功能获取测试点与所有已知数据点之间的距离 (2. Use the distance function to get distance between a test point and all known data points)

For step 2, I simply repeat the minkowski_distance calculation for all labeled points in X and store them in a dataframe.

对于第2步,我只需要对X中所有标记的点重复minkowski_distance计算,并将它们存储在数据框中。

# Define an arbitrary test pointtest_pt = [4.8, 2.7, 2.5, 0.7]# Calculate distance between test_pt and all points in Xdistances = []for i in X.index:distances.append(minkowski_distance(test_pt, X.iloc[i]))df_dists = pd.DataFrame(data=distances, index=X.index, columns=['dist'])
df_dists.head()
Image for post

3.对距离测量值进行排序以找到最接近测试点的点 (3. Sort distance measurements to find the points closest to the test point)

In step 3, I use the pandas .sort_values() method to sort by distance, and return only the top 5 results.

在第3步中,我使用pandas .sort_values()方法按距离排序,并且仅返回前5个结果。

# Find the 5 nearest neighborsdf_nn = df_dists.sort_values(by=['dist'], axis=0)[:5]
df_nn
Image for post

4.使用那些最近点的多数类标签来预测测试点的标签 (4. Use majority class labels of those closest points to predict the label of the test point)

For this step, I use collections.Counter to keep track of the labels that coincide with the nearest neighbor points. I then use the .most_common() method to return the most commonly occurring label. Note: if there is a tie between two or more labels for the title of “most common” label, the one that was first encountered by the Counter() object will be the one that gets returned.

对于这一步,我使用collections.Counter来跟踪与最近的邻居点重合的标签。 然后,我使用.most_common()方法返回最常见的标签。 注意:如果两个或两个以上标签之间的关系为“最常见”标签的标题,则Counter()对象首先遇到的标签将是返回的标签。

from collections import Counter# Create counter object to track the labelscounter = Counter(y[df_nn.index])# Get most common label of all the nearest neighborscounter.most_common()[0][0]
1

5.重复步骤1至4,直到对所有测试数据点进行分类 (5. Repeat steps 1 through 4 until all test data points are classified)

In this step, I put the code I’ve already written to work and write a function to classify the data using KNN. First, I perform a train_test_split on the data (75% train, 25% test), and then scale the data using StandardScaler(). Since KNN is distance-based, it is important to make sure that the features are scaled properly before feeding them into the algorithm.

在这一步中,我将已经编写的代码投入使用,并编写了一个使用KNN对数据进行分类的函数。 首先,我对数据执行train_test_split (75%的火车,25%的测试),然后使用StandardScaler()缩放数据。 由于KNN是基于距离的,因此在将特征输入算法之前,确保正确缩放特征很重要。

Additionally, to avoid data leakage, it is good practice to scale the features after the train_test_split has been performed. First, scale the data from the training set only (scaler.fit_transform(X_train)), and then use that information to scale the test set (scaler.tranform(X_test)). This way, I can ensure that no information outside of the training data is used to create the model.

此外,为避免数据泄漏,优良作法是在train_test_split执行之后缩放功能。 首先,仅缩放训练集中的数据 ( scaler.fit_transform(X_train) ),然后使用该信息来缩放测试集( scaler.tranform(X_test) )。 这样,我可以确保没有使用训练数据之外的任何信息来创建模型。

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler# Split the data - 75% train, 25% testX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1)# Scale the X datascaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Next, I define a function called knn_predict that takes in all of the training and test data, k, and p, and returns the predictions my KNN classifier makes for the test set (y_hat_test). This function doesn’t really include anything new — it is simply applying what I’ve already worked through above. The function should return a list of label predictions containing only 0’s, 1’s and 2’s.

接下来,我定义一个名为knn_predict的函数,该函数接收所有训练和测试数据kp ,并返回我的KNN分类器对测试集所做的预测( y_hat_test )。 该功能实际上并没有包含任何新功能-只是应用了我上面已经完成的工作。 该函数应返回仅包含0、1和2的标签预测列表。

def knn_predict(X_train, X_test, y_train, y_test, k, p):# Counter to help with label votingfrom collections import Counter# Make predictions on the test data# Need output of 1 prediction per test data pointy_hat_test = []for test_point in X_test:distances = []for train_point in X_train:distance = minkowski_distance(test_point, train_point, p=p)distances.append(distance)# Store distances in a dataframedf_dists = pd.DataFrame(data=distances, columns=['dist'], index=y_train.index)# Sort distances, and only consider the k closest pointsdf_nn = df_dists.sort_values(by=['dist'], axis=0)[:k]# Create counter object to track the labels of k closest neighborscounter = Counter(y_train[df_nn.index])# Get most common label of all the nearest neighborsprediction = counter.most_common()[0][0]# Append prediction to output listy_hat_test.append(prediction)return y_hat_test# Make predictions on test dataset
y_hat_test = knn_predict(X_train, X_test, y_train, y_test, k=5, p=1)print(y_hat_test)
[0, 1, 1, 0, 2, 1, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1, 2, 0, 2, 1, 0, 0, 1, 2, 1, 2, 1, 2, 2, 0, 1, 0]

And there they are! These are the predictions that this home-brewed KNN classifier has made on the test set. Let’s see how well it worked:

在那里! 这些是这个自制的KNN分类器对测试集所做的预测。 让我们看看它的效果如何:

# Get test accuracy scorefrom sklearn.metrics import accuracy_scoreprint(accuracy_score(y_test, y_hat_test))
0.9736842105263158

Looks like the classifier achieved 97% accuracy on the test set. Not too bad at all! But how do I know if it actually worked correctly? Let’s check the result of sklearn’s KNeighborsClassifier on the same data:

看起来分类器在测试集上达到了97%的准确性。 一点也不差! 但是我怎么知道它是否真的正常工作呢? 让我们在相同数据上检查sklearn的KNeighborsClassifier的结果:

# Testing to see results from sklearn.neighbors.KNeighborsClassifierfrom sklearn.neighbors import KNeighborsClassifierclf = KNeighborsClassifier(n_neighbors=5, p=1)
clf.fit(X_train, y_train)
y_pred_test = clf.predict(X_test)print(f"Sklearn KNN Accuracy: {accuracy_score(y_test, y_pred_test)}")
Sklearn KNN Accuracy: 0.9736842105263158

Nice! sklearn’s implementation of the KNN classifier gives us the exact same accuracy score.

真好! sklearn对KNN分类器的实现为我们提供了完全相同的准确性得分。

探索变化k的影响 (Exploring the effect of varying k)

My KNN classifier performed quite well with the selected value of k = 5. KNN doesn’t have as many tune-able parameters as other algorithms like Decision Trees or Random Forests, but k happens to be one of them. Let’s see how the classification accuracy changes when I vary k:

我的KNN分类器在选定的k = 5时表现很好。KNN没有像决策树或随机森林之类的其他算法那么多的可调参数,但k恰好是其中之一。 让我们看看改变k时分类精度如何变化:

# Obtain accuracy score varying k from 1 to 99accuracies = []for k in range(1,100):y_hat_test = knn_predict(X_train, X_test, y_train, y_test, k, p=1)accuracies.append(accuracy_score(y_test, y_hat_test))# Plot the results fig, ax = plt.subplots(figsize=(8,6))
ax.plot(range(1,100), accuracies)
ax.set_xlabel('# of Nearest Neighbors (k)')
ax.set_ylabel('Accuracy (%)');
Image for post

In this case, using nearly any k value less than 20 results in great (>95%) classification accuracy on the test set. However, when k becomes greater than about 60, accuracy really starts to drop off. This makes sense, because the data set only has 150 observations — when k is that high, the classifier is probably considering labeled training data points that are way too far from the test points.

在这种情况下,几乎使用任何小于20的k值,都可以在测试集上实现较高的分类精度(> 95%)。 但是,当k大于约60时,精度实际上开始下降。 这是有道理的,因为数据集只有150个观察值-当k很高时,分类器可能正在考虑与测试点相距太远的标记训练数据点。

每个邻居都有投票权吗? (Every neighbor gets a vote — or do they?)

In writing my own KNN classifier, I chose to overlook one clear hyperparameter tuning opportunity: the weight that each of the k nearest points has in classifying a point. In sklearn’s KNeighborsClassifier, this is the weights parameter, and it can be set to ‘uniform’, ‘distance’, or another user-defined function.

在编写自己的KNN分类器时,我选择忽略了一个明确的超参数调整机会: k个最近点中的每一个在对点进行分类时所具有的权重。 在sklearn的KNeighborsClassifier中 ,这是weights参数,可以将其设置为'uniform''distance'或其他用户定义的函数。

When set to ‘uniform’, each of the k nearest neighbors gets an equal vote in labeling a new point. When set to ‘distance’, the neighbors in closest to the new point are weighted more heavily than the neighbors farther away. There are certainly cases where weighting by ‘distance’ would produce better results, and the only way to find out is through hyperparameter tuning.

当设置为'uniform'时 ,k个最近的邻居中的每一个在标记新点时都会得到平等的投票。 设置为“距离”时 ,最接近新点的邻居的权重要比更远的邻居的权重大。 当然,在某些情况下,按“距离”进行加权会产生更好的结果,唯一的找出方法是通过超参数调整。

最后的想法 (Final Thoughts)

Now, make no mistake — sklearn’s implementation is undoubtedly more efficient and more user-friendly than what I’ve cobbled together here. However, I found it a valuable exercise to work through KNN from ‘scratch’, and it has only solidified my understanding of the algorithm. I hope it did the same for you!

现在,请不要误解-sklearn的实现无疑比我在这里拼凑的实现更加有效和用户友好。 但是,我发现从“从头开始”通过KNN进行工作是一个有价值的练习,并且它仅巩固了我对算法的理解。 希望对您也一样!

翻译自: https://towardsdatascience.com/how-to-build-knn-from-scratch-in-python-5e22b8920bd2

python中knn

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389856.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

unity第三人称射击游戏_在游戏上第3部分完美的信息游戏

unity第三人称射击游戏Previous article上一篇文章 The economics literature distinguishes the quality of a game’s information (perfect vs. imperfect) from the completeness of a game’s information (complete vs. incomplete). Perfect information means that ev…

JVM(2)--一文读懂垃圾回收

与其他语言相比,例如c/c,我们都知道,java虚拟机对于程序中产生的垃圾,虚拟机是会自动帮我们进行清除管理的,而像c/c这些语言平台则需要程序员自己手动对内存进行释放。 虽然这种自动帮我们回收垃圾的策略少了一定的灵活…

2058. 找出临界点之间的最小和最大距离

2058. 找出临界点之间的最小和最大距离 链表中的 临界点 定义为一个 局部极大值点 或 局部极小值点 。 如果当前节点的值 严格大于 前一个节点和后一个节点,那么这个节点就是一个 局部极大值点 。 如果当前节点的值 严格小于 前一个节点和后一个节点,…

tb计算机存储单位_如何节省数TB的云存储

tb计算机存储单位Whatever cloud provider a company may use, costs are always a factor that influences decision-making, and the way software is written. As a consequence, almost any approach that helps save costs is likely worth investigating.无论公司使用哪种…

Django Rest Framework(一)

一、什么是RESTful REST与技术无关,代表一种软件架构风格,REST是Representational State Transfer的简称,中文翻译为“表征状态转移”。 REST从资源的角度审视整个网络,它将分布在网络中某个节点的资源通过URL进行标识&#xff0c…

数据可视化机器学习工具在线_为什么您不能跳过学习数据可视化

数据可视化机器学习工具在线重点 (Top highlight)There’s no scarcity of posts online about ‘fancy’ data topics like data modelling and data engineering. But I’ve noticed their cousin, data visualization, barely gets the same amount of attention. Among dat…

python中nlp的库_用于nlp的python中的网站数据清理

python中nlp的库The most important step of any data-driven project is obtaining quality data. Without these preprocessing steps, the results of a project can easily be biased or completely misunderstood. Here, we will focus on cleaning data that is composed…

一张图看懂云栖大会·上海峰会重磅产品发布

2018云栖大会上海峰会上,阿里云重磅发布一批产品并宣布了新一轮的价格调整,再次用科技普惠广大开发者和用户,详情见长图。 了解更多产品请戳:https://yunqi.aliyun.com/2018/shanghai/product?spm5176.8142029.759399.2.a7236d3e…

怎么看另一个电脑端口是否通_谁一个人睡觉另一个看看夫妻的睡眠习惯

怎么看另一个电脑端口是否通In 2014, FiveThirtyEight took a survey of about 1057 respondents to get a look at the (literal) sleeping habits of the American public beyond media portrayal. Some interesting notices: first, that about 45% of all couples sleep to…

Java基础之Collection和Map

List:实现了collection接口,list可以重复,有顺序 实现方式:3种,分别为:ArrayList,LinkedList,Vector。 三者的比较: ArrayList底层是一个动态数组,数组是使用…

20155320《网络对抗》Exp4 恶意代码分析

20155320《网络对抗》Exp4 恶意代码分析 【系统运行监控】 使用schtasks指令监控系统运行 首先在C盘目录下建立一个netstatlog.bat文件(由于是系统盘,所以从别的盘建一个然后拷过去),用来将记录的联网结果格式化输出到netstatlog.…

tableau 自定义省份_在Tableau中使用自定义图像映射

tableau 自定义省份We have been reading about all the ways to make our vizzes in Tableau with more creativity and appeal. During my weekly practice for creating viz as part of makeovermonday2020 community, I came across geographical data which in way requir…

2055. 蜡烛之间的盘子

2055. 蜡烛之间的盘子 给你一个长桌子,桌子上盘子和蜡烛排成一列。给你一个下标从 0 开始的字符串 s ,它只包含字符 ‘’ 和 ‘|’ ,其中 ’ 表示一个 盘子 ,’|’ 表示一支 蜡烛 。 同时给你一个下标从 0 开始的二维整数数组 q…

Template、ItemsPanel、ItemContainerStyle、ItemTemplate

原文:Template、ItemsPanel、ItemContainerStyle、ItemTemplate先来看一张图(网上下的图,加了几个字) 实在是有够“乱”的,慢慢来理一下; 1、Template是指控件的样式 在WPF中所有继承自contentcontrol类的控件都含有此属性,&#…

熊猫烧香分析报告_熊猫分析进行最佳探索性数据分析

熊猫烧香分析报告目录 (Table of Contents) Introduction 介绍 Overview 总览 Variables 变数 Interactions 互动互动 Correlations 相关性 Missing Values 缺失值 Sample 样品 Summary 摘要 介绍 (Introduction) There are countless ways to perform exploratory data analys…

白裤子变粉裤子怎么办_使用裤子构建构建数据科学的monorepo

白裤子变粉裤子怎么办At HousingAnywhere, one of the first major obstacles we had to face when scaling the Data team was building a centralised repository that contains our ever-growing machine learning applications. Between these projects, many of them shar…

支持向量机SVM算法原理及应用(R)

支持向量机SVM算法原理及应用(R) 2016年08月17日 16:37:25 阅读数:22292更多 个人分类: 数据挖掘实战应用版权声明:本文为博主原创文章,转载请注明来源。 https://blog.csdn.net/csqazwsxedc/article/detai…

mad离群值_全部关于离群值

mad离群值An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset. Or in a layman term, we can say, an outlier is something that behaves differently from th…

青年报告_了解青年的情绪

青年报告Youth-led media is any effort created, planned, implemented, and reflected upon by young people in the form of media, including websites, newspapers, television shows, and publications. Such platforms connect writers, artists, and photographers in …

post提交参数过多时,取消Tomcat对 post长度限制

1.Tomcat 默认的post参数的最大大小为2M, 当超过时将会出错,可以配置maxPostSize参数来改变大小。 从 apache-tomcat-7.0.63 开始,参数 maxPostSize 的含义就变了: 如果将值设置为 0,表示 POST 最大值为 0,…