knn 机器学习_机器学习:通过预测意大利葡萄酒的品种来观察KNN的工作方式

knn 机器学习

Introduction

介绍

For this article, I’d like to introduce you to KNN with a practical example.

对于本文,我想通过一个实际的例子向您介绍KNN。

I will consider one of my project that you can find in my GitHub profile. For this project, I used a dataset from Kaggle.

我将考虑可以在我的GitHub个人资料中找到的我的项目之一。 对于这个项目,我使用了Kaggle的数据集。

The dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars organized in three classes. The analysis was done by considering the quantities of 13 constituents found in each of the three types of wines.

该数据集是对意大利同一地区种植的葡萄酒进行化学分析的结果,这些葡萄酒来自三个不同类别的三个品种。 通过考虑三种葡萄酒中每种葡萄酒中13种成分的数量来进行分析。

This article will be structured in three-part. In the first part, I will make a theoretical description of KNN, then I will focus on the part about exploratory data analysis in order to show you the insights that I found and at the end, I will show you the code that I used to prepare and evaluate the machine learning model.

本文将分为三部分。 在第一部分中,我将对KNN进行理论上的描述,然后,我将重点介绍探索性数据分析这一部分,以便向您展示我发现的见解,最后,我将向您展示我曾经使用过的代码准备和评估机器学习模型。

Part I: What is KNN and how it works mathematically?

第一部分:什么是KNN及其在数学上的作用?

The k-nearest neighbour algorithm is not a complex algorithm. The approach of KNN to predict and classify data consists of looking through the training data and finds the k training points that are closest to the new point. Then it assigns to the new data the class label of the nearest training data.

k最近邻居算法不是复杂的算法。 KNN预测和分类数据的方法包括浏览训练数据并找到最接近新点的k个训练点。 然后,它将新的训练数据的类别标签分配给新数据。

But how KNN works? To answer this question we have to refer to the formula of the euclidian distance between two points. Suppose you have to compute the distance between two points A(5,7) and B(1,4) in a Cartesian plane. The formula that you will apply is very simple:

但是KNN是如何工作的? 要回答这个问题,我们必须参考两点之间的欧几里得距离的公式。 假设您必须计算笛卡尔平面中两个点A(5,7)和B(1,4)之间的距离。 您将应用的公式非常简单:

Image for post

Okay, but how can we apply that in machine learning? Imagine to be a bookseller and you want to classify a new book called Ubick of Philip K. Dick with 240 pages which cost 14 euro. As you can see below there are 5 possible classes where to put our new book.

好的,但是我们如何将其应用到机器学习中呢? 想象成为一个书商,您想对一本名为Philip K. Dick的Ubick的新书进行分类,共有240页,售价14欧元。 如您在下面看到的,有5种可能的类别可用于放置我们的新书。

Image for post
image by author
图片作者

To know which is the best class for Ubick we can use the euclidian formula in order to compute the distance with each observation in the dataset.

要知道哪个是Ubick的最佳分类,我们可以使用欧几里得公式来计算数据集中每个观测值的距离。

Formula:

式:

Image for post
image by author
图片作者

output:

输出:

Image for post
image by author
图片作者

As you can see above the nearest class for Ubick is class C.

如您所见,Ubick最近的课程是C类

Part II: insights that I found to create the model

第二部分:我发现的创建模型的见解

Before to start to speak about the algorithm, that I used to create my model and predict the varieties of wine, let me show you briefly the main insights that I found.

在开始谈论算法之前,我曾用它来创建模型并预测葡萄酒的种类,然后让我简要地向您展示我发现的主要见解。

In the following heatmap, there are correlations between the different features. This is very useful to have a first look at the situation of our dataset and knowing if it is possible to apply a classification algorithm.

在下面的热图中,不同功能之间存在关联。 首先了解一下数据集的情况,并了解是否有可能应用分类算法,这非常有用。

Image for post
image by author
图片作者

The heatmap is great for a first look but that is not enough. I’d like also to know if there are some elements whose absolute sum of correlations is low in order to delete them before to train the machine learning model. So, I construct a histogram as you can see below.

该热图乍一看很棒,但这还不够。 我还想知道是否存在某些元素的相关绝对和很低,以便在训练机器学习模型之前将其删除。 因此,如下图所示,我构建了一个直方图。

You can see that there are three elements with low total absolute correlation. The elements are ash, magnesium and the color_intensity.

您会看到三个绝对绝对相关性较低的元素。 元素是灰,镁和color_intensity。

Image for post
image by author
图片作者

Thanks to these observations now we are sure that there is the possibility to apply a KNN algorithm to create a predictive model.

现在,由于这些观察,我们确信可以应用KNN算法创建预测模型。

Part III: use scikit-learn to make predictions

第三部分:使用scikit-learn进行预测

In this part, we will see how to prepare the model and evaluate it thanks to scikit-learn.

在这一部分中,我们将借助scikit-learn了解如何准备模型并进行评估。

Below you can observe that I split the model into two parts: 80% for training and 20% for testing. I chose this proportion because the data set is not big.

在下面,您可以看到我将模型分为两个部分:80%用于训练,20%用于测试。 我选择此比例是因为数据集不大。

# split data to train and test
y = df['class']
X = input_data.drop(columns=['ash','magnesium', 'color_intensity'])X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)# to be sure that the data was split rightly (80% for train data and 20% for test data)print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

out:

出:

X_train shape: (141, 10)
y_train shape: (141,)X_test shape: (36, 10)
y_test shape: (36,)

You have to know that all machine learning models in scikit-learn are implemented in their own classes. For example, the k-nearest neighbors classification algorithm is implemented in the KNeighborsClassifier class.

您必须知道scikit-learn中的所有机器学习模型都是在各自的类中实现的。 例如,在KNeighborsClassifier类中实现了k最近邻居分类算法。

The first step is to instantiate the class into an object that I called cli as you can see below. The object contains the algorithm that I will use to build the model from the training data and make predictions on new data points. It contains also the information that the algorithm has extracted from the training data.

第一步是将类实例化为一个我称为cli的对象,如下所示。 该对象包含用于从训练数据构建模型并对新数据点进行预测的算法。 它还包含算法已从训练数据中提取的信息。

Finally, to build the model on the training set, we call the fit method of the cli object.

最后,要在训练集上构建模型,我们调用cli对象的fit方法

from sklearn.neighbors import KNeighborsClassifiercli = KNeighborsClassifier(n_neighbors=1)
cli.fit(X_train, y_train)

out:

出:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_jobs=None, n_neighbors=1, p=2,weights='uniform')

In the output of the fit method, you can see the parameters used in creating the model.

在fit方法的输出中,您可以看到用于创建模型的参数。

Now, it is time to evaluate the model. Below, the first output shows us that the model predict the 89% of the test data. Instead the second output give us a complete overview of the accuracy for each class.

现在,该评估模型了。 下面的第一个输出向我们展示了该模型预测了89%的测试数据。 相反,第二个输出为我们提供了每个类别的准确性的完整概述。

y_pred = cli.predict(X_test)
print("Test set score: {:.2f}".format(cli.score(X_test, y_test))) # below the values of the model 
from sklearn.metrics import classification_report
print("Final result of the model \n {}".format(classification_report(y_test, y_pred)))

out:

出:

Test set score: 0.89

out:

出:

Image for post

Conclusion

结论

I think that the best way to learn something is by practising. So in my case, I download the dataset from Kaggle which is one of the best places where to find a good dataset on which you can apply your machine learning algorithms and learn how they work.

我认为最好的学习方法是练习。 因此,就我而言,我是从Kaggle下载数据集的,这是找到良好数据集的最佳位置之一,您可以在该数据集上应用机器学习算法并了解它们的工作方式。

Thanks for reading this. There are some other ways you can keep in touch with me and follow my work:

感谢您阅读本文。 您可以通过其他方法与我保持联系并关注我的工作:

  • Subscribe to my newsletter.

    订阅我的时事通讯。

  • You can also get in touch via my Telegram group, Data Science for Beginners.

    您也可以通过我的电报小组“ 面向初学者的数据科学”来联系

翻译自: https://towardsdatascience.com/machine-learning-observe-how-knn-works-by-predicting-the-varieties-of-italian-wines-a64960bb2dae

knn 机器学习

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391041.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

python 实现分步累加_Python网页爬取分步指南

python 实现分步累加As data scientists, we are always on the look for new data and information to analyze and manipulate. One of the main approaches to find data right now is scraping the web for a particular inquiry.作为数据科学家,我们一直在寻找…

关于双黑洞和引力波,LIGO科学家回答了这7个你可能会关心的问题

引力波的成功探测,就像双黑洞的碰撞一样,一石激起千层浪。 关于双黑洞和引力波,LIGO科学家回答了这7个你可能会关心的问题 最近,引力波的成功探测,就像双黑洞的碰撞一样,一石激起千层浪。 大家兴奋之余&am…

用于MLOps的MLflow简介第1部分:Anaconda环境

在这三部分的博客中跟随了演示之后,您将能够: (After following along with the demos in this three part blog you will be able to:) Understand how you and your Data Science teams can improve your MLOps practices using MLflow 了解您和您的数…

pymc3 贝叶斯线性回归_使用PyMC3估计的贝叶斯推理能力

pymc3 贝叶斯线性回归内部AI (Inside AI) If you’ve steered clear of Bayesian regression because of its complexity, this article shows how to apply simple MCMC Bayesian Inference to linear data with outliers in Python, using linear regression and Gaussian ra…

mongodb分布式集群搭建手记

一、架构简介 目标 单机搭建mongodb分布式集群(副本集 分片集群),演示mongodb分布式集群的安装部署、简单操作。 说明 在同一个vm启动由两个分片组成的分布式集群,每个分片都是一个PSS(Primary-Secondary-Secondary)模式的数据副本集; Confi…

python16_day37【爬虫2】

一、异步非阻塞 1.自定义异步非阻塞 1 import socket2 import select3 4 class Request(object):5 def __init__(self,sock,func,url):6 self.sock sock7 self.func func8 self.url url9 10 def fileno(self): 11 return self.soc…

朴素贝叶斯实现分类_关于朴素贝叶斯分类及其实现的简短教程

朴素贝叶斯实现分类Naive Bayes classification is one of the most simple and popular algorithms in data mining or machine learning (Listed in the top 10 popular algorithms by CRC Press Reference [1]). The basic idea of the Naive Bayes classification is very …

2019年度年中回顾总结_我的2019年回顾和我的2020年目标(包括数量和收入)

2019年度年中回顾总结In this post were going to take a look at how 2019 was for me (mostly professionally) and were also going to set some goals for 2020! 🤩 在这篇文章中,我们将了解2019年对我来说(主要是职业)如何,我们还将为20…

vray阴天室内_阴天有话:第1部分

vray阴天室内When working with text data and NLP projects, word-frequency is often a useful feature to identify and look into. However, creating good visuals is often difficult because you don’t have a lot of options outside of bar charts. Lets face it; ba…

高光谱图像分类_高光谱图像分析-分类

高光谱图像分类初学者指南 (Beginner’s Guide) This article provides detailed implementation of different classification algorithms on Hyperspectral Images(HSI).本文提供了在高光谱图像(HSI)上不同分类算法的详细实现。 目录 (Table of Contents) Introduction to H…

机器人的动力学和动力学联系_通过机器学习了解幸福动力学(第2部分)

机器人的动力学和动力学联系Happiness is something we all aspire to, yet its key factors are still unclear.幸福是我们所有人都渴望的东西,但其关键因素仍不清楚。 Some would argue that wealth is the most important condition as it determines one’s li…

ubuntu 16.04 安装mysql

2019独角兽企业重金招聘Python工程师标准>>> 1) 安装 sudo apt-get install mysql-server apt-get isntall mysql-client apt-get install libmysqlclient-dev 2) 验证 sudo netstat -tap | grep mysql 如果有 就代表已经安装成功。 3)开启远程访问 1、 …

大样品随机双盲测试_训练和测试样品生成

大样品随机双盲测试This post aims to explore a step-by-step approach to create a K-Nearest Neighbors Algorithm without the help of any third-party library. In practice, this Algorithm should be useful enough for us to classify our data whenever we have alre…

JavaScript 基础,登录验证

<script></script>的三种用法&#xff1a;放在<body>中放在<head>中放在外部JS文件中三种输出数据的方式&#xff1a;使用 document.write() 方法将内容写到 HTML 文档中。使用 window.alert() 弹出警告框。使用 innerHTML 写入到 HTML 元素。使用 &qu…

从数据角度探索在新加坡的非法毒品

All things are poisons, for there is nothing without poisonous qualities. It is only the dose which makes a thing poison.” ― Paracelsus万物都是毒药&#xff0c;因为没有毒药就没有什么。 只是使事物中毒的剂量。” ― 寄生虫 执行摘要(又名TL&#xff1b; DR) (Ex…

Android 自定义View实现QQ运动积分抽奖转盘

因为偶尔关注QQ运动&#xff0c; 看到QQ运动的积分抽奖界面比较有意思&#xff0c;所以就尝试用自定义View实现了下&#xff0c;原本想通过开发者选项查看下界面的一些信息&#xff0c;后来发现积分抽奖界面是在WebView中展示的&#xff0c;应该是在H5页面中用js代码实现的&…

瑞立视:厚积薄发且具有“工匠精神”的中国品牌

一家成立两年的公司&#xff1a;是如何在VR行业趋于稳定的情况下首次融资就获得如此大额的金额呢&#xff1f; 2017年VR行业内宣布融资的公司寥寥无几&#xff0c;无论是投资人还是消费者对这个 “宠儿”都开始纷纷投以怀疑的目光。但就在2017年7月27日&#xff0c;深圳市一家…

CSV模块的使用

CSV模块的使用 1、csv简介 CSV (Comma Separated Values)&#xff0c;即逗号分隔值&#xff08;也称字符分隔值&#xff0c;因为分隔符可以不是逗号&#xff09;&#xff0c;是一种常用的文本 格式&#xff0c;用以存储表格数据&#xff0c;包括数字或者字符。很多程序在处理数…

python 重启内核_Python从零开始的内核回归

python 重启内核Every beginner in Machine Learning starts by studying what regression means and how the linear regression algorithm works. In fact, the ease of understanding, explainability and the vast effective real-world use cases of linear regression is…

回归分析中自变量共线性_具有大特征空间的回归分析中的变量选择

回归分析中自变量共线性介绍 (Introduction) Performing multiple regression analysis from a large set of independent variables can be a challenging task. Identifying the best subset of regressors for a model involves optimizing against things like bias, multi…