iris数据集 测试集_IRIS数据集的探索性数据分析

iris数据集 测试集

Let’s explore one of the simplest datasets, The IRIS Dataset which basically is a data about three species of a Flower type in form of its sepal length, sepal width, petal length, and petal width. The data set consists of 50 samples from each of the three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Our objective is to classify a new flower as belonging to one of the 3 classes given the 4 features.

让我们探索最简单的数据集之一,IRIS数据集,该数据集基本上是有关花类型的三种物种的数据,其形式为萼片长度,萼片宽度,花瓣长度和花瓣宽度。 所述数据集包括从每三个物种鸢尾的50个样品( 山鸢尾虹膜锦葵 变色鸢尾 )。 从每个样品中测量出四个特征: 萼片和花瓣的长度和宽度,以厘米为单位。 我们的目标是根据4个特征将新花归为3类之一。

Download IRIS data from here.

从此处下载IRIS数据。

Here I'm importing the libraries in ipython notebook using Anaconda Navigator(download: https://www.anaconda.com/products/individual). which can be useful in our exploratory data analysis like pandas, matplotlib, numpy and seaborn.

在这里,我使用Anaconda Navigator(下载: https ://www.anaconda.com/products/individual)在ipython Notebook中导入库。 这对我们的探索性数据分析(如熊猫matplotlibnumpyseaborn)很有用

Image for post
Exploring the data
探索数据
Image for post
Exploring the data
探索数据

Here, IRIS is a balanced dataset because the number of data points for every class Setosa, Virginica, and Versicolor is 50. If the classes are having the different numbers of data points each then it’s an imbalanced dataset.

在这里,IRIS是一个平衡的数据集,因为Setosa,Virginica和Versicolor每个类的数据点数均为50。如果每个类的数据点数均不同,则它是一个不平衡的数据集。

2D散点图 (2D Scatter Plot)

By using the pandas object we created before we can plot a simple 2D graph of the features we give as x and y parameters of the plot() method of pandas. Matplotlib method show() helps to actually plot the data.

通过使用我们创建的pandas对象,我们可以绘制简单的二维图形来绘制作为pandas plot()方法的x和y参数的要素。 Matplotlib方法show()有助于实际绘制数据。

Image for post
2D Scatter Plot
2D散点图

But by Seaborn we can plot a more informative graph by color-coding by each flower type.

但是通过Seaborn,我们可以通过每种花的颜色编码来绘制更具信息量的图。

Image for post
2D Scatter Plot using Seaborn
使用Seaborn的2D散点图
Image for post

Here in the above graph notice that Blue Setosa points can be easily separated from Orange Versicolor and Green Verginica points by simply drawing a line but the Orange and Green points are still complex to be separated because they are overlapping. So by using sepal_length and sepal_width features of the data we can get this much information.

在上图中,通过简单画一条线可以很容易地将Blue Setosa点与Orange Versicolor点和Green Verginica点分离,但是Orange点和Green点由于重叠而仍然很复杂,难以分离。 因此,通过使用数据的sepal_lengthsepal_width功能,我们可以获得很多信息。

2D散点图:对图 (2D Scatter Plot: Pair Plot)

Pair Plot by Seaborn is capable of drawing multiple 2D Scatter Plots for each possible combination of features in one go.

Seaborn的结对图能够一次性绘制多个2D散点图,以用于每种可能的特征组合。

Image for post
Pair Plot by Seaborn
Seaborn的配对图
Image for post
Pair Plots
对图

So here if we observe the pair plots then we can say petal_length and petal_width are the most essential features to identify various flower types. While Setosa can be easily linearly separable, Virnica and Versicolor have some overlap. So we can separate them by a line and some “if-else” conditions.

因此,在这里,如果我们观察对图,那么我们可以说花瓣长度花瓣宽度是识别各种花朵类型的最基本特征。 虽然Setosa可以很容易地线性分离,但Virnica和Versicolor有一些重叠。 因此,我们可以通过一行和一些“ if-else”条件将它们分开。

一维散点图,直方图,PDF和CDF (1D Scatter Plot, Histogram, PDF & CDF)

Image for post
1D Scatter Plot of Petal-Length
花瓣长度的一维散点图

As we can observe the graph, it's very hard to make sense as points are overlapping a lot. There are better ways to visualize the scatter plots. By Seaborn, we can plot a Probability Distribution Function cum Histogram.

正如我们可以观察到的图形一样,由于点重叠很多,很难理解。 有更好的方法可视化散点图。 通过Seaborn,我们可以绘制概率分布函数和直方图

Histogram : Histogram is the plot representing the frequency counts of each data window of the feature for which the plot is drawn (Bar shapes in the graph).

直方图 :直方图是表示绘制该图的要素的每个数据窗口的频率计数的图(图中的条形)。

PDF : Probability Density Function is basically a smoothed histogram. Every point on the PDF represents the probability for that particular value in the data (bell shaped curve in the graph). PDF gets formatted using Kernel Density Estimation. For each value of the point on x-axis, y-axis value represents its probabily of occuring in the dataset. More the y value more of that value exists in the dataset.

PDF概率密度函数基本上是平滑的直方图。 PDF上的每个点都代表数据中该特定值(图中的钟形曲线)的概率。 使用内核密度估计来格式化PDF。 对于x轴上每个点的值,y轴值表示其在数据集中出现的概率。 y值越大,数据集中存在的值越多。

Image for post
PDF & Histogram of petal_length
花瓣长度的PDF和直方图
Image for post
PDF & Histogram of petal_length
花瓣长度的PDF和直方图
Image for post
PDF &Histogram of petal_width
花瓣宽度的PDF和直方图
Image for post
PDF &Histogram of petal_width
花瓣宽度的PDF和直方图
Image for post
PDF &Histogram of sepal_length
PDF和Sepal_length的直方图
Image for post
PDF &Histogram of sepal_length
PDF和Sepal_length的直方图
Image for post
PDF &Histogram of sepal_width
PDF格式的sepal_width
Image for post
PDF &Histogram of sepal_width
PDF格式的sepal_width

Now from these graphs, we can observe that by using just one feature a simple model can be formed by if..else condition as if(petal_length) < 2.5 then flower type is Setosa.

现在从这些图形中,我们可以观察到,仅使用一个功能,就可以通过if..else条件( if(petal_length)<2.5)形成简单模型, 然后花朵类型为Setosa

Now, what if we need the percentage of Versicolor points having a petal_length of less than 5 ? here comes CDF in our rescue!

现在,如果我们需要花瓣长度小于5的Versicolor点的百分比呢? CDF来了!

CDF: Cumulative Density Function is the cumulative sum of the PDF. Every point on the CDF curve represents integration of the PDF till that point of CDF. Below is the histogram of the Yield. Every point on the CDF represents how much percentage of the total points belong to below that point.

CDF:累积密度函数是PDF的累积和。 CDF曲线上的每个点都代表PDF到CDF为止的积分。 以下是收益的直方图。 CDF上的每个点代表该点以下的总点数百分比。

To construct a histogram, the first step is to “bin” the range of values — that is, divide the entire range of values into a series of intervals — and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often (but are not required to be) of equal size(for more information: https://www.datacamp.com/community/tutorials/histograms-matplotlib).

要构建直方图,第一步是将值的范围“ bin”(即,将值的整个范围划分为一系列间隔),然后计算每个间隔中有多少值。 通常将bin指定为变量的连续,不重叠的间隔。 垃圾箱(间隔)必须相邻,并且经常(但不是必须)大小相等(有关更多信息,请访问: https : //www.datacamp.com/community/tutorials/histograms-matplotlib )。

Image for post
Image for post

Now by plotting of CDF of petal_length for various types of flowers in a combined manner we can get an overall picture of the data.

现在,通过组合绘制各种类型花朵的petlet_length的CDF,可以得到数据的整体图。

Image for post
Image for post

Mean, Variance and Standard Deviation

均值,方差和标准差

Mean: https://en.wikipedia.org/wiki/Mean

意思是: https : //en.wikipedia.org/wiki/Mean

Variance: https://en.wikipedia.org/wiki/Variance

差异: https : //en.wikipedia.org/wiki/Variance

Standard Deviation: https://en.wikipedia.org/wiki/Standard_deviation

标准偏差: https : //en.wikipedia.org/wiki/Standard_deviation

Image for post

Median, Percentile, Quantile, MAD, IQR

中位数,百分位数,分位数,MAD,IQR

Median: https://en.wikipedia.org/wiki/Median

中位数: https : //en.wikipedia.org/wiki/Median

Percentile: https://en.wikipedia.org/wiki/Percentile

百分位数: https : //en.wikipedia.org/wiki/Percentile

Quantile: https://en.wikipedia.org/wiki/Quantile

分位数: https : //en.wikipedia.org/wiki/Quantile

MAD: Median Absolute Deviation: https://en.wikipedia.org/wiki/Median_absolute_deviation

MAD:中位数绝对偏差: https : //en.wikipedia.org/wiki/Median_absolute_deviation

IQR: Interquantile Range: https://en.wikipedia.org/wiki/Interquartile_range

IQR:分位数范围: https ://en.wikipedia.org/wiki/Interquartile_range

Image for post
Image for post

箱形图 (Box Plots)

Box plots with whiskers is another method for visualizing the 1D Scatter Plot more intuitively. The boxes in the graph represent Interquantile Range as the first horizontal line from the bottom of the box represents 25th percentile value, the middle line represents the 50th percentile and the top line represents the 75th percentile. The black lines outside of the boxes are called whiskers. It’s not fixed what whiskers represent but it might be the minimum value of the feature at below horizontal line and maximum value at the top horizontal line in some cases.

带晶须的箱形图是另一种更直观地可视化1D散布图的方法。 图中的框代表分位数范围,因为从框底部开始的第一条水平线代表第25个百分位数,中线代表第50个百分位数,顶线代表第75个百分位数。 盒子外面的黑线称为晶须。 晶须代表什么并不确定,但在某些情况下可能是特征在水平线以下的最小值和在水平线顶部的最大值。

Image for post

小提琴图 (Violin Plots)

Violin plot by Seaborn combine PDF and Box-Plot. As in the below plot, on all three colors, PDFs of petal_length are on the sides of the shape, and in the center in black, there is a representation of Box-Plots.

Seaborn的小提琴图结合了PDF和Box-Plot。 如下图所示,在所有三种颜色上,petlet_length的PDF都位于形状的侧面,而黑色的中心则是Box-Plots的表示形式。

Image for post

多元概率密度:轮廓图 (Multivariate Probability Density: Contour Plot)

Seaborn provides jointplot() method for contours. The name is “jointplot” because it represents Contours as well as PDFs on the edges. More the darker the region the more the probability of occurring that value of features for which the graph is plotted.

Seaborn提供了用于轮廓的jointplot()方法。 名称为“ jointplot”,因为它表示轮廓以及边缘的PDF 。 区域越黑,绘制该图的要素的值出现的可能性就越大。

Image for post
Image for post

翻译自: https://medium.com/swlh/exploratory-data-analysis-of-iris-dataset-2ab58e1a5dc6

iris数据集 测试集

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388039.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Oracle 12c 安装 Linuxx86_64

1)下载Oracle Database 12cRelease 1安装介质 官方的下载地址&#xff1a; 1&#xff1a;http://www.oracle.com/technetwork/database/enterprise-edition/downloads/index.html 2&#xff1a;https://edelivery.oracle.com/EPD/Download/get_form?egroup_aru_number16496…

python初学者_初学者使用Python的完整介绍

python初学者A magical art of teaching a computer to perform a task is called computer programming. Programming is one of the most valuable skills to have in this competitive world of computers. We, as modern humans, are living with lots of gadgets such as …

医疗大数据处理流程_我们需要数据来大规模改善医疗流程

医疗大数据处理流程Note: the fictitious examples and diagrams are for illustrative purposes ONLY. They are mainly simplifications of real phenomena. Please consult with your physician if you have any questions.注意&#xff1a;虚拟示例和图表仅用于说明目的。 …

ASP.NET Core中使用GraphQL - 第七章 Mutation

ASP.NET Core中使用GraphQL - 目录 ASP.NET Core中使用GraphQL - 第一章 Hello WorldASP.NET Core中使用GraphQL - 第二章 中间件ASP.NET Core中使用GraphQL - 第三章 依赖注入ASP.NET Core中使用GraphQL - 第四章 GrahpiQLASP.NET Core中使用GraphQL - 第五章 字段, 参数, 变量…

POM.xml红叉解决方法

方法/步骤 1用Eclipse创建一个maven工程&#xff0c;网上有很多资料&#xff0c;这里不再啰嗦。 2右键maven工程&#xff0c;进行更新 3在弹出的对话框中勾选强制更新&#xff0c;如图所示 4稍等片刻&#xff0c;pom.xml的红叉消失了。。。

JS前台页面验证文本框非空

效果图&#xff1a; 代码&#xff1a; 源代码&#xff1a; <script type"text/javascript"> function check(){ var xm document.getElementById("xm").value; if(xm null || xm ){ alert("用户名不能为空"); return false; } return …

05精益敏捷项目管理——超越Scrum

00.我们不是不知道它会给我们带来麻烦&#xff0c;只是没想到麻烦会有这么多。——威尔.罗杰斯 01.知识点&#xff1a; a.Scrum是一个强大、特意设计的轻量级框架&#xff0c;器特性就是将软件开发中在制品的数量限制在团队层级&#xff0c;使团队有能力与业务落班一起有效地开…

带标题的图片轮询展示

为什么80%的码农都做不了架构师&#xff1f;>>> <div> <table width"671" cellpadding"0" cellspacing"0"> <tr height"5"> <td style"back…

linux java 查找进程中的线程

这里对linux下、sun(oracle) JDK的线程资源占用问题的查找步骤做一个小结&#xff1b;linux环境下&#xff0c;当发现java进程占用CPU资源很高&#xff0c;且又要想更进一步查出哪一个java线程占用了CPU资源时&#xff0c;按照以下步骤进行查找&#xff1a;(一)&#xff1a;通过…

定位匹配 模板匹配 地图_什么是地图匹配?

定位匹配 模板匹配 地图By Marie Douriez, James Murphy, Kerrick Staley玛丽杜里兹(Marie Douriez)&#xff0c;詹姆斯墨菲(James Murphy)&#xff0c;凯里克史塔利(Kerrick Staley) When you request a ride, Lyft tries to match you with the driver most suited for your…

Sprint计划列表

转载于:https://www.cnblogs.com/zhs20160715/p/9953586.html

软件测试框架课程考试_那考试准备课程值得吗?

软件测试框架课程考试By Levi Petty李维佩蒂(Levi Petty) This project uses a public, synthesized exam scores dataset from Kaggle to analyze average scores in Math, Reading, and Writing subject areas, relative to the student’s parents’ level of education an…

DOCKER windows安装

DOCKER windows安装 DOCKER windows安装 1.下载程序包2. 设置环境变量3. 启动DOCKERT4. 分析start.sh5. 利用SSH工具管理6. 下载镜像 6.1 下载地址6.2 用FTP工具上传tar包6.3 安装6.4 查看镜像6.5 运行 windows必须是64位的 1.下载程序包 安装包 https://github.com/boot2doc…

Elasticsearch Reference [6.7] » Modules » Network Settings

2019独角兽企业重金招聘Python工程师标准>>> Search Settings Node Network Settingsedit Elasticsearch binds to localhost only by default. This is sufficient for you to run a local development server (or even a development cluster, if you star…

【百度】大型网站的HTTPS实践(一)——HTTPS协议和原理

大型网站的HTTPS实践&#xff08;一&#xff09;——HTTPS协议和原理 原创 网络通信/物联网 作者&#xff1a;AIOps智能运维 时间&#xff1a;2018-11-09 15:07:39 349 0前言 百度于2015年上线了全站HTTPS的安全搜索&#xff0c;默认会将HTTP请求跳转成HTTPS。从今天开始&…

LVS原理介绍及安装过程

一、ARP技术概念介绍 为什么讲ARP技术&#xff0c;因为平常工作中有接触。还有就是LVS的dr模式是用到arp的技术和数据。 1、什么是ARP协议 ARP协议全程地址解析协议&#xff08;AddressResolution Protocol&#xff0c;ARP&#xff09;是在仅知道主机的IP地址时确定其物理地…

DNS Bind9在windows7下

有些公司技术力量薄弱一些&#xff0c;一直在用windows系统&#xff0c;所以本文从windows出发&#xff0c;安装bind&#xff0c;利用它的view功能&#xff0c;做智能DNS&#xff0c;解决双线机房南北电信联通访问问题前言&#xff1a; 搞LINUX的朋友都知道&#xff0c;bind是l…

DNS的几个基本概念:

一&#xff0e; 根域 就是所谓的“.”&#xff0c;其实我们的网址www.baidu.com在配置当中应该是www.baidu.com.&#xff08;最后有一点&#xff09;&#xff0c;一般我们在浏览器里输入时会省略后面的点&#xff0c;而这也已经成为了习惯。 根域服务器我们知道有13台&#xff…

D3.js 加标签

条形图还可以配上实际的数值,我们通过文本元素添加数据值。 svg.selectAll("text").data(dataset).enter().append("text").text(function(d){return d;}) 通过 x 和 y 值来定位文本元素。 .attr("text-anchor", "middle").attr("…

oppo5.0以上机器(亲测有效)激活Xposed框架的教程

对于喜欢玩手机的朋友而言&#xff0c;常常会用到xposed框架以及种类繁多功能强大的模块&#xff0c;对于5.0以下的系统版本&#xff0c;只要手机能获得ROOT权限&#xff0c;安装和激活xposed框架是异常简便的&#xff0c;但随着系统版本的迭代&#xff0c;5.0以后的系统&#…