真实感人故事_您的数据可以告诉您真实故事吗?

真实感人故事

Many are passionate about Data Analytics. Many love matplotlib and Seaborn. Many enjoy designing and working on Classifiers. We are quick to grab a data set and launch Jupyter Notebook, import pandas and NumPy and get to work. But wait a minute!

M之外的任何即将数据分析多情。 许多人喜欢matplotlib和Seaborn。 许多人喜欢设计和使用分类器。 我们很快就会获取一个数据集并启动Jupyter Notebook,导入熊猫和NumPy并开始工作。 但是等一下

We may be great narrators, but its important to check facts before we get on stage. In other words, you may be an excellent data wrangler and analyst, but poor quality data can lead you to poor quality observations. Now, what is Good Quality Data?

我们可能是出色的解说员,但在上台之前检查事实很重要。 换句话说,您可能是出色的数据争夺者和分析师,但是质量低劣的数据可能会导致质量低劣的观察结果。 现在,什么是优质数据?

There are many factors that measure and define Good Quality Data. Among them are Accuracy, Completeness, Timeliness, Reliability to name a few. Some may say a data set with no null values, missing data, or duplicate information is Good Quality Data. Today, I would like to draw your attention to easily overlooked yet very important questions. How well does the data set represent your problem? Is it free of bias?

有许多因素可以衡量和定义高质量数据。 其中包括准确性,完整性,及时性,可靠性等。 有人可能会说没有空值,缺少数据或重复信息的数据集就是“高质量数据”。 今天,我想提请您注意那些容易忽视非常重要的问题。 数据集如何很好地表示您的问题? 它没有偏见吗?

Let me explain with a quick example. You are trying to see whether both the genders are equally prone to Diabetes. They say, Diabetes is a lifestyle disease. Let us assume that the person who collected the data ended up reaching out to middle-aged women who do not indulge in any form of physical exercise and have unhealthy eating habits. Say 75 out of 100 of these women were Diabetic. This person also approached 50 men who work 8 hours a day in a construction site always on their toes. 5 out of 50 were Diabetic. As analysts, if we did not inspect the data well before working with it, this can be catastrophic. One can very easily state that 75 percent of the women were Diabetic while the number was 10 percent for men. In conclusion, Women are more prone to Diabetes than Men.

让我用一个简单的例子来解释。 您正在尝试查看两种性别是否同样容易患糖尿病。 他们说,糖尿病是一种生活方式疾病 。 让我们假设收集数据的人最终接触了不沉迷于任何形式的体育锻炼且饮食习惯不健康的中年妇女。 假设其中100位女性中有75位是糖尿病患者。 此人还接近了50名每天要在建筑工地工作8小时的男人,他们总是用脚趾踩。 50名糖尿病患者中有5名。 作为分析人员,如果我们在使用数据之前没有很好地检查数据,这将是灾难性的。 可以很容易地指出,有75%的女性是糖尿病患者,而男性的这一比例是10%。 总之,女性比男性更容易患糖尿病。

While I kept the data set very simple, we still have big take-aways from this. The data set should have included samples of people from diverse backgrounds for each gender. It should have included an equal number of samples for both the genders. Factors like Age, Income, Geography, Level of Physical Activity, Food Habits, Other Diagnosed Diseases among others could tell a different story. Each of these categories in isolation can tell a different tale. Depending on what your problem statement is, the right sample of data set should be chosen to arrive at meaningful and sound conclusions.

尽管我将数据集保持得非常简单,但我们仍然可以从中获得很大收获。 数据集应包括每个性别背景不同的人的样本。 对于两个性别,应包括相等数量的样本。 诸如年龄,收入,地理,体育活动水平,饮食习惯,其他诊断出的疾病等因素可能会讲一个不同的故事。 这些类别中的每个类别都可以讲述一个不同的故事。 根据问题陈述的内容,应选择正确的数据集样本以得出有意义且合理的结论。

Let me give another example of the K-Nearest Neighbor Classification Algorithm. For those of you who are not very familiar with the term, KNN algorithm helps classify an object with unknown class/type into one of the X categories in the data set. The algorithm is first trained on data points(objects) with known Class/Types and then used to classify new objects. How KNN classifies a point is by calculating the Euclidean distance from K(a given value) closest neighbors. The new object is assigned the Class/Type with more number of votes.

让我再举一个“ K最近邻分类算法”的例子。 对于那些不太熟悉该术语的人,KNN算法可将类别/类型未知的对象分类为数据集中的X个类别之一。 该算法首先在具有已知类/类型的数据点(对象)上进行训练, 然后用于对新对象进行分类。 KNN如何对点进行分类是通过计算距K(给定值)最近的邻居的欧几里得距离。 为新对象分配了更多票数的“类别/类型”。

Image for post
K-Nearest Neighbor Classifier
最近邻分类器

In the above picture, we see that X should be classified as a Green Circle. If K=1, we get Class= Green Circle. When we set K=13, we see that inevitably, the object gets classified as Blue Square. While in some data sets it could be the right classification, in the above example it is not. Green Circle samples were less in number, which is why they were out-voted and the object was incorrectly classified.

在上图中,我们看到X应该被分类为绿色圆圈。 如果K = 1,我们得到Class = Green Circle。 当我们设置K = 13时,我们不可避免地看到该对象被归类为“蓝色正方形”。 虽然在某些数据集中可能是正确的分类,但在上面的示例中却不是。 Green Circle样本的数量较少,这就是为什么要对它们进行投票并且对对象进行错误分类的原因。

In real life, the conclusions you draw, and the solutions or business decisions you propose based on your conclusions are make-or-break. Some decisions are highly critical, which makes drawing conclusions from well represented data more crucial than we realize.

在现实生活中,您得出的结论以及根据您的结论提出的解决方案或业务决策都是成败的 。 有些决定至关重要,这使得从具有良好表现力的数据中得出结论比我们意识到的更为重要。

Disclaimer: Choosing the right K value is beyond the scope of this article.

免责声明 :选择合适的K值超出了本文的范围。

翻译自: https://medium.com/analytics-vidhya/does-your-data-let-you-tell-the-real-story-7c4c7d656a01

真实感人故事

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390701.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

转:防止跨站攻击,安全过滤

转:http://blog.csdn.net/zpf0918/article/details/43952511 Spring MVC防御CSRF、XSS和SQL注入攻击 本文说一下SpringMVC如何防御CSRF(Cross-site request forgery跨站请求伪造)和XSS(Cross site script跨站脚本攻击)。 说说CSRF 对CSRF来说,其实Spring…

Linux c编程

c语言标准 ANSI CPOSIX(提高UNIX程序可移植性)SVID(POSIX的扩展超集)XPG(X/Open可移植性指南)GNU C(唯一能编译Linux内核的编译器) gcc 简介 名称: GNU project C an…

html怎么注释掉代码_HTML注释:如何注释掉您HTML代码

html怎么注释掉代码HTML中的注释 (Comments in HTML) The comment tag is an element used to leave notes, mostly related to the project or the website. This tag is frequently used to explain something in the code or leave some recommendations about the project.…

k均值算法 二分k均值算法_使用K均值对加勒比珊瑚礁进行分类

k均值算法 二分k均值算法Have you ever seen a Caribbean reef? Well if you haven’t, prepare yourself.您见过加勒比礁吗? 好吧,如果没有,请做好准备。 Today, we will be answering a question that, at face value, appears quite sim…

您好,这是我的第一篇文章

您好我是CYL 这是一个辣鸡博客 欢迎指教 转载于:https://www.cnblogs.com/pigba/p/8823472.html

08_MySQL DQL_SQL99标准中的多表查询(内连接)

# sql99语法/*语法: select 查询列表 from 表1 别名 【连接类型】 join 表2 别名 on 连接条件 【where 筛选条件】 【group by 分组】 【having 分组后筛选】 【order by 排序列表】分类内连接(重点): inner外连接 左外&#xff0…

java中抽象类继承抽象类_Java中的抽象类用示例解释

java中抽象类继承抽象类Abstract classes are classes declared with abstract. They can be subclassed or extended, but cannot be instantiated. You can think of them as a class version of interfaces, or as an interface with actual code attached to the methods.抽…

新建VUX项目

使用Vue-cli安装Vux2 特别注意配置vux-loader。来自为知笔记(Wiz)

衡量试卷难度信度_我们可以通过数字来衡量语言难度吗?

衡量试卷难度信度Without a doubt, the world is “growing smaller” in terms of our access to people and content from other countries and cultures. Even the COVID-19 pandemic, which has curtailed international travel, has led to increasing virtual interactio…

Linux 题目总结

守护进程的工作就是打开一个端口,并且等待(Listen)进入连接。 如果客户端发起一个连接请求,守护进程就创建(Fork)一个子进程响应这个连接,而主进程继续监听其他的服务请求。 xinetd能够同时监听…

《精通Spring4.X企业应用开发实战》读后感第二章

一、配置Maven\tomcat https://www.cnblogs.com/Miracle-Maker/articles/6476687.html https://www.cnblogs.com/Knowledge-has-no-limit/p/7240585.html 二、创建数据库表 DROP DATABASE IF EXISTS sampledb; CREATE DATABASE sampledb DEFAULT CHARACTER SET utf8; USE sampl…

换了电脑如何使用hexo继续写博客

前言 我们知道,使用 Githubhexo 搭建一个个人博客确实需要花不少时间的,我们搭好博客后使用的挺好,但是如果我们有一天电脑突然坏了,或者换了系统,那么我们怎么使用 hexo 再发布文章到个人博客呢? 如果我们…

leetcode 525. 连续数组

给定一个二进制数组 nums , 找到含有相同数量的 0 和 1 的最长连续子数组,并返回该子数组的长度。 示例 1: 输入: nums [0,1] 输出: 2 说明: [0, 1] 是具有相同数量 0 和 1 的最长连续子数组。 示例 2: 输入: nums [0,1,0] 输出: 2 说明: [0, 1] (或 [1, 0]) 是…

实践作业2:黑盒测试实践(小组作业)每日任务记录1

会议时间:2017年11月24日20:00 – 20:30 会议地点:在线讨论 主 持 人:王晨懿 参会人员:王晨懿、余晨晨、郑锦波、杨潇、侯欢、汪元 记 录 人:杨潇 会议议题:软件测试课程作业-黑盒测试实践的启动计划 会议内…

视图可视化 后台_如何在单视图中可视化复杂的多层主题

视图可视化 后台Sometimes a dataset can tell many stories. Trying to show them all in a single visualization is great, but can be too much of a good thing. How do you avoid information overload without oversimplification?有时数据集可以讲述许多故事。 试图在…

iam身份验证以及访问控制_如何将受限访问IAM用户添加到EKS群集

iam身份验证以及访问控制介绍 (Introduction) Elastic Kubernetes Service (EKS) is the fully managed Kubernetes service from AWS. It is deeply integrated with many AWS services, such as AWS Identity and Access Management (IAM) (for authentication to the cluste…

一步一步构建自己的管理系统①

2019独角兽企业重金招聘Python工程师标准>>> 系统肯定要先选一个基础框架。 还算比较熟悉Spring. 就选Spring boot postgres mybatis. 前端用Angular. 开始搭开发环境,开在window上整的。 到时候再放到服务器上。 自己也去整了个小服务器,…

面向对象面向过程

1、面向语句: 直接写原生的sql语句,但是这样代码不容易维护。改一个方法会导致整个项目都要改动, 2、面向过程 定义一些函数,用的时候就调用不用就不调用。但是这也有解决不了的问题,如果要维护需要改动代码&#xff0…

python边玩边学_边听边学数据科学

python边玩边学Podcasts are a fun way to learn new stuff about the topics you like. Podcast hosts have to find a way to explain complex ideas in simple terms because no one would understand them otherwise 🙂 In this article I present a few episod…

react css多个变量_如何使用CSS变量和React上下文创建主题引擎

react css多个变量CSS variables are really cool. You can use them for a lot of things, like applying themes in your application with ease. CSS变量真的很棒。 您可以将它们用于很多事情,例如轻松地在应用程序中应用主题。 In this tutorial Ill show you …