

Many are passionate about Data Analytics. Many love matplotlib and Seaborn. Many enjoy designing and working on Classifiers. We are quick to grab a data set and launch Jupyter Notebook, import pandas and NumPy and get to work. But wait a minute!

M之外的任何即将数据分析多情。 许多人喜欢matplotlib和Seaborn。 许多人喜欢设计和使用分类器。 我们很快就会获取一个数据集并启动Jupyter Notebook,导入熊猫和NumPy并开始工作。 但是等一下

We may be great narrators, but its important to check facts before we get on stage. In other words, you may be an excellent data wrangler and analyst, but poor quality data can lead you to poor quality observations. Now, what is Good Quality Data?

我们可能是出色的解说员,但在上台之前检查事实很重要。 换句话说,您可能是出色的数据争夺者和分析师,但是质量低劣的数据可能会导致质量低劣的观察结果。 现在,什么是优质数据?

There are many factors that measure and define Good Quality Data. Among them are Accuracy, Completeness, Timeliness, Reliability to name a few. Some may say a data set with no null values, missing data, or duplicate information is Good Quality Data. Today, I would like to draw your attention to easily overlooked yet very important questions. How well does the data set represent your problem? Is it free of bias?

有许多因素可以衡量和定义高质量数据。 其中包括准确性,完整性,及时性,可靠性等。 有人可能会说没有空值,缺少数据或重复信息的数据集就是“高质量数据”。 今天,我想提请您注意那些容易忽视非常重要的问题。 数据集如何很好地表示您的问题? 它没有偏见吗?

Let me explain with a quick example. You are trying to see whether both the genders are equally prone to Diabetes. They say, Diabetes is a lifestyle disease. Let us assume that the person who collected the data ended up reaching out to middle-aged women who do not indulge in any form of physical exercise and have unhealthy eating habits. Say 75 out of 100 of these women were Diabetic. This person also approached 50 men who work 8 hours a day in a construction site always on their toes. 5 out of 50 were Diabetic. As analysts, if we did not inspect the data well before working with it, this can be catastrophic. One can very easily state that 75 percent of the women were Diabetic while the number was 10 percent for men. In conclusion, Women are more prone to Diabetes than Men.

让我用一个简单的例子来解释。 您正在尝试查看两种性别是否同样容易患糖尿病。 他们说,糖尿病是一种生活方式疾病 。 让我们假设收集数据的人最终接触了不沉迷于任何形式的体育锻炼且饮食习惯不健康的中年妇女。 假设其中100位女性中有75位是糖尿病患者。 此人还接近了50名每天要在建筑工地工作8小时的男人,他们总是用脚趾踩。 50名糖尿病患者中有5名。 作为分析人员,如果我们在使用数据之前没有很好地检查数据,这将是灾难性的。 可以很容易地指出,有75%的女性是糖尿病患者,而男性的这一比例是10%。 总之,女性比男性更容易患糖尿病。

While I kept the data set very simple, we still have big take-aways from this. The data set should have included samples of people from diverse backgrounds for each gender. It should have included an equal number of samples for both the genders. Factors like Age, Income, Geography, Level of Physical Activity, Food Habits, Other Diagnosed Diseases among others could tell a different story. Each of these categories in isolation can tell a different tale. Depending on what your problem statement is, the right sample of data set should be chosen to arrive at meaningful and sound conclusions.

尽管我将数据集保持得非常简单,但我们仍然可以从中获得很大收获。 数据集应包括每个性别背景不同的人的样本。 对于两个性别,应包括相等数量的样本。 诸如年龄,收入,地理,体育活动水平,饮食习惯,其他诊断出的疾病等因素可能会讲一个不同的故事。 这些类别中的每个类别都可以讲述一个不同的故事。 根据问题陈述的内容,应选择正确的数据集样本以得出有意义且合理的结论。

Let me give another example of the K-Nearest Neighbor Classification Algorithm. For those of you who are not very familiar with the term, KNN algorithm helps classify an object with unknown class/type into one of the X categories in the data set. The algorithm is first trained on data points(objects) with known Class/Types and then used to classify new objects. How KNN classifies a point is by calculating the Euclidean distance from K(a given value) closest neighbors. The new object is assigned the Class/Type with more number of votes.

让我再举一个“ K最近邻分类算法”的例子。 对于那些不太熟悉该术语的人,KNN算法可将类别/类型未知的对象分类为数据集中的X个类别之一。 该算法首先在具有已知类/类型的数据点(对象)上进行训练, 然后用于对新对象进行分类。 KNN如何对点进行分类是通过计算距K(给定值)最近的邻居的欧几里得距离。 为新对象分配了更多票数的“类别/类型”。

Image for post
K-Nearest Neighbor Classifier

In the above picture, we see that X should be classified as a Green Circle. If K=1, we get Class= Green Circle. When we set K=13, we see that inevitably, the object gets classified as Blue Square. While in some data sets it could be the right classification, in the above example it is not. Green Circle samples were less in number, which is why they were out-voted and the object was incorrectly classified.

在上图中,我们看到X应该被分类为绿色圆圈。 如果K = 1,我们得到Class = Green Circle。 当我们设置K = 13时,我们不可避免地看到该对象被归类为“蓝色正方形”。 虽然在某些数据集中可能是正确的分类,但在上面的示例中却不是。 Green Circle样本的数量较少,这就是为什么要对它们进行投票并且对对象进行错误分类的原因。

In real life, the conclusions you draw, and the solutions or business decisions you propose based on your conclusions are make-or-break. Some decisions are highly critical, which makes drawing conclusions from well represented data more crucial than we realize.

在现实生活中,您得出的结论以及根据您的结论提出的解决方案或业务决策都是成败的 。 有些决定至关重要,这使得从具有良好表现力的数据中得出结论比我们意识到的更为重要。

Disclaimer: Choosing the right K value is beyond the scope of this article.

免责声明 :选择合适的K值超出了本文的范围。

翻译自: https://medium.com/analytics-vidhya/does-your-data-let-you-tell-the-real-story-7c4c7d656a01






转:http://blog.csdn.net/zpf0918/article/details/43952511 Spring MVC防御CSRF、XSS和SQL注入攻击 本文说一下SpringMVC如何防御CSRF(Cross-site request forgery跨站请求伪造)和XSS(Cross site script跨站脚本攻击)。 说说CSRF 对CSRF来说,其实Spring…

Linux c编程

c语言标准 ANSI CPOSIX(提高UNIX程序可移植性)SVID(POSIX的扩展超集)XPG(X/Open可移植性指南)GNU C(唯一能编译Linux内核的编译器) gcc 简介 名称: GNU project C an…

k均值算法 二分k均值算法_使用K均值对加勒比珊瑚礁进行分类

k均值算法 二分k均值算法Have you ever seen a Caribbean reef? Well if you haven’t, prepare yourself.您见过加勒比礁吗? 好吧,如果没有,请做好准备。 Today, we will be answering a question that, at face value, appears quite sim…


使用Vue-cli安装Vux2 特别注意配置vux-loader。来自为知笔记(Wiz)


衡量试卷难度信度Without a doubt, the world is “growing smaller” in terms of our access to people and content from other countries and cultures. Even the COVID-19 pandemic, which has curtailed international travel, has led to increasing virtual interactio…

Linux 题目总结

守护进程的工作就是打开一个端口,并且等待(Listen)进入连接。 如果客户端发起一个连接请求,守护进程就创建(Fork)一个子进程响应这个连接,而主进程继续监听其他的服务请求。 xinetd能够同时监听…


一、配置Maven\tomcat https://www.cnblogs.com/Miracle-Maker/articles/6476687.html https://www.cnblogs.com/Knowledge-has-no-limit/p/7240585.html 二、创建数据库表 DROP DATABASE IF EXISTS sampledb; CREATE DATABASE sampledb DEFAULT CHARACTER SET utf8; USE sampl…

视图可视化 后台_如何在单视图中可视化复杂的多层主题

视图可视化 后台Sometimes a dataset can tell many stories. Trying to show them all in a single visualization is great, but can be too much of a good thing. How do you avoid information overload without oversimplification?有时数据集可以讲述许多故事。 试图在…


2019独角兽企业重金招聘Python工程师标准>>> 系统肯定要先选一个基础框架。 还算比较熟悉Spring. 就选Spring boot postgres mybatis. 前端用Angular. 开始搭开发环境,开在window上整的。 到时候再放到服务器上。 自己也去整了个小服务器,…


python边玩边学Podcasts are a fun way to learn new stuff about the topics you like. Podcast hosts have to find a way to explain complex ideas in simple terms because no one would understand them otherwise 🙂 In this article I present a few episod…

react css多个变量_如何使用CSS变量和React上下文创建主题引擎

react css多个变量CSS variables are really cool. You can use them for a lot of things, like applying themes in your application with ease. CSS变量真的很棒。 您可以将它们用于很多事情,例如轻松地在应用程序中应用主题。 In this tutorial Ill show you …

vue 自定义 移动端筛选条件

1.创建组件 components/FilterBar/FilterBar.vue <template><div class"filterbar" :style"{top: top px}"><div class"container"><div class"row"><divclass"col":class"{selected: ind…


楼主学生党一枚&#xff0c;最近研究netkeeper有些许心得。 关于netkeeper是调用windows的rasdial来进行上网的东西&#xff0c;网上已经有一大堆&#xff0c;我就不赘述了。 本文主要讲解rasdial的部分核心过程&#xff0c;以及我们可以利用它来干些什么。 netkeeper中rasdial…


作者&#xff1a;13 GitHub&#xff1a;https://github.com/ZHENFENG13 版权声明&#xff1a;本文为原创文章&#xff0c;未经允许不得转载。 问题描述 由于原服务器将要到期&#xff0c;因此趁着阿里云搞促销活动重新购买了一台ECS服务器&#xff0c;但是在初始化并启动后却无…

边缘计算 ai_在边缘探索AI!

边缘计算 ai介绍 (Introduction) What is Edge (or Fog) Computing?什么是边缘(或雾)计算&#xff1f; Gartner defines edge computing as: “a part of a distributed computing topology in which information processing is located close to the edge — where things a…


使用Spring或者SpringMVC的话依然有许多东西需要我们进行配置&#xff0c;这样不仅徒增工作量而且在跨平台部署时容易出问题。 使用Spring Boot可以让我们快速创建一个基于Spring的项目&#xff0c;而让这个Spring项目跑起来我们只需要很少的配置就可以了。Spring Boot主要有如…

leetcode 879. 盈利计划(dp)

这是我参与更文挑战的第9天 &#xff0c;活动详情查看更文挑战 题目 集团里有 n 名员工&#xff0c;他们可以完成各种各样的工作创造利润。 第 i 种工作会产生 profit[i] 的利润&#xff0c;它要求 group[i] 名成员共同参与。如果成员参与了其中一项工作&#xff0c;就不能…


区块链技术是一场记录系统的革命。 比特币是历史上第一个永久的、分散的、全球性的、无信任的记录分类帐。自其发明以来&#xff0c;世界各地各行各业的企业家都开始明白这一发展的意义。 区块链技术的本质让人联想到疯狂&#xff0c;因为这个想法现在可以应用到任何值得信赖的…


如何建立搜索引擎This article outlines one of the most important search algorithms used today and demonstrates how to implement it in Python in just a few lines of code.本文概述了当今使用的最重要的搜索算法之一&#xff0c;并演示了如何仅用几行代码就可以在Pyth…


纸壳CMS可以运行在Docker上&#xff0c;接下来看看如何自动构建纸壳CMS的Docker Image。我们希望的是在代码提交到GitHub以后&#xff0c;容器镜像服务可以自动构建Docker Image&#xff0c;构建好以后&#xff0c;就可以直接拿这个Docker Image来运行了。 Dockerfile 最重要的…