真实感人故事
Many are passionate about Data Analytics. Many love matplotlib and Seaborn. Many enjoy designing and working on Classifiers. We are quick to grab a data set and launch Jupyter Notebook, import pandas and NumPy and get to work. But wait a minute!
M之外的任何即将数据分析多情。 许多人喜欢matplotlib和Seaborn。 许多人喜欢设计和使用分类器。 我们很快就会获取一个数据集并启动Jupyter Notebook,导入熊猫和NumPy并开始工作。 但是等一下 !
We may be great narrators, but its important to check facts before we get on stage. In other words, you may be an excellent data wrangler and analyst, but poor quality data can lead you to poor quality observations. Now, what is Good Quality Data?
我们可能是出色的解说员,但在上台之前检查事实很重要。 换句话说,您可能是出色的数据争夺者和分析师,但是质量低劣的数据可能会导致质量低劣的观察结果。 现在,什么是优质数据?
There are many factors that measure and define Good Quality Data. Among them are Accuracy, Completeness, Timeliness, Reliability to name a few. Some may say a data set with no null values, missing data, or duplicate information is Good Quality Data. Today, I would like to draw your attention to easily overlooked yet very important questions. How well does the data set represent your problem? Is it free of bias?
有许多因素可以衡量和定义高质量数据。 其中包括准确性,完整性,及时性,可靠性等。 有人可能会说没有空值,缺少数据或重复信息的数据集就是“高质量数据”。 今天,我想提请您注意那些容易忽视但非常重要的问题。 数据集如何很好地表示您的问题? 它没有偏见吗?
Let me explain with a quick example. You are trying to see whether both the genders are equally prone to Diabetes. They say, Diabetes is a lifestyle disease. Let us assume that the person who collected the data ended up reaching out to middle-aged women who do not indulge in any form of physical exercise and have unhealthy eating habits. Say 75 out of 100 of these women were Diabetic. This person also approached 50 men who work 8 hours a day in a construction site always on their toes. 5 out of 50 were Diabetic. As analysts, if we did not inspect the data well before working with it, this can be catastrophic. One can very easily state that 75 percent of the women were Diabetic while the number was 10 percent for men. In conclusion, Women are more prone to Diabetes than Men.
让我用一个简单的例子来解释。 您正在尝试查看两种性别是否同样容易患糖尿病。 他们说,糖尿病是一种生活方式疾病 。 让我们假设收集数据的人最终接触了不沉迷于任何形式的体育锻炼且饮食习惯不健康的中年妇女。 假设其中100位女性中有75位是糖尿病患者。 此人还接近了50名每天要在建筑工地工作8小时的男人,他们总是用脚趾踩。 50名糖尿病患者中有5名。 作为分析人员,如果我们在使用数据之前没有很好地检查数据,这将是灾难性的。 可以很容易地指出,有75%的女性是糖尿病患者,而男性的这一比例是10%。 总之,女性比男性更容易患糖尿病。
While I kept the data set very simple, we still have big take-aways from this. The data set should have included samples of people from diverse backgrounds for each gender. It should have included an equal number of samples for both the genders. Factors like Age, Income, Geography, Level of Physical Activity, Food Habits, Other Diagnosed Diseases among others could tell a different story. Each of these categories in isolation can tell a different tale. Depending on what your problem statement is, the right sample of data set should be chosen to arrive at meaningful and sound conclusions.
尽管我将数据集保持得非常简单,但我们仍然可以从中获得很大收获。 数据集应包括每个性别背景不同的人的样本。 对于两个性别,应包括相等数量的样本。 诸如年龄,收入,地理,体育活动水平,饮食习惯,其他诊断出的疾病等因素可能会讲一个不同的故事。 这些类别中的每个类别都可以讲述一个不同的故事。 根据问题陈述的内容,应选择正确的数据集样本以得出有意义且合理的结论。
Let me give another example of the K-Nearest Neighbor Classification Algorithm. For those of you who are not very familiar with the term, KNN algorithm helps classify an object with unknown class/type into one of the X categories in the data set. The algorithm is first trained on data points(objects) with known Class/Types and then used to classify new objects. How KNN classifies a point is by calculating the Euclidean distance from K(a given value) closest neighbors. The new object is assigned the Class/Type with more number of votes.
让我再举一个“ K最近邻分类算法”的例子。 对于那些不太熟悉该术语的人,KNN算法可将类别/类型未知的对象分类为数据集中的X个类别之一。 该算法首先在具有已知类/类型的数据点(对象)上进行训练, 然后用于对新对象进行分类。 KNN如何对点进行分类是通过计算距K(给定值)最近的邻居的欧几里得距离。 为新对象分配了更多票数的“类别/类型”。
In the above picture, we see that X should be classified as a Green Circle. If K=1, we get Class= Green Circle. When we set K=13, we see that inevitably, the object gets classified as Blue Square. While in some data sets it could be the right classification, in the above example it is not. Green Circle samples were less in number, which is why they were out-voted and the object was incorrectly classified.
在上图中,我们看到X应该被分类为绿色圆圈。 如果K = 1,我们得到Class = Green Circle。 当我们设置K = 13时,我们不可避免地看到该对象被归类为“蓝色正方形”。 虽然在某些数据集中可能是正确的分类,但在上面的示例中却不是。 Green Circle样本的数量较少,这就是为什么要对它们进行投票并且对对象进行错误分类的原因。
In real life, the conclusions you draw, and the solutions or business decisions you propose based on your conclusions are make-or-break. Some decisions are highly critical, which makes drawing conclusions from well represented data more crucial than we realize.
在现实生活中,您得出的结论以及根据您的结论提出的解决方案或业务决策都是成败的 。 有些决定至关重要,这使得从具有良好表现力的数据中得出结论比我们意识到的更为重要。
Disclaimer: Choosing the right K value is beyond the scope of this article.
免责声明 :选择合适的K值超出了本文的范围。
翻译自: https://medium.com/analytics-vidhya/does-your-data-let-you-tell-the-real-story-7c4c7d656a01
真实感人故事
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390701.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!