如何进行数据分析统计_对您不了解的数据集进行统计分析

如何进行数据分析统计

Recently, I took the opportunity to work on a competition held by Wells Fargo (Mindsumo). The dataset provided was just a bunch of numbers in various columns with no indication of what the data might be. I always thought that the analysis of data required some knowledge and understanding of the data and the domain to perform an efficient analysis. I have attached a sample below. It consisted of columns from X0 to X29 which consisted of continuous values and XC which consisted of categorical data i.e. 30 variables in total. I set out on further analysis on the entire dataset to understand the data.

[R ecently,我趁机工作由举办的比赛富国银行(Mindsumo) 。 提供的数据集只是各个列中的一堆数字,没有指示数据可能是什么。 我一直认为,数据分析需要对数据和领域有一定的了解和理解,才能进行有效的分析。 我在下面附上了一个样本。 它由从X0X29的列组成,这些列由连续值组成,而XC则由分类数据组成,即总共30个变量。 我着手对整个数据集进行进一步分析以了解数据。

Image for post

连续变量的正态性检查 (Normality check of continuous variables)

I used the QQ plot to determine the normality distribution of the variables and understand if there is any skew in the data. All the data points were normally distributed with very less deviation which required no processing of the data to be done at this point to attain a Gaussian distribution. I prefer a QQ plot for the initial analysis because it makes it very easy to analyze the data and determine the type of distribution be it Gaussian distribution, uniform distribution, etc.

我使用QQ图来确定变量的正态分布,并了解数据中是否存在任何偏差。 所有数据点均以极小的偏差进行正态分布,因此在这一点上无需对数据进行任何处理即可获得高斯分布。 我更喜欢使用QQ图进行初始分析,因为它使分析数据和确定分布类型变得非常容易,包括高斯分布,均匀分布等。

Image for post

Once the data is determined to be a normal distribution using the QQ plot, a Shapiro Wilk test can be performed to confirm the hypothesis. It has been deigned specifically for normal distributions. The null hypothesis, in this case, is that the variable is normally distributed. If the p value obtained is less than 0.05 then the null hypothesis is rejected and it is assumed that the variable is not normally distributed. All the values seem to be greater than 0.5 which means that all the variables follow a normal distribution.

一旦使用QQ图确定数据为正态分布,就可以执行Shapiro Wilk检验来确认假设。 专为正态分布而设计。 在这种情况下,零假设是变量是正态分布的。 如果获得的p值小于0.05,则拒绝零假设,并假设变量不是正态分布的。 所有值似乎都大于0.5,这意味着所有变量都遵循正态分布。

Image for post

分类变量 (Categorical variable)

I checked the distribution of the categorical variable to check if the points in the dataset were equally distributed. The distribution of the variables was as shown below. The variable consisted of 5 unique values (A,B,C,D and E) with all the values being more or less equally distributed.

我检查了分类变量的分布,以检查数据集中的点是否均匀分布。 变量的分布如下所示。 该变量由5个唯一值(A,B,C,D和E)组成,所有值或多或少均等地分布。

Image for post

I used a One-Hot Encoding mechanism to convert the categorical variables to a binary variable for each resulting categorical value as shown below. Although this resulted in an increase in dimensionality, I was hoping to check for correlations later on and remove or merge certain rows.

我使用单点编码机制将每个结果分类值的分类变量转换为二进制变量,如下所示。 尽管这导致维数增加,但我希望以后再检查相关性,并删除或合并某些行。

Image for post

数据关联 (Data Correlation)

Correlation is an important technique which helps in determining the relationships between the variables and weeding out highly correlated data. The reason we do this because we don't want variables that are highly correlated with each other since they affect the final dependent variable in the same way. The correlation values range from -1 to 1 with a value of 1 signifying a strong, positive correlation between the two and a value of -1 signifying a strong, negative correlation. We also calculate the statistical significance of the correlations to determine if the null hypothesis (There is no correlation) is valid or not. I have taken three values as benchmarks for measuring statistical significance — 0.1, 0.05 and 0.01. The below table is a small sub sample of the correlation values for each set of variables. A p value which is less than 0.01 signifies a high statistical significance and that the null hypothesis can be rejected which is represented by 3 ‘*’ while the statistical significance is lesser if the p-value is lesser than 0.1 but greater than 0.05 which is represented by 1 ‘*’.

关联是一项重要的技术,可帮助确定变量之间的关系并清除高度相关的数据。 我们之所以这样做,是因为我们不希望变量之间具有高度相关性,因为它们以相同的方式影响最终的因变量。 相关值的范围是-1到1,值1表示两者之间的强正相关,值-1表示强的负相关。 我们还计算了相关性的统计显着性,以确定零假设(无相关性)是否有效。 我已经采取三个值作为基准,用于测量统计显着性- 0.1,0.050.01。 下表是每组变量的相关值的一小部分子样本。 如果p值小于0.01,则表示具有较高的统计显着性,并且可以拒绝由3 '*'表示的原假设,而如果p值小于0.1但大于0.05,则统计学显着性较小。用1' *'表示

Image for post

The dataset above did not have any values which had a high correlation value. Thus, I safely went ahead with the assumption that the values were not related to each other. I further explored the correlation of the variables with the dependent variable y.

上面的数据集没有任何具有高相关值的值。 因此,我可以安全地继续进行以下假设:这些值彼此无关。 我进一步探讨了变量与因变量y的相关性。

Image for post

We are concerned with the prediction of the variable y. As seen, there are a lot of variables which don't have any correlation to y as well as are not statistically significant at all which means that the relationship is weak and we can safely exclude them from the final dataset. These include variables such as X5, X8, X9, X10, XC_C, etc. I have not excluded the other variables which have low correlation but high statistical significance as there may be a small sample which affects the final dependent variable and we cannot exclude them completely. We can further reduce the variables by merging some of them. We do this on the basis of variables which have the same correlation value with the y variable. These include —

我们关注变量y的预测。 可以看出,有很多变量与y没有任何关系,并且在统计上根本不重要,这意味着该关系很弱,我们可以安全地将它们从最终数据集中排除。 这些变量包括X5X8X9X10XC_C等。我没有排除其他相关性较低但具有统计学意义的变量,因为可能会有一个小的样本影响最终因变量,因此我们不能排除它们完全。 我们可以通过合并其中的一些变量来进一步减少变量。 我们基于与y变量具有相同相关值的变量进行此操作。 这些包括 -

  • X7, X11, X17 and X21

    X7X11X17X21

  • X1 and X23

    X1X23

  • X22 and X26

    X22X26

  • X4, X15, X19 and X25

    X4X15X19X25

I merged these variables using an optimization technique. Let us consider the variable, X1 and X23. I achieved this by assuming a linear relation mX1 + nX23 with y. For determining the maximum correlation, we have to calculate the optimum value for m and n. In all the cases, I assumed n to be 1 and solved for m. The equation is as shown below.

我使用优化技术合并了这些变量。 让我们考虑变量X1X23 。 我通过假设线性关系m X1 + n X23y来实现这一点。 为了确定最大相关性,我们必须计算mn的最佳值。 在所有情况下,我都假定n1并求解m。 公式如下所示。

Image for post

Once n is set as 1, we can easily solve for m. This can be substituted in the above linear equation for each value. In this way, we can merge all the above variables. If the denominator is 0, m can be taken as 1 and the equation can be solved for n. Make sure that the correlation values are equal for both the variables. After merging, I generated the correlation table again.

将n设置为1后 ,我们可以轻松求解m。 可以在上面的线性方程式中将其替换为每个值。 这样,我们可以合并所有上述变量。 如果分母为0 ,则m可以取为1,并且该方程可以求解n 。 确保两个变量的相关值相等。 合并后,我再次生成了相关表。

Image for post

We can see that the correlation values for the merged variables have increased and the statistical significance is high for all of them. I managed to reduce the number of variables from 30 to 15. I can now use these variables to feed it into my machine learning model and check the accuracy against the validation dataset.

我们可以看到,合并变量的相关值已经增加,并且所有变量的统计意义都很高。 我设法将变量的数量从30个减少到15个 。 现在,我可以使用这些变量将其输入到我的机器学习模型中,并根据验证数据集检查准确性。

训练和验证数据 (Training and Validating the data)

I chose a Logistic Regression model for the training and predicting on this dataset for multiple reasons —

由于多种原因,我选择了Logistic回归模型来对该数据集进行训练和预测-

  • The dependent variable is binary

    因变量是二进制
  • The independent variables are related to the dependent variable

    自变量与因变量有关

After training the model, I checked the model against the validation dataset and these are the results.

训练模型后,我对照验证数据集检查了模型,这些是结果。

Image for post

The model had an accuracy of 99.6% with a F1 score of 99.45%.

该模型的accuracy99.6%F1 score99.45%

结论 (Conclusion)

This was a basic exploratory and statistical analysis to reduce the number of features and assure that there are no correlated variables in the final dataset. Using a few simple techniques, we can be assured of getting good results even if we do not understand what the data is initially. The main steps include ensuring a normal distribution of data and an efficient encoding scheme for the categorical variables. Further, the variables can be removed and merged based on correlation among them after which an appropriate model can be chosen for analysis. You can find the code repository at https://github.com/Raul9595/AnonyData.

这是一项基本的探索性和统计分析,目的是减少特征数量并确保最终数据集中不存在相关变量。 使用一些简单的技术,即使我们不了解最初的数据是什么,也可以确保获得良好的结果。 主要步骤包括确保数据的正态分布和有效的分类变量编码方案。 此外,可以基于变量之间的相关性来删除和合并变量,之后可以选择适当的模型进行分析。 您可以在https://github.com/Raul9595/AnonyData中找到代码存储库。

翻译自: https://towardsdatascience.com/statistical-analysis-on-a-dataset-you-dont-understand-f382f43c8fa5

如何进行数据分析统计

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390753.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

经典:区间dp-合并石子

题目链接 :http://acm.nyist.edu.cn/JudgeOnline/problem.php?pid737 这个动态规划的思是,要得出合并n堆石子的最优答案可以从小到大枚举所有石子合并的最优情况,例如要合并5堆石子就可以从,最优的23和14中得到最佳的答案。从两堆…

常见排序算法_解释的算法-它们是什么以及常见的排序算法

常见排序算法In its most basic form, an algorithm is a set of detailed step-by-step instructions to complete a task. For example, an algorithm to make coffee in a french press would be:在最基本的形式中,算法是一组完成任务的详细分步说明。 例如&…

020-Spring Boot 监控和度量

一、概述 通过配置使用actuator查看监控和度量信息 二、使用 2.1、建立web项目&#xff0c;增加pom <dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-web</artifactId></dependency> 启动项目&a…

matplotlib布局_Matplotlib多列,行跨度布局

matplotlib布局For Visualization in Python, Matplotlib library has been the workhorse for quite some time now. It has held its own even after more nimble rivals with easier code interface and capabilities like seaborn, plotly, bokeh etc. have arrived on the…

Hadoop生态系统

大数据架构-Lambda Lambda架构由Storm的作者Nathan Marz提出。旨在设计出一个能满足实时大数据系统关键特性的架构&#xff0c;具有高容错、低延时和可扩展等特性。Lambda架构整合离线计算和实时计算&#xff0c;融合不可变性&#xff08;Immutability&#xff09;&#xff0c…

javascript之 原生document.querySelector和querySelectorAll方法

querySelector和querySelectorAll是W3C提供的 新的查询接口&#xff0c;其主要特点如下&#xff1a; 1、querySelector只返回匹配的第一个元素&#xff0c;如果没有匹配项&#xff0c;返回null。 2、querySelectorAll返回匹配的元素集合&#xff0c;如果没有匹配项&#xff0c;…

RDBMS数据定时采集到HDFS

[toc] RDBMS数据定时采集到HDFS 前言 其实并不难&#xff0c;就是使用sqoop定时从MySQL中导入到HDFS中&#xff0c;主要是sqoop命令的使用和Linux脚本的操作这些知识。 场景 在我们的场景中&#xff0c;需要每天将数据库中新增的用户数据采集到HDFS中&#xff0c;数据库中有tim…

单词嵌入_神秘的文本分类:单词嵌入简介

单词嵌入Natural language processing (NLP) is an old science that started in the 1950s. The Georgetown IBM experiment in 1954 was a big step towards a fully automated text translation. More than 60 Russian sentences were translated into English using simple…

使用Hadoop所需要的一些Linux基础

Linux 概念 Linux 是一个类Unix操作系统&#xff0c;是 Unix 的一种&#xff0c;它 控制整个系统基本服务的核心程序 (kernel) 是由 Linus 带头开发出来的&#xff0c;「Linux」这个名称便是以 「Linus’s unix」来命名的。 Linux泛指一类操作系统&#xff0c;具体的版本有&a…

python多项式回归_Python从头开始的多项式回归

python多项式回归Polynomial regression in an improved version of linear regression. If you know linear regression, it will be simple for you. If not, I will explain the formulas here in this article. There are other advanced and more efficient machine learn…

《Linux命令行与shell脚本编程大全 第3版》Linux命令行---4

以下为阅读《Linux命令行与shell脚本编程大全 第3版》的读书笔记&#xff0c;为了方便记录&#xff0c;特地与书的内容保持同步&#xff0c;特意做成一节一次随笔&#xff0c;特记录如下&#xff1a; 《Linux命令行与shell脚本编程大全 第3版》Linux命令行--- Linux命令行与she…

彻底搞懂 JS 中 this 机制

彻底搞懂 JS 中 this 机制 摘要&#xff1a;本文属于原创&#xff0c;欢迎转载&#xff0c;转载请保留出处&#xff1a;https://github.com/jasonGeng88/blog 目录 this 是什么this 的四种绑定规则绑定规则的优先级绑定例外扩展&#xff1a;箭头函数this 是什么 理解this之前&a…

⚡如何在2分钟内将GraphQL服务器添加到RESTful Express.js API

You can get a lot done in 2 minutes, like microwaving popcorn, sending a text message, eating a cupcake, and hooking up a GraphQL server.您可以在2分钟内完成很多工作&#xff0c;例如微波炉爆米花&#xff0c;发送短信&#xff0c; 吃蛋糕以及连接GraphQL服务器 。 …

leetcode 1744. 你能在你最喜欢的那天吃到你最喜欢的糖果吗?

给你一个下标从 0 开始的正整数数组 candiesCount &#xff0c;其中 candiesCount[i] 表示你拥有的第 i 类糖果的数目。同时给你一个二维数组 queries &#xff0c;其中 queries[i] [favoriteTypei, favoriteDayi, dailyCapi] 。 你按照如下规则进行一场游戏&#xff1a; 你…

回归分析_回归

回归分析Machine learning algorithms are not your regular algorithms that we may be used to because they are often described by a combination of some complex statistics and mathematics. Since it is very important to understand the background of any algorith…

ruby nil_Ruby中的数据类型-True,False和Nil用示例解释

ruby niltrue, false, and nil are special built-in data types in Ruby. Each of these keywords evaluates to an object that is the sole instance of its respective class.true &#xff0c; false和nil是Ruby中的特殊内置数据类型。 这些关键字中的每一个都求值为一个对…

浅尝flutter中的动画(淡入淡出)

在移动端开发中&#xff0c;经常会有一些动画交互&#xff0c;比如淡入淡出,效果如图&#xff1a; 因为官方封装好了AnimatedOpacity Widget&#xff0c;开箱即用&#xff0c;所以我们用起来很方便&#xff0c;代码量很少&#xff0c;做少量配置即可&#xff0c;所以&#xff0…

数据科学还是计算机科学_何时不使用数据科学

数据科学还是计算机科学意见 (Opinion) 目录 (Table of Contents) Introduction 介绍 Examples 例子 When You Should Use Data Science 什么时候应该使用数据科学 Summary 摘要 介绍 (Introduction) Both Data Science and Machine Learning are useful fields that apply sev…

空间复杂度 用什么符号表示_什么是大O符号解释:时空复杂性

空间复杂度 用什么符号表示Do you really understand Big O? If so, then this will refresh your understanding before an interview. If not, don’t worry — come and join us for some endeavors in computer science.您真的了解Big O吗&#xff1f; 如果是这样&#xf…

leetcode 523. 连续的子数组和

给你一个整数数组 nums 和一个整数 k &#xff0c;编写一个函数来判断该数组是否含有同时满足下述条件的连续子数组&#xff1a; 子数组大小 至少为 2 &#xff0c;且 子数组元素总和为 k 的倍数。 如果存在&#xff0c;返回 true &#xff1b;否则&#xff0c;返回 false 。 …