如何进行数据分析统计_对您不了解的数据集进行统计分析

如何进行数据分析统计

Recently, I took the opportunity to work on a competition held by Wells Fargo (Mindsumo). The dataset provided was just a bunch of numbers in various columns with no indication of what the data might be. I always thought that the analysis of data required some knowledge and understanding of the data and the domain to perform an efficient analysis. I have attached a sample below. It consisted of columns from X0 to X29 which consisted of continuous values and XC which consisted of categorical data i.e. 30 variables in total. I set out on further analysis on the entire dataset to understand the data.

[R ecently,我趁机工作由举办的比赛富国银行(Mindsumo) 。 提供的数据集只是各个列中的一堆数字,没有指示数据可能是什么。 我一直认为,数据分析需要对数据和领域有一定的了解和理解,才能进行有效的分析。 我在下面附上了一个样本。 它由从X0X29的列组成,这些列由连续值组成,而XC则由分类数据组成,即总共30个变量。 我着手对整个数据集进行进一步分析以了解数据。

Image for post

连续变量的正态性检查 (Normality check of continuous variables)

I used the QQ plot to determine the normality distribution of the variables and understand if there is any skew in the data. All the data points were normally distributed with very less deviation which required no processing of the data to be done at this point to attain a Gaussian distribution. I prefer a QQ plot for the initial analysis because it makes it very easy to analyze the data and determine the type of distribution be it Gaussian distribution, uniform distribution, etc.

我使用QQ图来确定变量的正态分布,并了解数据中是否存在任何偏差。 所有数据点均以极小的偏差进行正态分布,因此在这一点上无需对数据进行任何处理即可获得高斯分布。 我更喜欢使用QQ图进行初始分析,因为它使分析数据和确定分布类型变得非常容易,包括高斯分布,均匀分布等。

Image for post

Once the data is determined to be a normal distribution using the QQ plot, a Shapiro Wilk test can be performed to confirm the hypothesis. It has been deigned specifically for normal distributions. The null hypothesis, in this case, is that the variable is normally distributed. If the p value obtained is less than 0.05 then the null hypothesis is rejected and it is assumed that the variable is not normally distributed. All the values seem to be greater than 0.5 which means that all the variables follow a normal distribution.

一旦使用QQ图确定数据为正态分布,就可以执行Shapiro Wilk检验来确认假设。 专为正态分布而设计。 在这种情况下,零假设是变量是正态分布的。 如果获得的p值小于0.05,则拒绝零假设,并假设变量不是正态分布的。 所有值似乎都大于0.5,这意味着所有变量都遵循正态分布。

Image for post

分类变量 (Categorical variable)

I checked the distribution of the categorical variable to check if the points in the dataset were equally distributed. The distribution of the variables was as shown below. The variable consisted of 5 unique values (A,B,C,D and E) with all the values being more or less equally distributed.

我检查了分类变量的分布,以检查数据集中的点是否均匀分布。 变量的分布如下所示。 该变量由5个唯一值(A,B,C,D和E)组成,所有值或多或少均等地分布。

Image for post

I used a One-Hot Encoding mechanism to convert the categorical variables to a binary variable for each resulting categorical value as shown below. Although this resulted in an increase in dimensionality, I was hoping to check for correlations later on and remove or merge certain rows.

我使用单点编码机制将每个结果分类值的分类变量转换为二进制变量,如下所示。 尽管这导致维数增加,但我希望以后再检查相关性,并删除或合并某些行。

Image for post

数据关联 (Data Correlation)

Correlation is an important technique which helps in determining the relationships between the variables and weeding out highly correlated data. The reason we do this because we don't want variables that are highly correlated with each other since they affect the final dependent variable in the same way. The correlation values range from -1 to 1 with a value of 1 signifying a strong, positive correlation between the two and a value of -1 signifying a strong, negative correlation. We also calculate the statistical significance of the correlations to determine if the null hypothesis (There is no correlation) is valid or not. I have taken three values as benchmarks for measuring statistical significance — 0.1, 0.05 and 0.01. The below table is a small sub sample of the correlation values for each set of variables. A p value which is less than 0.01 signifies a high statistical significance and that the null hypothesis can be rejected which is represented by 3 ‘*’ while the statistical significance is lesser if the p-value is lesser than 0.1 but greater than 0.05 which is represented by 1 ‘*’.

关联是一项重要的技术,可帮助确定变量之间的关系并清除高度相关的数据。 我们之所以这样做,是因为我们不希望变量之间具有高度相关性,因为它们以相同的方式影响最终的因变量。 相关值的范围是-1到1,值1表示两者之间的强正相关,值-1表示强的负相关。 我们还计算了相关性的统计显着性,以确定零假设(无相关性)是否有效。 我已经采取三个值作为基准,用于测量统计显着性- 0.1,0.050.01。 下表是每组变量的相关值的一小部分子样本。 如果p值小于0.01,则表示具有较高的统计显着性,并且可以拒绝由3 '*'表示的原假设,而如果p值小于0.1但大于0.05,则统计学显着性较小。用1' *'表示

Image for post

The dataset above did not have any values which had a high correlation value. Thus, I safely went ahead with the assumption that the values were not related to each other. I further explored the correlation of the variables with the dependent variable y.

上面的数据集没有任何具有高相关值的值。 因此,我可以安全地继续进行以下假设:这些值彼此无关。 我进一步探讨了变量与因变量y的相关性。

Image for post

We are concerned with the prediction of the variable y. As seen, there are a lot of variables which don't have any correlation to y as well as are not statistically significant at all which means that the relationship is weak and we can safely exclude them from the final dataset. These include variables such as X5, X8, X9, X10, XC_C, etc. I have not excluded the other variables which have low correlation but high statistical significance as there may be a small sample which affects the final dependent variable and we cannot exclude them completely. We can further reduce the variables by merging some of them. We do this on the basis of variables which have the same correlation value with the y variable. These include —

我们关注变量y的预测。 可以看出,有很多变量与y没有任何关系,并且在统计上根本不重要,这意味着该关系很弱,我们可以安全地将它们从最终数据集中排除。 这些变量包括X5X8X9X10XC_C等。我没有排除其他相关性较低但具有统计学意义的变量,因为可能会有一个小的样本影响最终因变量,因此我们不能排除它们完全。 我们可以通过合并其中的一些变量来进一步减少变量。 我们基于与y变量具有相同相关值的变量进行此操作。 这些包括 -

  • X7, X11, X17 and X21

    X7X11X17X21

  • X1 and X23

    X1X23

  • X22 and X26

    X22X26

  • X4, X15, X19 and X25

    X4X15X19X25

I merged these variables using an optimization technique. Let us consider the variable, X1 and X23. I achieved this by assuming a linear relation mX1 + nX23 with y. For determining the maximum correlation, we have to calculate the optimum value for m and n. In all the cases, I assumed n to be 1 and solved for m. The equation is as shown below.

我使用优化技术合并了这些变量。 让我们考虑变量X1X23 。 我通过假设线性关系m X1 + n X23y来实现这一点。 为了确定最大相关性,我们必须计算mn的最佳值。 在所有情况下,我都假定n1并求解m。 公式如下所示。

Image for post

Once n is set as 1, we can easily solve for m. This can be substituted in the above linear equation for each value. In this way, we can merge all the above variables. If the denominator is 0, m can be taken as 1 and the equation can be solved for n. Make sure that the correlation values are equal for both the variables. After merging, I generated the correlation table again.

将n设置为1后 ,我们可以轻松求解m。 可以在上面的线性方程式中将其替换为每个值。 这样,我们可以合并所有上述变量。 如果分母为0 ,则m可以取为1,并且该方程可以求解n 。 确保两个变量的相关值相等。 合并后,我再次生成了相关表。

Image for post

We can see that the correlation values for the merged variables have increased and the statistical significance is high for all of them. I managed to reduce the number of variables from 30 to 15. I can now use these variables to feed it into my machine learning model and check the accuracy against the validation dataset.

我们可以看到,合并变量的相关值已经增加,并且所有变量的统计意义都很高。 我设法将变量的数量从30个减少到15个 。 现在,我可以使用这些变量将其输入到我的机器学习模型中,并根据验证数据集检查准确性。

训练和验证数据 (Training and Validating the data)

I chose a Logistic Regression model for the training and predicting on this dataset for multiple reasons —

由于多种原因,我选择了Logistic回归模型来对该数据集进行训练和预测-

  • The dependent variable is binary

    因变量是二进制
  • The independent variables are related to the dependent variable

    自变量与因变量有关

After training the model, I checked the model against the validation dataset and these are the results.

训练模型后,我对照验证数据集检查了模型,这些是结果。

Image for post

The model had an accuracy of 99.6% with a F1 score of 99.45%.

该模型的accuracy99.6%F1 score99.45%

结论 (Conclusion)

This was a basic exploratory and statistical analysis to reduce the number of features and assure that there are no correlated variables in the final dataset. Using a few simple techniques, we can be assured of getting good results even if we do not understand what the data is initially. The main steps include ensuring a normal distribution of data and an efficient encoding scheme for the categorical variables. Further, the variables can be removed and merged based on correlation among them after which an appropriate model can be chosen for analysis. You can find the code repository at https://github.com/Raul9595/AnonyData.

这是一项基本的探索性和统计分析,目的是减少特征数量并确保最终数据集中不存在相关变量。 使用一些简单的技术,即使我们不了解最初的数据是什么,也可以确保获得良好的结果。 主要步骤包括确保数据的正态分布和有效的分类变量编码方案。 此外,可以基于变量之间的相关性来删除和合并变量,之后可以选择适当的模型进行分析。 您可以在https://github.com/Raul9595/AnonyData中找到代码存储库。

翻译自: https://towardsdatascience.com/statistical-analysis-on-a-dataset-you-dont-understand-f382f43c8fa5

如何进行数据分析统计

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390753.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

020-Spring Boot 监控和度量

一、概述 通过配置使用actuator查看监控和度量信息 二、使用 2.1、建立web项目&#xff0c;增加pom <dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-web</artifactId></dependency> 启动项目&a…

matplotlib布局_Matplotlib多列,行跨度布局

matplotlib布局For Visualization in Python, Matplotlib library has been the workhorse for quite some time now. It has held its own even after more nimble rivals with easier code interface and capabilities like seaborn, plotly, bokeh etc. have arrived on the…

Hadoop生态系统

大数据架构-Lambda Lambda架构由Storm的作者Nathan Marz提出。旨在设计出一个能满足实时大数据系统关键特性的架构&#xff0c;具有高容错、低延时和可扩展等特性。Lambda架构整合离线计算和实时计算&#xff0c;融合不可变性&#xff08;Immutability&#xff09;&#xff0c…

使用Hadoop所需要的一些Linux基础

Linux 概念 Linux 是一个类Unix操作系统&#xff0c;是 Unix 的一种&#xff0c;它 控制整个系统基本服务的核心程序 (kernel) 是由 Linus 带头开发出来的&#xff0c;「Linux」这个名称便是以 「Linus’s unix」来命名的。 Linux泛指一类操作系统&#xff0c;具体的版本有&a…

python多项式回归_Python从头开始的多项式回归

python多项式回归Polynomial regression in an improved version of linear regression. If you know linear regression, it will be simple for you. If not, I will explain the formulas here in this article. There are other advanced and more efficient machine learn…

回归分析_回归

回归分析Machine learning algorithms are not your regular algorithms that we may be used to because they are often described by a combination of some complex statistics and mathematics. Since it is very important to understand the background of any algorith…

数据科学还是计算机科学_何时不使用数据科学

数据科学还是计算机科学意见 (Opinion) 目录 (Table of Contents) Introduction 介绍 Examples 例子 When You Should Use Data Science 什么时候应该使用数据科学 Summary 摘要 介绍 (Introduction) Both Data Science and Machine Learning are useful fields that apply sev…

leetcode 523. 连续的子数组和

给你一个整数数组 nums 和一个整数 k &#xff0c;编写一个函数来判断该数组是否含有同时满足下述条件的连续子数组&#xff1a; 子数组大小 至少为 2 &#xff0c;且 子数组元素总和为 k 的倍数。 如果存在&#xff0c;返回 true &#xff1b;否则&#xff0c;返回 false 。 …

Docker学习笔记 - Docker Compose

一、概念 Docker Compose 用于定义运行使用多个容器的应用&#xff0c;可以一条命令启动应用&#xff08;多个容器&#xff09;。 使用Docker Compose 的步骤&#xff1a; 定义容器 Dockerfile定义应用的各个服务 docker-compose.yml启动应用 docker-compose up二、安装 Note t…

线性回归算法数学原理_线性回归算法-非数学家的高级数学

线性回归算法数学原理内部AI (Inside AI) Linear regression is one of the most popular algorithms used in different fields well before the advent of computers. Today with the powerful computers, we can solve multi-dimensional linear regression which was not p…

Linux 概述

UNIX发展历程 第一个版本是1969年由Ken Thompson&#xff08;UNIX之父&#xff09;在AT& T贝尔实验室实现Ken Thompson和Dennis Ritchie&#xff08;C语言之父&#xff09;使用C语言对整个系统进行了再加工和编写UNIX的源代码属于SCO公司&#xff08;AT&T ->Novell …

泰坦尼克:机器从灾难中学习_用于灾难响应的机器学习研究:什么才是好的论文?...

泰坦尼克:机器从灾难中学习For the first time in 2021, a major Machine Learning conference will have a track devoted to disaster response. The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021) has a track on…

github持续集成的设置_如何使用GitHub Actions和Puppeteer建立持续集成管道

github持续集成的设置Lately Ive added continuous integration to my blog using Puppeteer for end to end testing. My main goal was to allow automatic dependency updates using Dependabot. In this guide Ill show you how to create such a pipeline yourself. 最近&…

shell与常用命令

虚拟控制台 一台计算机的输入输出设备就是一个物理的控制台 &#xff1b; 如果在一台计算机上用软件的方法实现了多个互不干扰独立工作的控制台界面&#xff0c;就是实现了多个虚拟控制台&#xff1b; Linux终端的工作方式是字符命令行方式&#xff0c;用户通过键盘输入命令进…

Linux文本编辑器

Linux文本编辑器 Linux系统下有很多文本编辑器。 按编辑区域&#xff1a; 行编辑器 ed 全屏编辑器 vi 按运行环境&#xff1a; 命令行控制台编辑器 vi X Window图形界面编辑器 gedit ed 它是一个很古老的行编辑器&#xff0c;vi这些编辑器都是ed演化而来。 每次只能对一…

Alpha第十天

Alpha第十天 听说 031502543 周龙荣&#xff08;队长&#xff09; 031502615 李家鹏 031502632 伍晨薇 031502637 张柽 031502639 郑秦 1.前言 任务分配是VV、ZQ、ZC负责前端开发&#xff0c;由JP和LL负责建库和服务器。界面开发的教辅材料是《第一行代码》&#xff0c;利用And…

Streamlit —使用数据应用程序更好地测试模型

介绍 (Introduction) We use all kinds of techniques from creating a very reliable validation set to using k-fold cross-validation or coming up with all sorts of fancy metrics to determine how good our model performs. However, nothing beats looking at the ra…

X Window系统

X Window系统 一种以位图方式显示的软件窗口系统。诞生于1984&#xff0c;比Microsoft Windows要早。是一套独立于内核的软件 Linux上的X Window系统 X Window系统由三个基本元素组成&#xff1a;X Server、X Client和二者通信的通道。 X Server&#xff1a;是控制输出及输入…

lasso回归和岭回归_如何计划新产品和服务机会的回归

lasso回归和岭回归Marketers sometimes have to be creative to offer customers something new without the luxury of that new item being a brand-new product or built-from-scratch service. In fact, incrementally introducing features is familiar to marketers of c…

Linux 设备管理和进程管理

设备管理 Linux系统中设备是用文件来表示的&#xff0c;每种设备都被抽象为设备文件的形式&#xff0c;这样&#xff0c;就给应用程序一个一致的文件界面&#xff0c;方便应用程序和操作系统之间的通信。 设备文件集中放置在/dev目录下&#xff0c;一般有几千个&#xff0c;不…