数据统计 测试方法_统计测试:了解如何为数据选择最佳测试!

数据统计 测试方法

This post is not meant for seasoned statisticians. This is geared towards data scientists and machine learning (ML) learners & practitioners, who like me, do not come from a statistical background.

Ť他的职位是不是意味着经验丰富的统计人员。 这是针对数据科学家和机器学习(ML)学习者和从业者的 ,他们和我一样,并非来自统计背景。

For a person being from a non-statistical background the most confusing aspect of statistics, are the fundamental statistical tests, and when to use which test?. This post is an attempt to mark out the difference between the most common tests and the relevant key assumptions.

对于一个非统计学背景的人来说,统计方面最令人困惑的方面是基本统计检验 ,以及何时使用哪种检验? 这篇文章是试图指出最常见的测试和相关的关键假设之间的差异。

目录 (Table of contents)

  1. Terminologies: (KEY TERMINOLOGIES FOR THIS POST)

    术语:( 此职位的主要术语)

  2. Statistical Test(Hypothesis Testing)

    统计检验(假设检验)
  3. Statistical Assumptions

    统计假设
  4. Parametric tests

    参数测试
  5. Parametric test Flowchart

    参数测试流程图
  6. Dealing with non-normal distributions (Non-Parametric tests)

    处理非正态分布(非参数检验)

1)术语: (1) TERMINOLIGIES:)

独立变量和独立变量 (DEPENDENT AND INDEPENDENT VARIABLES)

An independent variable often called “predictor variable”, is a variable that is being manipulated in order to observe the effect on a dependent variable, sometimes called an outcome/output variable.

通常被称为“预测变量”的自变量是为了观察对因变量的影响而被操纵的变量,有时称为结果/输出变量。

  • Independent variable(s)-> Predictor variable(s)

    自变量->预测变量
  • Dependent variable(s) -> Outcome/Output variable(s)

    因变量->结果/输出变量

变量类型 (TYPES OF VARIABLES)

It is important to distinguish the difference between the type of variables because this plays a key role in determining the correct type of statistical test to adopt. There are two main categories:

区分变量类型之间的差异非常重要,因为这在确定要采用的正确统计检验类型中起着关键作用。 主要有两个类别:

  • QUANTITATIVE: express the amounts of things (e.g. the number of cigarettes in a pack). The two different types of quantitative variables are:

    数量 表达物品的数量(例如,一包香烟的数量)。 两种不同类型的定量变量是:

  1. CONTINOUS (a.k.a Ratio): is used to describe measures and can usually be divided into units smaller than one (e.g. 1.50 kg).

    连续 (又称比率 ):用于描述度量,通常可以划分为小于一的单位(例如1.50千克)。

  2. DISCRETE (a.k.a Interval): is used to describe counts and usually can’t be divided into units smaller than one (e.g. 1 cigarette).

    DISCRETE (又名Interval ):用于描述计数,通常不能分为小于1的单位(例如1支香烟)。

  • CATEGORICAL: express groupings of things (e.g. the different type of fruits). The three different types of categorical variables are:

    类别 表达事物的分组(例如,不同类型的水果)。 三种不同类型的类别变量是:

  1. ORDINAL: represent data with an order (e.g. rankings).

    序数表示具有顺序的数据(例如排名)。

  2. NOMINAL: represent group names (e.g. brands or species names).

    名词代表组名(例如品牌或品种名称)。

  3. BINARY: represent data with a yes/no or 1/0 outcome (e.g. LEFT or RIGHT).

    BINARY :表示结果为是/否或1/0的数据(例如,左或右)。

Image for post
TYPES OF VARIABLES SUMMARY (Image by author)
变量类型摘要(作者提供)

2)统计测试 (2) STATISTICAL TESTS)

Statistics is all about data. Data alone is not interesting. It is the interpretation of the data that we are interested in.

统计信息都是关于数据的。 单独的数据并不有趣。 它是对我们感兴趣的数据的解释。

In Statistics, one very important thing is statistical testing, if statistics “is the interpretation of the data”, statistical testing can be considered as the “formal procedure for investigating our ideas about the world”.

在统计中,非常重要的一件事是统计测试,如果统计“是对数据的解释”,则统计测试可以被视为“调查我们对世界的看法的正式程序”。

In other words, whenever we want to make claims about the distribution of data or whether one set of results are different from another set of results, data scientists must rely on hypothesis testing.

换句话说,每当我们要对数据的分布或一组结果是否与另一组结果有所不同时,数据科学家必须依靠假设检验。

假设检验 (HYPOTHESIS TESTING)

Using Hypothesis Testing, we try to interpret or draw conclusions about the population using sample data, evaluating two mutually exclusive statements about a population to determine which statement is best supported by the sample data.

使用“ 假设检验” ,我们尝试使用样本数据来解释或得出有关总体的结论,评估关于总体的两个互斥陈述,以确定样本数据最能支持哪种陈述。

假设检验有五个主要步骤: (THERE ARE FIVE MAIN STEPS IN HYPOTHESIS TESTING:)

Step 1) State your hypothesis as a Null (Ho) and Alternate (Ha) hypothesis.

步骤1)将您的假设陈述为零(Ho)和替代(Ha)假设。

Step 2) Choose a significance level (also called alpha or α).

步骤2)选择显着性水平(也称为alpha或α)。

Step 3) Collect data in a way designed to test the hypothesis.

步骤3)以旨在检验假设的方式收集数据。

Step 4) Perform an appropriate statistical test: compute the p-value and compare from the test to the significance level.

步骤4)执行适当的统计检验:计算p值,然后将检验与显着性水平进行比较。

Step 5) Decide whether to “ REJECT ” the null hypothesis(Ho) or “ FAIL TO REJECT ” the null hypothesis(Ho).

步骤5)决定是“拒绝”无效假设(Ho)还是“失败”无效假设(Ho)。

Note: Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

注意 :尽管具体细节可能有所不同,但是在检验假设时将使用的过程将始终遵循这些步骤的某些版本。

If you want to further understand hypothesis testing, I would highly recommend these two great posts on Hypothesis testing.

如果您想进一步了解假设检验,我强烈推荐有关假设检验的这两篇好文章。

3)统计假设 (3) STATISTICAL ASSUMPTIONS)

Statistical tests make some common assumptions about the data being tested (If these assumptions are violated then the test may not be valid: e.g. the resulting p-value may not be correct)

统计测试对要测试的数据做出一些通用假设(如果违反了这些假设,则该测试可能无效:例如,得出的p值可能不正确)

  1. Independence of observations: the observations/variables you include in your test should not be related(e.g. several tests from a same test subject are not independent, while several tests from multiple different test subjects are independent)

    观察结果的独立性 :您包含在测试中的观察值/变量不应该相关(例如,来自同一测试对象的多个测试不是独立的,而来自多个不同测试对象的多个测试是独立的)

  2. Homogeneity of variance: the “variance” within each group is being compared should be similar to the rest of the group variance. If a group has a bigger variance than the other(s) this will limit the test’s effectiveness.

    方差的同质性 :比较每个组中的“方差”应与其余组方差相似。 如果组的方差大于其他方,这将限制测试的有效性。

  3. Normality of data: the data follows a normal distribution, normality means that the distribution of the test is normally distributed (or bell-shaped) with mean 0, with 1 standard deviation and a symmetric bell-shaped curve.

    数据的正态性 :数据遵循正态分布,正态性表示测试的分布呈正态分布(或钟形),平均值为0,标准差为1,钟形曲线对称。

Image for post
source: https://studylib.net/doc/10831020/the-bell-curve-the-standard-normal-bell-curve
来源: https : //studylib.net/doc/10831020/the-bell-curve-the-standard-normal-bell-curve

4)参数测试 (4) PARAMETRIC TESTS)

Parametric tests are the ones that can only be run with data that stick with the “three statistical assumptions” mentioned above. The most common types of parametric tests are divided into three categories.

参数测试是只能使用符合上述“三个统计假设”的数据运行的测试。 最常见的参数测试类型分为三类。

回归测试: (Regression tests:)

These tests are used test cause-and-effect relationships, if the change in one or more continuous variable predicts change in another variable.

如果一个或多个连续变量的变化预示着另一个变量的变化则将这些检验用于检验因果关系

  • Simple linear regression: tests how a change in the predictor variable predicts the level of change in the outcome variable.

    简单线性回归:测试预测变量的变化如何预测结果变量的变化水平。

  • Multiple linear regression: tests how changes in the combination of two or more predictor variables predict the level of change in the outcome variable

    多元线性回归:测试两个或多个预测变量组合的变化如何预测结果变量的变化水平

  • Logistic regression: is used to describe data and to explain the relationship between one dependent (binary) variable and one or more nominal, ordinal, interval or ratio-level independent variable(s).

    Logistic回归:用于描述数据并解释一个(二元)变量与一个或多个名义,有序,区间或比率级别的自变量之间的关系。

比较测试: (Comparison tests:)

These tests look for the difference between the means of variables:Comparison of Means.

这些测试寻找变量均值之间的差异:均值比较。

  • T-tests are used when comparing the means of precisely two groups (e.g. the average heights of men and women).

    在精确比较两组的平均值(例如,男性和女性的平均身高)时,使用T检验

  • Independent t-test: Tests the difference between the same variable from different populations (e.g., comparing dogs to cats)

    独立t检验 :测试来自不同人群相同变量之间的差异 (例如,比较狗和猫)

  • ANOVA and MANOVA tests are used to compare the means of more than two groups or more(e.g. the average weights of children, teenagers, and adults).

    ANOVAMANOVA检验用于比较两组或以上两组的均值(例如,儿童,青少年和成人的平均体重)。

关联测试: (Correlation tests:)

These tests look for an association between variable checking whether two variables are related.

这些测试在变量之间寻找关联,检查两个变量是否相关。

  • Pearson Correlation: Tests for the strength of the association between two continuous variables.

    皮尔逊相关:测试两个连续变量之间关联的强度。

  • Spearman Correlation: Tests for the strength of the association between two ordinal variables (it does not rely on the assumption of normally distributed data)

    Spearman相关性:测试两个序数变量之间的关联强度(它不依赖于正态分布数据的假设)

  • Chi-Square Test: Tests for the strength of the association between two categorical variables.

    卡方检验:测试两个类别变量之间的关联强度。

Image for post
PARAMETRIC TESTS SUMMARY( Image by Author)
参数测试摘要(作者提供)

5)流程图:选择参数测试 (5) FLOWCHART: CHOOSING A PARAMETRIC TEST)

This flowchart will help you choose among the above described parametric tests. For nonparametric alternatives, check the following section.

该流程图将帮助您在上述参数测试中进行选择。 对于非参数替代,请检查以下部分。

Image for post
PARAMETRIC TEST Flowchart (Image by author)
参数测试流程图(作者提供)

6)处理非正态分布 (6) DEALING WITH NON- NORMAL DISTRIBUTIONS)

Although the normal distribution takes centre part in statistics, many processes follow non-normal distributions. Many datasets naturally fit a non-normal model:

尽管正态分布在统计中占据中心位置,但是许多过程遵循非正态分布。 许多数据集自然适合于非正常模型:

-The number of accidents tends to fit a “Poisson distribution”

-事故数量趋于符合“泊松分布”

-The Lifetimes of products usually fit a “Weibull distribution”.

-产品的使用寿命通常符合“威布尔分布”。

非正态分布的示例 (Example of Non-Normal Distributions)

  1. Beta Distribution.

    Beta发行版。
  2. Exponential Distribution.

    指数分布。
  3. Gamma Distribution.

    伽玛分布。
  4. Inverse Gamma Distribution.

    反伽玛分布。
  5. Log-Normal Distribution.

    对数正态分布。
  6. Logistic Distribution.

    物流配送。
  7. Maxwell-Boltzmann Distribution.

    Maxwell-Boltzmann分布。
  8. Poisson Distribution.

    泊松分布。
  9. Skewed Distribution.

    分布偏斜。
  10. Symmetric Distribution.

    对称分布。
  11. Uniform Distribution.

    均匀分布。
  12. Unimodal Distribution.

    单峰分布。
  13. Weibull Distribution.

    威布尔分布。

那么,我们如何处理非正态分布? (Well then, How do we deal with non-Normal-Distributions?)

When your data is supposed to fit a normal distribution but doesn’t, we could do a few things to handle them:

当您的数据应该符合正态分布但不符合正态分布时,我们可以做一些事情来处理它们:

  • We may still be able to run parametric tests if your sample size is large enough (usually over 20 items) and try to interpret the results accordingly.

    如果您的样本量足够大(通常超过20个项目),我们仍然可以运行参数测试,并尝试相应地解释结果。
  • We may choose to transform the data with different statistical techniques, forcing it to fit a normal distribution.

    我们可能选择使用不同的统计技术来转换数据,迫使其适应正态分布。
  • If the sample size is small, skewed or if it represents another distribution type, you might run a non-parametric test.

    如果样本量小,偏斜或代表其他分布类型,则可以运行非参数检验

非参数测试 (Non-Parametric Tests)

Non-parametric tests (figure below) don’t make as many assumptions about the data and are useful when one or more of the three statistical assumptions are violated.

非参数检验(下图)对数据的假设不多,当违反三个统计假设中的一个或多个时很有用。

Note that: The inferences that non-parametric tests make aren’t as strong as the parametric tests.

请注意:非参数测试的推论不如参数测试强。

Image for post
NON- PARAMETRIC TESTS(Image by author)
非参数测试(作者提供)

Hope you find this post informative and useful. Please let me know if you have any feedback. Thanks a lot for reading!

希望您发现这篇文章有益和有用。 如果您有任何反馈意见,请告诉我。 非常感谢您的阅读!

翻译自: https://towardsdatascience.com/statistical-testing-understanding-how-to-select-the-best-test-for-your-data-52141c305168

数据统计 测试方法

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388746.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

spring的几个通知(前置、后置、环绕、异常、最终)

1、没有异常的 2、有异常的 1、被代理类接口Person.java 1 package com.xiaostudy;2 3 /**4 * desc 被代理类接口5 * 6 * author xiaostudy7 *8 */9 public interface Person { 10 11 public void add(); 12 public void update(); 13 public void delete();…

每个Power BI开发人员的Power Query提示

If someone asks you to define the Power Query, what should you say? If you’ve ever worked with Power BI, there is no chance that you haven’t used Power Query, even if you weren’t aware of it. Therefore, one could easily say that Power Query is the “he…

c# PDF 转换成图片

1.新建项目 2.新增一个新文件夹“lib”(主要是为了存放引用的dll) 3.将“gsdll32.dll 、PDFLibNet.dll 、PDFView.dll”3个dll添加到文件夹中 4.项目添加“PDFLibNet.dll 、PDFView.dll”2个类库的引用,并将gsdll32.dll 拷贝到项目生产根…

oracle 死锁

为什么80%的码农都做不了架构师?>>> ORA-01013: user requested cancel of current operation 转载于:https://my.oschina.net/8808/blog/2994537

a/b测试_如何进行A / B测试?

a/b测试The idea of A/B testing is to present different content to different variants (user groups), gather their reactions and user behaviour and use the results to build product or marketing strategies in the future.A / B测试的想法是将不同的内容呈现给不同…

hibernate h2变mysql_struts2-hibernate-mysql开发案例 -解道Jdon

Hibernate专题struts2-hibernate-mysql开发案例与源码源码下载本案例展示使用Struts2,Hibernate和MySQL数据库开发一个个人音乐管理器Web应用程序。,可将您的音乐收藏添加到数据库中。功能有:显示一个添加记录的表单和所有的音乐收藏的列表。…

提取图像感兴趣区域_从图像中提取感兴趣区域

提取图像感兴趣区域Welcome to the second post in this series where we talk about extracting regions of interest (ROI) from images using OpenCV and Python.欢迎来到本系列的第二篇文章,我们讨论使用OpenCV和Python从图像中提取感兴趣区域(ROI)。 As a rec…

解决java compiler level does not match the version of the installed java project facet

ava compiler level does not match the version of the installed java project facet错误的解决 因工作的关系,Eclipse开发的Java项目拷来拷去,有时候会报一个很奇怪的错误。明明源码一模一样,为什么项目复制到另一台机器上,就会…

php模板如何使用,ThinkPHP如何使用模板

到目前为止,我们只是使用了控制器和模型,还没有接触视图,下面来给上面的应用添加视图模板。首先我们修改下 Action 的 index 操作方法,添加模板赋值和渲染模板操作。PHP代码classIndexActionextendsAction{publicfunctionindex(){…

什么是嵌入式系统

在我们的日常生活中,我们经常使用许多使用嵌入式系统技术设计的电气和电子电路和套件。计算机,手机,平板,笔记本电脑,数字电子系统以及其他电子和电子设备都是使用嵌入式系统设计的。 什么是嵌入式系统?将硬…

面向数据科学家的实用统计学_数据科学家必知的统计数据

面向数据科学家的实用统计学Beginners usually ignore most foundational statistical knowledge. To understand different models, and various techniques better, these concepts are essential. These work as baseline knowledge for various concepts involved in data …

suse安装php,SUSE下安装LAMP

安装Apache可以看到编译安装Apache出错,rpm包安装gcc (首先要安装GCC)makemake install修改apache端口cd /home/sxit/apache2vi conf/httpd.confListen 8000启动 apache/home/root/apache2/bin/apachectl start(stop restart)http://localhost:8000安装一下PHP开发…

自己动手写事件总线(EventBus)

2019独角兽企业重金招聘Python工程师标准>>> 本文由云社区发表 事件总线核心逻辑的实现。 <!--more--> EventBus的作用 Android中存在各种通信场景&#xff0c;如Activity之间的跳转&#xff0c;Activity与Fragment以及其他组件之间的交互&#xff0c;以及在某…

viz::viz3d报错_我可以在Excel中获得该Viz吗?

viz::viz3d报错Have you ever found yourself in the following situation?您是否遇到以下情况&#xff1f; Your team has been preparing and working tireless hours to create and showcase the end product — an interactive visual dashboard. It’s a culmination of…

java 添加用户 数据库,跟屌丝学DB2 第二课 建立数据库以及添加用户

在安装DB2 之后&#xff0c;就可以在 DB2 环境中创建自己的数据库。首先考虑数据库应该使用哪个实例。实例(instance) 提供一个由数据库管理配置(DBM CFG)文件控制的逻辑层&#xff0c;可以在这里将多个数据库分组在一起。DBM CFG 文件包含一组 DBM CFG 参数&#xff0c;可以使…

iphone视频教程

公开课介绍 本课程共28集 翻译至第15集 网易正在翻译16-28集 敬请关注 返回公开课首页 一键分享&#xff1a;  网易微博开心网豆瓣网新浪微博搜狐微博腾讯微博邮件 讲师介绍 名称&#xff1a;Alan Cannistraro 课程介绍 如果你对iPhone Development有兴趣&#xff0c;以下是入…

在Python中有效使用JSON的4个技巧

Python has two data types that, together, form the perfect tool for working with JSON: dictionaries and lists. Lets explore how to:Python有两种数据类型&#xff0c;它们一起构成了使用JSON的理想工具&#xff1a; 字典和列表 。 让我们探索如何&#xff1a; load a…

Vlan中Trunk接口配置

Vlan中Trunk接口配置 参考文献&#xff1a;HCNA网络技术实验指南 模拟器&#xff1a;eNSP 实验环境&#xff1a; 实验目的&#xff1a;掌握Trunk端口配置 掌握Trunk端口允许所有Vlan配置方法 掌握Trunk端口允许特定Vlan配置方法 实验拓扑&#xff1a; 实验IP地址 &#xff1a;…

django中的admin组件

Admin简介&#xff1a; Admin:是django的后台 管理的wed版本 我们现在models.py文件里面建几张表&#xff1a; class Author(models.Model):nid models.AutoField(primary_keyTrue)namemodels.CharField( max_length32)agemodels.IntegerField()# 与AuthorDetail建立一对一的关…

虚拟主机创建虚拟lan_创建虚拟背景应用

虚拟主机创建虚拟lanThis is the Part 2 of the MediaPipe Series I am writing.这是我正在编写的MediaPipe系列的第2部分。 Previously, we saw how to get started with MediaPipe and use it with your own tflite model. If you haven’t read it yet, check it out here.…