面向数据科学家的实用统计学_数据科学家必知的统计数据

面向数据科学家的实用统计学

Beginners usually ignore most foundational statistical knowledge. To understand different models, and various techniques better, these concepts are essential. These work as baseline knowledge for various concepts involved in data science, machine learning, and artificial intelligence.

初学者通常会忽略大多数基础统计知识。 为了更好地理解不同的模型和各种技术,这些概念至关重要。 这些作为数据科学,机器学习和人工智能中涉及的各种概念的基础知识。

Here is the list of concepts covered in this article.

这是本文涵盖的概念列表。

  1. Measures of central tendency

    集中趋势的度量
  2. Measures of spread

    传播措施
  3. Population & sample

    人口与样本
  4. Central limit theorem

    中心极限定理
  5. Sampling & sampling techniques

    采样与采样技术
  6. Selection Bias

    选择偏见
  7. Correlation & various correlation coefficients

    相关和各种相关系数

Let’s dive in!

让我们潜入吧!

Image for post

1-集中趋势的度量 (1 — Measures of central tendency)

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. The three most common values used as a measure of center are,

集中趋势度量是单个值,该值试图通过识别数据集中的中心位置来描述该数据集。 用来衡量中心的三个最常见的值是:

— Mean is the average of all the values in data.

—平均值是数据中所有值的平均值。

Image for post
Mean of ’n’ data values
n个数据值的平均值

— Median is the middle value in the sorted(ordered) data. Median is a better measure of center than mean as it is not affected by outliers.

—中位数是排序(排序)数据的中间值。 中值比中心值更好地衡量中心,因为它不受异常值的影响。

— Mode is the most frequent value in the data.

—模式是数据中最频繁的值。

Image for post

2-传播方式 (2 — Measures of spread)

Measures of spread describe how similar or varied the set of observed values are for a particular variable (data item). Measures of spread include the range, quartiles, and the interquartile range, variance, and standard deviation.

传播量度描述了对于特定变量(数据项)的观察值集有多相似或多变。 价差的度量包括范围,四分位数以及四分位数间距,方差和标准差。

— Range is the difference between the smallest value and the largest value in the data.

—范围是数据中最小值和最大值之间的差。

— Quartiles divide an ordered dataset into four equal parts, and refer to the values of the point between the quarters. The lower quartile (Q1) is the value between the lowest 25% of values and the highest 75% of values. It is also called the 25th percentile.The second quartile (Q2) is the middle value of the data set. It is also called the 50th percentile, or the median.The upper quartile (Q3) is the value between the lowest 75% and the highest 25% of values. It is also called the 75th percentile.

—四分位数将有序数据集分为四个相等的部分,并引用四分之一点之间的点的值。 下四分位数(Q1)是值的最低25%和值的最高75%之间的值。 也称为第25个百分位数。 第二个四分位数(Q2)是数据集的中间值。 也称为第50个百分位数,即中位数。 上四分位数(Q3)是值的最低75%和最高25%之间的值。 也称为第75个百分位。

Image for post
Distribution of quartiles (Image by author)
四分位数的分布(作者提供的图片)

The interquartile range (IQR) is the difference between the upper (Q3) and lower (Q1) quartiles, and describes the middle 50% of values when ordered from lowest to highest. The IQR is often seen as a better measure of spread than the range as it is not affected by outliers.

四分位数间距(IQR)是上四分位数(Q3)和下四分位数(Q1)之间的差,并描述了从最低到最高顺序的中间50%值。 由于IQR不受异常值的影响,因此通常认为IQR比范围更好。

The variance of all the data points whose mean is μ, each data point is denoted by Xi, and N number of data points is given by,

-所有的数据点,其平均值μ,每一个数据点是由表示的数据点的方差 ,以及N个由下式给出,

Image for post
The mathematical formula of Variance
方差的数学公式

The standard deviation is the square root of the variance. The standard deviation for a population is represented by σ.

标准偏差是方差的平方根。 总体的标准偏差由σ表示

In datasets with a small spread, all values are very close to the mean, resulting in a small variance and standard deviation. Where a dataset is more dispersed, values are spread further away from the mean, leading to a larger variance and standard deviation.

在具有较小分布的数据集中,所有值都非常接近均值,从而导致较小的方差和标准偏差。 如果数据集更加分散,则值与平均值之间的距离会越来越远,从而导致较大的方差和标准偏差。

Image for post

3-人口和样本 (3 — Population and Sample)

  • The population is the entire set of possible data values.

    总体是所有可能的数据值的集合。

  • A sample of data set contains a part, or a subset, of a population. The size of a sample is always less than the size of the population from which it is taken.

    数据集样本包含总体的一部分或子集。 样本的大小始终小于获取样本的人口的大小。

Image for post
Simple sketch to illustrate population & sample (Image by author)
简单的草图来说明人口和样本(作者提供的图片)

For example, the set of all people of a country is ‘population’ and a subset of people is ‘sample’ which is usually less than the population.

例如, 一个国家的所有人的集合是“ 人口 ”, 一部分人是“ 样本 ”,通常少于人口。

Image for post

4 —中心极限定理 (4 — Central Limit Theorem)

Central Limit Theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

中心极限定理是概率论中的关键概念,因为它暗示适用于正态分布的概率和统计方法可以适用于涉及其他类型分布的许多问题。

CLT states that, “Sampling from a population using a sufficiently large sample size, the mean of the samples, known as the sample mean, will be normally distributed. This is true regardless of the distribution of the population.”

CLT指出:“从使用足够大样本量的总体中进行抽样,样本均值(即样本均值)将呈正态分布。 无论人口分布如何,都是如此。”

Image for post
Wikipedia)维基百科 )

Other acumens from CLT are,

CLT的其他敏锐度是,

  • The sample mean converges in probability and almost surely to the expected value of the population mean.

    样本均值收敛于概率,并且几乎可以肯定地收敛于总体均值的期望值。
  • The variance of the population is the same as the product of variance of the sample and the number of elements in each sample.

    总体方差与样本方差与每个样本中元素数量的乘积相同。
Image for post

5—采样和采样技术 (5— Sampling & Sampling techniques)

Sampling is a statistical analysis technique used to select, manipulate, and analyze a representative subset of the data points to identify patterns and trends in the larger data set under observation.

采样是一种统计分析技术,用于选择,操作和分析数据点的代表性子集,以识别正在观察的较大数据集中的模式和趋势。

There are many different methods for drawing samples from the data; the ideal one depends on the data set and problem in hand. Commonly used sampling techniques are given below,

有很多不同的方法可以从数据中提取样本。 理想的选择取决于数据集和手头的问题。 下面给出了常用的采样技术,

  • Simple random sampling: In this case, each value in the sample is chosen entirely by chance and each value of the population has an equal chance, or probability, of being selected.

    简单随机抽样:在这种情况下,样本中的每个值完全是由偶然选择的,总体中的每个值都有相等的被选择的机会或概率。

  • Stratified sampling: In this method, the population is first divided into subgroups (or strata) which share a similar characteristic. It is used when we might reasonably expect the measurement of interest to vary between the different subgroups, and we want to ensure representation from all the subgroups.

    分层抽样:采用这种方法,首先将总体分为具有相似特征的子组(或阶层)。 当我们可能合理地期望感兴趣的度量在不同子组之间发生变化,并且我们希望确保所有子组都有代表性时,可以使用它。

  • Cluster sampling: In a clustered sample, subgroups of the population are used as the sampling unit, rather than individual values. The population is divided into subgroups, known as clusters, which are randomly selected to be included in the study.

    聚类抽样:在聚类样本中,总体的子组用作抽样单位,而不是单个值。 总体分为亚组,称为簇,被随机选择以纳入研究。

  • Systematic sampling: Individual values are selected at regular intervals from the sampling frame. The intervals are chosen to ensure an adequate sample size. If you need a sample size n from a population of size x, you should select every x/nth individual for the sample.

    系统采样:从采样帧中定期选择单个值。 选择间隔以确保足够的样本量。 如果需要从数量为x的总体中获得样本大小为n ,则应为样本选择每个第x / n个人。

Image for post

6 —选择偏向 (6 — Selection Bias)

Selection bias (also called Sampling bias) is a systematic error due to a non-random sample of a population, causing some values of the population to be less likely to be included than others, resulting in a biased sample, in which all values are not equally balanced or objectively represented.

选择偏差(也称为抽样偏差)是由于总体的非随机样本而导致的系统误差,导致总体的某些值比其他值更不可能被包含,从而导致偏差样本,其中所有值都是不能均衡地平衡或客观地代表。

This means that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed.

这意味着无法实现适当的随机化,从而确保获得的样本不代表要分析的总体。

In the general case, selection biases cannot be overcome with statistical analysis of existing data alone. An assessment of the degree of selection bias can be made by examining correlations.

通常,仅通过对现有数据进行统计分析就无法克服选择偏见。 选择偏倚程度的评估可以通过检查相关性来进行。

Image for post

7 —相关性 (7 — Correlation)

Correlation is simply, a metric that measures the extent to which variables (or features or samples or any groups) are associated with one another. In almost any data analysis, data scientists will compare two variables and how they relate to one another.

简单来说,相关性是一种度量,用于度量变量(或特征,样本或任何组)相互关联的程度。 在几乎所有数据分析中,数据科学家都将比较两个变量以及它们之间的关系。

The following are the most widely used correlation techniques,

以下是使用最广泛的相关技术,

  1. Covariance

    协方差
  2. Pearson Correlation Coefficient

    皮尔逊相关系数
  3. Spearman Rank Correlation Coefficient

    斯皮尔曼等级相关系数

1.协方差 (1. Covariance)

For two samples, say, X and Y, let E(X), E(Y) be the mean values of X, Y respectively, and ‘n’ be the total number of data points. The covariance of X, Y is given by,

对于两个样本,例如X和Y ,令E(X),E(Y)分别为X,Y的平均值,而' n '为数据点的总数。 X,Y的协方差为

Image for post
A mathematical formula for the covariance of X, Y
X,Y协方差的数学公式

The sign of the covariance indicates the tendency of the linear relationship between the variables.

协方差的符号表示变量之间线性关系的趋势。

Image for post
Image for post
Wikipedia)Wikipedia )

2.皮尔逊相关系数 (2. Pearson Correlation Coefficient)

Pearson Correlation Coefficient is a statistic that also measures the linear correlation between two features. For two samples, X, Y let σX, σY be the standard deviations of X, Y respectively. PCC of X, Y is given by,

皮尔逊相关系数是一种统计数据,还可以测量两个特征之间的线性相关性。 对于两个样本,X,Y分别使σX,σY为X,Y的标准偏差。 X,Y的PCC由下式给出:

Image for post
A mathematical formula for the PCC of X, Y
X,Y的PCC的数学公式

It has a value between -1 and +1.

它的值介于-1和+1之间。

Image for post
Image for post
A sample plot of variables with the value of PCC between -1 and 0, 0 and +1 respectively
PCC值分别在-1和0、0和+1之间的变量的样本图
Image for post
Image for post
Image for post
A sample plot of variables with the value of PCC -1, 0, 1 respectively
分别为PCC -1、0、1的变量的样本图

3. Spearman等级相关系数 (3. Spearman Rank Correlation Coefficient)

Spearman Rank Correlation Coefficient (SRCC) assesses how well the relationship between two samples can be described using a monotonic function (whether linear or not) where PCC can assess only linear relationships.

Spearman等级相关系数(SRCC)使用PCC只能评估线性关系的单调函数(无论是否线性 )评估两个样本之间的关系描述得如何。

The Spearman rank correlation coefficient between the two samples is equal to the Pearson correlation coefficient between the rank values of those two samples. Rank is the relative position label of the observations within the variable.

两个样本之间的Spearman等级相关系数等于这两个样本之间的等级值之间的Pearson相关系数等级是变量中观测值的相对位置标签。

Intuitively, the Spearman rank correlation coefficient between two variables will be high when observations have a similar rank between the two variables and low when observations have a dissimilar rank between the two variables.

直观地,当两个变量之间的观察值相似时,两个变量之间的Spearman等级相关系数将较高;而当两个变量之间的观察值具有不同等级时,则Spearman等级相关系数将较低。

The Spearman rank correlation coefficient lies between +1 and -1 where

Spearman等级相关系数介于+1和-1之间,其中

  • 1 is a perfect positive correlation

    1是完美的正相关

  • 0 is no correlation

    0 无相关

  • −1 is a perfect negative correlation

    -1是理想的负相关

To know more about correlation techniques and when to which one, do check that article below.

要了解有关关联技术以及何时使用哪种技术的更多信息,请检查以下文章。

Image for post

Thanks for the read. I am going to write more beginner-friendly posts in the future too. Follow me up on Medium to be informed about them. I welcome feedback and can be reached out on Twitter ramya_vidiyala and LinkedIn RamyaVidiyala. Happy learning!

感谢您的阅读。 我将来也会写更多对初学者友好的文章。 跟我上Medium ,以了解有关它们的信息。 我欢迎您提供反馈,可以在Twitter ramya_vidiyala和LinkedIn RamyaVidiyala上与他们联系 。 学习愉快!

翻译自: https://towardsdatascience.com/data-scientist-must-know-statistics-a161fa7c1bca

面向数据科学家的实用统计学

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388730.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

suse安装php,SUSE下安装LAMP

安装Apache可以看到编译安装Apache出错,rpm包安装gcc (首先要安装GCC)makemake install修改apache端口cd /home/sxit/apache2vi conf/httpd.confListen 8000启动 apache/home/root/apache2/bin/apachectl start(stop restart)http://localhost:8000安装一下PHP开发…

自己动手写事件总线(EventBus)

2019独角兽企业重金招聘Python工程师标准>>> 本文由云社区发表 事件总线核心逻辑的实现。 <!--more--> EventBus的作用 Android中存在各种通信场景&#xff0c;如Activity之间的跳转&#xff0c;Activity与Fragment以及其他组件之间的交互&#xff0c;以及在某…

viz::viz3d报错_我可以在Excel中获得该Viz吗?

viz::viz3d报错Have you ever found yourself in the following situation?您是否遇到以下情况&#xff1f; Your team has been preparing and working tireless hours to create and showcase the end product — an interactive visual dashboard. It’s a culmination of…

java 添加用户 数据库,跟屌丝学DB2 第二课 建立数据库以及添加用户

在安装DB2 之后&#xff0c;就可以在 DB2 环境中创建自己的数据库。首先考虑数据库应该使用哪个实例。实例(instance) 提供一个由数据库管理配置(DBM CFG)文件控制的逻辑层&#xff0c;可以在这里将多个数据库分组在一起。DBM CFG 文件包含一组 DBM CFG 参数&#xff0c;可以使…

iphone视频教程

公开课介绍 本课程共28集 翻译至第15集 网易正在翻译16-28集 敬请关注 返回公开课首页 一键分享&#xff1a;  网易微博开心网豆瓣网新浪微博搜狐微博腾讯微博邮件 讲师介绍 名称&#xff1a;Alan Cannistraro 课程介绍 如果你对iPhone Development有兴趣&#xff0c;以下是入…

在Python中有效使用JSON的4个技巧

Python has two data types that, together, form the perfect tool for working with JSON: dictionaries and lists. Lets explore how to:Python有两种数据类型&#xff0c;它们一起构成了使用JSON的理想工具&#xff1a; 字典和列表 。 让我们探索如何&#xff1a; load a…

Vlan中Trunk接口配置

Vlan中Trunk接口配置 参考文献&#xff1a;HCNA网络技术实验指南 模拟器&#xff1a;eNSP 实验环境&#xff1a; 实验目的&#xff1a;掌握Trunk端口配置 掌握Trunk端口允许所有Vlan配置方法 掌握Trunk端口允许特定Vlan配置方法 实验拓扑&#xff1a; 实验IP地址 &#xff1a;…

django中的admin组件

Admin简介&#xff1a; Admin:是django的后台 管理的wed版本 我们现在models.py文件里面建几张表&#xff1a; class Author(models.Model):nid models.AutoField(primary_keyTrue)namemodels.CharField( max_length32)agemodels.IntegerField()# 与AuthorDetail建立一对一的关…

虚拟主机创建虚拟lan_创建虚拟背景应用

虚拟主机创建虚拟lanThis is the Part 2 of the MediaPipe Series I am writing.这是我正在编写的MediaPipe系列的第2部分。 Previously, we saw how to get started with MediaPipe and use it with your own tflite model. If you haven’t read it yet, check it out here.…

.net程序员安全注意代码及服务器配置

概述 本人.net架构师&#xff0c;软件行业为金融资讯以及股票交易类的软件产品设计开发。由于长时间被黑客攻击以及骚扰。从事高量客户访问的服务器解决架构设计以及程序员编写指导工作。特此总结一些.net程序员在代码编写安全以及服务器设置安全常用到的知识。希望能给对大家…

接口测试框架2

现在市面上做接口测试的工具很多&#xff0c;比如Postman&#xff0c;soapUI, JMeter, Python unittest等等&#xff0c;各种不同的测试工具拥有不同的特色。但市面上的接口测试工具都存在一个问题就是无法完全吻合的去适用没一个项目&#xff0c;比如数据的处理&#xff0c;加…

python 传不定量参数_Python中的定量金融

python 传不定量参数The first quantitative class for vanilla finance and quantitative finance majors alike has to do with the time value of money. Essentially, it’s a semester-long course driving notions like $100 today is worth more than $100 a year from …

雷军宣布红米 Redmi 品牌独立,这对小米意味着什么?

雷锋网消息&#xff0c;1 月 3 日&#xff0c;小米公司宣布&#xff0c;将在 1 月 10 日召开全新独立品牌红米 Redmi 发布会。从小米公布的海报来看&#xff0c;Redmi 品牌标识出现的倒影中&#xff0c;有 4800 的字样&#xff0c;这很容易让人联想起此前小米总裁林斌所宣布的 …

JAVA的rotate怎么用,java如何利用rotate旋转图片_如何在Java中旋转图形

I have drawn some Graphics in a JPanel, like circles, rectangles, etc.But I want to draw some Graphics rotated a specific degree amount, like a rotated ellipse. What should I do?解决方案If you are using plain Graphics, cast to Graphics2D first:Graphics2D …

贝叶斯 朴素贝叶斯_手动执行贝叶斯分析

贝叶斯 朴素贝叶斯介绍 (Introduction) Bayesian analysis offers the possibility to get more insights from your data compared to the pure frequentist approach. In this post, I will walk you through a real life example of how a Bayesian analysis can be perform…

西工大java实验报告给,西工大数字集成电路实验 实验课6 加法器的设计

西工大数字集成电路实验练习六 加法器的设计一、使用与非门(NAND)、或非门(NOR)、非门(INV)等布尔逻辑器件实现下面的设计。1、仿照下图的全加器&#xff0c;实现一个N位的减法器。要求仿照图1画出N位减法器的结构。ABABABAB0123图1 四位逐位进位加法器的结构2、根据自己构造的…

DS二叉树--二叉树之数组存储

二叉树可以采用数组的方法进行存储&#xff0c;把数组中的数据依次自上而下,自左至右存储到二叉树结点中&#xff0c;一般二叉树与完全二叉树对比&#xff0c;比完全二叉树缺少的结点就在数组中用0来表示。&#xff0c;如下图所示 从上图可以看出&#xff0c;右边的是一颗普通的…

VS IIS Express 支持局域网访问

使用Visual Studio开发Web网页的时候有这样的情况&#xff1a;想要在调试模式下让局域网的其他设备进行访问&#xff0c;以便进行测试。虽然可以部署到服务器中&#xff0c;但是却无法进行调试&#xff0c;就算是注入进程进行调试也是无法达到自己的需求&#xff1b;所以只能在…

构建图像金字塔_我们如何通过转移学习构建易于使用的图像分割工具

构建图像金字塔Authors: Jenny Huang, Ian Hunt-Isaak, William Palmer作者&#xff1a; 黄珍妮 &#xff0c; 伊恩亨特伊萨克 &#xff0c; 威廉帕尔默 GitHub RepoGitHub回购 介绍 (Introduction) Training an image segmentation model on new images can be daunting, es…

PHP mongodb运用,MongoDB在PHP下的应用学习笔记

1、连接mongodb默认端口是&#xff1a;27017&#xff0c;因此我们连接mongodb&#xff1a;$mongodb new Mongo(localhost) 或者指定IP与端口 $mongodb new Mongo(192.168.127.1:27017) 端口可改变若mongodb开启认证&#xff0c;即--auth,则连接为&#xff1a; $mongodb new …