面向数据科学家的实用统计学
Beginners usually ignore most foundational statistical knowledge. To understand different models, and various techniques better, these concepts are essential. These work as baseline knowledge for various concepts involved in data science, machine learning, and artificial intelligence.
初学者通常会忽略大多数基础统计知识。 为了更好地理解不同的模型和各种技术,这些概念至关重要。 这些作为数据科学,机器学习和人工智能中涉及的各种概念的基础知识。
Here is the list of concepts covered in this article.
这是本文涵盖的概念列表。
- Measures of central tendency 集中趋势的度量
- Measures of spread 传播措施
- Population & sample 人口与样本
- Central limit theorem 中心极限定理
- Sampling & sampling techniques 采样与采样技术
- Selection Bias 选择偏见
- Correlation & various correlation coefficients 相关和各种相关系数
Let’s dive in!
让我们潜入吧!
1-集中趋势的度量 (1 — Measures of central tendency)
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. The three most common values used as a measure of center are,
集中趋势的度量是单个值,该值试图通过识别数据集中的中心位置来描述该数据集。 用来衡量中心的三个最常见的值是:
— Mean is the average of all the values in data.
—平均值是数据中所有值的平均值。
— Median is the middle value in the sorted(ordered) data. Median is a better measure of center than mean as it is not affected by outliers.
—中位数是排序(排序)数据中的中间值。 中值比中心值更好地衡量中心,因为它不受异常值的影响。
— Mode is the most frequent value in the data.
—模式是数据中最频繁的值。
2-传播方式 (2 — Measures of spread)
Measures of spread describe how similar or varied the set of observed values are for a particular variable (data item). Measures of spread include the range, quartiles, and the interquartile range, variance, and standard deviation.
传播量度描述了对于特定变量(数据项)的观察值集有多相似或多变。 价差的度量包括范围,四分位数以及四分位数间距,方差和标准差。
— Range is the difference between the smallest value and the largest value in the data.
—范围是数据中最小值和最大值之间的差。
— Quartiles divide an ordered dataset into four equal parts, and refer to the values of the point between the quarters. The lower quartile (Q1) is the value between the lowest 25% of values and the highest 75% of values. It is also called the 25th percentile.The second quartile (Q2) is the middle value of the data set. It is also called the 50th percentile, or the median.The upper quartile (Q3) is the value between the lowest 75% and the highest 25% of values. It is also called the 75th percentile.
—四分位数将有序数据集分为四个相等的部分,并引用四分之一点之间的点的值。 下四分位数(Q1)是值的最低25%和值的最高75%之间的值。 也称为第25个百分位数。 第二个四分位数(Q2)是数据集的中间值。 也称为第50个百分位数,即中位数。 上四分位数(Q3)是值的最低75%和最高25%之间的值。 也称为第75个百分位。
The interquartile range (IQR) is the difference between the upper (Q3) and lower (Q1) quartiles, and describes the middle 50% of values when ordered from lowest to highest. The IQR is often seen as a better measure of spread than the range as it is not affected by outliers.
四分位数间距(IQR)是上四分位数(Q3)和下四分位数(Q1)之间的差,并描述了从最低到最高顺序的中间50%值。 由于IQR不受异常值的影响,因此通常认为IQR比范围更好。
— The variance of all the data points whose mean is μ, each data point is denoted by Xi, and N number of data points is given by,
-所有的数据点,其平均值μ,每一个数据点是由僖表示的数据点的方差 ,以及N个由下式给出,
The standard deviation is the square root of the variance. The standard deviation for a population is represented by σ.
标准偏差是方差的平方根。 总体的标准偏差由σ表示。
In datasets with a small spread, all values are very close to the mean, resulting in a small variance and standard deviation. Where a dataset is more dispersed, values are spread further away from the mean, leading to a larger variance and standard deviation.
在具有较小分布的数据集中,所有值都非常接近均值,从而导致较小的方差和标准偏差。 如果数据集更加分散,则值与平均值之间的距离会越来越远,从而导致较大的方差和标准偏差。
3-人口和样本 (3 — Population and Sample)
The population is the entire set of possible data values.
总体是所有可能的数据值的集合。
A sample of data set contains a part, or a subset, of a population. The size of a sample is always less than the size of the population from which it is taken.
数据集样本包含总体的一部分或子集。 样本的大小始终小于获取样本的人口的大小。
For example, the set of all people of a country is ‘population’ and a subset of people is ‘sample’ which is usually less than the population.
例如, 一个国家的所有人的集合是“ 人口 ”, 一部分人是“ 样本 ”,通常少于人口。
4 —中心极限定理 (4 — Central Limit Theorem)
Central Limit Theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.
中心极限定理是概率论中的关键概念,因为它暗示适用于正态分布的概率和统计方法可以适用于涉及其他类型分布的许多问题。
CLT states that, “Sampling from a population using a sufficiently large sample size, the mean of the samples, known as the sample mean, will be normally distributed. This is true regardless of the distribution of the population.”
CLT指出:“从使用足够大样本量的总体中进行抽样,样本均值(即样本均值)将呈正态分布。 无论人口分布如何,都是如此。”
Other acumens from CLT are,
CLT的其他敏锐度是,
- The sample mean converges in probability and almost surely to the expected value of the population mean. 样本均值收敛于概率,并且几乎可以肯定地收敛于总体均值的期望值。
- The variance of the population is the same as the product of variance of the sample and the number of elements in each sample. 总体方差与样本方差与每个样本中元素数量的乘积相同。
5—采样和采样技术 (5— Sampling & Sampling techniques)
Sampling is a statistical analysis technique used to select, manipulate, and analyze a representative subset of the data points to identify patterns and trends in the larger data set under observation.
采样是一种统计分析技术,用于选择,操作和分析数据点的代表性子集,以识别正在观察的较大数据集中的模式和趋势。
There are many different methods for drawing samples from the data; the ideal one depends on the data set and problem in hand. Commonly used sampling techniques are given below,
有很多不同的方法可以从数据中提取样本。 理想的选择取决于数据集和手头的问题。 下面给出了常用的采样技术,
Simple random sampling: In this case, each value in the sample is chosen entirely by chance and each value of the population has an equal chance, or probability, of being selected.
简单随机抽样:在这种情况下,样本中的每个值完全是由偶然选择的,总体中的每个值都有相等的被选择的机会或概率。
Stratified sampling: In this method, the population is first divided into subgroups (or strata) which share a similar characteristic. It is used when we might reasonably expect the measurement of interest to vary between the different subgroups, and we want to ensure representation from all the subgroups.
分层抽样:采用这种方法,首先将总体分为具有相似特征的子组(或阶层)。 当我们可能合理地期望感兴趣的度量在不同子组之间发生变化,并且我们希望确保所有子组都有代表性时,可以使用它。
Cluster sampling: In a clustered sample, subgroups of the population are used as the sampling unit, rather than individual values. The population is divided into subgroups, known as clusters, which are randomly selected to be included in the study.
聚类抽样:在聚类样本中,总体的子组用作抽样单位,而不是单个值。 总体分为亚组,称为簇,被随机选择以纳入研究。
Systematic sampling: Individual values are selected at regular intervals from the sampling frame. The intervals are chosen to ensure an adequate sample size. If you need a sample size n from a population of size x, you should select every x/nth individual for the sample.
系统采样:从采样帧中定期选择单个值。 选择间隔以确保足够的样本量。 如果需要从数量为x的总体中获得样本大小为n ,则应为样本选择每个第x / n个人。
6 —选择偏向 (6 — Selection Bias)
Selection bias (also called Sampling bias) is a systematic error due to a non-random sample of a population, causing some values of the population to be less likely to be included than others, resulting in a biased sample, in which all values are not equally balanced or objectively represented.
选择偏差(也称为抽样偏差)是由于总体的非随机样本而导致的系统误差,导致总体的某些值比其他值更不可能被包含,从而导致偏差样本,其中所有值都是不能均衡地平衡或客观地代表。
This means that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed.
这意味着无法实现适当的随机化,从而确保获得的样本不代表要分析的总体。
In the general case, selection biases cannot be overcome with statistical analysis of existing data alone. An assessment of the degree of selection bias can be made by examining correlations.
通常,仅通过对现有数据进行统计分析就无法克服选择偏见。 选择偏倚程度的评估可以通过检查相关性来进行。
7 —相关性 (7 — Correlation)
Correlation is simply, a metric that measures the extent to which variables (or features or samples or any groups) are associated with one another. In almost any data analysis, data scientists will compare two variables and how they relate to one another.
简单来说,相关性是一种度量,用于度量变量(或特征,样本或任何组)相互关联的程度。 在几乎所有数据分析中,数据科学家都将比较两个变量以及它们之间的关系。
The following are the most widely used correlation techniques,
以下是使用最广泛的相关技术,
- Covariance 协方差
- Pearson Correlation Coefficient 皮尔逊相关系数
- Spearman Rank Correlation Coefficient 斯皮尔曼等级相关系数
1.协方差 (1. Covariance)
For two samples, say, X and Y, let E(X), E(Y) be the mean values of X, Y respectively, and ‘n’ be the total number of data points. The covariance of X, Y is given by,
对于两个样本,例如X和Y ,令E(X),E(Y)分别为X,Y的平均值,而' n '为数据点的总数。 X,Y的协方差为
The sign of the covariance indicates the tendency of the linear relationship between the variables.
协方差的符号表示变量之间线性关系的趋势。
2.皮尔逊相关系数 (2. Pearson Correlation Coefficient)
Pearson Correlation Coefficient is a statistic that also measures the linear correlation between two features. For two samples, X, Y let σX, σY be the standard deviations of X, Y respectively. PCC of X, Y is given by,
皮尔逊相关系数是一种统计数据,还可以测量两个特征之间的线性相关性。 对于两个样本,X,Y分别使σX,σY为X,Y的标准偏差。 X,Y的PCC由下式给出:
It has a value between -1 and +1.
它的值介于-1和+1之间。
3. Spearman等级相关系数 (3. Spearman Rank Correlation Coefficient)
Spearman Rank Correlation Coefficient (SRCC) assesses how well the relationship between two samples can be described using a monotonic function (whether linear or not) where PCC can assess only linear relationships.
Spearman等级相关系数(SRCC)使用PCC只能评估线性关系的单调函数(无论是否线性 )评估两个样本之间的关系描述得如何。
The Spearman rank correlation coefficient between the two samples is equal to the Pearson correlation coefficient between the rank values of those two samples. Rank is the relative position label of the observations within the variable.
两个样本之间的Spearman等级相关系数等于这两个样本之间的等级值之间的Pearson相关系数 。 等级是变量中观测值的相对位置标签。
Intuitively, the Spearman rank correlation coefficient between two variables will be high when observations have a similar rank between the two variables and low when observations have a dissimilar rank between the two variables.
直观地,当两个变量之间的观察值相似时,两个变量之间的Spearman等级相关系数将较高;而当两个变量之间的观察值具有不同等级时,则Spearman等级相关系数将较低。
The Spearman rank correlation coefficient lies between +1 and -1 where
Spearman等级相关系数介于+1和-1之间,其中
1 is a perfect positive correlation
1是完美的正相关
0 is no correlation
0 无相关
−1 is a perfect negative correlation
-1是理想的负相关
To know more about correlation techniques and when to which one, do check that article below.
要了解有关关联技术以及何时使用哪种技术的更多信息,请检查以下文章。
Thanks for the read. I am going to write more beginner-friendly posts in the future too. Follow me up on Medium to be informed about them. I welcome feedback and can be reached out on Twitter ramya_vidiyala and LinkedIn RamyaVidiyala. Happy learning!
感谢您的阅读。 我将来也会写更多对初学者友好的文章。 跟我上Medium ,以了解有关它们的信息。 我欢迎您提供反馈,可以在Twitter ramya_vidiyala和LinkedIn RamyaVidiyala上与他们联系 。 学习愉快!
翻译自: https://towardsdatascience.com/data-scientist-must-know-statistics-a161fa7c1bca
面向数据科学家的实用统计学
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388730.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!