正态分布高斯分布泊松分布

For detailed implementation in python check my GitHub repository.

有关在python中的详细实现，请查看我的GitHub存储库。

介绍 (Introduction)

Some machine learning model like linear and logistic regression assumes a Gaussian distribution or normal distribution. One of the first steps of statistical analysis of your data is therefore to check the distribution of the data.

某些机器学习模型(例如线性和逻辑回归)采用高斯分布或正态分布。因此，对数据进行统计分析的第一步就是检查数据的分布。

The familiar bell curve shows a normal distribution.

熟悉的钟形曲线显示正态分布。

If your data has a Gaussian distribution, the machine learning methods are powerful and well understood.

如果您的数据具有高斯分布，则机器学习方法功能强大且易于理解。

Most of the data scientists claim they are getting more accurate results when they transform the predictor variables.

大多数数据科学家声称，他们在转换预测变量时会获得更准确的结果。

To transform data, you perform a mathematical operation on each observation, then use these transformed data in our model.
要转换数据，您需要对每个观测值执行数学运算，然后在我们的模型中使用这些转换后的数据。

为什么我们需要正态分布？ (Why do we need a normal distribution?)

If a method that assumes a Gaussian distribution, and your data was drawn from a different distribution other then normal distribution, then the findings may be misleading or plain wrong.

如果采用假定高斯分布的方法，并且您的数据是从不同于正态分布的其他分布中提取的，则发现可能会产生误导或明显错误。

It is possible that your data does not look Gaussian or fails a normality test, but can be transformed to make it fit a Gaussian distribution.

您的数据看起来可能不是高斯或未通过正态性检验，但可以进行转换以使其适合高斯分布。

转换类型 (Type of transformation)

Log Transformation
日志转换
Reciprocal Transformation
相互转换
Square-Root Transformation
平方根变换
Cube root Transformation
立方根转换
Exponential Transformation
指数变换
Box-Cox Transformation
Box-Cox转换
Yeo-Johnson Transformation
杨约翰逊变换

可视化并检查分布 (Visualize and Checks distribution)

We can create plots of the data to check whether it is Gaussian or not.

我们可以创建数据图以检查其是否为高斯。

We will look at two common methods for visually inspecting a dataset to check if it was drawn from a Gaussian distribution.

我们将看两种常见的方法，以可视方式检查数据集以检查它是否是从高斯分布中提取的。

Histogram
直方图
Quantile-Quantile plot (Q-Q plot)
分位数图(QQ图)

1.直方图 (1. Histogram)

A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal distribution), outliers, skewness, etc.

直方图是一种图表，可让您发现并显示一组连续数据的基础频率分布(形状)。这允许检查数据的基本分布(例如，正态分布)，离群值，偏度等。

2. QQ图 (2. Q-Q Plot)

The Q-Q plot, or quantile-quantile, are plots of two quantiles against each other. A quantile is a fraction where certain values fall below that quantile.

QQ图(或分位数)是两个分位数彼此相对的图。分位数是某些值低于该分位数的分数。

For example, the median is a quantile where 50% of the data fall below that point and 50% lie above it.

例如，中位数是一个分位数，其中50％的数据低于该点，而50％的数据位于该点之上。

The purpose of Q Q plots is to find out if two sets of data come from the same distribution. A 45-degree angle is plotted on the Q Q plot; if the two data sets come from a common distribution, the points will fall on that reference line.

QQ图的目的是找出两组数据是否来自同一分布。 QQ图上绘制了45度角；如果两个数据集来自同一分布，则这些点将落在该参考线上。

数据 (Data)

We will use randomly generated data in this article to demonstrate all techniques.

我们将在本文中使用随机生成的数据来演示所有技术。

# generate a univariate data sample
np.random.seed(142)
datas = sorted(stats.lognorm.rvs(s=0.5, loc=1, scale=1000, size=1000))# convert to dataframe
data = pd.DataFrame(datas, columns=['values'])#

生成图的方法 (Method to generate a graph)

After we transform our data each time we will call this method to generate a graph.

每次转换数据后，我们将调用此方法来生成图形。

def gen_graph(value):
    plt.figure(figsize=(15,5))
    plt.subplot(1, 2, 1)
    plt.hist(value, color='g', alpha=0.5)    plt.subplot(1, 2, 2)
    stats.probplot(value, dist="norm", plot=plt)    plt.show()

转换前可视化数据 (visualize data before transformation)

gen_graph(data['values'])

1.日志转换 (1. Log Transformation)

Log or Logarithmic transformation is a data transformation method in which we replace each variable x with a log(x). When our original continuous data do not follow the bell curve, we can log transform this data to make it “normal”.

对数或对数转换是一种数据转换方法，其中我们将每个变量x替换为log(x)。当我们的原始连续数据不遵循钟形曲线时，我们可以对数据进行对数转换以使其“正常”。

data_log = np.log(data['values'])gen_graph(data_log)

2.相互转化 (2. Reciprocal Transformation)

The reciprocal transformation is defined as the transformation of x to 1/x. The transformation has a dramatic effect on the shape of the distribution, reversing the order of values with the same sign. The transformation can only be used for non-zero values.

互逆变换定义为x到1 / x的变换。变换对分布的形状产生了巨大影响，颠倒了具有相同符号的值的顺序。转换只能用于非零值。

data_rec = np.reciprocal(data['values'])# ordata_rec_2 = 1/data['values']gen_graph(data_rec)

3.平方根变换 (3. Square-Root Transformation)

The square root of your variables, i.e. x → x(1/2) = sqrt(x). This will have a moderate effect on the distribution and usually works for data with non-constant variance. However, it is considered to be weaker than logarithmic or cube root transforms.

变量的平方根，即x→x(1/2)= sqrt(x)。这将对分布产生适度的影响，并且通常适用于具有非恒定方差的数据。但是，它被认为比对数或立方根转换要弱。

data_square = np.sqrt(data['values'])# or 
data_square_2 = (data)**(1/2)gen_graph(data_square)

4.多维数据集根转换 (4. Cube root Transformation)

The cube root transformation involves converting x to x^(1/3).

立方根转换涉及将x转换为x ^(1/3)。

It is useful for reducing the right skewness. This is a fairly strong transformation with a substantial effect on the distribution shape but is weaker than the logarithm. It can be applied to negative and zero values too. Negatively skewed data.

这对于减少正确的偏斜很有用。这是一个相当强的变换，对分布形状有很大影响，但比对数弱。它也可以应用于负值和零值。负偏斜数据。

data_cube = np.cbrt(data['values'])gen_graph(data_cube)

5.指数变换 (5. Exponential Transformation)

An exponential transformation provides a useful alternative to Box and Cox’s, one parameter power transformation, and has the advantage of allowing negative data values.

指数变换是Box和Cox变换的一种有用的替代方法，它是一种参数幂变换，并且具有允许负数据值的优点。

It has been found in particular that this transformation is quite effective at turning skew unimodal distribution into nearly symmetric normal like distribution.

特别地，已经发现该变换在将偏斜单峰分布转变成近似对称的正态分布方面非常有效。

data_expo =  data['values'] ** (1/5)gen_graph(data_expo)

6. Box-Cox转换 (6. Box-Cox Transformation)

At the core of the Box-Cox transformation is an exponent, lambda (λ), which varies from -5 to 5.

Box-Cox变换的核心是指数lambda(λ)，从-5到5不等。

All values of λ are considered, and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve. The transformation of Y has the form:

考虑所有λ值，并选择数据的最佳值； “最佳值”是导致正态分布曲线最佳近似的值。 Y的转换形式为：

data_boxcox, a = stats.boxcox(data['values'])gen_graph(data_boxcox)

7.杨约翰逊转型 (7. Yeo-Johnson Transformation)

This is one of the older transformation technique which is very similar to Box-cox transformation but does not require the values to be strictly positive.This transformation is also having the ability to make the distribution more symmetric.

这是较旧的变换技术之一，与Box-cox变换非常相似，但不需要严格将值设为正数。此变换还具有使分布更加对称的能力。

data_yeo, a = stats.yeojohnson(data['values'])gen_graph(data_yeo)

结论 (Conclusion)

In this article, we have seen the different types of transformations to normalized the distribution of data. Box-cox and log transformation is one of the best transformation technique that gives us a good result.

在本文中，我们已经看到了不同类型的转换以标准化数据的分布。 Box-cox和log转换是最好的转换技术之一，可以为我们带来良好的效果。

Hope you like this article.

希望您喜欢这篇文章。

Follow me for more such interesting article
跟随我获得更多如此有趣的文章

Please clap and show your appreciation :)
请鼓掌并表示感谢：)

Thanks for reading. 😃

谢谢阅读。 😃

翻译自: https://medium.com/next-gen-machine-learning/normal-distribution-data-transformation-to-gaussian-distribution-405941324f53

正态分布高斯分布泊松分布

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/388507.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

BABOK - 开篇：业务分析知识体系介绍

本文更新版已挪至 http://www.zhoujingen.cn/itbang/328.html ---------------------------------------------- 当我们作项目时，下面这张图很多人都明白，从计划、构建、测试、部署实施后发现提供的方案并不能真正解决用户的问题，那么我们是…

对象-检测属性

<h3>判断某个属性是否存在于某个对象中；</h3><ol><li>in：检查一个属性是否属于某个对象，包括继承来的属性；<pre>var person {name:yourname, age:10};console.log(name in person); //trueconsole…

黑苹果 wifi android,动动手指零负担让你的黑苹果连上Wifi

动动手指零负担让你的黑苹果连上Wifi2019-12-02 10:08:485点赞36收藏4评论购买理由黑苹果Wifi是个头疼的问题，高“贵”的原机Wifi蓝牙很贵，比如我最近偶然得到的BCM94360CS2，估计要180。稍微便宜的一点的，搞各种ID，各种…

洛谷——P2018 消息传递

P2018 消息传递题目描述巴蜀国的社会等级森严，除了国王之外，每个人均有且只有一个直接上级，当然国王没有上级。如果A是B的上级，B是C的上级，那么A就是C的上级。绝对不会出现这样的关系：A是B的上级&#xf…

axios异步请求数据的简单使用

使用Mock模拟好后端数据之后（Mock模拟数据的使用参考：https://segmentfault.com/a/11...），就需要尝试请求加载数据了。数据请求选择了axios，现在都推荐使用axios。 axios（https://github.com/axios/axios&a…

float在html语言中的用法,float属性值包括

html中不属于float常用属性值的是float常用的值就三个:left\right\none。没有其他的值了。其中none这个值是默认的，所以一般不用写。css中float属性有几种用法？值描述left 元素向左浮动。 right 元素向右浮动。 none 默认值。元素不浮动，并…

它们是什么以及为什么我们不需要它们

Once in a while, when reading papers in the Reinforcement Learning domain, you may stumble across mysterious-sounding phrases such as ‘we deal with a filtered probability space’, ‘the expected value is conditional on a filtration’ or ‘the decision-mak…

LoadRunner8.1破解汉化过程

LR8.1版本已经将7.8和8.0中通用的license封了，因此目前无法使用LR8.1版本，包括该版本的中文补丁。破解思路：由于软件的加密程序和运行的主程序是分开的，因此可以使用7.8的加密程序覆盖8.1中的加密程序，这样老的7.8和…

TCP/IP网络编程之基于TCP的服务端/客户端（二）

回声客户端问题上一章TCP/IP网络编程之基于TCP的服务端/客户端（一）中，我们解释了回声客户端所存在的问题，那么单单是客户端的问题，服务端没有任何问题？是的，服务端没有问题，现在先让…

谈谈iOS获取调用链

本文由云社区发表iOS开发过程中难免会遇到卡顿等性能问题或者死锁之类的问题，此时如果有调用堆栈将对解决问题很有帮助。那么在应用中如何来实时获取函数的调用堆栈呢？本文参考了网上的一些博文，讲述了使用mach thread的方式来获取调用栈的步…

python 移动平均线_Python中的移动平均线

python 移动平均线There are situations, particularly when dealing with real-time data, when a conventional average is of little use because it includes old values which are no longer relevant and merely give a misleading impression of the current situation.…