正态分布高斯分布泊松分布
For detailed implementation in python check my GitHub repository.
有关在python中的详细实现,请查看我的GitHub存储库。
介绍 (Introduction)
Some machine learning model like linear and logistic regression assumes a Gaussian distribution or normal distribution. One of the first steps of statistical analysis of your data is therefore to check the distribution of the data.
某些机器学习模型(例如线性和逻辑回归)采用高斯分布或正态分布。 因此,对数据进行统计分析的第一步就是检查数据的分布。
The familiar bell curve shows a normal distribution.
熟悉的钟形曲线显示正态分布。
If your data has a Gaussian distribution, the machine learning methods are powerful and well understood.
如果您的数据具有高斯分布,则机器学习方法功能强大且易于理解。
Most of the data scientists claim they are getting more accurate results when they transform the predictor variables.
大多数数据科学家声称,他们在转换预测变量时会获得更准确的结果。
To transform data, you perform a mathematical operation on each observation, then use these transformed data in our model.
要转换数据,您需要对每个观测值执行数学运算,然后在我们的模型中使用这些转换后的数据。
为什么我们需要正态分布? (Why do we need a normal distribution?)
If a method that assumes a Gaussian distribution, and your data was drawn from a different distribution other then normal distribution, then the findings may be misleading or plain wrong.
如果采用假定高斯分布的方法,并且您的数据是从不同于正态分布的其他分布中提取的,则发现可能会产生误导或明显错误。
It is possible that your data does not look Gaussian or fails a normality test, but can be transformed to make it fit a Gaussian distribution.
您的数据看起来可能不是高斯或未通过正态性检验,但可以进行转换以使其适合高斯分布。
转换类型 (Type of transformation)
- Log Transformation 日志转换
- Reciprocal Transformation 相互转换
- Square-Root Transformation 平方根变换
- Cube root Transformation 立方根转换
- Exponential Transformation 指数变换
- Box-Cox Transformation Box-Cox转换
- Yeo-Johnson Transformation 杨约翰逊变换
可视化并检查分布 (Visualize and Checks distribution)
We can create plots of the data to check whether it is Gaussian or not.
我们可以创建数据图以检查其是否为高斯。
We will look at two common methods for visually inspecting a dataset to check if it was drawn from a Gaussian distribution.
我们将看两种常见的方法,以可视方式检查数据集以检查它是否是从高斯分布中提取的。
- Histogram 直方图
- Quantile-Quantile plot (Q-Q plot) 分位数图(QQ图)
1.直方图 (1. Histogram)
A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal distribution), outliers, skewness, etc.
直方图是一种图表,可让您发现并显示一组连续数据的基础频率分布(形状)。 这允许检查数据的基本分布(例如,正态分布),离群值,偏度等。
2. QQ图 (2. Q-Q Plot)
The Q-Q plot, or quantile-quantile, are plots of two quantiles against each other. A quantile is a fraction where certain values fall below that quantile.
QQ图(或分位数)是两个分位数彼此相对的图。 分位数是某些值低于该分位数的分数。
For example, the median is a quantile where 50% of the data fall below that point and 50% lie above it.
例如,中位数是一个分位数,其中50%的数据低于该点,而50%的数据位于该点之上。
The purpose of Q Q plots is to find out if two sets of data come from the same distribution. A 45-degree angle is plotted on the Q Q plot; if the two data sets come from a common distribution, the points will fall on that reference line.
QQ图的目的是找出两组数据是否来自同一分布。 QQ图上绘制了45度角; 如果两个数据集来自同一分布,则这些点将落在该参考线上。
数据 (Data)
We will use randomly generated data in this article to demonstrate all techniques.
我们将在本文中使用随机生成的数据来演示所有技术。
# generate a univariate data sample
np.random.seed(142)
datas = sorted(stats.lognorm.rvs(s=0.5, loc=1, scale=1000, size=1000))# convert to dataframe
data = pd.DataFrame(datas, columns=['values'])#
生成图的方法 (Method to generate a graph)
After we transform our data each time we will call this method to generate a graph.
每次转换数据后,我们将调用此方法来生成图形。
def gen_graph(value):
plt.figure(figsize=(15,5))
plt.subplot(1, 2, 1)
plt.hist(value, color='g', alpha=0.5) plt.subplot(1, 2, 2)
stats.probplot(value, dist="norm", plot=plt) plt.show()
转换前可视化数据 (visualize data before transformation)
gen_graph(data['values'])
1.日志转换 (1. Log Transformation)
Log or Logarithmic transformation is a data transformation method in which we replace each variable x with a log(x). When our original continuous data do not follow the bell curve, we can log transform this data to make it “normal”.
对数或对数转换是一种数据转换方法,其中我们将每个变量x替换为log(x)。 当我们的原始连续数据不遵循钟形曲线时,我们可以对数据进行对数转换以使其“正常”。
data_log = np.log(data['values'])gen_graph(data_log)
2.相互转化 (2. Reciprocal Transformation)
The reciprocal transformation is defined as the transformation of x to 1/x. The transformation has a dramatic effect on the shape of the distribution, reversing the order of values with the same sign. The transformation can only be used for non-zero values.
互逆变换定义为x到1 / x的变换。 变换对分布的形状产生了巨大影响,颠倒了具有相同符号的值的顺序。 转换只能用于非零值。
data_rec = np.reciprocal(data['values'])# ordata_rec_2 = 1/data['values']gen_graph(data_rec)
3.平方根变换 (3. Square-Root Transformation)
The square root of your variables, i.e. x → x(1/2) = sqrt(x). This will have a moderate effect on the distribution and usually works for data with non-constant variance. However, it is considered to be weaker than logarithmic or cube root transforms.
变量的平方根,即x→x(1/2)= sqrt(x)。 这将对分布产生适度的影响,并且通常适用于具有非恒定方差的数据。 但是,它被认为比对数或立方根转换要弱。
data_square = np.sqrt(data['values'])# or
data_square_2 = (data)**(1/2)gen_graph(data_square)
4.多维数据集根转换 (4. Cube root Transformation)
The cube root transformation involves converting x to x^(1/3).
立方根转换涉及将x转换为x ^(1/3)。
It is useful for reducing the right skewness. This is a fairly strong transformation with a substantial effect on the distribution shape but is weaker than the logarithm. It can be applied to negative and zero values too. Negatively skewed data.
这对于减少正确的偏斜很有用。 这是一个相当强的变换,对分布形状有很大影响,但比对数弱。 它也可以应用于负值和零值。 负偏斜数据。
data_cube = np.cbrt(data['values'])gen_graph(data_cube)
5.指数变换 (5. Exponential Transformation)
An exponential transformation provides a useful alternative to Box and Cox’s, one parameter power transformation, and has the advantage of allowing negative data values.
指数变换是Box和Cox变换的一种有用的替代方法,它是一种参数幂变换,并且具有允许负数据值的优点。
It has been found in particular that this transformation is quite effective at turning skew unimodal distribution into nearly symmetric normal like distribution.
特别地,已经发现该变换在将偏斜单峰分布转变成近似对称的正态分布方面非常有效。
data_expo = data['values'] ** (1/5)gen_graph(data_expo)
6. Box-Cox转换 (6. Box-Cox Transformation)
At the core of the Box-Cox transformation is an exponent, lambda (λ), which varies from -5 to 5.
Box-Cox变换的核心是指数lambda(λ),从-5到5不等。
All values of λ are considered, and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve. The transformation of Y has the form:
考虑所有λ值,并选择数据的最佳值; “最佳值”是导致正态分布曲线最佳近似的值。 Y的转换形式为:
data_boxcox, a = stats.boxcox(data['values'])gen_graph(data_boxcox)
7.杨约翰逊转型 (7. Yeo-Johnson Transformation)
This is one of the older transformation technique which is very similar to Box-cox transformation but does not require the values to be strictly positive.This transformation is also having the ability to make the distribution more symmetric.
这是较旧的变换技术之一,与Box-cox变换非常相似,但不需要严格将值设为正数。此变换还具有使分布更加对称的能力。
data_yeo, a = stats.yeojohnson(data['values'])gen_graph(data_yeo)
结论 (Conclusion)
In this article, we have seen the different types of transformations to normalized the distribution of data. Box-cox and log transformation is one of the best transformation technique that gives us a good result.
在本文中,我们已经看到了不同类型的转换以标准化数据的分布。 Box-cox和log转换是最好的转换技术之一,可以为我们带来良好的效果。
Hope you like this article.
希望您喜欢这篇文章。
Follow me for more such interesting article
跟随我获得更多如此有趣的文章
Please clap and show your appreciation :)
请鼓掌并表示感谢:)
Thanks for reading. 😃
谢谢阅读。 😃
翻译自: https://medium.com/next-gen-machine-learning/normal-distribution-data-transformation-to-gaussian-distribution-405941324f53
正态分布高斯分布泊松分布
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388507.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!