网页缩放与窗口缩放_功能缩放—不同的Scikit-Learn缩放器的效果:深入研究

网页缩放与窗口缩放

内部AI (Inside AI)

In supervised machine learning, we calculate the value of the output variable by supplying input variable values to an algorithm. Machine learning algorithm relates the input and output variable with a mathematical function.

在有监督的机器学习中,我们通过将输入变量值提供给算法来计算输出变量的值。 机器学习算法将输入和输出变量与数学函数相关联。

Output variable value = (2.4* Input Variable 1 )+ (6*Input Variable 2) + 3.5

输出变量值=(2.4 *输入变量1)+(6 *输入变量2)+ 3.5

There are a few specific assumptions behind each of the machine learning algorithms. To build an accurate model, we need to ensure that the input data meets those assumptions. In case, the data fed to machine learning algorithms do not satisfy the assumptions then prediction accuracy of the model is compromised.

每个机器学习算法背后都有一些特定的假设。 为了建立准确的模型,我们需要确保输入数据符合这些假设。 如果馈送到机器学习算法的数据不满足假设,则模型的预测准确性会受到损害。

Most of the supervised algorithms in sklearn require standard normally distributed input data centred around zero and have variance in the same order. If the value range from 1 to 10 for an input variable and 4000 to 700,000 for the other variable then the second input variable values will dominate and the algorithm will not be able to learn from other features correctly as expected.

sklearn中的大多数监督算法都需要以零为中心的标准正态分布输入数据,并且具有相同顺序的方差。 如果输入变量的值范围是1到10,其他变量的值范围是4000到700,000,则第二个输入变量值将占主导地位,并且该算法将无法正确地从其他功能中学习。

In this article, I will illustrate the effect of scaling the input variables with different scalers in scikit-learn and three different regression algorithms.

在本文中,我将说明在scikit-learn中使用不同的缩放器和三种不同的回归算法来缩放输入变量的效果。

In the below code, we import the packages we will be using for the analysis. We will create the test data with the help of make_regression

在下面的代码中,我们导入将用于分析的软件包。 我们将在make_regression的帮助下创建测试数据

from sklearn.datasets import make_regression
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import *
from sklearn.linear_model import*

We will use the sample size of 100 records with three independent (input) variables. Further, we will inject three outliers using the method “np.random.normal”

我们将使用100个记录的样本大小以及三个独立的(输入)变量。 此外,我们将使用“ np.random.normal”方法注入三个异常值

X, y, coef = make_regression(n_samples=100, n_features=3,noise=2,tail_strength=0.5,coef=True, random_state=0)X[:3] = 1 + 0.9 * np.random.normal(size=(3,3))
y[:3] = 1 + 2 * np.random.normal(size=3)

We will print the real coefficients of the sample datasets as a reference and compare with predicted coefficients.

我们将打印样本数据集的实际系数作为参考,并与预测系数进行比较。

print("The real coefficients are ". coef)
Image for post

We will train the algorithm with 80 records and reserve the remaining 20 samples unseen by the algorithm earlier for testing the accuracy of the model.

我们将使用80条记录来训练该算法,并保留该算法之前看不到的其余20个样本,以测试模型的准确性。

X_train, X_test, y_train, y_test = train_test_split(X, y,           test_size=0.20,random_state=42)

We will study the scaling effect with the scikit-learn StandardScaler, MinMaxScaler, power transformers, RobustScaler and, MaxAbsScaler.

我们将使用scikit-learn StandardScaler,MinMaxScaler,电源变压器,RobustScaler和MaxAbsScaler研究缩放效果。

regressors=[StandardScaler(),MinMaxScaler(),
PowerTransformer(method='yeo-johnson'),
RobustScaler(quantile_range=(25,75)),MaxAbsScaler()]

All the regression model we will be using is mentioned in a list object.

我们将使用的所有回归模型都在列表对象中提到。

models=[Ridge(alpha=1.0),HuberRegressor(),LinearRegression()]

In the code below, we scale the training and test sample input variable by calling each scaler in succession from the regressor list defined earlier. We will draw a scatter plot of the original first input variable and scaled the first input variable to get an insight on various scaling. We see each of these plots little later in this article.

在下面的代码中,我们通过从先前定义的回归列表中依次调用每个缩放器来缩放训练和测试样本输入变量。 我们将绘制原始第一个输入变量的散点图,并缩放第一个输入变量,以了解各种缩放比例。 我们将在本文的稍后部分看到这些图。

Further, we fit each of the models with scaled input variables from different scalers and predict the values of dependent variables for test sample dataset.

此外,我们使用来自不同缩放器的缩放输入变量拟合每个模型,并预测测试样本数据集的因变量值。

for regressor in regressors:    X_train_scaled=regressor.fit_transform(X_train)
X_test_scaled=regressor.transform(X_test)
Scaled =plt.scatter(X_train_scaled[:,0],y_train, marker='^', alpha=0.8)
Original=plt.scatter(X_train[:,0],y_train)
plt.legend((Scaled, Original),('Scaled', 'Original'),loc='best',fontsize=13)
plt.xlabel("Feature 1")
plt.ylabel("Train Target")
plt.show() for model in models:
reg_lin=model.fit(X_train_scaled, y_train)
y_pred=reg_lin.predict(X_test_scaled)
print("The calculated coeffiects with ", model , "and", regressor, reg_lin.coef_)

Finally, the predicted coefficients from the model fit are printed for the comparison with real coefficients.

最后,打印来自模型拟合的预测系数,以便与实际系数进行比较。

Image for post
Image for post
Results in tabular format
表格格式的结果

On first glance itself, we can deduce that same regression estimator predicts different values of the coefficients based on the scalers.Predicted coefficients with MaxAbsScaler and MinMax scaler is quite far from true coefficient values.We can see the importance of appropriate scalers in the prediction accuracy of the model from this example.

乍一看,我们可以推断出相同的回归估计器基于缩放器预测系数的不同值。使用MaxAbsScaler和MinMax缩放器预测的系数与真实系数值相差很远,我们可以看到合适的缩放器在预测精度中的重要性。此示例中的模型。

As a self-exploration and learning exercise, I will encourage you all to calculate the R2 score and Root Mean Square Error (RMSE) for each of the training and testing set combination and compare it with each other.

作为一项自我探索和学习的练习,我鼓励大家为每种训练和测试集组合计算R2得分和均方根误差(RMSE),并将其相互比较。

Now that we understand the importance of scaling and selecting suitable scalers, we will get into the inner working of each scaler.

现在我们了解了缩放和选择合适的缩放器的重要性,我们将深入研究每个缩放器的内部工作。

Standard Scaler: It is one of the popular scalers used in various real-life machine learning projects. The mean value and standard deviation of each input variable sample set are determined separately. It then subtracts the mean from each data point and divides by the standard deviation to transforms the variables to zero mean and standard deviation of one. It does not bound the values to a specific range, and it can be an issue for a few algorithms.

Standard Scaler:它是在各种现实机器学习项目中使用的流行缩放器之一。 每个输入变量样本集的平均值和标准偏差分别确定。 然后,它从每个数据点减去平均值,然后除以标准差,以将变量转换为零均值和标准差为1。 它不会将值限制在特定范围内,并且对于某些算法而言可能是个问题。

Image for post
Standard Scaler — Original Vs Scaled Plot based on the code discussed in the article
Standard Scaler —基于本文讨论的代码的原始Vs缩放图

MinMax Scaler: All the numeric values scaled between 0 and 1 with a MinMax Scaler

MinMax Scaler:使用MinMax Scaler在0到1之间缩放所有数值

Xscaled= (X-Xmin)/(Xmax-Xmin)

Xscaled =(X-Xmin)/(Xmax-Xmin)

MinMax scaling is quite affected by the outliers. If we have one or more extreme outlier in our data set, then the min-max scaler will place the normal values quite closely to accommodate the outliers within the 0 and 1 range. We saw earlier that the predicted coefficients with MinMax scaler are approximately three times the real coefficient. I will recommend not to use MinMax Scaler with outlier dataset.

MinMax缩放比例受异常值的影响很大。 如果我们在数据集中有一个或多个极端离群值,则最小-最大缩放器将非常接近地放置正常值以适应0和1范围内的离群值。 前面我们看到,用MinMax缩放器预测的系数大约是实际系数的三倍。 我建议不要对异常数据集使用MinMax Scaler

Image for post
Robust Scaler — Original Vs Scaled Plot based on the code discussed in the article
稳健的缩放器—基于本文讨论的代码的原始VS缩放图

Robust Scaler- Robust scaler is one of the best-suited scalers for outlier data sets. It scales the data according to the interquartile range. The interquartile range is the middle range where most of the data points exist.

稳健的缩放器-稳健的缩放器是离群数据集最适合的缩放器之一。 它根据四分位数范围缩放数据。 四分位数范围是存在大多数数据点的中间范围。

Power Transformer Scaler: Power transformer tries to scale the data like Gaussian. It attempts optimal scaling to stabilize variance and minimize skewness through maximum likelihood estimation. Sometimes, Power transformer fails to scale Gaussian-like results hence it is important to check the plot the scaled data

电力变压器缩放器:电力变压器尝试缩放像高斯这样的数据。 它尝试最佳缩放以通过最大似然估计来稳定方差并使偏斜最小化。 有时,电源变压器无法缩放类似高斯的结果,因此检查绘图的缩放数据很重要

Image for post
Power Transformer Scaler — Original Vs Scaled Plot based on the code discussed in the article
Power Transformer Scaler —基于本文讨论的代码的原始Vs缩放图

MaxAbs Scaler: MaxAbsScaler is best suited to scale the sparse data. It scales each feature by dividing it with the largest maximum value in each feature.

MaxAbs Scaler: MaxAbsScaler最适合缩放稀疏数据。 它通过将每个特征除以每个特征中的最大值来缩放每个特征。

Image for post
Maxabs Scaler — Original Vs Scaled Plot based on the code discussed in the article
Maxabs Scaler —基于本文讨论的代码的原始Vs缩放图

For example, if an input variable has the original value [2,-1,0,1] then MaxAbs will scale it as [1,-0.5,0,0.5]. It divided each value with the highest value i.e. 2. It is not advised to use with large outlier dataset.

例如,如果输入变量的原始值为[2,-1,0,1],则MaxAbs会将其缩放为[1,-0.5,0,0.5]。 它将每个值除以最高值,即2。不建议将其用于大型离群数据集。

We have learnt that scaling the input variables with suitable scaler is as vital as selecting the right machine learning algorithm. Few of the scalers are quite sensitive to outlier dataset, and others are robust. Each of the scalers in Scikit-Learn has its strengths and limitations, and we need to be mindful of it while using it.

我们已经知道,使用合适的缩放器缩放输入变量与选择正确的机器学习算法一样重要。 很少有缩放器对异常数据集非常敏感,而其他缩放器则很健壮。 Scikit-Learn中的每个定标器都有其优势和局限性,我们在使用它时需要谨记。

It also highlights the importance of performing the exploratory data analysis (EDA) initially to identify the presence or absence of outliers and other idiosyncrasies which will guide the selection of appropriate scaler.

它还强调了首先进行探索性数据分析(EDA)的重要性,以识别异常值和其他特质的存在与否,这将指导选择合适的定标器。

In my article, 5 Advanced Visualisation for Exploratory data analysis (EDA) you can learn more about this area.

在我的文章5探索性数据分析的高级可视化(EDA)中,您可以了解有关此领域的更多信息。

In case, you would like to learn a structured approach to identify the appropriate independent variables to make accurate predictions then read my article “How to identify the right independent variables for Machine Learning Supervised.

如果您想学习一种结构化的方法来识别适当的独立变量以做出准确的预测,然后阅读我的文章“如何为受监督的机器学习确定正确的独立变量” 。

"""Full Code"""from sklearn.datasets import make_regression
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import *
from sklearn.linear_model import*
import matplotlib.pyplot as plt
import seaborn as sns
X, y, coef = make_regression(n_samples=100, n_features=3,noise=2,tail_strength=0.5,coef=True, random_state=0)print("The real coefficients are ", coef)X[:3] = 1 + 0.9 * np.random.normal(size=(3,3))
y[:3] = 1 + 2 * np.random.normal(size=3)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20,random_state=42)regressors=[StandardScaler(),MinMaxScaler(),PowerTransformer(method='yeo-johnson'),RobustScaler(quantile_range=(25, 75)),MaxAbsScaler()]models=[Ridge(alpha=1.0),HuberRegressor(),LinearRegression()]for regressor in regressors:
X_train_scaled=regressor.fit_transform(X_train)
X_test_scaled=regressor.transform(X_test)
Scaled =plt.scatter(X_train_scaled[:,0],y_train, marker='^', alpha=0.8)
Original=plt.scatter(X_train[:,0],y_train)
plt.legend((Scaled, Original),('Scaled', 'Original'),loc='best',fontsize=13)
plt.xlabel("Feature 1")
plt.ylabel("Train Target")
plt.show()
for model in models:
reg_lin=model.fit(X_train_scaled, y_train)
y_pred=reg_lin.predict(X_test_scaled)
print("The calculated coeffiects with ", model , "and", regressor, reg_lin.coef_)

翻译自: https://towardsdatascience.com/feature-scaling-effect-of-different-scikit-learn-scalers-deep-dive-8dec775d4946

网页缩放与窗口缩放

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390956.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Python自动化开发01

一、 变量变量命名规则变量名只能是字母、数字或下划线的任意组合变量名的第一个字符不能是数字以下关键字不能声明为变量名 [and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not,…

未越狱设备提取数据_从三星设备中提取健康数据

未越狱设备提取数据Health data is collected every time you have your phone in your pocket. Apple or Android, the phones are equipped with a pedometer that counts your steps. Hence, health data is recorded. This data could be your one free data mart for a si…

[BZOJ2599][IOI2011]Race 点分治

2599: [IOI2011]Race Time Limit: 70 Sec Memory Limit: 128 MBSubmit: 3934 Solved: 1163[Submit][Status][Discuss]Description 给一棵树,每条边有权.求一条简单路径,权值和等于K,且边的数量最小.N < 200000, K < 1000000 Input 第一行 两个整数 n, k第二..n行 每行三…

分词消除歧义_角色标题消除歧义

分词消除歧义折磨数据&#xff0c;它将承认任何事情 (Torture the data, and it will confess to anything) Disambiguation as defined in the vocabulary.com dictionary refers to the removal of ambiguity by making something clear and narrowing down its meaning. Whi…

北航教授李波:说AI会有低潮就是胡扯,这是人类长期的追求

这一轮所谓人工智能的高潮&#xff0c;和以往的几次都有所不同&#xff0c;那是因为其受到了产业界的极大关注和参与。而以前并不是这样。 当今世界是一个高度信息化的世界&#xff0c;甚至我们有一只脚已经踏入了智能化时代。而在我们日常交流和信息互动中&#xff0c;迅速发…

在加利福尼亚州投资于新餐馆:一种数据驱动的方法

“It is difficult to make predictions, especially about the future.”“很难做出预测&#xff0c;尤其是对未来的预测。” ~Niels Bohr〜尼尔斯波尔 Everything is better interpreted through data. And data-driven decision making is crucial for success in any ind…

阿里云ESC上的Ubuntu图形界面的安装

系统装的是Ubuntu Server 16.04 64位版的图形界面&#xff0c;这里是转载的一个大神的帖子 http://blog.csdn.net/dk_0228/article/details/54571867&#xff0c; 当然自己也再记录一下&#xff0c;加深点印象 1.更新apt-get 保证最新 apt-get update 2.用putty或者Xshell连接远…

近似算法的近似率_选择最佳近似最近算法的数据科学家指南

近似算法的近似率by Braden Riggs and George Williams (gwilliamsgsitechnology.com)Braden Riggs和George Williams(gwilliamsgsitechnology.com) Whether you are new to the field of data science or a seasoned veteran, you have likely come into contact with the te…

VMware安装CentOS之二——最小化安装CentOS

1、上文已经创建了一个虚拟机&#xff0c;现在我们点击开启虚拟机。2、虚拟机进入到安装的界面&#xff0c;在这里我们选择第一行&#xff0c;安装或者升级系统。3、这里会提示要检查光盘&#xff0c;我们直接选择跳过。4、这里会提示我的硬件设备不被支持&#xff0c;点击OK&a…

在Python中使用Seaborn和WordCloud可视化YouTube视频

I am an avid Youtube user and love watching videos on it in my free time. I decided to do some exploratory data analysis on the youtube videos streamed in the US. I found the dataset on the Kaggle on this link我是YouTube的狂热用户&#xff0c;喜欢在业余时间…

老生常谈:抽象工厂模式

在创建型模式中有一个模式是不得不学的,那就是抽象工厂模式(Abstract Factory),这是创建型模式中最为复杂,功能最强大的模式.它常与工厂方法组合来实现。平时我们在写一个组件的时候一般只针对一种语言,或者说是针对一个区域的人来实现。 例如:现有有一个新闻组件,在中国我们有…

数据结构入门最佳书籍_最佳数据科学书籍

数据结构入门最佳书籍Introduction介绍 I get asked a lot what resources I recommend for people who want to start their Data Science journey. This section enlists books I recommend you should read at least once in your life as a Data Scientist.我被很多人问到…

函数式编程概念

什么是函数式编程 简单地说&#xff0c;函数式编程通过使用函数&#xff0c;将值转换成抽象单元&#xff0c;接着用于构建软件系统。 面向对象VS函数式编程 面向对象编程 面向对象编程认为一切事物皆对象&#xff0c;将现实世界的事物抽象成对象&#xff0c;现实世界中的关系抽…

多重插补 均值插补_Feature Engineering Part-1均值/中位数插补。

多重插补 均值插补Understanding the Mean /Median Imputation and Implementation using feature-engine….!了解使用特征引擎的均值/中位数插补和实现…。&#xff01; 均值或中位数插补&#xff1a; (Mean or Median Imputation:) The mean or median value should be calc…

linux 查看用户上次修改密码的日期

查看root用户密码上次修改的时间 方法一&#xff1a;查看日志文件&#xff1a; # cat /var/log/secure |grep password changed 方法二&#xff1a; # chage -l root-----Last password change : Feb 27, 2018 Password expires : never…

客户行为模型 r语言建模_客户行为建模:汇总统计的问题

客户行为模型 r语言建模As a Data Scientist, I spend quite a bit of time thinking about Customer Lifetime Value (CLV) and how to model it. A strong CLV model is really a strong customer behavior model — the better you can predict next actions, the better yo…

【知识科普】解读闪电/雷电网络,零基础秒懂!

知识科普&#xff0c;解读闪电/雷电网络&#xff0c;零基础秒懂&#xff01; 闪电网络的技术是革命性的&#xff0c;将实现即时0手续费的小金额支付。第一步是解决扩容问题&#xff0c;第二部就是解决共通性问题&#xff0c;利用原子交换协议和不同链条的状态通道结合&#xff…

Alpha 冲刺 (5/10)

【Alpha go】Day 5&#xff01; Part 0 简要目录 Part 1 项目燃尽图Part 2 项目进展Part 3 站立式会议照片Part 4 Scrum 摘要Part 5 今日贡献Part 1 项目燃尽图 Part 2 项目进展 已分配任务进度博客检索功能&#xff1a;根据标签检索流程图 -> 实现 -> 测试近期比…

多维空间可视化_使用GeoPandas进行空间可视化

多维空间可视化Recently, I was working on a project where I was trying to build a model that could predict housing prices in King County, Washington — the area that surrounds Seattle. After looking at the features, I wanted a way to determine the houses’ …

机器学习 来源框架_机器学习的秘密来源:策展

机器学习 来源框架成功的机器学习/人工智能方法 (Methods for successful Machine learning / Artificial Intelligence) It’s widely stated that data is the new oil, and like oil, data needs the right refinement to evolve to be utilised perfectly. The power of ma…