什么是支持向量回归? (What is Support Vector Regression?)
Support vector regression is a special kind of regression that gives you some sort of buffer or flexibility with the error. How does it do that ? I’m going to explain it to you in simple terms by showing 2 different graphs.
支持向量回归是一种特殊的回归,它为您提供了某种缓冲或灵活的误差。 它是如何做到的? 我将通过显示2个不同的图形以简单的方式向您解释。
The above is an hypothetical linear regression graph. You can see that the regression line is drawn at a position with minimum sqaured errors. Errors are basically the sqaures of difference in distance between the original data point (points in black) and the regression line (predicted values).
上面是一个假设的线性回归图。 您可以看到回归线绘制在具有最小平方误差的位置。 误差基本上是原始数据点(黑色的点)与回归线(预测值)之间的距离差的平方。
The above is the same setting with SVR(Support Vector Regression). You can observe that there are 2 boundaries around the regression line. This is a tube with the vertical distance of epsilon above and below the regression line. In reality, it is kown as epsilon insensitive tube. The role of this tube is that it creates a buffer for the error. To be specific, all the data points within this tube are considered to have zero error from the regression line. Only the points outside of this tube are considered for calculating the errors. The error is calculated as the distance from the data point to the boundary of the tube rather than data point to the regression line (as seen in Linear Regression)
上面与SVR(支持向量回归)的设置相同。 您可以观察到回归线周围有2个边界。 这是在回归线上方和下方都有ε垂直距离的管。 实际上,它被称为ε敏感管。 该管的作用是为错误创建缓冲区。 具体而言,该管内的所有数据点都被认为与回归线的误差为零。 仅考虑该管外部的点才能计算误差。 误差计算为从数据点到管边界的距离,而不是从数据点到回归线的距离(如线性回归所示)
Why support vector ?
为什么要支持向量?
Well, all the points outside of the tube are known as slack points and they are essentially vectors in a 2-dimensional space. Imagine drawing vectors from the origin to the individual slack points, then you can see all the vectors in the graph. These vectors are supporting the structure or formation of the this tube and hence it is known as support vector regression. You can understand it from the below graph.
好吧,管外的所有点都称为松弛点,它们本质上是二维空间中的向量。 想象一下从原点到各个松弛点的绘制矢量,然后您可以在图中看到所有矢量。 这些向量支持该管的结构或形成,因此被称为支持向量回归。 您可以从下图了解它。
用Python实现 (Implementation in Python)
Let us deep dive into python and build a random forest regression model and try to predict the salary of an employee of 6.5 level(hypothetical).
让我们深入研究python并建立一个随机森林回归模型,并尝试预测6.5级(假设)的员工薪水。
Before you move forward, please download the CSV data file from my GitHub Gist.
在继续之前,请从我的GitHub Gist下载CSV数据文件。
https://gist.github.com/tharunpeddisetty/3fe7c29e9e56c3e17eb41a376e666083
Once you open the link, you can find "Download Zip" button on the top right corner of the window. Go ahead and download the files.
You can download 1) python file 2)data file (.csv)
Rename the folder accordingly and store it in desired location and you are all set.If you are a beginner I highly recommend you to open your python IDE and follow the steps below because here, I write detailed comments(statements after #.., these do not compile when our run the code) on the working of code. You can use the actual python as your backup file or for your future reference.
Importing Libraries
导入库
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Import Data and Define the X and Y variables
导入数据并定义X和Y变量
dataset = pd.read_csv(‘/Users/tharunpeddisetty/Desktop/Position_Salaries.csv’) #add your file pathX = dataset.iloc[:,1:-1].values
y = dataset.iloc[:, -1].values#iloc takes the values from the specified index locations and stores them in the assigned variable as an array
Let us look at our data and understand the variables:
让我们看一下数据并了解变量:
This data depicts the position/level of the employee and their salaries. This is the same dataset that I used in my Decision Tree Regression article.
此数据描述了员工的职位/水平及其薪水。 这与我在“决策树回归”文章中使用的数据集相同。
Feature Scaling
功能缩放
#Feature Scaling. Required for SVR. Since there’s no concept of coefficients
print(y)
#we need to reshape y because standard scaler class expects a 2D array
y=y.reshape(len(y),1)from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X= sc_X.fit_transform(X)
# create a new sc object because the first one calcualtes the mean and standard deviation of X. We need different values of mean and standard deviation for Y
y= sc_y.fit_transform(y)
print(y)
There is no concept of coefficients like linear regression in SVR, so in order to reduce the effect of high valued features we need to scale the features or in other words get all the values under one scale. We achieve this by standardizing the values. Since we have only one feature in this example, we would apply on it anyway. We do that using the StandardScaler() function from sklearn. But, for other datasets, do not forget to scale all your features and the dependent variable. Also, remember to reshape the Y (dependent variable i.e., Salary), which is purely for the sake of passing it through the standard scaler in python.
SVR中没有像线性回归这样的系数概念,因此,为了减少高价值要素的影响,我们需要对要素进行缩放,或者换句话说,将所有值都置于一个尺度下。 我们通过标准化值来实现。 由于在此示例中只有一个功能,因此无论如何我们都可以应用它。 我们使用sklearn的StandardScaler()函数进行此操作。 但是,对于其他数据集,请不要忘记缩放所有特征和因变量。 另外,请记住要重塑Y(因变量,即Salary),这纯粹是为了使其通过python中的标准缩放器。
Training the SVR model
训练SVR模型
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
Simple, isn’t it ? We are going to use Radial Basis Function as the kernel inside the SVR algorithm. This means that we are using a function called ‘rbf’ in order to map the data from one space to another. Explaining how this works is out of the scope of this article. But, you can always research about it online. The choice of kernel function varies with the distribution of the data. I suggest you research about them after implementing this basic program in python.
很简单,不是吗? 我们将使用径向基函数作为SVR算法中的内核。 这意味着我们正在使用一个名为“ rbf”的函数,以便将数据从一个空间映射到另一个空间。 解释其工作原理超出了本文的范围。 但是,您始终可以在线对其进行研究。 内核功能的选择随数据的分布而变化。 我建议您在python中实现此基本程序后,对它们进行研究。
Visualizing the results of SVR Regression
可视化SVR回归的结果
X_grid = np.arange(min(sc_X.inverse_transform(X)), max(sc_X.inverse_transform(X)), 0.1)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(sc_X.inverse_transform(X), sc_y.inverse_transform(y), color = 'red')
plt.plot(X_grid, sc_y.inverse_transform(regressor.predict(sc_X.transform(X_grid))), color = 'blue')
plt.title('Support Vector Regression')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
You can see how this model fits to the data. Do you think it is doing a great job ? Compare the results of this with the other regressions that were implemented on the same data in my previous articles and you can see the difference or wait until the end of this article.
您可以看到此模型如何适合数据。 您认为它做得很好吗? 将其结果与我以前的文章中基于相同数据实现的其他回归进行比较,您可以看到差异,或者等到本文结尾。
Predicting 6.5 level result using Decision tree Regression
使用决策树回归预测6.5级结果
print(sc_y.inverse_transform(regressor.predict(sc_X.transform([[6.5]])))
)
#We also need to inverse transform in order to get the final result
Make sure you apply all the transformations as that of initial data so that it is easier for the model to recognize the data and produce the relevant results.
确保将所有转换都应用为原始数据,以便模型更容易识别数据并产生相关结果。
Result
结果
Let me summarize all the results from various regression models so that it is easier for our comparison.
让我总结各种回归模型的所有结果,以便我们进行比较。
Support Vector Regression: 170370.0204065
支持向量回归:170370.0204065
Random Forest Regression: 167000 (Output is not part of the code)
随机森林回归:167000(输出不是代码的一部分)
Decision Tree Regression: 150000 (Output is not part of the code)
决策树回归:150000(输出不是代码的一部分)
Polynomial Linear Regression : 158862.45 (Output is not part of the code)
多项式线性回归:158862.45(输出不是代码的一部分)
Linear Regression predicts: 330378.79 (Output is not part of the code)
线性回归预测:330378.79(输出不是代码的一部分)
结论 (Conclusion)
You have the data in front of you. Now, act as a manager and take a decision by yourself. How much salary would you give an employee at 6.5 level (consider level to be the years of experience)? You see, there’s no absolute answer in data science. I can not say that SVR performed better than others, so that is the best model to predict the salaries. If you ask about what I think, I feel the prediction result of random forest regression is realistic than SVR. But again, that is my feeling. Remember that a lot of factors come into play such as position of the employee, average salary in that region for that position and the employee’s previous salary etc. So, don’t even believe me if I say random forest result is the best one. I only said that it is more realistic than others. The end decision depends on the business case of the organization and by no means there is a perfect model to predict the salary of the employee perfectly.
数据就摆在您面前。 现在,担任经理并自己做出决定。 您会给6.5级的员工多少薪水(考虑到多年的经验水平)? 您会发现,数据科学没有绝对的答案。 我不能说SVR的表现要好于其他,所以这是预测薪资的最佳模型。 如果您问我的想法,我觉得随机森林回归的预测结果比SVR更现实。 但是再次,这就是我的感觉。 请记住,许多因素都在起作用,例如员工的职位,该地区在该地区的平均薪水以及员工以前的薪水等。因此,如果我说随机森林成绩是最好的,甚至不要相信我。 我只是说这比其他人更现实。 最终决定取决于组织的业务案例,绝没有完美的模型可以完美地预测员工的薪水。
Congratulations! You have implemented support vector regression in the minimum lines of code. You now have a template of the code and you can implement this on other datasets and observe results. This marks the end of my articles on regression. Next stop is Classification models. Thanks for reading. Happy Machine Learning!
恭喜你! 您已在最少的代码行中实现了支持向量回归。 现在,您有了代码模板,您可以在其他数据集上实现此模板并观察结果。 这标志着我有关回归的文章的结尾。 下一站是分类模型。 谢谢阅读。 快乐的机器学习!
翻译自: https://towardsdatascience.com/baby-steps-towards-data-science-support-vector-regression-in-python-d6f5231f3be2
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392004.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!