迈向数据科学的第一步:在Python中支持向量回归

什么是支持向量回归? (What is Support Vector Regression?)

Support vector regression is a special kind of regression that gives you some sort of buffer or flexibility with the error. How does it do that ? I’m going to explain it to you in simple terms by showing 2 different graphs.

支持向量回归是一种特殊的回归,它为您提供了某种缓冲或灵活的误差。 它是如何做到的? 我将通过显示2个不同的图形以简单的方式向您解释。

Image for post
Image by author
图片作者

The above is an hypothetical linear regression graph. You can see that the regression line is drawn at a position with minimum sqaured errors. Errors are basically the sqaures of difference in distance between the original data point (points in black) and the regression line (predicted values).

上面是一个假设的线性回归图。 您可以看到回归线绘制在具有最小平方误差的位置。 误差基本上是原始数据点(黑色的点)与回归线(预测值)之间的距离差的平方。

Image for post
Image by author
图片作者

The above is the same setting with SVR(Support Vector Regression). You can observe that there are 2 boundaries around the regression line. This is a tube with the vertical distance of epsilon above and below the regression line. In reality, it is kown as epsilon insensitive tube. The role of this tube is that it creates a buffer for the error. To be specific, all the data points within this tube are considered to have zero error from the regression line. Only the points outside of this tube are considered for calculating the errors. The error is calculated as the distance from the data point to the boundary of the tube rather than data point to the regression line (as seen in Linear Regression)

上面与SVR(支持向量回归)的设置相同。 您可以观察到回归线周围有2个边界。 这是在回归线上方和下方都有ε垂直距离的管。 实际上,它被称为ε敏感管。 该管的作用是为错误创建缓冲区。 具体而言,该管内的所有数据点都被认为与回归线的误差为零。 仅考虑该管外部的点才能计算误差。 误差计算为从数据点到管边界的距离,而不是从数据点到回归线的距离(如线性回归所示)

Why support vector ?

为什么要支持向量?

Well, all the points outside of the tube are known as slack points and they are essentially vectors in a 2-dimensional space. Imagine drawing vectors from the origin to the individual slack points, then you can see all the vectors in the graph. These vectors are supporting the structure or formation of the this tube and hence it is known as support vector regression. You can understand it from the below graph.

好吧,管外的所有点都称为松弛点,它们本质上是二维空间中的向量。 想象一下从原点到各个松弛点的绘制矢量,然后您可以在图中看到所有矢量。 这些向量支持该管的结构或形成,因此被称为支持向量回归。 您可以从下图了解它。

Image for post

用Python实现 (Implementation in Python)

Let us deep dive into python and build a random forest regression model and try to predict the salary of an employee of 6.5 level(hypothetical).

让我们深入研究python并建立一个随机森林回归模型,并尝试预测6.5级(假设)的员工薪水。

Before you move forward, please download the CSV data file from my GitHub Gist.

在继续之前,请从我的GitHub Gist下载CSV数据文件。

https://gist.github.com/tharunpeddisetty/3fe7c29e9e56c3e17eb41a376e666083
Once you open the link, you can find "Download Zip" button on the top right corner of the window. Go ahead and download the files.
You can download 1) python file 2)data file (.csv)
Rename the folder accordingly and store it in desired location and you are all set.If you are a beginner I highly recommend you to open your python IDE and follow the steps below because here, I write detailed comments(statements after #.., these do not compile when our run the code) on the working of code. You can use the actual python as your backup file or for your future reference.

Importing Libraries

导入库

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Import Data and Define the X and Y variables

导入数据并定义X和Y变量

dataset = pd.read_csv(‘/Users/tharunpeddisetty/Desktop/Position_Salaries.csv’) #add your file pathX = dataset.iloc[:,1:-1].values
y = dataset.iloc[:, -1].values#iloc takes the values from the specified index locations and stores them in the assigned variable as an array

Let us look at our data and understand the variables:

让我们看一下数据并了解变量:

Image for post

This data depicts the position/level of the employee and their salaries. This is the same dataset that I used in my Decision Tree Regression article.

此数据描述了员工的职位/水平及其薪水。 这与我在“决策树回归”文章中使用的数据集相同。

Feature Scaling

功能缩放

#Feature Scaling. Required for SVR. Since there’s no concept of coefficients
print(y)
#we need to reshape y because standard scaler class expects a 2D array
y=y.reshape(len(y),1)from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X= sc_X.fit_transform(X)
# create a new sc object because the first one calcualtes the mean and standard deviation of X. We need different values of mean and standard deviation for Y
y= sc_y.fit_transform(y)
print(y)

There is no concept of coefficients like linear regression in SVR, so in order to reduce the effect of high valued features we need to scale the features or in other words get all the values under one scale. We achieve this by standardizing the values. Since we have only one feature in this example, we would apply on it anyway. We do that using the StandardScaler() function from sklearn. But, for other datasets, do not forget to scale all your features and the dependent variable. Also, remember to reshape the Y (dependent variable i.e., Salary), which is purely for the sake of passing it through the standard scaler in python.

SVR中没有像线性回归这样的系数概念,因此,为了减少高价值要素的影响,我们需要对要素进行缩放,或者换句话说,将所有值都置于一个尺度下。 我们通过标准化值来实现。 由于在此示例中只有一个功能,因此无论如何我们都可以应用它。 我们使用sklearn的StandardScaler()函数进行此操作。 但是,对于其他数据集,请不要忘记缩放所有特征和因变量。 另外,请记住要重塑Y(因变量,即Salary),这纯粹是为了使其通过python中的标准缩放器。

Training the SVR model

训练SVR模型

from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)

Simple, isn’t it ? We are going to use Radial Basis Function as the kernel inside the SVR algorithm. This means that we are using a function called ‘rbf’ in order to map the data from one space to another. Explaining how this works is out of the scope of this article. But, you can always research about it online. The choice of kernel function varies with the distribution of the data. I suggest you research about them after implementing this basic program in python.

很简单,不是吗? 我们将使用径向基函数作为SVR算法中的内核。 这意味着我们正在使用一个名为“ rbf”的函数,以便将数据从一个空间映射到另一个空间。 解释其工作原理超出了本文的范围。 但是,您始终可以在线对其进行研究。 内核功能的选择随数据的分布而变化。 我建议您在python中实现此基本程序后,对它们进行研究。

Visualizing the results of SVR Regression

可视化SVR回归的结果

X_grid = np.arange(min(sc_X.inverse_transform(X)), max(sc_X.inverse_transform(X)), 0.1)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(sc_X.inverse_transform(X), sc_y.inverse_transform(y), color = 'red')
plt.plot(X_grid, sc_y.inverse_transform(regressor.predict(sc_X.transform(X_grid))), color = 'blue')
plt.title('Support Vector Regression')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
Image for post
Image by author
图片作者

You can see how this model fits to the data. Do you think it is doing a great job ? Compare the results of this with the other regressions that were implemented on the same data in my previous articles and you can see the difference or wait until the end of this article.

您可以看到此模型如何适合数据。 您认为它做得很好吗? 将其结果与我以前的文章中基于相同数据实现的其他回归进行比较,您可以看到差异,或者等到本文结尾。

Predicting 6.5 level result using Decision tree Regression

使用决策树回归预测6.5级结果

print(sc_y.inverse_transform(regressor.predict(sc_X.transform([[6.5]])))
)
#We also need to inverse transform in order to get the final result

Make sure you apply all the transformations as that of initial data so that it is easier for the model to recognize the data and produce the relevant results.

确保将所有转换都应用为原始数据,以便模型更容易识别数据并产生相关结果。

Result

结果

Let me summarize all the results from various regression models so that it is easier for our comparison.

让我总结各种回归模型的所有结果,以便我们进行比较。

Support Vector Regression: 170370.0204065

支持向量回归:170370.0204065

Random Forest Regression: 167000 (Output is not part of the code)

随机森林回归:167000(输出不是代码的一部分)

Decision Tree Regression: 150000 (Output is not part of the code)

决策树回归:150000(输出不是代码的一部分)

Polynomial Linear Regression : 158862.45 (Output is not part of the code)

多项式线性回归:158862.45(输出不是代码的一部分)

Linear Regression predicts: 330378.79 (Output is not part of the code)

线性回归预测:330378.79(输出不是代码的一部分)

结论 (Conclusion)

You have the data in front of you. Now, act as a manager and take a decision by yourself. How much salary would you give an employee at 6.5 level (consider level to be the years of experience)? You see, there’s no absolute answer in data science. I can not say that SVR performed better than others, so that is the best model to predict the salaries. If you ask about what I think, I feel the prediction result of random forest regression is realistic than SVR. But again, that is my feeling. Remember that a lot of factors come into play such as position of the employee, average salary in that region for that position and the employee’s previous salary etc. So, don’t even believe me if I say random forest result is the best one. I only said that it is more realistic than others. The end decision depends on the business case of the organization and by no means there is a perfect model to predict the salary of the employee perfectly.

数据就摆在您面前。 现在,担任经理并自己做出决定。 您会给6.5级的员工多少薪水(考虑到多年的经验水平)? 您会发现,数据科学没有绝对的答案。 我不能说SVR的表现要好于其他,所以这是预测薪资的最佳模型。 如果您问我的想法,我觉得随机森林回归的预测结果比SVR更现实。 但是再次,这就是我的感觉。 请记住,许多因素都在起作用,例如员工的职位,该地区在该地区的平均薪水以及员工以前的薪水等。因此,如果我说随机森林成绩是最好的,甚至不要相信我。 我只是说这比其他人更现实。 最终决定取决于组织的业务案例,绝没有完美的模型可以完美地预测员工的薪水。

Congratulations! You have implemented support vector regression in the minimum lines of code. You now have a template of the code and you can implement this on other datasets and observe results. This marks the end of my articles on regression. Next stop is Classification models. Thanks for reading. Happy Machine Learning!

恭喜你! 您已在最少的代码行中实现了支持向量回归。 现在,您有了代码模板,您可以在其他数据集上实现此模板并观察结果。 这标志着我有关回归的文章的结尾。 下一站是分类模型。 谢谢阅读。 快乐的机器学习!

翻译自: https://towardsdatascience.com/baby-steps-towards-data-science-support-vector-regression-in-python-d6f5231f3be2

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392004.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

jQuery事件整合

一、jQuery事件 1、focus()元素获得焦点 2、blur()元素失去焦点 3、change() 表单元素的值发生变化(可用于验证用户名是否存在) 4、click() 鼠标单击 5、dbc…

tableau跨库创建并集_刮擦柏林青年旅舍,并以此建立一个Tableau全景。

tableau跨库创建并集One of the coolest things about making our personal project is the fact that we can explore topics of our own interest. On my case, I’ve had the chance to backpack around the world for more than a year between 2016–2017, and it was one…

1.0 Hadoop的介绍、搭建、环境

HADOOP背景介绍 1.1 Hadoop产生背景 HADOOP最早起源于Nutch。Nutch的设计目标是构建一个大型的全网搜索引擎,包括网页抓取、索引、查询等功能,但随着抓取网页数量的增加,遇到了严重的可扩展性问题——如何解决数十亿网页的存储和索引问题。20…

如何实现多维智能监控?--AI运维的实践探索【一】

作者丨吴树生:腾讯高级工程师,负责SNG大数据监控平台建设。近十年监控系统开发经验,具有构建基于大数据平台的海量高可用分布式监控系统研发经验。 导语:监控数据多维化后,带来新的应用场景。SNG的哈勃多维监控平台在完…

使用Python和MetaTrader在5分钟内开始构建您的交易策略

In one of my last posts, I showed how to create graphics using the Plotly library. To do this, we import data from MetaTrader in a ‘raw’ way without automation. Today, we will learn how to automate this process and plot a heatmap graph of the correlation…

请对比html与css的异同,css2与css3的区别是什么?

css主要有三个版本,分别是css1、css2、css3。css2使用的比较多,因为css1的属性比较少,而css3有一些老式浏览器并不支持,所以大家在开发的时候主要还是使用css2。CSS1提供有关字体、颜色、位置和文本属性的基本信息,该版…

ipywidgets_未来价值和Ipywidgets

ipywidgetsHow to use Ipywidgets to visualize future value with different interest rates.如何使用Ipywidgets可视化不同利率下的未来价值。 There are some calculations that even being easy becoming better with a visualization of his terms. Moreover, the sooner…

计算机主机后面辐射大,电脑的背面辐射大吗

众所周知,电子产品的辐射都比较大,而电脑是非常常见的电子产品,它也存在着一定的辐射,那么电脑的背面辐射大吗?下面就一起随佰佰安全网小编来了解一下吧。有资料显示,电脑后面的辐射比前面大,长期近距离在…

装饰器3--装饰器作用原理

多思考,多记忆!!! 转载于:https://www.cnblogs.com/momo8238/p/7217345.html

用folium模块画地理图_使用Folium表示您的地理空间数据

用folium模块画地理图As a part of the Data Science community, Geospatial data is one of the most crucial kinds of data to work with. The applications are as simple as ‘Where’s my food delivery order right now?’ and as complex as ‘What is the most optim…

python创建类统计属性_轻松创建统计数据的Python包

python创建类统计属性介绍 (Introduction) Sometimes you may need a distribution figure for your slide or class. Since you are not using data, you want a quick solution.有时,您的幻灯片或课程可能需要一个分配图。 由于您不使用数据,因此需要快…

浅析STM32之usbh_def.H

【温故而知新】类似文章浅析USB HID ReportDesc (HID报告描述符) 现在将en.stm32cubef1\STM32Cube_FW_F1_V1.4.0\Middlewares\ST\STM32_USB_Host_Library\Core\Inc\usbh_def.H /********************************************************************************* file us…

C# (类型、对象、线程栈和托管堆)在运行时的相互关系

在介绍运行时的关系之前,先从一些计算机基础只是入手,如下图: 该图展示了已加载CLR的一个windows进程,该进程可能有多个线程,线程创建时会分配到1MB的栈空间.栈空间用于向方法传递实参,方法定义的局部变量也在实参上,上图的右侧展示了线程的栈内存,栈从高位内存地址向地位内存地…

2019-08-01 纪中NOIP模拟赛B组

T1 [JZOJ2642] 游戏 题目描述 Alice和Bob在玩一个游戏,游戏是在一个N*N的矩阵上进行的,每个格子上都有一个正整数。当轮到Alice/Bob时,他/她可以选择最后一列或最后一行,并将其删除,但必须保证选择的这一行或这一列所有…

knn分类 knn_关于KNN的快速小课程

knn分类 knnAs the title says, here is a quick little lesson on how to construct a simple KNN model in SciKit-Learn. I will be using this dataset. It contains information on students’ academic performance.就像标题中所说的,这是关于如何在SciKit-Le…

office漏洞利用--获取shell

环境: kali系统, windows系统 流程: 在kali系统生成利用文件, kali系统下监听本地端口, windows系统打开doc文件,即可中招 第一种利用方式, 适合测试用: 从git下载代码: …

pandas之DataFrame合并merge

一、merge merge操作实现两个DataFrame之间的合并,类似于sql两个表之间的关联查询。merge的使用方法及参数解释如下: pd.merge(left, right, onNone, howinner, left_onNone, right_onNone, left_indexFalse, right_indexFalse,    sortFalse, suffi…

python ==字符串

字符串类型(str): 包含在引号(单,双,三)里面,由一串字符组成。 用途:姓名,性别,地址,学历,密码 Name ‘zbk’ 取值: 首先要明确,字符…

认证鉴权与API权限控制在微服务架构中的设计与实现(一)

作者: [Aoho’s Blog] 引言: 本文系《认证鉴权与API权限控制在微服务架构中的设计与实现》系列的第一篇,本系列预计四篇文章讲解微服务下的认证鉴权与API权限控制的实现。 1. 背景 最近在做权限相关服务的开发,在系统微服务化后&a…

mac下完全卸载程序的方法

在国外网上看到的,觉得很好,不仅可以长卸载的知识,还对mac系统有更深的认识。比如偏好设置文件,我以前设置一个程序坏了,打不开了,怎么重装都打不开,后来才知道系统还保留着原来的偏好设置文件。…