python多项式回归_Python从头开始的多项式回归

python多项式回归

Polynomial regression in an improved version of linear regression. If you know linear regression, it will be simple for you. If not, I will explain the formulas here in this article. There are other advanced and more efficient machine learning algorithms are out there. But it is a good idea to learn linear based regression techniques. Because they are simple, fast, and works with very well known formulas. Though it may not work with a complex set of data.

线性回归的改进版本中的多项式回归。 如果您知道线性回归,那么对您来说很简单。 如果没有,我将在本文中解释这些公式。 还有其他先进且更有效的机器学习算法。 但是,学习基于线性的回归技术是一个好主意。 因为它们简单,快速并且可以使用众所周知的公式。 尽管它可能不适用于复杂的数据集。

多项式回归公式 (Polynomial Regression Formula)

Linear regression can perform well only if there is a linear correlation between the input variables and the output variable. As I mentioned before polynomial regression is built on linear regression. If you need a refresher on linear regression, here is the link to linear regression:

仅当输入变量和输出变量之间存在线性相关性时,线性回归才能很好地执行。 如前所述,多项式回归建立在线性回归的基础上。 如果您需要线性回归的基础知识,请访问以下线性回归链接:

Polynomial regression can find the relationship between input features and the output variable in a better way even if the relationship is not linear. It uses the same formula as the linear regression:

多项式回归可以更好地找到输入要素与输出变量之间的关系,即使该关系不是线性的。 它使用与线性回归相同的公式:

Y = BX + C

Y = BX + C

I am sure, we all learned this formula in school. For linear regression, we use symbols like this:

我敢肯定,我们都在学校学过这个公式。 对于线性回归,我们使用如下符号:

Here, we get X and Y from the dataset. X is the input feature and Y is the output variable. Theta values are initialized randomly.

在这里,我们从数据集中获得X和Y。 X是输入要素,Y是输出变量。 Theta值是随机初始化的。

For polynomial regression, the formula becomes like this:

对于多项式回归,公式如下所示:

Image for post

We are adding more terms here. We are using the same input features and taking different exponentials to make more features. That way, our algorithm will be able to learn about the data better.

我们在这里添加更多术语。 我们使用相同的输入功能,并采用不同的指数以制作更多功能。 这样,我们的算法将能够更好地了解数据。

The powers do not have to be 2, 3, or 4. They could be 1/2, 1/3, or 1/4 as well. Then the formula will look like this:

幂不必为2、3或4。它们也可以为1 / 2、1 / 3或1/4。 然后,公式将如下所示:

Image for post

成本函数和梯度下降 (Cost Function And Gradient Descent)

Cost function gives an idea of how far the predicted hypothesis is from the values. The formula is:

成本函数给出了预测假设与值之间的距离的概念。 公式为:

Image for post

This equation may look complicated. It is doing a simple calculation. First, deducting the hypothesis from the original output variable. Taking a square to eliminate the negative values. Then dividing that value by 2 times the number of training examples.

这个方程可能看起来很复杂。 它正在做一个简单的计算。 首先,从原始输出变量中减去假设。 取平方消除负值。 然后将该值除以训练示例数的2倍。

What is gradient descent? It helps in fine-tuning our randomly initialized theta values. I am not going to the differential calculus here. If you take the partial differential of the cost function on each theta, we can derive these formulas:

什么是梯度下降? 它有助于微调我们随机初始化的theta值。 我不打算在这里进行微积分。 如果对每个θ取成本函数的偏微分,则可以得出以下公式:

Image for post

Here, alpha is the learning rate. You choose the value of alpha.

在这里,alpha是学习率。 您选择alpha的值。

多项式回归的Python实现 (Python Implementation of Polynomial Regression)

Here is the step by step implementation of Polynomial regression.

这是多项式回归的逐步实现。

  1. We will use a simple dummy dataset for this example that gives the data of salaries for positions. Import the dataset:

    在此示例中,我们将使用一个简单的虚拟数据集,该数据集提供职位的薪水数据。 导入数据集:
import pandas as pd
import numpy as np
df = pd.read_csv('position_salaries.csv')
df.head()
Image for post

2. Add the bias column for theta 0. This bias column will only contain 1. Because if you multiply 1 with a number it does not change.

2.添加theta 0的偏差列。该偏差列将仅包含1。因为如果将1乘以数字,它不会改变。

df = pd.concat([pd.Series(1, index=df.index, name='00'), df], axis=1)
df.head()
Image for post

3. Delete the ‘Position’ column. Because the ‘Position’ column contains strings and algorithms do not understand strings. We have the ‘Level’ column to represent the positions.

3.删除“位置”列。 由于“位置”列包含字符串,并且算法无法理解字符串。 我们有“级别”列来代表职位。

df = df.drop(columns='Position')

4. Define our input variable X and the output variable y. In this example, ‘Level’ is the input feature and ‘Salary’ is the output variable. We want to predict the salary for levels.

4.定义我们的输入变量X和输出变量y。 在此示例中,“级别”是输入要素,而“薪水”是输出变量。 我们要预测各个级别的薪水。

y = df['Salary']
X = df.drop(columns = 'Salary')
X.head()
Image for post

5. Take the exponentials of the ‘Level’ column to make ‘Level1’ and ‘Level2’ columns.

5.以“级别”列的指数表示“级别1”和“级别2”列。

X['Level1'] = X['Level']**2
X['Level2'] = X['Level']**3
X.head()
Image for post

6. Now, normalize the data. Divide each column by the maximum value of that column. That way, we will get the values of each column ranging from 0 to 1. The algorithm should work even without normalization. But it helps to converge faster. Also, calculate the value of m which is the length of the dataset.

6.现在,标准化数据。 将每一列除以该列的最大值。 这样,我们将获得每列的值,范围从0到1。即使没有规范化,该算法也应该起作用。 但这有助于收敛更快。 同样,计算m的值,它是数据集的长度。

m = len(X)
X = X/X.max()

7. Define the hypothesis function. That will use the X and theta to predict the ‘y’.

7.定义假设函数。 这将使用X和theta来预测“ y”。

def hypothesis(X, theta):
y1 = theta*X
return np.sum(y1, axis=1)

8. Define the cost function, with our formula for cost-function above:

8.使用上面的成本函数公式定义成本函数:

def cost(X, y, theta):
y1 = hypothesis(X, theta)
return sum(np.sqrt((y1-y)**2))/(2*m)

9. Write the function for gradient descent. We will keep updating the theta values until we find our optimum cost. For each iteration, we will calculate the cost for future analysis.

9.编写梯度下降函数。 我们将不断更新theta值,直到找到最佳成本。 对于每次迭代,我们将计算成本以供将来分析。

def gradientDescent(X, y, theta, alpha, epoch):
J=[]
k=0
while k < epoch:
y1 = hypothesis(X, theta)
for c in range(0, len(X.columns)):
theta[c] = theta[c] - alpha*sum((y1-y)* X.iloc[:, c])/m
j = cost(X, y, theta)
J.append(j)
k += 1
return J, theta

10. All the functions are defined. Now, initialize the theta. I am initializing an array of zero. You can take any other random values. I am choosing alpha as 0.05 and I will iterate the theta values for 700 epochs.

10.定义了所有功能。 现在,初始化theta。 我正在初始化零数组。 您可以采用任何其他随机值。 我选择alpha为0.05,我将迭代700个纪元的theta值。

theta = np.array([0.0]*len(X.columns))
J, theta = gradientDescent(X, y, theta, 0.05, 700)

11. We got our final theta values and the cost in each iteration as well. Let’s find the salary prediction using our final theta.

11.我们还获得了最终的theta值以及每次迭代的成本。 让我们使用最终的theta查找薪水预测。

y_hat = hypothesis(X, theta)

12. Now plot the original salary and our predicted salary against the levels.

12.现在根据水平绘制原始薪水和我们的预测薪水。

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(x=X['Level'],y= y)
plt.scatter(x=X['Level'], y=y_hat)
plt.show()
Image for post

Our prediction does not exactly follow the trend of salary but it is close. Linear regression can only return a straight line. But in polynomial regression, we can get a curved line like that. If the line would not be a nice curve, polynomial regression can learn some more complex trends as well.

我们的预测并不完全符合薪资趋势,但接近。 线性回归只能返回一条直线。 但是在多项式回归中,我们可以得到这样的曲线。 如果该线不是一条好曲线,则多项式回归也可以学习一些更复杂的趋势。

13. Let’s plot the cost we calculated in each epoch in our gradient descent function.

13.让我们绘制我们在梯度下降函数中每个时期计算的成本。

plt.figure()
plt.scatter(x=list(range(0, 700)), y=J)
plt.show()
Image for post

The cost fell drastically in the beginning and then the fall was slow. In a good machine learning algorithm, cost should keep going down until the convergence. Please feel free to try it with a different number of epochs and different learning rates (alpha).

成本从一开始就急剧下降,然后下降缓慢。 在一个好的机器学习算法中,成本应该一直下降直到收敛。 请随意尝试不同的时期和不同的学习率(alpha)。

Here is the dataset: salary_data

这是数据集: salary_data

Follow this link for the full working code: Polynomial Regression

请点击以下链接获取完整的工作代码: 多项式回归

推荐阅读: (Recommended reading:)

翻译自: https://towardsdatascience.com/polynomial-regression-from-scratch-in-python-1f34a3a5f373

python多项式回归

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390743.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

回归分析_回归

回归分析Machine learning algorithms are not your regular algorithms that we may be used to because they are often described by a combination of some complex statistics and mathematics. Since it is very important to understand the background of any algorith…

数据科学还是计算机科学_何时不使用数据科学

数据科学还是计算机科学意见 (Opinion) 目录 (Table of Contents) Introduction 介绍 Examples 例子 When You Should Use Data Science 什么时候应该使用数据科学 Summary 摘要 介绍 (Introduction) Both Data Science and Machine Learning are useful fields that apply sev…

leetcode 523. 连续的子数组和

给你一个整数数组 nums 和一个整数 k &#xff0c;编写一个函数来判断该数组是否含有同时满足下述条件的连续子数组&#xff1a; 子数组大小 至少为 2 &#xff0c;且 子数组元素总和为 k 的倍数。 如果存在&#xff0c;返回 true &#xff1b;否则&#xff0c;返回 false 。 …

Docker学习笔记 - Docker Compose

一、概念 Docker Compose 用于定义运行使用多个容器的应用&#xff0c;可以一条命令启动应用&#xff08;多个容器&#xff09;。 使用Docker Compose 的步骤&#xff1a; 定义容器 Dockerfile定义应用的各个服务 docker-compose.yml启动应用 docker-compose up二、安装 Note t…

线性回归算法数学原理_线性回归算法-非数学家的高级数学

线性回归算法数学原理内部AI (Inside AI) Linear regression is one of the most popular algorithms used in different fields well before the advent of computers. Today with the powerful computers, we can solve multi-dimensional linear regression which was not p…

Linux 概述

UNIX发展历程 第一个版本是1969年由Ken Thompson&#xff08;UNIX之父&#xff09;在AT& T贝尔实验室实现Ken Thompson和Dennis Ritchie&#xff08;C语言之父&#xff09;使用C语言对整个系统进行了再加工和编写UNIX的源代码属于SCO公司&#xff08;AT&T ->Novell …

泰坦尼克:机器从灾难中学习_用于灾难响应的机器学习研究:什么才是好的论文?...

泰坦尼克:机器从灾难中学习For the first time in 2021, a major Machine Learning conference will have a track devoted to disaster response. The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021) has a track on…

github持续集成的设置_如何使用GitHub Actions和Puppeteer建立持续集成管道

github持续集成的设置Lately Ive added continuous integration to my blog using Puppeteer for end to end testing. My main goal was to allow automatic dependency updates using Dependabot. In this guide Ill show you how to create such a pipeline yourself. 最近&…

shell与常用命令

虚拟控制台 一台计算机的输入输出设备就是一个物理的控制台 &#xff1b; 如果在一台计算机上用软件的方法实现了多个互不干扰独立工作的控制台界面&#xff0c;就是实现了多个虚拟控制台&#xff1b; Linux终端的工作方式是字符命令行方式&#xff0c;用户通过键盘输入命令进…

Linux文本编辑器

Linux文本编辑器 Linux系统下有很多文本编辑器。 按编辑区域&#xff1a; 行编辑器 ed 全屏编辑器 vi 按运行环境&#xff1a; 命令行控制台编辑器 vi X Window图形界面编辑器 gedit ed 它是一个很古老的行编辑器&#xff0c;vi这些编辑器都是ed演化而来。 每次只能对一…

Alpha第十天

Alpha第十天 听说 031502543 周龙荣&#xff08;队长&#xff09; 031502615 李家鹏 031502632 伍晨薇 031502637 张柽 031502639 郑秦 1.前言 任务分配是VV、ZQ、ZC负责前端开发&#xff0c;由JP和LL负责建库和服务器。界面开发的教辅材料是《第一行代码》&#xff0c;利用And…

Streamlit —使用数据应用程序更好地测试模型

介绍 (Introduction) We use all kinds of techniques from creating a very reliable validation set to using k-fold cross-validation or coming up with all sorts of fancy metrics to determine how good our model performs. However, nothing beats looking at the ra…

X Window系统

X Window系统 一种以位图方式显示的软件窗口系统。诞生于1984&#xff0c;比Microsoft Windows要早。是一套独立于内核的软件 Linux上的X Window系统 X Window系统由三个基本元素组成&#xff1a;X Server、X Client和二者通信的通道。 X Server&#xff1a;是控制输出及输入…

lasso回归和岭回归_如何计划新产品和服务机会的回归

lasso回归和岭回归Marketers sometimes have to be creative to offer customers something new without the luxury of that new item being a brand-new product or built-from-scratch service. In fact, incrementally introducing features is familiar to marketers of c…

Linux 设备管理和进程管理

设备管理 Linux系统中设备是用文件来表示的&#xff0c;每种设备都被抽象为设备文件的形式&#xff0c;这样&#xff0c;就给应用程序一个一致的文件界面&#xff0c;方便应用程序和操作系统之间的通信。 设备文件集中放置在/dev目录下&#xff0c;一般有几千个&#xff0c;不…

贝叶斯 定理_贝叶斯定理实际上是一个直观的分数

贝叶斯 定理Bayes’ Theorem is one of the most known to the field of probability, and it is used often as a baseline model in machine learning. It is, however, too often memorized and chanted by people who don’t really know what P(B|E) P(E|B) * P(B) / P(E…

文本数据可视化_如何使用TextHero快速预处理和可视化文本数据

文本数据可视化自然语言处理 (Natural Language Processing) When we are working on any NLP project or competition, we spend most of our time on preprocessing the text such as removing digits, punctuations, stopwords, whitespaces, etc and sometimes visualizati…

linux shell 编程

shell的作用 shell是用户和系统内核之间的接口程序shell是命令解释器 shell程序 Shell程序的特点及用途&#xff1a; shell程序可以认为是将shell命令按照控制结构组织到一个文本文件中&#xff0c;批量的交给shell去执行 不同的shell解释器使用不同的shell命令语法 shell…

真实感人故事_您的数据可以告诉您真实故事吗?

真实感人故事Many are passionate about Data Analytics. Many love matplotlib and Seaborn. Many enjoy designing and working on Classifiers. We are quick to grab a data set and launch Jupyter Notebook, import pandas and NumPy and get to work. But wait a minute…

转:防止跨站攻击,安全过滤

转&#xff1a;http://blog.csdn.net/zpf0918/article/details/43952511 Spring MVC防御CSRF、XSS和SQL注入攻击 本文说一下SpringMVC如何防御CSRF(Cross-site request forgery跨站请求伪造)和XSS(Cross site script跨站脚本攻击)。 说说CSRF 对CSRF来说&#xff0c;其实Spring…