线性回归算法数学原理
内部AI (Inside AI)
Linear regression is one of the most popular algorithms used in different fields well before the advent of computers. Today with the powerful computers, we can solve multi-dimensional linear regression which was not possible earlier. In single or multidimensional linear regression, the basic mathematical concept is quite the same.
在计算机出现之前,线性回归是在不同领域中使用最广泛的算法之一。 如今,借助功能强大的计算机,我们可以解决以前无法实现的多维线性回归。 在一维或多维线性回归中,基本的数学概念完全相同。
Today with machine learning libraries, like Scikit-learn, it is possible to use the linear regression in modelling without understanding the mathematical concept behind it. In my opinion, it is quite essential for a data scientist and machine learning professional to understand the mathematical concept and logic behind an algorithm before using it.
如今, 借助Scikit-learn之类的机器学习库,可以在建模中使用线性回归而无需了解其背后的数学概念。 我认为,对于数据科学家和机器学习专业人员来说,在使用算法之前了解其数学概念和逻辑非常重要。
Most of us may not have studied advanced mathematics and statistics, and we get scared by seeing the mathematical notation and jargon behind the algorithms. In this article, I will explain the math and logic behind linear regressions with simplified python code and easy math to build your understanding
我们大多数人可能没有研究过高级数学和统计学,而看到算法背后的数学符号和行话让我们感到害怕。 在本文中,我将通过简化的python代码和简单的数学方法解释线性回归背后的数学和逻辑,以帮助您理解
Overview
总览
We will start with a simple linear equation with one variable and without any intercept/bias. First, we will learn the step by step approach taken by packages like Scikit-learn to solve linear regression. During this walkthrough, we will understand the important concept of Gradient Descent. Further, we will see an example with a simple linear equation with one variable and an intercept/bias.
我们将从具有一个变量且没有任何截距/偏差的简单线性方程式开始。 首先,我们将学习Scikit-learn之类的软件包采用的逐步方法来解决线性回归问题。 在本演练中,我们将了解“梯度下降”的重要概念。 此外,我们将看到一个带有一个变量和一个截距/偏置的简单线性方程的示例。
Step 1: We will use the python package NumPy for working with a sample dataset and Matplotlib to plot various graphs for visualisation.
步骤1:我们将使用python软件包NumPy处理示例数据集,并使用Matplotlib绘制各种图形以进行可视化。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Step 2: Let us consider a simple scenario where a single input /independent variable controls the outcome/dependent variable value. In the code below, we have declared two NumPy arrays to hold the values of the independent and dependent variables.
步骤2:让我们考虑一个简单的场景,其中单个输入/独立变量控制结果/因变量值。 在下面的代码中,我们声明了两个NumPy数组来保存自变量和因变量的值。
Independent_Variable=np.array([1,2,3,12,15,17,20,21,5,7,9,10,3,12,15,17,20,7])
Dependent_Variable=np.array([7,14,21,84,105,116.1,139,144.15,32.6,50.1,65.4,75.4,20.8,83.4,103.15,110.9,136.6,48.7])
Step 3: Let us quickly draw a scatter plot to understand the data points.
步骤3:让我们快速绘制散点图以了解数据点。
plt.scatter(Independent_Variable, Dependent_Variable, color='green')
plt.xlabel('Independent Variable/Input Parameter')
plt.ylabel('Dependent Variable/ Output Parameter')
plt.show()
Our goal is to formulate a linear equation which can predict the dependent variable value with minimum error for an independent/input variable.
我们的目标是制定一个线性方程式,该方程式可以预测自变量/输入变量的因变量值,并且误差最小。
Dependent Variable = Constant * Independent Variable
因变量=常数*自变量
In mathematical terms, Y=constant*X
用数学术语,Y =常数* X
In terms of visualisation, we need to find the best fit line to get a minimum error for points.
在可视化方面,我们需要找到最佳拟合线以使点的误差最小。
Minimum error is also known as loss function in the machine learning world.
最小误差在机器学习领域也被称为损失函数。
We can calculate the loss in the iteration for each assumed constant value in the equation Y=constant*X with all independent data points. The goal is to find the constant for which this loss is minimum and formulate the equation. Please note that in the loss function equation “m” stands for the number of points. In the current example, we have 18 points hence 1/2m translates to 1/36. Do not get terrified by seeing the loss function formula. We calculate the loss as the summation of the square of the difference between the calculated value and actual value for each data points and then divide it with twice the number of points. We will decipher it step by step in the article below with the help of code in python.
我们可以在方程式Y = constant * X中使用所有独立的数据点,为每个假定的常数值计算迭代中的损失。 目的是找到损耗最小的常数,并公式化。 请注意,在损失函数方程中,“ m”代表点数。 在当前示例中,我们有18个点,因此1 / 2m转换为1/36。 不要因为看到损失函数公式而感到恐惧。 我们将损失计算为每个数据点的计算值与实际值之差的平方之和,然后将其除以点数的两倍。 我们将在下面的文章中借助python中的代码逐步解释它。
Step 4: To understand the core idea and math behind identifying the equation, we will consider the limited set of constant values mentioned in the code below and calculate the loss function.
步骤4:要了解识别方程式背后的核心思想和数学运算,我们将考虑以下代码中提及的有限常量集,并计算损失函数。
In actual linear regression algorithms, in particular gaps, constants are considered for loss function calculation. Initially, the gaps between the two constants considered for loss function calculation is bigger. As we move closer to the actual solution constants with smaller gaps are considered. In the machine learning world, the learning rate is the gap in which constant is increased/decreased for loss function calculation.
在实际的线性回归算法中,尤其是在间隙中,考虑常数用于损失函数计算。 最初,用于损失函数计算的两个常数之间的差距较大。 随着我们越来越接近实际的解决方案常数,将考虑较小的差距。 在机器学习世界中,学习率是指为进行损失函数计算而增加/减少常数的差距。
m=[-5,-3,-1,1,3,5,6.6,7,8.5,9,11,13,15]
Step 5: In the code below, we calculate the loss function for each value of constant (i.e. values in the list m declared in the earlier step) for all input and output data points.
步骤5:在下面的代码中,我们为所有输入和输出数据点的每个常数值(即,在前面的步骤中声明的列表m中的值)计算损耗函数。
We store the calculated loss for each constant in a Numpy array “errormargin”.
我们将计算出的每个常数的损耗存储在Numpy数组“ errormargin”中。
errormargin=np.array([])
for slope in m:
counter=0
sumerror=0
cost=sumerror/10
for x in Independent_Variable:
yhat=slope*x
error=(yhat-Dependent_Variable[counter])*(yhat-Dependent_Variable[counter])
sumerror=error+sumerror
counter=counter+1
cost=sumerror/18
errormargin=np.append(errormargin,cost)
Step 6:We will plot the calculated loss function for the constant to determine the actual constant value.
步骤6:我们将为常数绘制计算出的损失函数,以确定实际常数值。
plt.plot(m,errormargin)
plt.xlabel("Slope Values")
plt.ylabel("Loss Function")
plt.show()
The value of the constant for which the curve is at the lowest point is the real constant with which we can formulate the equation of the line.
曲线处于最低点的常数的值是实常数,我们可以用它来表示直线方程。
In our example, for the value of constant 6.8, the curve is at the lowest point.
在我们的示例中,对于常数6.8而言,曲线位于最低点。
A line with this value as Y=6.8*X can best fit the data points with minimum error.
该值为Y = 6.8 * X的线可以以最小的误差最好地拟合数据点。
This approach of plotting the loss function and identifying the true values of the fixed parameters in the equation at the lowest point of the loss curve is known as gradient descent. As an example, we have considered one variable for simplicity, hence the loss function is a 2-dimensional curve. In the case of multiple linear regression, the gradient descent curve will be multi-dimensional.
这种在损失曲线的最低点绘制损失函数并确定方程式中固定参数的真值的方法称为梯度下降 。 例如,为简单起见,我们考虑了一个变量,因此损失函数为二维曲线。 在多元线性回归的情况下,梯度下降曲线将是多维的。
We have learnt the inner working to calculate the coefficient of the independent variable. Next, let us learn the step by step way to calculate the coefficient and intercept/bias in linear regression.
我们已经了解了计算自变量系数的内部工作。 接下来,让我们逐步学习在线性回归中计算系数和截距/偏差的方法。
Step 1: Just like earlier, let us consider a sample set of independent and dependent variable values. These are the input and output data points available our goal is to formulate a linear equation which can predict the dependent variable value with minimum error for an independent/input variable.
步骤1:就像之前一样,让我们考虑一组独立变量和因变量值。 这些是可用的输入和输出数据点,我们的目标是建立一个线性方程式,该方程式可以预测自变量/输入变量的因变量值,并且误差最小。
Dependent Variable = (Coefficient*Independent Variable)+ Constant
因变量=(系数*因变量)+常数
In mathematical terms, y=(Coefficient*x)+ c
用数学术语,y =(系数* x)+ c
Please note that coefficient is also a constant term multiplied with the independent variable in the equation.
请注意,系数也是常数项乘以方程式中的独立变量。
Independent_Variable=np.array([1,2,4,3,5])
Dependent_Variable=np.array([1,3,3,2,5])
Step 2: We will assume that the initial value of the coefficient and constant “m” and “c” respectively as zero. We will increase the value of m and c after each iteration of error calculation by a small learning rate of 0.001. Epoch is the number of times we want to do this calculation on the entire available data points. As we increase the number of epoch, the solution becomes more accurate, but it consumes time and computing power. Based on the business case, we can decide the acceptable error in the calculated values to stop the iterations.
步骤2:我们假设系数的初始值和常数“ m”和“ c”分别为零。 每次误差计算迭代后,我们将m和c的值增加一个小的学习率0.001。 时期是我们要对整个可用数据点进行此计算的次数。 随着我们增加纪元的数量,解决方案变得更加准确,但是却消耗了时间和计算能力。 根据业务案例,我们可以确定计算值中可接受的误差以停止迭代。
LR=0.001
m=0
c=0
epoch=0
Step 3: In the below code, we run 1100 iterations on the available dataset and calculate the coefficient and constant value.
步骤3:在下面的代码中,我们在可用数据集上运行1100次迭代,并计算系数和常数值。
For each independent data point, we calculate the dependent value (i.e. yhat) and then calculate the error between the calculated and actual dependent value.
对于每个独立数据点,我们计算相关值(即yhat),然后计算所计算的相关值与实际相关值之间的误差。
Based on the error, we change the value of coefficient and constant for the next iteration calculation.
根据该误差,我们更改系数和常数的值,以进行下一次迭代计算。
New coefficient= Current coefficient — (Learning Rate*Error)
新系数=当前系数-(学习率*错误)
New Constant = Current Constant -(Learning Rate*Error*Independent Variable Value)
新常数=当前常数-(学习率*错误*独立变量值)
while epoch<1100:
epoch=epoch+1
counter=0
for x in Independent_Variable:
yhat=(m*x)+c
error=yhat-Dependent_Variable[counter]
c=c-(LR*error)
m=m-(LR*error*x)
counter=counter+1
We check the value of the coefficient and constant after 1100 iterations on the available dataset.
我们在可用数据集上进行1100次迭代后检查系数和常数的值。
print("The final value of m", m)
print("The final value of c", c)
Mathematically, it can be represented as y=(0.81*x)+0.33
数学上可以表示为y =(0.81 * x)+0.33
Finally, let us compare the earlier output with the Scikit-learn linear regression algorithm result
最后,让我们将早期输出与Scikit-learn线性回归算法的结果进行比较
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(Independent_Variable.reshape(-1,1), Dependent_Variable)
print(reg.coef_)
print(reg.intercept_)
With 1100 iterations on the available dataset the calculated value of the coefficient and constant/bias is very close to the output of Scikit-learn linear regression algorithm.
在可用数据集上进行1100次迭代后,系数和常数/偏置的计算值非常接近Scikit-learn线性回归算法的输出。
I hope this gives article gave you a firm understanding on behind the scene mathematical calculation and concept in linear regression. Also, we have seen the way gradient descent is applied to find the optimal solution. In the case of multiple linear regression, the math and logic remain the same, and it is just scaled further in more dimensions.
我希望本文能使您对线性回归的数学计算和概念有一个深刻的了解。 同样,我们已经看到了应用梯度下降法找到最佳解的方法。 在多元线性回归的情况下,数学和逻辑保持不变,并且只是在更多维度上进行了进一步扩展。
翻译自: https://towardsdatascience.com/linear-regression-algorithm-under-the-hood-math-for-non-mathematicians-c228d244e3f3
线性回归算法数学原理
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390730.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!