线性回归算法数学原理_线性回归算法-非数学家的高级数学

线性回归算法数学原理

内部AI (Inside AI)

Linear regression is one of the most popular algorithms used in different fields well before the advent of computers. Today with the powerful computers, we can solve multi-dimensional linear regression which was not possible earlier. In single or multidimensional linear regression, the basic mathematical concept is quite the same.

在计算机出现之前,线性回归是在不同领域中使用最广泛的算法之一。 如今,借助功能强大的计算机,我们可以解决以前无法实现的多维线性回归。 在一维或多维线性回归中,基本的数学概念完全相同。

Today with machine learning libraries, like Scikit-learn, it is possible to use the linear regression in modelling without understanding the mathematical concept behind it. In my opinion, it is quite essential for a data scientist and machine learning professional to understand the mathematical concept and logic behind an algorithm before using it.

如今, 借助Scikit-learn之类的机器学习库,可以在建模中使用线性回归而无需了解其背后的数学概念。 我认为,对于数据科学家和机器学习专业人员来说,在使用算法之前了解其数学概念和逻辑非常重要。

Most of us may not have studied advanced mathematics and statistics, and we get scared by seeing the mathematical notation and jargon behind the algorithms. In this article, I will explain the math and logic behind linear regressions with simplified python code and easy math to build your understanding

我们大多数人可能没有研究过高级数学和统计学,而看到算法背后的数学符号和行话让我们感到害怕。 在本文中,我将通过简化的python代码和简单的数学方法解释线性回归背后的数学和逻辑,以帮助您理解

Overview

总览

We will start with a simple linear equation with one variable and without any intercept/bias. First, we will learn the step by step approach taken by packages like Scikit-learn to solve linear regression. During this walkthrough, we will understand the important concept of Gradient Descent. Further, we will see an example with a simple linear equation with one variable and an intercept/bias.

我们将从具有一个变量且没有任何截距/偏差的简单线性方程式开始。 首先,我们将学习Scikit-learn之类的软件包采用的逐步方法来解决线性回归问题。 在本演练中,我们将了解“梯度下降”的重要概念。 此外,我们将看到一个带有一个变量和一个截距/偏置的简单线性方程的示例。

Step 1: We will use the python package NumPy for working with a sample dataset and Matplotlib to plot various graphs for visualisation.

步骤1:我们将使用python软件包NumPy处理示例数据集,并使用Matplotlib绘制各种图形以进行可视化。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 2: Let us consider a simple scenario where a single input /independent variable controls the outcome/dependent variable value. In the code below, we have declared two NumPy arrays to hold the values of the independent and dependent variables.

步骤2:让我们考虑一个简单的场景,其中单个输入/独立变量控制结果/因变量值。 在下面的代码中,我们声明了两个NumPy数组来保存自变量和因变量的值。

Independent_Variable=np.array([1,2,3,12,15,17,20,21,5,7,9,10,3,12,15,17,20,7])
Dependent_Variable=np.array([7,14,21,84,105,116.1,139,144.15,32.6,50.1,65.4,75.4,20.8,83.4,103.15,110.9,136.6,48.7])

Step 3: Let us quickly draw a scatter plot to understand the data points.

步骤3:让我们快速绘制散点图以了解数据点。

plt.scatter(Independent_Variable, Dependent_Variable,  color='green')
plt.xlabel('Independent Variable/Input Parameter')
plt.ylabel('Dependent Variable/ Output Parameter')
plt.show()
Image for post

Our goal is to formulate a linear equation which can predict the dependent variable value with minimum error for an independent/input variable.

我们的目标是制定一个线性方程式,该方程式可以预测自变量/输入变量的因变量值,并且误差最小。

Dependent Variable = Constant * Independent Variable

因变量=常数*自变量

In mathematical terms, Y=constant*X

用数学术语,Y =常数* X

In terms of visualisation, we need to find the best fit line to get a minimum error for points.

在可视化方面,我们需要找到最佳拟合线以使点的误差最小。

Minimum error is also known as loss function in the machine learning world.

最小误差在机器学习领域也被称为损失函数。

Image for post
Loss function formula ( written by the author in word and then taken a screenshot)
损失函数公式(由作者用文字写成,然后截取屏幕截图)

We can calculate the loss in the iteration for each assumed constant value in the equation Y=constant*X with all independent data points. The goal is to find the constant for which this loss is minimum and formulate the equation. Please note that in the loss function equation “m” stands for the number of points. In the current example, we have 18 points hence 1/2m translates to 1/36. Do not get terrified by seeing the loss function formula. We calculate the loss as the summation of the square of the difference between the calculated value and actual value for each data points and then divide it with twice the number of points. We will decipher it step by step in the article below with the help of code in python.

我们可以在方程式Y = constant * X中使用所有独立的数据点,为每个假定的常数值计算迭代中的损失。 目的是找到损耗最小的常数,并公式化。 请注意,在损失函数方程中,“ m”代表点数。 在当前示例中,我们有18个点,因此1 / 2m转换为1/36。 不要因为看到损失函数公式而感到恐惧。 我们将损失计算为每个数据点的计算值与实际值之差的平方之和,然后将其除以点数的两倍。 我们将在下面的文章中借助python中的代码逐步解释它。

Step 4: To understand the core idea and math behind identifying the equation, we will consider the limited set of constant values mentioned in the code below and calculate the loss function.

步骤4:要了解识别方程式背后的核心思想和数学运算,我们将考虑以下代码中提及的有限常量集,并计算损失函数。

In actual linear regression algorithms, in particular gaps, constants are considered for loss function calculation. Initially, the gaps between the two constants considered for loss function calculation is bigger. As we move closer to the actual solution constants with smaller gaps are considered. In the machine learning world, the learning rate is the gap in which constant is increased/decreased for loss function calculation.

在实际的线性回归算法中,尤其是在间隙中,考虑常数用于损失函数计算。 最初,用于损失函数计算的两个常数之间的差距较大。 随着我们越来越接近实际的解决方案常数,将考虑较小的差距。 在机器学习世界中,学习率是指为进行损失函数计算而增加/减少常数的差距。

m=[-5,-3,-1,1,3,5,6.6,7,8.5,9,11,13,15]

Step 5: In the code below, we calculate the loss function for each value of constant (i.e. values in the list m declared in the earlier step) for all input and output data points.

步骤5:在下面的代码中,我们为所有输入和输出数据点的每个常数值(即,在前面的步骤中声明的列表m中的值)计算损耗函数。

We store the calculated loss for each constant in a Numpy array “errormargin”.

我们将计算出的每个常数的损耗存储在Numpy数组“ errormargin”中。

errormargin=np.array([])
for slope in m:
counter=0
sumerror=0
cost=sumerror/10
for x in Independent_Variable:
yhat=slope*x
error=(yhat-Dependent_Variable[counter])*(yhat-Dependent_Variable[counter])
sumerror=error+sumerror
counter=counter+1
cost=sumerror/18
errormargin=np.append(errormargin,cost)

Step 6:We will plot the calculated loss function for the constant to determine the actual constant value.

步骤6:我们将为常数绘制计算出的损失函数,以确定实际常数值。

plt.plot(m,errormargin)
plt.xlabel("Slope Values")
plt.ylabel("Loss Function")
plt.show()

The value of the constant for which the curve is at the lowest point is the real constant with which we can formulate the equation of the line.

曲线处于最低点的常数的值是实常数,我们可以用它来表示直线方程。

Image for post

In our example, for the value of constant 6.8, the curve is at the lowest point.

在我们的示例中,对于常数6.8而言,曲线位于最低点。

A line with this value as Y=6.8*X can best fit the data points with minimum error.

该值为Y = 6.8 * X的线可以以最小的误差最好地拟合数据点。

Image for post

This approach of plotting the loss function and identifying the true values of the fixed parameters in the equation at the lowest point of the loss curve is known as gradient descent. As an example, we have considered one variable for simplicity, hence the loss function is a 2-dimensional curve. In the case of multiple linear regression, the gradient descent curve will be multi-dimensional.

这种在损失曲线的最低点绘制损失函数并确定方程式中固定参数的真值的方法称为梯度下降 。 例如,为简单起见,我们考虑了一个变量,因此损失函数为二维曲线。 在多元线性回归的情况下,梯度下降曲线将是多维的。

We have learnt the inner working to calculate the coefficient of the independent variable. Next, let us learn the step by step way to calculate the coefficient and intercept/bias in linear regression.

我们已经了解了计算自变量系数的内部工作。 接下来,让我们逐步学习在线性回归中计算系数和截距/偏差的方法。

Step 1: Just like earlier, let us consider a sample set of independent and dependent variable values. These are the input and output data points available our goal is to formulate a linear equation which can predict the dependent variable value with minimum error for an independent/input variable.

步骤1:就像之前一样,让我们​​考虑一组独立变量和因变量值。 这些是可用的输入和输出数据点,我们的目标是建立一个线性方程式,该方程式可以预测自变量/输入变量的因变量值,并且误差最小。

Dependent Variable = (Coefficient*Independent Variable)+ Constant

因变量=(系数*因变量)+常数

In mathematical terms, y=(Coefficient*x)+ c

用数学术语,y =(系数* x)+ c

Please note that coefficient is also a constant term multiplied with the independent variable in the equation.

请注意,系数也是常数项乘以方程式中的独立变量。

Independent_Variable=np.array([1,2,4,3,5])
Dependent_Variable=np.array([1,3,3,2,5])

Step 2: We will assume that the initial value of the coefficient and constant “m” and “c” respectively as zero. We will increase the value of m and c after each iteration of error calculation by a small learning rate of 0.001. Epoch is the number of times we want to do this calculation on the entire available data points. As we increase the number of epoch, the solution becomes more accurate, but it consumes time and computing power. Based on the business case, we can decide the acceptable error in the calculated values to stop the iterations.

步骤2:我们假设系数的初始值和常数“ m”和“ c”分别为零。 每次误差计算迭代后,我们将m和c的值增加一个小的学习率0.001。 时期是我们要对整个可用数据点进行此计算的次数。 随着我们增加纪元的数量,解决方案变得更加准确,但是却消耗了时间和计算能力。 根据业务案例,我们可以确定计算值中可接受的误差以停止迭代。

LR=0.001
m=0
c=0
epoch=0

Step 3: In the below code, we run 1100 iterations on the available dataset and calculate the coefficient and constant value.

步骤3:在下面的代码中,我们在可用数据集上运行1100次迭代,并计算系数和常数值。

For each independent data point, we calculate the dependent value (i.e. yhat) and then calculate the error between the calculated and actual dependent value.

对于每个独立数据点,我们计算相关值(即yhat),然后计算所计算的相关值与实际相关值之间的误差。

Based on the error, we change the value of coefficient and constant for the next iteration calculation.

根据该误差,我们更改系数和常数的值,以进行下一次迭代计算。

New coefficient= Current coefficient — (Learning Rate*Error)

新系数=当前系数-(学习率*错误)

New Constant = Current Constant -(Learning Rate*Error*Independent Variable Value)

新常数=当前常数-(学习率*错误*独立变量值)

while epoch<1100:
epoch=epoch+1
counter=0
for x in Independent_Variable:
yhat=(m*x)+c
error=yhat-Dependent_Variable[counter]
c=c-(LR*error)
m=m-(LR*error*x)
counter=counter+1

We check the value of the coefficient and constant after 1100 iterations on the available dataset.

我们在可用数据集上进行1100次迭代后检查系数和常数的值。

print("The final value of  m", m)
print("The final value of c", c)
Image for post

Mathematically, it can be represented as y=(0.81*x)+0.33

数学上可以表示为y =(0.81 * x)+0.33

Image for post

Finally, let us compare the earlier output with the Scikit-learn linear regression algorithm result

最后,让我们将早期输出与Scikit-learn线性回归算法的结果进行比较

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(Independent_Variable.reshape(-1,1), Dependent_Variable)
print(reg.coef_)
print(reg.intercept_)
Image for post

With 1100 iterations on the available dataset the calculated value of the coefficient and constant/bias is very close to the output of Scikit-learn linear regression algorithm.

在可用数据集上进行1100次迭代后,系数和常数/偏置的计算值非常接近Scikit-learn线性回归算法的输出。

I hope this gives article gave you a firm understanding on behind the scene mathematical calculation and concept in linear regression. Also, we have seen the way gradient descent is applied to find the optimal solution. In the case of multiple linear regression, the math and logic remain the same, and it is just scaled further in more dimensions.

我希望本文能使您对线性回归的数学计算和概念有一个深刻的了解。 同样,我们已经看到了应用梯度下降法找到最佳解的方法。 在多元线性回归的情况下,数学和逻辑保持不变,并且只是在更多维度上进行了进一步扩展。

翻译自: https://towardsdatascience.com/linear-regression-algorithm-under-the-hood-math-for-non-mathematicians-c228d244e3f3

线性回归算法数学原理

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390730.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Linux 概述

UNIX发展历程 第一个版本是1969年由Ken Thompson&#xff08;UNIX之父&#xff09;在AT& T贝尔实验室实现Ken Thompson和Dennis Ritchie&#xff08;C语言之父&#xff09;使用C语言对整个系统进行了再加工和编写UNIX的源代码属于SCO公司&#xff08;AT&T ->Novell …

泰坦尼克:机器从灾难中学习_用于灾难响应的机器学习研究:什么才是好的论文?...

泰坦尼克:机器从灾难中学习For the first time in 2021, a major Machine Learning conference will have a track devoted to disaster response. The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021) has a track on…

github持续集成的设置_如何使用GitHub Actions和Puppeteer建立持续集成管道

github持续集成的设置Lately Ive added continuous integration to my blog using Puppeteer for end to end testing. My main goal was to allow automatic dependency updates using Dependabot. In this guide Ill show you how to create such a pipeline yourself. 最近&…

shell与常用命令

虚拟控制台 一台计算机的输入输出设备就是一个物理的控制台 &#xff1b; 如果在一台计算机上用软件的方法实现了多个互不干扰独立工作的控制台界面&#xff0c;就是实现了多个虚拟控制台&#xff1b; Linux终端的工作方式是字符命令行方式&#xff0c;用户通过键盘输入命令进…

Linux文本编辑器

Linux文本编辑器 Linux系统下有很多文本编辑器。 按编辑区域&#xff1a; 行编辑器 ed 全屏编辑器 vi 按运行环境&#xff1a; 命令行控制台编辑器 vi X Window图形界面编辑器 gedit ed 它是一个很古老的行编辑器&#xff0c;vi这些编辑器都是ed演化而来。 每次只能对一…

Alpha第十天

Alpha第十天 听说 031502543 周龙荣&#xff08;队长&#xff09; 031502615 李家鹏 031502632 伍晨薇 031502637 张柽 031502639 郑秦 1.前言 任务分配是VV、ZQ、ZC负责前端开发&#xff0c;由JP和LL负责建库和服务器。界面开发的教辅材料是《第一行代码》&#xff0c;利用And…

Streamlit —使用数据应用程序更好地测试模型

介绍 (Introduction) We use all kinds of techniques from creating a very reliable validation set to using k-fold cross-validation or coming up with all sorts of fancy metrics to determine how good our model performs. However, nothing beats looking at the ra…

X Window系统

X Window系统 一种以位图方式显示的软件窗口系统。诞生于1984&#xff0c;比Microsoft Windows要早。是一套独立于内核的软件 Linux上的X Window系统 X Window系统由三个基本元素组成&#xff1a;X Server、X Client和二者通信的通道。 X Server&#xff1a;是控制输出及输入…

lasso回归和岭回归_如何计划新产品和服务机会的回归

lasso回归和岭回归Marketers sometimes have to be creative to offer customers something new without the luxury of that new item being a brand-new product or built-from-scratch service. In fact, incrementally introducing features is familiar to marketers of c…

Linux 设备管理和进程管理

设备管理 Linux系统中设备是用文件来表示的&#xff0c;每种设备都被抽象为设备文件的形式&#xff0c;这样&#xff0c;就给应用程序一个一致的文件界面&#xff0c;方便应用程序和操作系统之间的通信。 设备文件集中放置在/dev目录下&#xff0c;一般有几千个&#xff0c;不…

贝叶斯 定理_贝叶斯定理实际上是一个直观的分数

贝叶斯 定理Bayes’ Theorem is one of the most known to the field of probability, and it is used often as a baseline model in machine learning. It is, however, too often memorized and chanted by people who don’t really know what P(B|E) P(E|B) * P(B) / P(E…

文本数据可视化_如何使用TextHero快速预处理和可视化文本数据

文本数据可视化自然语言处理 (Natural Language Processing) When we are working on any NLP project or competition, we spend most of our time on preprocessing the text such as removing digits, punctuations, stopwords, whitespaces, etc and sometimes visualizati…

linux shell 编程

shell的作用 shell是用户和系统内核之间的接口程序shell是命令解释器 shell程序 Shell程序的特点及用途&#xff1a; shell程序可以认为是将shell命令按照控制结构组织到一个文本文件中&#xff0c;批量的交给shell去执行 不同的shell解释器使用不同的shell命令语法 shell…

真实感人故事_您的数据可以告诉您真实故事吗?

真实感人故事Many are passionate about Data Analytics. Many love matplotlib and Seaborn. Many enjoy designing and working on Classifiers. We are quick to grab a data set and launch Jupyter Notebook, import pandas and NumPy and get to work. But wait a minute…

转:防止跨站攻击,安全过滤

转&#xff1a;http://blog.csdn.net/zpf0918/article/details/43952511 Spring MVC防御CSRF、XSS和SQL注入攻击 本文说一下SpringMVC如何防御CSRF(Cross-site request forgery跨站请求伪造)和XSS(Cross site script跨站脚本攻击)。 说说CSRF 对CSRF来说&#xff0c;其实Spring…

Linux c编程

c语言标准 ANSI CPOSIX&#xff08;提高UNIX程序可移植性&#xff09;SVID&#xff08;POSIX的扩展超集&#xff09;XPG&#xff08;X/Open可移植性指南&#xff09;GNU C&#xff08;唯一能编译Linux内核的编译器&#xff09; gcc 简介 名称&#xff1a; GNU project C an…

k均值算法 二分k均值算法_使用K均值对加勒比珊瑚礁进行分类

k均值算法 二分k均值算法Have you ever seen a Caribbean reef? Well if you haven’t, prepare yourself.您见过加勒比礁吗&#xff1f; 好吧&#xff0c;如果没有&#xff0c;请做好准备。 Today, we will be answering a question that, at face value, appears quite sim…

新建VUX项目

使用Vue-cli安装Vux2 特别注意配置vux-loader。来自为知笔记(Wiz)

衡量试卷难度信度_我们可以通过数字来衡量语言难度吗?

衡量试卷难度信度Without a doubt, the world is “growing smaller” in terms of our access to people and content from other countries and cultures. Even the COVID-19 pandemic, which has curtailed international travel, has led to increasing virtual interactio…

Linux 题目总结

守护进程的工作就是打开一个端口&#xff0c;并且等待&#xff08;Listen&#xff09;进入连接。 如果客户端发起一个连接请求&#xff0c;守护进程就创建&#xff08;Fork&#xff09;一个子进程响应这个连接&#xff0c;而主进程继续监听其他的服务请求。 xinetd能够同时监听…