逻辑回归 python_深入研究Python的逻辑回归

逻辑回归 python

Classification techniques are an essential part of machine learning and data science applications. Approximately 70% of problems in machine learning are classification problems. There are lots of classification problems that are available, but the logistics regression is the most common way and is a useful regression method for solving the binary classification(0–1) problem,Another category of classification is multi-class classification, which takes care of the issues where several classes are present in the outcome variable. For example, IRIS dataset a very famous example of multi-class classification. Other examples are classifying articles ( require NLP skills ).

分类技术是机器学习和数据科学应用程序的重要组成部分。 机器学习中大约70%的问题是分类问题。 可用的分类问题很多,但是物流回归是最常用的方法,是解决二元分类(0-1)问题的一种有用的回归方法。另一类分类是多分类,需要注意结果变量中存在几个类别的问题。 例如,IRIS数据集是非常著名的多类分类示例。 其他示例是对文章进行分类(需要NLP技能)。

Logistic Regression can be used for multiple classification problems such as credit cards approvalls ,Diabetes prediction and given a customer will purchase a particular product or will they churn another competitor, whether the user will click on a given advertisement link or not, and so much more.

Logistic回归可用于多种分类问题,例如信用卡审批,糖尿病预测以及给定客户将购买特定产品还是会吸引其他竞争对手(无论用户是否会点击给定的广告链接)等等。 。

Logistic Regression is one of the most famous or simple way and commonly used Machine Learning algorithms for binary (0–1) classification. It is easy to code and can be used as the baseline for any binary classification problem. Its basic fundamental concepts are also constructive in deep learning. Logistic regression describes and estimates the relationship between one dependent binary variable and independent variables.

Logistic回归是最著名或最简单的方法之一,也是用于二进制(0-1)分类的机器学习算法。 它易于编码,可以用作任何二进制分类问题的基准。 它的基本基本概念在深度学习中也具有建设性。 Logistic回归描述和估计一个因变量和自变量之间的关系。

To break down to you,here is what we gonna be learning in this tutorial.

总结一下,这是我们将在本教程中学习的内容。

  • Introduction To Logistic Regression.

    Logistic回归概论。

  • Linear Regression Vs Logistic Regression.

    线性回归与逻辑回归。

  • Maximum Likelihood Estimation Vs. Ordinary Least Square Method

    最大似然估计与 普通最小二乘法
  • Logistic Regression under the hood?

    引擎盖下的逻辑回归?
  • Logistic Regression in Scikit-learn.

    Scikit学习中的逻辑回归

  • Confusion Matrix for Model Evaluation.

    用于模型评估的混淆矩阵。
  • Advantages over Disadvantages.

    优势胜过劣势。

逻辑回归: (Logistic Regression:)

Logistic regression is a statistical learning method for predicting two-class(0–1). the target variable is dichotomous in nature. Dichotomous means there are only two possible classes. For example, it can be used for cancer detection problems. the logistic regression output probabilities and based on that output we can predict the corresponding classes.

Logistic回归是一种用于预测两级(0-1)的统计学习方法。 目标变量本质上是二分法的。 二分法意味着只有两种可能的类别。 例如,它可以用于癌症检测问题。 逻辑回归输出的概率,并基于该输出,我们可以预测相应的类。

It is a special case of linear regression where the target variable is categorical in nature. It uses a log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a binary event utilizing a logit function.

这是线性回归的一种特殊情况,其中目标变量本质上是分类的。 它使用几率对数作为因变量。 Logistic回归利用logit函数预测二进制事件的发生概率。

Linear Regression Equation:

线性回归方程:

In statistics, econometrics and machine learning, a linear regression model is a regression model that seeks to establish a linear relationship between a variable, called explained, and one or more variables, called explanatory,It is also referred to as a linear model or a linear regression model.

在统计,计量经济学和机器学习中,线性回归模型是一种回归模型,旨在建立一个变量(称为解释变量)和一个或多个变量(称为解释变量)之间的线性关系,也称为线性模型或线性模型。线性回归模型。

We consider the model for individual i. For each individual, the explained variable is written as a linear function of the explanatory variables.

我们考虑个人i的模型。 对于每个人,将解释变量写为解释变量的线性函数。

Image for post

where yi and the xi,j are fixed and εi represents the error.

其中yixij是固定的,而εi表示误差。

Sigmoid Function:

乙状结肠功能:

Image for post
Image for post

The sigmoid function, also called logistic function gives an ‘S’ shaped curve that can take any real-valued number and map it into a value between 0 and 1( probability). If the curve goes to positive infinity, y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0. If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES( by default in logistic regression), and if it is less than 0.5, we can classify it as 0 or NO. The output cannot For example: If the output is 0.75, we can say in terms of probability as: There is a 75 percent chance that patient will suffer from cancer.

S形函数 (也称为逻辑函数)给出了一个“ S”形曲线,该曲线可以采用任何实数值,并将其映射为0到1(概率)之间的值。 如果曲线变为正无穷大,则y预测将变为1,如果曲线变为负无穷大,则y预测将变为0。如果S型函数的输出大于0.5,我们可以将结果分类为1或YES。 (在逻辑回归中默认情况下),如果小于0.5,我们可以将其分类为0或NO。 输出不能为例如:如果输出为0.75,就概率而言,我们可以这样说:患者有75%的机会患上癌症。

Properties of Logistic Regression:

Logistic回归的属性:

  • The dependent variable in logistic regression follows Bernoulli Distribution.

    逻辑回归中的因变量遵循伯努利分布。

  • Estimation is done through maximum likelihood.

    估计是通过最大似然来完成的。
  • No R Square, Model fitness is calculated through Concordance, KS-Statistics.

    没有R平方,模型适用性通过Concordance,KS-Statistics计算。

线性回归与逻辑回归: (Linear Regression Vs Logistic Regression:)

Image for post

Linear regression gives you a continuous output, but logistic regression provides a constant output. An example of the continuous output is house price and stock price. Example’s of the discrete output is predicting whether a patient has cancer or not, predicting whether the customer will churn. Linear regression is estimated using Ordinary Least Squares (OLS) while logistic regression is estimated using Maximum Likelihood Estimation (MLE) approach.

线性回归可提供连续的输出,而逻辑回归可提供恒定的输出。 连续输出的一个例子是房屋价格和股票价格。 离散输出的示例正在预测患者是否患有癌症,并预测客户是否会流失。 使用普通最小二乘(OLS)估计线性回归,而使用最大似然估计(MLE)方法估计逻辑回归。

最大似然估计与 最小二乘法: (Maximum Likelihood Estimation Vs. Least Square Method:)

In statistics, maximum likelihood estimate (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

在统计中,最大似然估计(MLE)是通过最大化似然函数来估计概率分布参数的方法,因此在假定的统计模型下,观察到的数据最有可能。 参数空间中使似然函数最大化的点称为最大似然估计。 最大似然逻辑既直观又灵活,因此该方法已成为统计推断的主要手段。

Ordinary Least squares estimates are computed by fitting a regression line on given data points that has the minimum sum of the squared deviations (least square error). Both are used to estimate the parameters of a linear regression model. MLE assumes a joint probability mass function, while OLS doesn’t require any stochastic assumptions for minimizing distance.

普通最小二乘估计值是通过在给定的数据点上拟合回归线来计算的,该数据点具有平方差的最小和(最小二乘误差)。 两者都用于估计线性回归模型的参数。 MLE假设一个联合概率质量函数,而OLS不需要任何随机假设来最小化距离。

Image for post
fitting a regression line on given data points
在给定数据点上拟合回归线

Logistic回归的类型: (Types of Logistic Regression:)

  • Multinomial Logistic Regression: The target variable has three or more nominal categories such as predicting the type of Wine.

    多项逻辑回归 :目标变量具有三个或更多名义类别,例如预测酒的类型。

  • Binary Logistic Regression: The target variable has only two possible outcomes such as Spam or Not Spam.

    二进制Logistic回归 :目标变量只有两个可能的结果,例如垃圾邮件或非垃圾邮件。

  • multi-label Logistic Regression: the target variable has three or more ordinal categories such as restaurant or product rating from 1 to 5.

    多标签Logistic回归 :目标变量具有三个或多个序数类别,例如餐厅或产品等级从1到5。

Scikit学习中的模型构建: (Model building in Scikit-learn:)

Scikit-learn is a free Python library for machine learning. It has been developed by many contributors, particularly in the academic world, by French higher education and research institutes such as Inria. It includes functions for estimating random forests, logistic regressions, classification algorithms, and support vector machines. It is designed to harmonize with other free Python libraries, including NumPy and SciPy.

Scikit-learn是用于机器学习的免费Python库。 它是由许多贡献者开发的,特别是在学术界,是由法国高等教育和研究所(例如Inria)开发的。 它包括用于估计随机森林的功能,逻辑回归,分类算法和支持向量机。 它旨在与其他免费的Python库(包括NumPySciPy)保持一致

datasets: https://www.kaggle.com/uciml/pima-indians-diabetes-database

数据集: https : //www.kaggle.com/uciml/pima-indians-diabetes-database

now Let’s build the diabetes prediction model,Here, you are going to predict diabetes using Logistic Regression Classifier,first load the required Pima Indian Diabetes.

现在,我们建立糖尿病预测模型,在这里,您将使用Logistic回归分类器预测糖尿病,首先加载所需的Pima印度糖尿病。

Image for post
python code
python代码
Image for post
the first five rows of the dataframe
数据框的前五行

选择功能: (Selecting Feature:)

Here, you need to divide the given columns into two types of variables dependent(or target variable) and independent variable(or feature variables or predictors variables).

在这里,您需要将给定的列分为因变量(或目标变量)和自变量(或特征变量或预测变量)两种类型。

Image for post
python code
python代码

分割资料: (Splitting Data:)

To understand model performance, dividing the dataset into a training set and a test set is a good strategy.

为了了解模型的性能,将数据集分为训练集和测试集是一个很好的策略。

Let’s split dataset by using function train_test_split(). You need to pass 3 parameters features, target, and test_set size. Additionally, you can use random_state to select records randomly.

让我们使用函数train_test_split()拆分数据集。 您需要传递3个参数功能,目标和test_set大小。 此外,您可以使用random_state随机选择记录。

Image for post
python code
python代码

Here, the Dataset is broken into two parts in a ratio of 75:25. It means 75% data will be used for model training and 25% for model testing.

在这里,数据集按75:25的比例分为两部分。 这意味着75%的数据将用于模型训练,而25%的数据将用于模型测试。

模型开发和预测: (Model Development and Prediction:)

First, import the Logistic Regression module and create a Logistic Regression classifier object using LogisticRegression() function.

首先,导入Logistic回归模块,并使用LogisticRegression()函数创建一个Logistic回归分类器对象。

Then, fit your model on the train set using fit() and perform prediction on the test set using predict().

然后,使用fit()将模型拟合到训练集上,并使用predict()对测试集执行预测。

Image for post
python code
python代码

使用混淆矩阵的模型评估: (Model Evaluation using Confusion Matrix:)

A confusion matrix is a table that is used to evaluate the performance of a classification model. You can also visualize the performance of an algorithm. The fundamental of a confusion matrix is the number of correct and incorrect predictions are summed up class-wise.

混淆矩阵是用于评估分类模型的性能的表。 您还可以可视化算法的性能。 混淆矩阵的基本原理是按类别汇总正确和错误预测的数量。

Image for post
Confusion Matrix
混淆矩阵
Image for post
python code
python代码
Image for post
output matrix
输出矩阵

Here, you can see the confusion matrix in the form of the array object. The dimension of this matrix is 2*2 because this model is binary classification. You have two classes 0 and 1. Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions. In the output, 119 and 36 are actual predictions, and 26 and 11 are incorrect predictions.

在这里,您可以看到数组对象形式的混淆矩阵。 该矩阵的维数为2 * 2,因为该模型是二进制分类。 您有两个类别0和1。对角线值表示准确的预测,而非对角线元素则表示不正确的预测。 在输出中,119和36是实际预测,而26和11是不正确的预测。

使用热图可视化混淆矩阵: (Visualizing Confusion Matrix using Heatmap:)

A heatmap is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space.

热图是一种数据可视化技术,可将现象的大小显示为二维颜色。 颜色的变化可能是色相或强度 ,从而为读者提供了有关现象如何聚集或随空间变化的明显视觉提示。

Here, you will visualize the confusion matrix using Heatmap.

在这里,您将使用Heatmap可视化混淆矩阵。

Image for post
python code
python代码
Image for post
python code
python代码
Image for post
output
输出

混淆矩阵评估指标: (Confusion Matrix Evaluation Metrics:)

Let’s evaluate the model using model evaluation metrics such as accuracy, precision, and recall.

让我们使用模型评估指标(例如准确性,准确性和召回率)评估模型。

Image for post
python code
python代码
Image for post

Well, you got a classification rate of 80%, considered as good accuracy.

好吧,您的分类率为80%,被认为是不错的准确性。

Precision: Precision is about being precise, i.e., how accurate your model is. In other words, you can say, when a model makes a prediction, how often it is correct. In your prediction case, when your Logistic Regression model predicted patients are going to suffer from diabetes, that patients have 76% of the time.

精度 :精度是指精度,即模型的精度。 换句话说,您可以说,当模型做出预测时,预测正确的频率是多少。 在您的预测案例中,当您的Logistic回归模型预测患者将患有糖尿病时,该患者有76%的时间。

Recall: If there are patients who have diabetes in the test set and your Logistic Regression model can identify it 58% of the time.

回想一下 :如果测试集中有糖尿病患者,并且您的Logistic回归模型可以在58%的时间内识别出糖尿病。

优点: (Advantages:)

Because of its efficient and straightforward nature, doesn’t require high computation power, easy to implement, easily interpretable, used widely by data analyst and scientist. Also, it doesn’t require scaling of features. Logistic regression provides a probability score for observations.

由于其高效而直接的特性,它不需要高计算能力,易于实现,易于解释的方法,并被数据分析师和科学家广泛使用。 而且,它不需要缩放功能。 Logistic回归为观察提供了概率分数。

缺点: (Disadvantages:)

Logistic regression is not able to handle a large number of categorical features/variables. It is vulnerable to overfitting. Also, can’t solve the non-linear problem with the logistic regression that is why it requires a transformation of non-linear features. Logistic regression will not perform well with independent variables that are not correlated to the target variable and are very similar or correlated to each other.

Logistic回归无法处理大量分类特征/变量。 它很容易过拟合。 此外,无法通过逻辑回归来解决非线性问题,这就是为什么它需要转换非线性特征的原因。 如果逻辑变量与目标变量不相关,非常相似或彼此相关,则逻辑回归将无法很好地执行。

结论: (Conclusion:)

this tutorial, you covered a lot of details about Logistic Regression. You have learned what the logistic regression is, how to build respective models, how to visualize results and some of the theoretical background information. Also, you covered some basic concepts such as the sigmoid function, maximum likelihood, confusion matrix, with that said see you guys in the next article and don’t forget to keep learning.

在本教程中,您涵盖了有关Logistic回归的许多详细信息。 您已经了解了逻辑回归是什么,如何建立各自的模型,如何可视化结果以及一些理论背景信息。 另外,您还介绍了一些基本概念,例如S形函数,最大似然,混淆矩阵,并说在下一篇文章中与大家见面, 不要忘记继续学习。

Image for post

翻译自: https://medium.com/analytics-vidhya/dive-into-logistic-regression-with-python-48911f37f8ee

逻辑回归 python

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388783.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

spring定时任务(@Scheduled注解)

(一)在xml里加入task的命名空间 xmlns:task"http://www.springframework.org/schema/task" http://www.springframework.org/schema/task http://www.springframework.org/schema/task/spring-task-4.1.xsd(二)启用注…

JavaScript是如何工作的:与WebAssembly比较及其使用场景

*摘要:** WebAssembly未来可期。 原文:JavaScript是如何工作的:与WebAssembly比较及其使用场景作者:前端小智Fundebug经授权转载,版权归原作者所有。 这是专门探索 JavaScript及其所构建的组件的系列文章的第6篇。 如果…

Matplotlib中的“ plt”和“ ax”到底是什么?

Indeed, as the most popular and fundamental data visualisation library, Matplotlib is kind of confusing in some perspectives. It is usually to see that someone asking about的确,作为最受欢迎的基础数据可视化库,Matplotlib在某些方面令人困…

2018年阿里云NoSQL数据库大事盘点

2019独角兽企业重金招聘Python工程师标准>>> NoSQL一词最早出现在1998年。2009年Last.fm的Johan Oskarsson发起了一次关于分布式开源数据库的讨论,来自Rackspace的Eric Evans再次提出了NoSQL概念,这时的NoSQL主要是指非关系型、分布式、不提供…

cayenne:用于随机模拟的Python包

TL;DR; We just released v1.0 of cayenne, our Python package for stochastic simulations! Read on to find out if you should model your system as a stochastic process, and why you should try out cayenne.TL; DR; 我们刚刚发布了 cayenne v1.0 ,这是我们…

java 如何将word 转换为ftl_使用 freemarker导出word文档

近日需要将人员的基本信息导出,存储为word文档,查阅了很多资料,最后选择了使用freemarker,网上一共有四种方式,效果都一样,选择它呢是因为使用简单,再次记录一下,一个简单的demo,仅供…

DotNetBar office2007效果

1.DataGridView 格式化显示cell里的数据日期等。 进入编辑列,选择要设置的列,DefaultCellStyle里->行为->formart设置 2.tabstrip和mdi窗口的结合使用给MDI窗口加上TabPage。拖动个tabstrip到MDI窗口上tabstrip里选择到主窗口名就加上TABPAGE了。d…

spotify 数据分析_没有数据? 没问题! 如何从Wikipedia和Spotify收集重金属数据

spotify 数据分析For many data science students, collecting data is seen as a solved problem. It’s just there in Kaggle or UCI. However, that’s not how data is available daily for working Data Scientists. Also, many of the datasets used for learning have …

IS环境下配置PHP5+MySql+PHPMyAdmin

IIS环境下配置PHP5MySqlPHPMyAdmin Posted on 2009-08-07 15:18 谢启祥 阅读(1385)评论(18) 编辑 收藏 虽然主要是做.net开发的,但是,时不时的还要搞一下php,但是,php在windows下的配置,总是走很多弯路,正好…

kaggle数据集_Kaggle上有170万份ArXiv文章的数据集

kaggle数据集“arXiv is a free distribution service and an open-access archive for 1.7 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and sys…

深度学习数据集中数据差异大_使用差异隐私来利用大数据并保留隐私

深度学习数据集中数据差异大The modern world runs on “big data,” the massive data sets used by governments, firms, and academic researchers to conduct analyses, unearth patterns, and drive decision-making. When it comes to data analysis, bigger can be bett…

C#图片处理基本应用(裁剪,缩放,清晰度,水印)

前言 需求源自项目中的一些应用,比如相册功能,通常用户上传相片后我们都会针对该相片再生成一张缩略图,用于其它页面上的列表显示。随便看一下,大部分网站基本都是将原图等比缩放来生成缩略图。但完美主义者会发现一些问题&#…

Java客户端访问HBase集群解决方案(优化)

测试环境&#xff1a;IdeaWindows10 准备工作&#xff1a; <1>、打开本地 C:\Windows\System32\drivers\etc&#xff08;系统默认&#xff09;下名为hosts的系统文件&#xff0c;如果提示当前用户没有权限打开文件&#xff1b;第一种方法是将hosts文件拖到桌面进行配置后…

WPF布局系统

WPF之路——WPF布局系统 前言 前段时间忙了一阵子Google Earth&#xff0c;这周又忙了一阵子架构师论文开题报告&#xff0c;现在终于有时间继续<WPF之路>了。先回忆一下上篇的内容&#xff0c;在《从HelloWorld到WPF World》中&#xff0c;我们对WPF有了个大概的了解&am…

PostGIS容器运行

2019独角兽企业重金招聘Python工程师标准>>> 获取镜像&#xff1a; docker pull mdillon/postgis 该 mdillon/postgis 镜像提供了容器中运行Postgres&#xff08;内置安装PostGIS 2.5&#xff09; 。该镜像基于官方 postgres image&#xff0c;提供了多种变体&#…

小型数据库_如果您从事“小型科学”工作,那么您是否正在利用数据存储库?

小型数据库If you’re a scientist, especially one performing a lot of your research alone, you probably have more than one spreadsheet of important data that you just haven’t gotten around to writing up yet. Maybe you never will. Sitting idle on a hard dri…

BitmapEffect位图效果是简单的像素处理操作。它可以呈现下面几种特殊效果。

BitmapEffect位图效果是简单的像素处理操作。它可以呈现下面几种特殊效果。 BevelBitmapEffect 凹凸效果 BlurBitmapEffect 模糊效果 DropShadowBitmapEffect投影效果 EmbossBitmapEffect 浮雕效果 Outer…

AutoScaling 与函数计算结合,赋予更丰富的弹性能力

目前&#xff0c;弹性伸缩服务已经接入了负载均衡&#xff08;SLB&#xff09;、云数据库RDS 等云产品&#xff0c;但是暂未接入 云数据库Redis&#xff0c;有时候我们可能会需要弹性伸缩服务在扩缩容的时候自动将扩缩容涉及到的 ECS 实例私网 IP 添加到 Redis 白名单或者从 Re…

参考文献_参考

参考文献Recently, I am attracted by the news that Tanzania has attained lower middle income status under the World Bank’s classification, five years ahead of projection. Being curious on how they make the judgement, I take a look of the World Bank’s offi…

数据统计 测试方法_统计测试:了解如何为数据选择最佳测试!

数据统计 测试方法This post is not meant for seasoned statisticians. This is geared towards data scientists and machine learning (ML) learners & practitioners, who like me, do not come from a statistical background.Ť他的职位是不是意味着经验丰富的统计人…