逻辑回归 python
Classification techniques are an essential part of machine learning and data science applications. Approximately 70% of problems in machine learning are classification problems. There are lots of classification problems that are available, but the logistics regression is the most common way and is a useful regression method for solving the binary classification(0–1) problem,Another category of classification is multi-class classification, which takes care of the issues where several classes are present in the outcome variable. For example, IRIS dataset a very famous example of multi-class classification. Other examples are classifying articles ( require NLP skills ).
分类技术是机器学习和数据科学应用程序的重要组成部分。 机器学习中大约70%的问题是分类问题。 可用的分类问题很多,但是物流回归是最常用的方法,是解决二元分类(0-1)问题的一种有用的回归方法。另一类分类是多分类,需要注意结果变量中存在几个类别的问题。 例如,IRIS数据集是非常著名的多类分类示例。 其他示例是对文章进行分类(需要NLP技能)。
Logistic Regression can be used for multiple classification problems such as credit cards approvalls ,Diabetes prediction and given a customer will purchase a particular product or will they churn another competitor, whether the user will click on a given advertisement link or not, and so much more.
Logistic回归可用于多种分类问题,例如信用卡审批,糖尿病预测以及给定客户将购买特定产品还是会吸引其他竞争对手(无论用户是否会点击给定的广告链接)等等。 。
Logistic Regression is one of the most famous or simple way and commonly used Machine Learning algorithms for binary (0–1) classification. It is easy to code and can be used as the baseline for any binary classification problem. Its basic fundamental concepts are also constructive in deep learning. Logistic regression describes and estimates the relationship between one dependent binary variable and independent variables.
Logistic回归是最著名或最简单的方法之一,也是用于二进制(0-1)分类的机器学习算法。 它易于编码,可以用作任何二进制分类问题的基准。 它的基本基本概念在深度学习中也具有建设性。 Logistic回归描述和估计一个因变量和自变量之间的关系。
To break down to you,here is what we gonna be learning in this tutorial.
总结一下,这是我们将在本教程中学习的内容。
Introduction To Logistic Regression.
Logistic回归概论。
Linear Regression Vs Logistic Regression.
线性回归与逻辑回归。
- Maximum Likelihood Estimation Vs. Ordinary Least Square Method 最大似然估计与 普通最小二乘法
- Logistic Regression under the hood? 引擎盖下的逻辑回归?
Logistic Regression in Scikit-learn.
Scikit学习中的逻辑回归 。
- Confusion Matrix for Model Evaluation. 用于模型评估的混淆矩阵。
- Advantages over Disadvantages. 优势胜过劣势。
逻辑回归: (Logistic Regression:)
Logistic regression is a statistical learning method for predicting two-class(0–1). the target variable is dichotomous in nature. Dichotomous means there are only two possible classes. For example, it can be used for cancer detection problems. the logistic regression output probabilities and based on that output we can predict the corresponding classes.
Logistic回归是一种用于预测两级(0-1)的统计学习方法。 目标变量本质上是二分法的。 二分法意味着只有两种可能的类别。 例如,它可以用于癌症检测问题。 逻辑回归输出的概率,并基于该输出,我们可以预测相应的类。
It is a special case of linear regression where the target variable is categorical in nature. It uses a log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a binary event utilizing a logit function.
这是线性回归的一种特殊情况,其中目标变量本质上是分类的。 它使用几率对数作为因变量。 Logistic回归利用logit函数预测二进制事件的发生概率。
Linear Regression Equation:
线性回归方程:
In statistics, econometrics and machine learning, a linear regression model is a regression model that seeks to establish a linear relationship between a variable, called explained, and one or more variables, called explanatory,It is also referred to as a linear model or a linear regression model.
在统计,计量经济学和机器学习中,线性回归模型是一种回归模型,旨在建立一个变量(称为解释变量)和一个或多个变量(称为解释变量)之间的线性关系,也称为线性模型或线性模型。线性回归模型。
We consider the model for individual i. For each individual, the explained variable is written as a linear function of the explanatory variables.
我们考虑个人i的模型。 对于每个人,将解释变量写为解释变量的线性函数。
where yi and the xi,j are fixed and εi represents the error.
其中yi和xi , j是固定的,而εi表示误差。
Sigmoid Function:
乙状结肠功能:
The sigmoid function, also called logistic function gives an ‘S’ shaped curve that can take any real-valued number and map it into a value between 0 and 1( probability). If the curve goes to positive infinity, y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0. If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES( by default in logistic regression), and if it is less than 0.5, we can classify it as 0 or NO. The output cannot For example: If the output is 0.75, we can say in terms of probability as: There is a 75 percent chance that patient will suffer from cancer.
S形函数 (也称为逻辑函数)给出了一个“ S”形曲线,该曲线可以采用任何实数值,并将其映射为0到1(概率)之间的值。 如果曲线变为正无穷大,则y预测将变为1,如果曲线变为负无穷大,则y预测将变为0。如果S型函数的输出大于0.5,我们可以将结果分类为1或YES。 (在逻辑回归中默认情况下),如果小于0.5,我们可以将其分类为0或NO。 输出不能为例如:如果输出为0.75,就概率而言,我们可以这样说:患者有75%的机会患上癌症。
Properties of Logistic Regression:
Logistic回归的属性:
The dependent variable in logistic regression follows Bernoulli Distribution.
逻辑回归中的因变量遵循伯努利分布。
- Estimation is done through maximum likelihood. 估计是通过最大似然来完成的。
- No R Square, Model fitness is calculated through Concordance, KS-Statistics. 没有R平方,模型适用性通过Concordance,KS-Statistics计算。
线性回归与逻辑回归: (Linear Regression Vs Logistic Regression:)
Linear regression gives you a continuous output, but logistic regression provides a constant output. An example of the continuous output is house price and stock price. Example’s of the discrete output is predicting whether a patient has cancer or not, predicting whether the customer will churn. Linear regression is estimated using Ordinary Least Squares (OLS) while logistic regression is estimated using Maximum Likelihood Estimation (MLE) approach.
线性回归可提供连续的输出,而逻辑回归可提供恒定的输出。 连续输出的一个例子是房屋价格和股票价格。 离散输出的示例正在预测患者是否患有癌症,并预测客户是否会流失。 使用普通最小二乘(OLS)估计线性回归,而使用最大似然估计(MLE)方法估计逻辑回归。
最大似然估计与 最小二乘法: (Maximum Likelihood Estimation Vs. Least Square Method:)
In statistics, maximum likelihood estimate (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.
在统计中,最大似然估计(MLE)是通过最大化似然函数来估计概率分布参数的方法,因此在假定的统计模型下,观察到的数据最有可能。 参数空间中使似然函数最大化的点称为最大似然估计。 最大似然逻辑既直观又灵活,因此该方法已成为统计推断的主要手段。
Ordinary Least squares estimates are computed by fitting a regression line on given data points that has the minimum sum of the squared deviations (least square error). Both are used to estimate the parameters of a linear regression model. MLE assumes a joint probability mass function, while OLS doesn’t require any stochastic assumptions for minimizing distance.
普通最小二乘估计值是通过在给定的数据点上拟合回归线来计算的,该数据点具有平方差的最小和(最小二乘误差)。 两者都用于估计线性回归模型的参数。 MLE假设一个联合概率质量函数,而OLS不需要任何随机假设来最小化距离。
Logistic回归的类型: (Types of Logistic Regression:)
Multinomial Logistic Regression: The target variable has three or more nominal categories such as predicting the type of Wine.
多项逻辑回归 :目标变量具有三个或更多名义类别,例如预测酒的类型。
Binary Logistic Regression: The target variable has only two possible outcomes such as Spam or Not Spam.
二进制Logistic回归 :目标变量只有两个可能的结果,例如垃圾邮件或非垃圾邮件。
multi-label Logistic Regression: the target variable has three or more ordinal categories such as restaurant or product rating from 1 to 5.
多标签Logistic回归 :目标变量具有三个或多个序数类别,例如餐厅或产品等级从1到5。
Scikit学习中的模型构建: (Model building in Scikit-learn:)
Scikit-learn is a free Python library for machine learning. It has been developed by many contributors, particularly in the academic world, by French higher education and research institutes such as Inria. It includes functions for estimating random forests, logistic regressions, classification algorithms, and support vector machines. It is designed to harmonize with other free Python libraries, including NumPy and SciPy.
Scikit-learn是用于机器学习的免费Python库。 它是由许多贡献者开发的,特别是在学术界,是由法国高等教育和研究所(例如Inria)开发的。 它包括用于估计随机森林的功能,逻辑回归,分类算法和支持向量机。 它旨在与其他免费的Python库(包括NumPy和SciPy)保持一致 。
datasets: https://www.kaggle.com/uciml/pima-indians-diabetes-database
数据集: https : //www.kaggle.com/uciml/pima-indians-diabetes-database
now Let’s build the diabetes prediction model,Here, you are going to predict diabetes using Logistic Regression Classifier,first load the required Pima Indian Diabetes.
现在,我们建立糖尿病预测模型,在这里,您将使用Logistic回归分类器预测糖尿病,首先加载所需的Pima印度糖尿病。
选择功能: (Selecting Feature:)
Here, you need to divide the given columns into two types of variables dependent(or target variable) and independent variable(or feature variables or predictors variables).
在这里,您需要将给定的列分为因变量(或目标变量)和自变量(或特征变量或预测变量)两种类型。
分割资料: (Splitting Data:)
To understand model performance, dividing the dataset into a training set and a test set is a good strategy.
为了了解模型的性能,将数据集分为训练集和测试集是一个很好的策略。
Let’s split dataset by using function train_test_split(). You need to pass 3 parameters features, target, and test_set size. Additionally, you can use random_state to select records randomly.
让我们使用函数train_test_split()拆分数据集。 您需要传递3个参数功能,目标和test_set大小。 此外,您可以使用random_state随机选择记录。
Here, the Dataset is broken into two parts in a ratio of 75:25. It means 75% data will be used for model training and 25% for model testing.
在这里,数据集按75:25的比例分为两部分。 这意味着75%的数据将用于模型训练,而25%的数据将用于模型测试。
模型开发和预测: (Model Development and Prediction:)
First, import the Logistic Regression module and create a Logistic Regression classifier object using LogisticRegression() function.
首先,导入Logistic回归模块,并使用LogisticRegression()函数创建一个Logistic回归分类器对象。
Then, fit your model on the train set using fit() and perform prediction on the test set using predict().
然后,使用fit()将模型拟合到训练集上,并使用predict()对测试集执行预测。
使用混淆矩阵的模型评估: (Model Evaluation using Confusion Matrix:)
A confusion matrix is a table that is used to evaluate the performance of a classification model. You can also visualize the performance of an algorithm. The fundamental of a confusion matrix is the number of correct and incorrect predictions are summed up class-wise.
混淆矩阵是用于评估分类模型的性能的表。 您还可以可视化算法的性能。 混淆矩阵的基本原理是按类别汇总正确和错误预测的数量。
Here, you can see the confusion matrix in the form of the array object. The dimension of this matrix is 2*2 because this model is binary classification. You have two classes 0 and 1. Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions. In the output, 119 and 36 are actual predictions, and 26 and 11 are incorrect predictions.
在这里,您可以看到数组对象形式的混淆矩阵。 该矩阵的维数为2 * 2,因为该模型是二进制分类。 您有两个类别0和1。对角线值表示准确的预测,而非对角线元素则表示不正确的预测。 在输出中,119和36是实际预测,而26和11是不正确的预测。
使用热图可视化混淆矩阵: (Visualizing Confusion Matrix using Heatmap:)
A heatmap is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space.
热图是一种数据可视化技术,可将现象的大小显示为二维颜色。 颜色的变化可能是色相或强度 ,从而为读者提供了有关现象如何聚集或随空间变化的明显视觉提示。
Here, you will visualize the confusion matrix using Heatmap.
在这里,您将使用Heatmap可视化混淆矩阵。
混淆矩阵评估指标: (Confusion Matrix Evaluation Metrics:)
Let’s evaluate the model using model evaluation metrics such as accuracy, precision, and recall.
让我们使用模型评估指标(例如准确性,准确性和召回率)评估模型。
Well, you got a classification rate of 80%, considered as good accuracy.
好吧,您的分类率为80%,被认为是不错的准确性。
Precision: Precision is about being precise, i.e., how accurate your model is. In other words, you can say, when a model makes a prediction, how often it is correct. In your prediction case, when your Logistic Regression model predicted patients are going to suffer from diabetes, that patients have 76% of the time.
精度 :精度是指精度,即模型的精度。 换句话说,您可以说,当模型做出预测时,预测正确的频率是多少。 在您的预测案例中,当您的Logistic回归模型预测患者将患有糖尿病时,该患者有76%的时间。
Recall: If there are patients who have diabetes in the test set and your Logistic Regression model can identify it 58% of the time.
回想一下 :如果测试集中有糖尿病患者,并且您的Logistic回归模型可以在58%的时间内识别出糖尿病。
优点: (Advantages:)
Because of its efficient and straightforward nature, doesn’t require high computation power, easy to implement, easily interpretable, used widely by data analyst and scientist. Also, it doesn’t require scaling of features. Logistic regression provides a probability score for observations.
由于其高效而直接的特性,它不需要高计算能力,易于实现,易于解释的方法,并被数据分析师和科学家广泛使用。 而且,它不需要缩放功能。 Logistic回归为观察提供了概率分数。
缺点: (Disadvantages:)
Logistic regression is not able to handle a large number of categorical features/variables. It is vulnerable to overfitting. Also, can’t solve the non-linear problem with the logistic regression that is why it requires a transformation of non-linear features. Logistic regression will not perform well with independent variables that are not correlated to the target variable and are very similar or correlated to each other.
Logistic回归无法处理大量分类特征/变量。 它很容易过拟合。 此外,无法通过逻辑回归来解决非线性问题,这就是为什么它需要转换非线性特征的原因。 如果逻辑变量与目标变量不相关,非常相似或彼此相关,则逻辑回归将无法很好地执行。
结论: (Conclusion:)
this tutorial, you covered a lot of details about Logistic Regression. You have learned what the logistic regression is, how to build respective models, how to visualize results and some of the theoretical background information. Also, you covered some basic concepts such as the sigmoid function, maximum likelihood, confusion matrix, with that said see you guys in the next article and don’t forget to keep learning.
在本教程中,您涵盖了有关Logistic回归的许多详细信息。 您已经了解了逻辑回归是什么,如何建立各自的模型,如何可视化结果以及一些理论背景信息。 另外,您还介绍了一些基本概念,例如S形函数,最大似然,混淆矩阵,并说在下一篇文章中与大家见面, 不要忘记继续学习。
翻译自: https://medium.com/analytics-vidhya/dive-into-logistic-regression-with-python-48911f37f8ee
逻辑回归 python
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388783.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!