回归分析预测_使用回归分析预测心脏病。

回归分析预测

As per the Centers for Disease Control and Prevention report, heart disease is the prime killer of both men and women in the United States and around the globe. There are several data mining techniques that can be leveraged by researchers/ statisticians to help health care professionals determine heart disease and its potential causes. Some of the significant risk factors associated with heart disease are age, blood pressure, total cholesterol, diabetes, hypertension, family history of heart disease, obesity, lack of physical exercise, etc.
根据美国疾病控制与预防中心的报告，心脏病是美国乃至全球男女的主要杀手。研究人员/统计人员可以利用多种数据挖掘技术来帮助医疗保健专业人员确定心脏病及其潜在原因。与心脏病有关的一些重要危险因素是年龄，血压，总胆固醇，糖尿病，高血压，心脏病家族史，肥胖症，缺乏体育锻炼等。

In this project from Data Camp, the objective of my project is to build a regression model and run statistical tests to assess how strongly are the clinical factors associated with heart disease and how it is related to the higher probability of getting a heart disease. I shall be implementing Multiple and Logistic Regression approaches together with data explorations in ggplot and dplyr. This project uses the Cleveland heart disease dataset.

在这个来自Data Camp的项目中，我的目标是建立一个回归模型并运行统计测试，以评估与心脏病相关的临床因素有多强烈，以及它与患心脏病可能性更高的相关性。我将在ggplot和dplyr中实现多元和逻辑回归方法以及数据探索。该项目使用克利夫兰心脏病数据集。

Here’s a glimpse of the dataset in hand -

这是现有数据集的一瞥-

Image for post — On inspecting the first five rows of Cleveland heart disease dataset

数据字典 (Data Dictionary)

There are 14 columns in the dataset which are set out as mentioned below -

数据集中有14列，其内容如下所述-

a. Age : It is a continuous data type which describes the age of the person in years.

一个。年龄：这是一个连续的数据类型，描述了人的年龄(以年为单位)。

b. Sex: It is a discrete data type that describes the gender of the person. Here 0 = Female and 1 = Male

b。 性别：这是描述人的性别的离散数据类型。 0 =女性，1 =男性

c. CP(Chest Pain type): It is a discrete data type that describes the chest pain type with following parameters- 1 = Typical angina; 2 = Atypical angina; 3 = Non-anginal pain ; 4 = Asymptotic

C。 CP(Chest Pain type) ：这是一种离散数据类型，描述了具有以下参数的胸痛类型-1 =典型心绞痛； 2 =非典型心绞痛； 3 =非心绞痛； 4 =渐近的

d. Trestbps : It is a continuous data type which describes resting blood pressure in mm Hg

d。 Trestbps：这是一个连续数据类型，以mm Hg表示静息血压

e. Cholesterol: It is a continuous data type that describes the serum cholesterol in mg/dl

e。 胆固醇：这是一个连续的数据类型，以mg / dl的形式描述血清胆固醇

f. FBS: It is a discrete data type that compares the fasting blood sugar of the person with 120 mg/dl. If FBS >120 then 1 = true else 0 = false

F。 FBS：这是一种离散数据类型，用于将人的空腹血糖与120 mg / dl进行比较。如果FBS> 120，则1 = true，否则0 = false

g. RestECG: It is a discrete data type that shows the resting ECG results where 0 = normal; 1 = ST-T wave abnormality; 2 = left ventricular hypertrophy

G。 RestECG：这是一种离散数据类型，显示静态 ECG结果，其中0 =正常； 1 = ST-T波异常； 2 =左心室肥大

h. Thalach: It is a continuous data type that describes the max heart rate achieved.

H。 Thalach ：这是一个连续的数据类型，描述了达到的最大心率。

i. Exang: It is a discrete data type where exercise induced angina is shown by 1 = Yes and 0 = No

一世。 Exang：这是一种离散的数据类型，其中运动诱发的心绞痛显示为1 =是和0 =否

j. Oldpeak: It is a continuous data type that shows the depression induced by exercise relative to weight

j。 Oldpeak：这是一个连续的数据类型，显示了运动引起的相对于体重的压抑

k. Slope: It is a discrete data type that shows us the slope of the peak exercise segment where 1= up-sloping; 2 = flat; 3 = down-sloping

k。 斜率：这是一种离散的数据类型，向我们显示了峰值运动段的斜率，其中1 =向上倾斜； 2 =平坦； 3 =向下倾斜

l.ca: It is a continuous data type that shows us the number of major vessels colored by fluoroscopy that ranges from 0 to 3.

l。 ca：这是一个连续的数据类型，向我们显示了通过荧光检查显色的主要血管数量，范围为0到3。

m. Thal: It is a discrete data type that shows us Thalassemia where 3 = normal ; 6 = fixed defect ; 7 = reversible defect.

米 Thal：这是一种离散的数据类型，向我们显示地中海贫血，其中3 =正常； 6 =固定缺陷; 7 =可逆缺陷。

n. Class: It is a discrete data type where diagnose class 0 = No Presence and 1 -4 is range for the person to have the heart disease from least likely to most likely, 1 being least likely.

。 类别：它是一种离散的数据类型，其中诊断类别0 =无状态，而1 -4是该人患心脏病的范围，从最不可能到最可能，从1到最不可能。

数据整理 (Data Wrangling)

Since the outcome variable class has more than 2 levels; I created a new variable hd using mutate() to represent binary 0/1 outcome where any value > 0 shall be 1 and all 0 values will stay 0. Also, I renamed sex levels (originally 1 and 0) as Male/ Female for better clarity.

由于结果变量类具有两个以上的级别；我使用mutate()创建了一个新变量hd来表示二进制0/1结果，其中任何值> 0都应为1，所有0值都将保持为0。此外，我将性别级别(最初为1和0)重命名为Male / Female更好的清晰度。

统计检验 (Statistical Tests)

I ran statistical tests to check which predictor variables are closely related to heart disease. Depending on the data type (continuous/ discrete), I implemented t-test and chi-squared test to derive p-values.

我进行了统计测试，以检查哪些预测变量与心脏病密切相关。根据数据类型(连续/离散)，我实施了t检验和卡方检验以得出p值。

In this project, I examined how sex, age and thalach are related to heart disease which is as shown below -

在这个专案中，我检查了性别，年龄和下丘脑与心脏病的关系，如下所示-

→ Sex: Since sex is a binary variable in this dataset, chi-squared test will be the appropriate test for this variable. Here’s the output on using chisq.test() to assess the relationship between sex and hd(outcome variable)

→Sex ：由于sex是此数据集中的二进制变量，因此卡方检验将是此变量的适当检验。这是使用chisq.test()评估性别和hd(结果变量)之间关系的输出

data:  hd_data$sex and hd_data$hd
X-squared = 22.043, df = 1, p-value = 2.667e-06

→ Age: Since age is a continuous variable, I used t.test() to determine relationship between age and hd.

→ 年龄：由于年龄是一个连续变量，因此我使用t.test()来确定年龄和高清之间的关系。

data:  hd_data$age by hd_data$hd
t = -4.0303, df = 300.93, p-value = 7.061e-05

→ Thalach: Using t.test() again to assess relationship between thalach and hd.

→ Thalach ：再次使用t.test()来评估thalach和hd之间的关系。

data:  hd_data$thalach by hd_data$hd
t = 7.8579, df = 272.27, p-value = 9.106e-14

图形化的关联显示(因为每张图片都讲述了一个更好的故事！) (Graphical visualization of the associations (Because every picture tells a better story!))

I have plotted boxplot for continuous variables like Age and Thalach (max heart rate).

我为连续变量(例如Age和Thalach(最大心率))绘制了箱线图 。

Plotted Barplot for sex since it’s a binary variable in this dataset.

为性别绘制Barplot，因为它是此数据集中的二进制变量。

→ The graphical plots above and statistical tests clearly show us that all the three clinical variables (Age, Sex, Thalach) that were chosen are significantly associated with our outcome since p-value < 0.001 for all the tests.
→上面的图形图和统计测试清楚地向我们表明，选择的所有三个临床变量(年龄，性别，Thalach)均与我们的结果显着相关，因为所有测试的p值<0.001。

用所有3个变量拟合Logistic回归模型 (Fitting Logistic Regression Model with all 3 variables)

I have fitted a Logistic Regression model here since there are two predicting variables and one binary outcome variable. This model will help us determine the effect that a max heart rate (thalach), age and sex can have on the likelihood that an individual will have a heart disease.

由于这里有两个预测变量和一个二进制结果变量，因此我在这里拟合了Logistic回归模型。该模型将帮助我们确定最大心率(thalach)，年龄和性别对个人患心脏病的可能性的影响。

model <- glm(data = hd_data, hd ~ age + sex + thalach, family = “binomial” )
模型<-glm(数据= hd_data，hd〜年龄+性别+触角，家庭=“二项式”)

# extract the model summarysummary(model)
＃提取模型summarysummary(model)

Here’s the output as shown below after implementing the model -

实施模型后，输出如下所示-

Call:
glm(formula = hd ~ age + sex + thalach, family = "binomial", 
    data = hd_data)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2250  -0.8486  -0.4570   0.9043   2.1156  
Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.111610   1.607466   1.936   0.0529 .  
age          0.031886   0.016440   1.940   0.0524 .  
sexMale      1.491902   0.307193   4.857 1.19e-06 ***
thalach     -0.040541   0.007073  -5.732 9.93e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)

This logistic regression model can be used to predict the probability of a person having heart disease given his/her age, sex and max heart rate. Additionally, we can translate the predicted probability into a decision rule for clinical use by defining cutoff value on the probability scale. For instance- if a 45 year old female patient with a max heart rate = 150 walks in, we can find out the predicted probability of the heart disease by creating a new data frame called newdata.

鉴于年龄，性别和最大心率，该逻辑回归模型可用于预测某人患心脏病的可能性。此外，我们可以通过在概率标度上定义临界值将预测的概率转换为临床使用的决策规则。例如，如果一名最大心率= 150的45岁女性患者走进来，我们可以通过创建一个称为newdata的新数据框来找到心脏病的预测概率。

We can see that the model generated a heart disease probability of 0.177 for a 45 year old female with a max heart rate of 150 which indicates a low risk of heart disease.
我们可以看到，该模型为45岁女性，最大心率150产生的心脏病概率为0.177 ，这表明患心脏病的风险较低。

评估模型性能 (Evaluating model performance)

While these predictive models can be used to predict the probability of an event occurring, it is vital to check the accuracy of any model before computing the predicted values. Some of the core metrics that can be used to evaluate this model are as described below-

尽管这些预测模型可用于预测事件发生的可能性，但在计算预测值之前检查任何模型的准确性至关重要。可以用来评估该模型的一些核心指标如下所述：

Accuracy : It is one of the most straightforward metric which tells us the proportion of total number of predictions being correct
准确性 ：这是最直接的指标之一，它告诉我们正确的预测总数中所占的比例
Classification Error Rate : This can be calculated using 1-Accuracy
分类错误率：可以使用1-Accuracy计算
Area under the ROC curve (AUC): This is one of the most sought after metrics used for evaluation. It is popular since it is independent of the change in proportion of responders. It ranges from 0–1. The closer it gets to 1, the better is the model performance
ROC曲线下的面积(AUC)：这是用于评估的最受欢迎的指标之一。它之所以受欢迎是因为它与响应者比例的变化无关。取值范围是0-1。越接近1，模型性能越好
Confusion Matrix: It is a N*N matrix where N is the level of outcome.This metric reports the the number of false positives, false negatives, true positives, and true negatives.
混淆矩阵：这是一个N * N矩阵，其中N是结果水平。此度量标准报告误报，误报，真报和真报错的数量。

结果 (Result)

From the above output, we can see that the model has an overall accuracy of 0.71. Also, there are cases that were misclassified as shown in the confusion matrix. We can improve the existing model by including other relevant predictors from the dataset into our model.

从上面的输出中，我们可以看到该模型的整体精度为0.71。而且，有些情况下分类错误，如混淆矩阵中所示。我们可以通过将数据集中的其他相关预测变量纳入模型来改善现有模型。

You can find the entire code of this project on my Github.

您可以在我的Github上找到该项目的全部代码。

Disclaimer: This project has been done solely for educational motives and to solidify my understanding of data mining techniques. It is not intended to be used for diagnosis of actual heart patients. Please consult your healthcare practitioner for professional advice.
免责声明：此项目仅出于教育目的而进行，目的是巩固我对数据挖掘技术的理解。它不能用于诊断实际的心脏病患者。请咨询您的医疗保健从业人员以获取专业建议。

翻译自: https://medium.com/swlh/predicting-heart-disease-using-regression-analysis-486401cd0a47

回归分析预测

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/389072.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

VMware文件共享

VMware tools 文件共享已经安装后： vmhgfs-fuse .host:/ /mnt/hgfs

npm 问题（一）

今天在使用npm安装程序时出现了以下问题如下： 我解决了问题，这是由于缓存清除错误（但他们自动修复）有一些数据损坏，没有让JSON文件解析，使用以下命令可以解决：即： npm cache clean -…

UDP打洞程序包的源码

C#实现UDP打洞转自：http://hi.baidu.com/sdfiyon/blog/item/63a6e039155e02f23a87ceb1.html 下面是UDP打洞程序包的源码：//WellKnown公用库using System;using System.IO;using System.Runtime.Serialization.Formatters.Binary;using System.Net ;usi…

NLPPython笔记——WordNet

WordNet是一种面向语义的英语词典，由Princeton大学的心理学家、语言学家和计算机工程师联合设计。它不是光把单词以字母顺序排列，而且按照单词的意义组成一个“单词的网络”。 NLTK库中包含了英语WordNet，里面共有155287个词以及117659个同义…

crc16的c语言函数计算ccitt_C语言为何如此重要

●●●如今，有很多学生不懂为何要学习编程语言，为何要学习C语言？原因是大学生不能满足于只会用办公软件，而应当有更高的学习要求，对于理工科的学生尤其如此。计算机的本质是“程序的机器”，程序和指令的思想…

毫米波雷达与激光雷达的初探

毫米波雷达与激光雷达的初探雷达 （Radio Detection and Range, Radar）是一种利用电磁波来对目标进行探测和定位的电子设备。实现距离测量、运动参数测量、搜索和发现目标、目标定位、目标特性参数分析等功能。分类电磁波按照从低频到高频的顺序&…

aws spark_使用Spark构建AWS数据湖时的一些问题以及如何处理这些问题

aws spark技术提示 (TECHNICAL TIPS) 介绍 (Introduction) At first, it seemed to be quite easy to write down and run a Spark application. If you are experienced with data frame manipulation using pandas, numpy and other packages in Python, and/or the SQL lang…