回归分析预测_使用回归分析预测心脏病。

回归分析预测

As per the Centers for Disease Control and Prevention report, heart disease is the prime killer of both men and women in the United States and around the globe. There are several data mining techniques that can be leveraged by researchers/ statisticians to help health care professionals determine heart disease and its potential causes. Some of the significant risk factors associated with heart disease are age, blood pressure, total cholesterol, diabetes, hypertension, family history of heart disease, obesity, lack of physical exercise, etc.

根据美国疾病控制与预防中心的报告,心脏病是美国乃至全球男女的主要杀手。 研究人员/统计人员可以利用多种数据挖掘技术来帮助医疗保健专业人员确定心脏病及其潜在原因。 与心脏病有关的一些重要危险因素是年龄,血压,总胆固醇,糖尿病,高血压,心脏病家族史,肥胖症,缺乏体育锻炼等。

In this project from Data Camp, the objective of my project is to build a regression model and run statistical tests to assess how strongly are the clinical factors associated with heart disease and how it is related to the higher probability of getting a heart disease. I shall be implementing Multiple and Logistic Regression approaches together with data explorations in ggplot and dplyr. This project uses the Cleveland heart disease dataset.

在这个来自Data Camp的项目中,我的目标是建立一个回归模型并运行统计测试,以评估与心脏病相关的临床因素有多强烈,以及它与患心脏病可能性更高的相关性。 我将在ggplot和dplyr中实现多元和逻辑回归方法以及数据探索。 该项目使用克利夫兰心脏病数据集。

Here’s a glimpse of the dataset in hand -

这是现有数据集的一瞥-

Image for post
On inspecting the first five rows of Cleveland heart disease dataset
在检查克利夫兰心脏病数据集的前五行时

数据字典 (Data Dictionary)

There are 14 columns in the dataset which are set out as mentioned below -

数据集中有14列,其内容如下所述-

a. Age : It is a continuous data type which describes the age of the person in years.

一个。 年龄 :这是一个连续的数据类型,描述了人的年龄(以年为单位)。

b. Sex: It is a discrete data type that describes the gender of the person. Here 0 = Female and 1 = Male

b。 性别:这是描述人的性别的离散数据类型。 0 =女性,1 =男性

c. CP(Chest Pain type): It is a discrete data type that describes the chest pain type with following parameters- 1 = Typical angina; 2 = Atypical angina; 3 = Non-anginal pain ; 4 = Asymptotic

C。 CP(Chest Pain type) :这是一种离散数据类型,描述了具有以下参数的胸痛类型-1 =典型心绞痛; 2 =非典型心绞痛; 3 =非心绞痛; 4 =渐近的

d. Trestbps : It is a continuous data type which describes resting blood pressure in mm Hg

d。 Trestbps:这是一个连续数据类型,以mm Hg表示静息血压

e. Cholesterol: It is a continuous data type that describes the serum cholesterol in mg/dl

e。 胆固醇:这是一个连续的数据类型,以mg / dl的形式描述血清胆固醇

f. FBS: It is a discrete data type that compares the fasting blood sugar of the person with 120 mg/dl. If FBS >120 then 1 = true else 0 = false

F。 FBS:这是一种离散数据类型,用于将人的空腹血糖与120 mg / dl进行比较。 如果FBS> 120,则1 = true,否则0 = false

g. RestECG: It is a discrete data type that shows the resting ECG results where 0 = normal; 1 = ST-T wave abnormality; 2 = left ventricular hypertrophy

G。 RestECG:这是一种离散数据类型,显示静态 ECG结果,其中0 =正常; 1 = ST-T波异常; 2 =左心室肥大

h. Thalach: It is a continuous data type that describes the max heart rate achieved.

H。 Thalach :这是一个连续的数据类型,描述了达到的最大心率。

i. Exang: It is a discrete data type where exercise induced angina is shown by 1 = Yes and 0 = No

一世。 Exang:这是一种离散的数据类型,其中运动诱发的心绞痛显示为1 =是和0 =否

j. Oldpeak: It is a continuous data type that shows the depression induced by exercise relative to weight

j。 Oldpeak:这是一个连续的数据类型,显示了运动引起的相对于体重的压抑

k. Slope: It is a discrete data type that shows us the slope of the peak exercise segment where 1= up-sloping; 2 = flat; 3 = down-sloping

k。 斜率:这是一种离散的数据类型,向我们显示了峰值运动段的斜率,其中1 =向上倾斜; 2 =平坦; 3 =向下倾斜

l.ca: It is a continuous data type that shows us the number of major vessels colored by fluoroscopy that ranges from 0 to 3.

l。 ca:这是一个连续的数据类型,向我们显示了通过荧光检查显色的主要血管数量,范围为0到3。

m. Thal: It is a discrete data type that shows us Thalassemia where 3 = normal ; 6 = fixed defect ; 7 = reversible defect.

Thal:这是一种离散的数据类型,向我们显示地中海贫血,其中3 =正常; 6 =固定缺陷; 7 =可逆缺陷。

n. Class: It is a discrete data type where diagnose class 0 = No Presence and 1 -4 is range for the person to have the heart disease from least likely to most likely, 1 being least likely.

类别:它是一种离散的数据类型,其中诊断类别0 =无状态,而1 -4是该人患心脏病的范围,从最不可能到最可能,从1到最不可能。

数据整理 (Data Wrangling)

Since the outcome variable class has more than 2 levels; I created a new variable hd using mutate() to represent binary 0/1 outcome where any value > 0 shall be 1 and all 0 values will stay 0. Also, I renamed sex levels (originally 1 and 0) as Male/ Female for better clarity.

由于结果变量具有两个以上的级别; 我使用mutate()创建了一个新变量hd来表示二进制0/1结果,其中任何值> 0都应为1,所有0值都将保持为0。此外,我将性别级别(最初为1和0)重命名为Male / Female更好的清晰度。

统计检验 (Statistical Tests)

I ran statistical tests to check which predictor variables are closely related to heart disease. Depending on the data type (continuous/ discrete), I implemented t-test and chi-squared test to derive p-values.

我进行了统计测试,以检查哪些预测变量与心脏病密切相关。 根据数据类型(连续/离散),我实施了t检验和卡方检验以得出p值。

In this project, I examined how sex, age and thalach are related to heart disease which is as shown below -

在这个专案中,我检查了性别,年龄和下丘脑与心脏病的关系,如下所示-

→ Sex: Since sex is a binary variable in this dataset, chi-squared test will be the appropriate test for this variable. Here’s the output on using chisq.test() to assess the relationship between sex and hd(outcome variable)

→Sex :由于sex是此数据集中的二进制变量,因此卡方检验将是此变量的适当检验。 这是使用chisq.test()评估性别和hd(结果变量)之间关系的输出

data:  hd_data$sex and hd_data$hd
X-squared = 22.043, df = 1, p-value = 2.667e-06

Age: Since age is a continuous variable, I used t.test() to determine relationship between age and hd.

年龄:由于年龄是一个连续变量,因此我使用t.test()来确定年龄和高清之间的关系。

data:  hd_data$age by hd_data$hd
t = -4.0303, df = 300.93, p-value = 7.061e-05

Thalach: Using t.test() again to assess relationship between thalach and hd.

Thalach :再次使用t.test()来评估thalach和hd之间的关系。

data:  hd_data$thalach by hd_data$hd
t = 7.8579, df = 272.27, p-value = 9.106e-14

图形化的关联显示(因为每张图片都讲述了一个更好的故事!) (Graphical visualization of the associations (Because every picture tells a better story!))

I have plotted boxplot for continuous variables like Age and Thalach (max heart rate).

我为连续变量(例如Age和Thalach(最大心率))绘制了箱线图

Image for post
Age v/s Hd
年龄v / s Hd
Image for post
Thalach (max heart rate) v/s hd
Thalach(最大心率)v / s hd

Plotted Barplot for sex since it’s a binary variable in this dataset.

为性别绘制Barplot,因为它是此数据集中的二进制变量。

Image for post
Sex v/s hd
性别v / s高清

→ The graphical plots above and statistical tests clearly show us that all the three clinical variables (Age, Sex, Thalach) that were chosen are significantly associated with our outcome since p-value < 0.001 for all the tests.

→上面的图形图和统计测试清楚地向我们表明,选择的所有三个临床变量(年龄,性别,Thalach)均与我们的结果显着相关,因为所有测试的p值<0.001。

用所有3个变量拟合Logistic回归模型 (Fitting Logistic Regression Model with all 3 variables)

I have fitted a Logistic Regression model here since there are two predicting variables and one binary outcome variable. This model will help us determine the effect that a max heart rate (thalach), age and sex can have on the likelihood that an individual will have a heart disease.

由于这里有两个预测变量和一个二进制结果变量,因此我在这里拟合了Logistic回归模型。 该模型将帮助我们确定最大心率(thalach),年龄和性别对个人患心脏病的可能性的影响。

model <- glm(data = hd_data, hd ~ age + sex + thalach, family = “binomial” )

模型<-glm(数据= hd_data,hd〜年龄+性别+触角,家庭=“二项式”)

# extract the model summarysummary(model)

#提取模型summarysummary(model)

Here’s the output as shown below after implementing the model -

实施模型后,输出如下所示-

Call:
glm(formula = hd ~ age + sex + thalach, family = "binomial",
data = hd_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2250 -0.8486 -0.4570 0.9043 2.1156
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.111610 1.607466 1.936 0.0529 .
age 0.031886 0.016440 1.940 0.0524 .
sexMale 1.491902 0.307193 4.857 1.19e-06 ***
thalach -0.040541 0.007073 -5.732 9.93e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)

This logistic regression model can be used to predict the probability of a person having heart disease given his/her age, sex and max heart rate. Additionally, we can translate the predicted probability into a decision rule for clinical use by defining cutoff value on the probability scale. For instance- if a 45 year old female patient with a max heart rate = 150 walks in, we can find out the predicted probability of the heart disease by creating a new data frame called newdata.

鉴于年龄,性别和最大心率,该逻辑回归模型可用于预测某人患心脏病的可能性。 此外,我们可以通过在概率标度上定义临界值将预测的概率转换为临床使用的决策规则。 例如,如果一名最大心率= 150的45岁女性患者走进来,我们可以通过创建一个称为newdata的新数据框来找到心脏病的预测概率。

Image for post

We can see that the model generated a heart disease probability of 0.177 for a 45 year old female with a max heart rate of 150 which indicates a low risk of heart disease.

我们可以看到,该模型为45岁女性,最大心率150产生的心脏病概率为0.177 ,这表明患心脏病的风险较低。

评估模型性能 (Evaluating model performance)

While these predictive models can be used to predict the probability of an event occurring, it is vital to check the accuracy of any model before computing the predicted values. Some of the core metrics that can be used to evaluate this model are as described below-

尽管这些预测模型可用于预测事件发生的可能性,但在计算预测值之前检查任何模型的准确性至关重要。 可以用来评估该模型的一些核心指标如下所述:

  1. Accuracy : It is one of the most straightforward metric which tells us the proportion of total number of predictions being correct

    准确性 :这是最直接的指标之一,它告诉我们正确的预测总数中所占的比例

  2. Classification Error Rate : This can be calculated using 1-Accuracy

    分类错误率:可以使用1-Accuracy计算

  3. Area under the ROC curve (AUC): This is one of the most sought after metrics used for evaluation. It is popular since it is independent of the change in proportion of responders. It ranges from 0–1. The closer it gets to 1, the better is the model performance

    ROC曲线下的面积(AUC):这是用于评估的最受欢迎的指标之一。 它之所以受欢迎是因为它与响应者比例的变化无关。 取值范围是0-1。 越接近1,模型性能越好

  4. Confusion Matrix: It is a N*N matrix where N is the level of outcome.This metric reports the the number of false positives, false negatives, true positives, and true negatives.

    混淆矩阵:这是一个N * N矩阵,其中N是结果水平。此度量标准报告误报,误报,真报和真报错的数量。

Image for post
Model performance metrics with the output
使用输出对性能指标进行建模

结果 (Result)

From the above output, we can see that the model has an overall accuracy of 0.71. Also, there are cases that were misclassified as shown in the confusion matrix. We can improve the existing model by including other relevant predictors from the dataset into our model.

从上面的输出中,我们可以看到该模型的整体精度为0.71。 而且,有些情况下分类错误,如混淆矩阵中所示。 我们可以通过将数据集中的其他相关预测变量纳入模型来改善现有模型。

You can find the entire code of this project on my Github.

您可以在我的Github上找到该项目的全部代码。

Disclaimer: This project has been done solely for educational motives and to solidify my understanding of data mining techniques. It is not intended to be used for diagnosis of actual heart patients. Please consult your healthcare practitioner for professional advice.

免责声明:此项目仅出于教育目的而进行,目的是巩固我对数据挖掘技术的理解。 它不能用于诊断实际的心脏病患者。 请咨询您的医疗保健从业人员以获取专业建议。

翻译自: https://medium.com/swlh/predicting-heart-disease-using-regression-analysis-486401cd0a47

回归分析预测

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389072.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

crc16的c语言函数 计算ccitt_C语言为何如此重要

●●●如今&#xff0c;有很多学生不懂为何要学习编程语言&#xff0c;为何要学习C语言&#xff1f;原因是大学生不能满足于只会用办公软件&#xff0c;而应当有更高的学习要求&#xff0c;对于理工科的学生尤其如此。计算机的本质是“程序的机器”&#xff0c;程序和指令的思想…

aws spark_使用Spark构建AWS数据湖时的一些问题以及如何处理这些问题

aws spark技术提示 (TECHNICAL TIPS) 介绍 (Introduction) At first, it seemed to be quite easy to write down and run a Spark application. If you are experienced with data frame manipulation using pandas, numpy and other packages in Python, and/or the SQL lang…

冲刺第三天 11.27 TUE

任务执行情况 已解决问题 数据库结构已经确定 对联生成model已训练完成 词匹配部分完成 微信前端rush版本完成 总体情况 团队成员今日已完成任务剩余任务困难Dacheng, Weijieazure数据库搭建(完成&#xff09;multiple communication scripts, call APIs需要进行整合调试Yichon…

DPDK+Pktgen 高速发包测试

参考博客 Pktgen概述 Pktgen,(Packet Gen-erator)是一个基于DPDK的软件框架&#xff0c;发包速率可达线速。提供运行时管理&#xff0c;端口实时测量。可以控制 UDP, TCP, ARP, ICMP, GRE, MPLS and Queue-in-Queue等包。可以通过TCP进行远程控制。Pktgen官网 安装使用过程 版本…

数据科学家编程能力需要多好_我们不需要这么多的数据科学家

数据科学家编程能力需要多好I have held the title of data scientist in two industries. I’ve interviewed for more than 30 additional data science positions. I’ve been the CTO of a data-centric startup. I’ve done many hours of data science consulting.我曾担…

excel表格行列显示十字定位_WPS表格:Excel表格打印时,如何每页都显示标题行?...

电子表格数据很多的时候&#xff0c;要分很多页打印&#xff0c;如何每页都能显示标题行呢&#xff1f;以下表为例&#xff0c;我们在WPS2019中演示如何每页都显示前两行标题行&#xff1f;1.首先点亮顶部的页面布局选项卡。然后点击打印标题或表头按钮。2.在弹出的页面设置对话…

sql优化技巧_使用这些查询优化技巧成为SQL向导

sql优化技巧成为SQL向导&#xff01; (Become an SQL Wizard!) It turns out storing data by rows and columns is convenient in a lot of situations, so relational databases have remained a cornerstone of data management in businesses across the globe. Structured…

Day 4:集合——迭代器与List接口

Collection-迭代方法 1、toArray() 返回Object类型数据&#xff0c;接收也需要Object对象&#xff01; Object[] toArray(); Collection c new ArrayList(); Object[] arr c.toArray(); 2、iterator() Collection的方法&#xff0c;返回实现Iterator接口的对象&#xff0c;…

物种分布模型_减少物种分布建模中的空间自相关

物种分布模型Species distribution models (SDM; for review and definition see, e.g., Peterson et al., 2011) are a dominant paradigm to quantify the relationship between environmental dynamics and several manifestations of species biogeography. These statisti…

深入理解激活函数

为什么需要非线性激活函数&#xff1f; 说起神经网络肯定会降到神经函数&#xff0c;看了很多资料&#xff0c;也许你对激活函数这个名词会感觉很困惑&#xff0c; 它为什么叫激活函数&#xff1f;它有什么作用呢&#xff1f; 看了很多书籍上的讲解说会让神经网络变成很丰富的…

如何一键部署项目、代码自动更新

为什么80%的码农都做不了架构师&#xff1f;>>> 摘要&#xff1a;my-deploy:由nodejs写的一个自动更新工具,理论支持所有语言(php、java、c#)的项目,支持所有git仓库(bitbucket、github等)。github效果如何?如果你的后端项目放在github、bitbucket等git仓库中管理…

Kettle7.1在window启动报错

实验环境&#xff1a; window10 x64 kettle7.1 pdi-ce-7.1.0.0-12.zip 错误现象&#xff1a; a java exception has occurred 问题解决&#xff1a; 运行调试工具 data-integration\SpoonDebug.bat //调试错误的&#xff0c;根据错误明确知道为何启动不了&#xff0c;Y--Y-…

opa847方波放大电路_电子管放大电路当中阴极电阻的作用和选择

胆机制作知识视频&#xff1a;6P14单端胆机用示波器方波测试输出波形详细步骤演示完整版自制胆机试听视频&#xff1a;胆机播放《猛士的士高》经典舞曲 熟悉的旋律震撼的效果首先看下面这一张300B电子管电路图&#xff1a;300B单端胆机原理图图纸里面画圆圈的电阻就是放大电路当…

清洁数据ploy n_清洁屋数据

清洁数据ploy nAs a bootcamp project, I was asked to analyze data about the sale prices of houses in King County, Washington, in 2014 and 2015. The dataset is well known to students of data science because it lends itself to linear regression modeling. You …

redis安装redis集群

NoSql数据库之Redis1、什么是nosql&#xff0c;nosql的应用场景2、Nonsql数据库的类型a) Key-valueb) 文档型&#xff08;类似于json&#xff09;c) 列式存储d) 图式3、redis的相关概念kv型的。4、Redis的安装及部署5、Redis的使用方法及数据类型a) Redis启动及关闭b) Redis的数…

机器学习实践一 logistic regression regularize

Logistic regression 数据内容&#xff1a; 两个参数 x1 x2 y值 0 或 1 Potting def read_file(file):data pd.read_csv(file, names[exam1, exam2, admitted])data np.array(data)return datadef plot_data(X, y):plt.figure(figsize(6, 4), dpi150)X1 X[y 1, :]X2 X[…

深度学习数据扩张_适用于少量数据的深度学习结构

作者&#xff1a;Gorkem Polat编译&#xff1a;ronghuaiyang导读一些最常用的few shot learning的方案介绍及对比。传统的CNNs (AlexNet, VGG, GoogLeNet, ResNet, DenseNet…)在数据集中每个类样本数量较多的情况下表现良好。不幸的是&#xff0c;当你拥有一个小数据集时&…

基于边缘计算的实时绩效_基于绩效的营销中的三大错误

基于边缘计算的实时绩效We’ve gone through 20% of the 21st century. It’s safe to say digitalization isn’t a new concept anymore. Things are fully or at least mostly online, and they tend to escalate in the digital direction. That’s why it’s important to…

为什么Facebook的API以一个循环作为开头?

作者 | Antony Garand译者 | 无明如果你有在浏览器中查看过发给大公司 API 的请求&#xff0c;你可能会注意到&#xff0c;JSON 前面会有一些奇怪的 JavaScript&#xff1a;为什么他们会用这几个字节来让 JSON 失效&#xff1f;为了保护你的数据 如果没有这些字节&#xff0c;那…

城市轨道交通运营票务管理论文_城市轨道交通运营管理专业就业前景怎么样?中职优选告诉你...

​​城市轨道交通运营管理专业&#xff0c;专业就业前景怎么样&#xff1f;就业方向有哪些&#xff1f;有很多同学都感觉很迷忙&#xff0c;为了让更多的同学们了解城市轨道交通运营管理专业的就业前景与就业方向&#xff0c;整理出以下内容希望可以帮助同学们。城市轨道交通运…