项目背景

背景描述

本数据集涵盖了与睡眠和日常习惯有关的诸多变量。如性别、年龄、职业、睡眠时间、睡眠质量、身体活动水平、压力水平、BMI类别、血压、心率、每日步数、以及是否有睡眠障碍等细节。

数据集的主要特征：综合睡眠指标：
探索睡眠持续时间、质量和影响睡眠模式的因素。
生活方式因素：分析身体活动水平、压力水平和 BMI 类别。
心血管健康：检查血压和心率测量值。
睡眠障碍分析：确定失眠和睡眠呼吸暂停等睡眠障碍的发生。
数据集列：人员 ID：
每个人的标识符。
性别：人员的性别（男性/女性）。
年龄：人员的年龄（以年为单位）。
职业：人的职业或职业。
睡眠持续时间（小时）：该人每天睡眠的小时数。
睡眠质量（量表：1-10）：对睡眠质量的主观评分，范围从1到10。
身体活动水平（分钟/天）：该人每天进行身体活动的分钟数。
压力水平（等级：1-10）：对人所经历的压力水平的主观评级，范围从1到10。
BMI 类别：人的 BMI 类别（例如，体重不足、正常、超重）。
血压（收缩压/舒张压）：人的血压测量值，表示收缩压超过舒张压。
心率（bpm）：人的静息心率（以每分钟心跳数为单位）。
每日步数：此人每天的步数。
睡眠障碍：人体内是否存在睡眠障碍（无、失眠、睡眠呼吸暂停）。

有关睡眠障碍专栏的详细信息：

类型	说明
无	个体没有表现出任何特定的睡眠障碍。
失眠	个人难以入睡或保持睡眠状态，导致睡眠不足或质量差。
睡眠呼吸暂停	个人在睡眠期间呼吸暂停，导致睡眠模式中断和潜在的健康风险。

睡眠健康和生活方式数据集包括400行和13列，涵盖了与睡眠和日常习惯相关的广泛变量。它包括性别、年龄、职业、睡眠持续时间、睡眠质量、身体活动水平、压力水平、身体质量指数分类、血压、心率、每日步数以及是否存在睡眠障碍等详细信息。
在这里，我们将使用已经可供使用的“Sleep _ health _ and _ life style _ dataset . CSV”数据库，下面您将看到数据的分析、数据的处理以及使用机器模型的学习分类来实现我们的目标。

导入库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from yellowbrick.classifier import ConfusionMatrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder

导入数据

df = pd.read_csv('/kaggle/input/sleep-health-and-lifestyle-dataset/Sleep_health_and_lifestyle_dataset.csv', sep = ',')

这里我们可以看到，我们有分类和连续变量，我们也可以看到，我们没有空值。
在这里插入图片描述

Person ID（个人ID）:每个人的标识符。
Gender（性别）:人的性别(男/女)。
Age（年龄）:以年为单位的人的年龄。
Occupation（职业）:人的职业或专业。
Sleep Duration (hours)（睡眠持续时间(小时)）:人每天睡眠的小时数。
Quality of Sleep (scale: 1-10)（睡眠质量(等级:1-10):对睡眠质量的主观评价，范围从1到10。
Physical Activity Level (minutes/day)（身体活动水平(分钟/天)）:个人每天从事身体活动的分钟数。
Stress Level (scale: 1-10)（压力水平(等级:1-10)）:对人所经历的压力水平的主观评级，范围从1到10。
BMI Category（身体质量指数类别）:个人的身体质量指数类别(例如，体重不足、正常、超重)。
Blood Pressure (systolic/diastolic)（血压(收缩压/舒张压)）:个人的血压测量值，表示为收缩压/舒张压。
Heart Rate (bpm)（心率(bpm)）:人的静息心率，单位为每分钟心跳数。
Daily Steps（每日步数）:此人每天走的步数。
Sleep Disorder（睡眠障碍）:人是否存在睡眠障碍(无、失眠、睡眠呼吸暂停)。

在这里插入图片描述

数据分析

corr = df.corr().round(2)
plt.figure(figsize = (25,20))
sns.heatmap(corr, annot = True, cmap = 'YlOrBr')

在这里插入图片描述
分类变量。
在这里，当我们查看分类变量时，我们可以看到我们的数据在男性和女性之间分布良好，查看身体质量指数，我们可以看到大多数人在正常和超重之间，当我们查看我们的目标变量时，我们可以看到大多数人没有睡眠问题，那些有睡眠问题的人在失眠和睡眠呼吸暂停之间分布良好。

plt.figure(figsize = (20, 10))plt.subplot(2, 2, 1)
plt.gca().set_title('Variable Gender')
sns.countplot(x = 'Gender', palette = 'Set2', data = df)plt.subplot(2, 2, 2)
plt.gca().set_title('Variable BMI Category')
sns.countplot(x = 'BMI Category', palette = 'Set2', data = df)plt.subplot(2, 2, 3)
plt.gca().set_title('Variable Sleep Disorder')
sns.countplot(x = 'Sleep Disorder', palette = 'Set2', data = df)

在这里插入图片描述
看看职业一栏，我们可以看到我们有一些优势职业。

连续变量。
当我们看我们的分类变量时，我们可以看到，我们在大多数变量中没有一个模式，几乎所有的变量在一些数据中显示平衡，而在另一些数据中不显示平衡。

plt.figure(figsize = (15, 18))plt.subplot(4, 2, 1)
sns.histplot(x = df['Age'], kde = False)plt.subplot(4, 2, 2)
sns.histplot(x = df['Sleep Duration'], kde = False)plt.subplot(4, 2, 3)
sns.histplot(x = df['Quality of Sleep'], kde = False)plt.subplot(4, 2, 4)
sns.histplot(x = df['Physical Activity Level'], kde = False)plt.subplot(4, 2, 5)
sns.histplot(x = df['Stress Level'], kde = False)plt.subplot(4, 2, 6)
sns.histplot(x = df['Heart Rate'], kde = False)plt.subplot(4, 2, 7)
sns.histplot(x = df['Daily Steps'], kde = False)

在这里插入图片描述
查看箱线图，我们可以确认我们没有需要处理的异常值

plt.title("Boxplot Age", fontdict = {'fontsize': 20})
sns.boxplot(x=df["Age"])

在这里插入图片描述

plt.title("Boxplot Sleep Duration", fontdict = {'fontsize': 20})
sns.boxplot(x=df["Sleep Duration"])

在这里插入图片描述

plt.title("Boxplot Quality of Sleep", fontdict = {'fontsize': 20})
sns.boxplot(x=df["Quality of Sleep"])

在这里插入图片描述

plt.title("Boxplot Physical Activity Level", fontdict = {'fontsize': 20})
sns.boxplot(x=df["Physical Activity Level"])

在这里插入图片描述

plt.title("Boxplot Stress Level", fontdict = {'fontsize': 20})
sns.boxplot(x=df["Stress Level"])

在这里插入图片描述

plt.title("Boxplot Heart Rate", fontdict = {'fontsize': 20})
sns.boxplot(x=df["Heart Rate"])

在这里插入图片描述

plt.title("Boxplot Daily Steps", fontdict = {'fontsize': 20})
sns.boxplot(x=df["Daily Steps"])

在这里插入图片描述
双变量分析。
当我们比较我们的目标变量和分类变量时，我们可以看到有趣的模式，例如女性比男性有更多的睡眠问题，同样的，当我们看身体质量指数变量时，超重的人更有可能有睡眠问题，正常体重的人通常没有任何问题。
当我们看病人的职业时，我们可以看到一个非常有趣的事情，教授，护士和售货员更有可能有睡眠问题，当我们看律师，医生和工程师时，我们可以看到他们一般没有任何问题。

plt.figure(figsize = (20, 10))
plt.suptitle("Analysis Of Variable Sleep Disorder",fontweight="bold", fontsize=20)plt.subplot(2, 1, 1)
plt.gca().set_title('Variable Gender')
sns.countplot(x = 'Gender', hue = 'Sleep Disorder', palette = 'Set2', data = df)plt.subplot(2, 1, 2)
plt.gca().set_title('Variable BMI Category')
sns.countplot(x = 'BMI Category', hue = 'Sleep Disorder', palette = 'Set2', data = df)

在这里插入图片描述

plt.figure(figsize = (20, 10))plt.subplot(2, 1, 1)
plt.gca().set_title('Variable Occupation')
sns.countplot(x = 'Occupation', hue = 'Sleep Disorder', palette = 'Set2', data = df)plt.subplot(2, 1, 2)
plt.gca().set_title('Variable Blood Pressure')
sns.countplot(x = 'Blood Pressure', hue = 'Sleep Disorder', palette = 'Set2', data = df)

在这里插入图片描述

plt.figure(figsize = (25, 20))
plt.suptitle("Analysis Of Variable Sleep Disorder",fontweight="bold", fontsize=20)plt.subplot(4,2,1)
sns.boxplot(x="Sleep Disorder", y="Age", data=df)plt.subplot(4,2,2)
sns.boxplot(x="Sleep Disorder", y="Sleep Duration", data=df)plt.subplot(4,2,3)
sns.boxplot(x="Sleep Disorder", y="Quality of Sleep", data=df)plt.subplot(4,2,4)
sns.boxplot(x="Sleep Disorder", y="Physical Activity Level", data=df)plt.subplot(4,2,5)
sns.boxplot(x="Sleep Disorder", y="Stress Level", data=df)plt.subplot(4,2,6)
sns.boxplot(x="Sleep Disorder", y="Heart Rate", data=df)plt.subplot(4,2,7)
sns.boxplot(x="Sleep Disorder", y="Daily Steps", data=df)

在这里插入图片描述

模型构建。

这里，我们将删除不会在模型中使用的Person ID变量。

df = df.drop('Person ID', axis = 1)

hot = pd.get_dummies(df[['Gender', 'Occupation', 'BMI Category', 'Blood Pressure']])
df = pd.concat([df, hot], axis = 1)
df = df.drop(['Gender', 'Occupation', 'BMI Category', 'Blood Pressure'], axis = 1)

X = df.drop('Sleep Disorder', axis = 1)
X = X.values
y = df['Sleep Disorder']

标准缩放器
在这里，我们将使用StandardScaler将我们的数据放在相同的比例中

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_standard = scaler.fit_transform(X)

将数据转换为训练e测试，这里我们将使用30%的数据来测试机器学习模型。

from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(X_standard, y, test_size = 0.3, random_state = 0)

朴素贝叶斯
运行高斯模型。
这里我们将使用朴素贝叶斯模型，我们将使用我们的正态数据测试高斯模型。
在我们的第一个模型中，我们有一个非常差的结果，只有53%的准确率，虽然它只能很好地预测有问题的人，但它在预测没有问题的人时结果很差。

from sklearn.naive_bayes import GaussianNB
naive_bayes = GaussianNB()
naive_bayes.fit(X_train, y_train)
previsoes = naive_bayes.predict(X_test)cm = ConfusionMatrix(naive_bayes)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)

在这里插入图片描述

score_naive_gaussian = 0.530973451327433

决策图表
这里我们将使用决策树模型，我们将测试熵和基尼系数的计算。
在这里，我们应用GridSearch来检查哪些是可以使用的最佳指标。

parameters = {'max_depth': [3, 4, 5, 6, 7, 9, 11],'min_samples_split': [2, 3, 4, 5, 6, 7],'criterion': ['entropy', 'gini']}model = DecisionTreeClassifier()
gridDecisionTree = RandomizedSearchCV(model, parameters, cv = 3, n_jobs = -1)
gridDecisionTree.fit(X_train, y_train)print('Mín Split: ', gridDecisionTree.best_estimator_.min_samples_split)
print('Max Nvl: ', gridDecisionTree.best_estimator_.max_depth)
print('Algorithm: ', gridDecisionTree.best_estimator_.criterion)
print('Score: ', gridDecisionTree.best_score_)

在这里插入图片描述
运行决策树。
现在，在我们的决策树模型中，与朴素贝叶斯相比，我们有了非常大的改进，我们有89.38%的准确性，该模型能够很好地预测3个类别。

score_tree = 0.8938053097345132

columns = df.drop('Sleep Disorder', axis = 1).columns
feature_imp = pd.Series(decision_tree.feature_importances_, index = columns).sort_values(ascending = False)
feature_imp

BMI Category_Normal 0.450011
Physical Activity Level 0.227875
Age 0.109282
BMI Category_Normal Weight 0.090082
Sleep Duration 0.041911
Stress Level 0.038346
Blood Pressure_142/92 0.017656
Occupation_Teacher 0.013466
Occupation_Lawyer 0.005720
Occupation_Engineer 0.002862
Quality of Sleep 0.002789
Gender_Female 0.000000
Blood Pressure_130/85 0.000000
Blood Pressure_122/80 0.000000
Blood Pressure_125/80 0.000000
Blood Pressure_125/82 0.000000
Blood Pressure_126/83 0.000000
Blood Pressure_128/84 0.000000
Blood Pressure_128/85 0.000000
Blood Pressure_129/84 0.000000
Blood Pressure_130/86 0.000000
Blood Pressure_120/80 0.000000
Blood Pressure_131/86 0.000000
Blood Pressure_132/87 0.000000
Blood Pressure_135/88 0.000000
Blood Pressure_135/90 0.000000
Blood Pressure_139/91 0.000000
Blood Pressure_140/90 0.000000
Blood Pressure_140/95 0.000000
Blood Pressure_121/79 0.000000
Blood Pressure_118/76 0.000000
Blood Pressure_119/77 0.000000
Gender_Male 0.000000
Occupation_Accountant 0.000000
Occupation_Doctor 0.000000
Occupation_Manager 0.000000
Occupation_Nurse 0.000000
Occupation_Sales Representative 0.000000
Occupation_Salesperson 0.000000
Occupation_Scientist 0.000000
Occupation_Software Engineer 0.000000
Daily Steps 0.000000
Heart Rate 0.000000
BMI Category_Obese 0.000000
BMI Category_Overweight 0.000000
Blood Pressure_115/78 0.000000
Blood Pressure_117/76 0.000000
Blood Pressure_118/75 0.000000
Blood Pressure_115/75 0.000000
dtype: float64
随机森林
这里我们将使用随机森林模型，我们将测试熵和基尼系数的计算。
应用网格搜索

from sklearn.ensemble import RandomForestClassifierparameters = {'max_depth': [3, 4, 5, 6, 7, 9, 11],'min_samples_split': [2, 3, 4, 5, 6, 7],'criterion': ['entropy', 'gini']}model = RandomForestClassifier()
gridRandomForest = RandomizedSearchCV(model, parameters, cv = 5, n_jobs = -1)
gridRandomForest.fit(X_train, y_train)print('Algorithm: ', gridRandomForest.best_estimator_.criterion)
print('Score: ', gridRandomForest.best_score_)
print('Mín Split: ', gridRandomForest.best_estimator_.min_samples_split)
print('Max Nvl: ', gridRandomForest.best_estimator_.max_depth)

在这里插入图片描述
运行随机森林。
在这里，在随机森林模型中，我们设法提高了更多，我们获得了90.26%的准确性。

random_forest = RandomForestClassifier(n_estimators = 100, min_samples_split = 5, max_depth= 5,  criterion = 'gini', random_state = 0)
random_forest.fit(X_train, y_train)
previsoes = random_forest.predict(X_test)cm = ConfusionMatrix(random_forest)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)

在这里插入图片描述

score_random = 0.9026548672566371

检查模型中最重要的变量

feature_imp_random = pd.Series(random_forest.feature_importances_, index = columns).sort_values(ascending = False)
feature_imp_random

MI Category_Normal 0.131351
BMI Category_Overweight 0.130821
Blood Pressure_140/95 0.092496
Age 0.086894
Sleep Duration 0.073755
Occupation_Nurse 0.060407
Physical Activity Level 0.055393
Heart Rate 0.049097
Daily Steps 0.048757
Stress Level 0.041599
Quality of Sleep 0.030586
Occupation_Salesperson 0.026217
Blood Pressure_135/90 0.025475
Blood Pressure_130/85 0.020394
Gender_Male 0.015391
Blood Pressure_125/80 0.014591
Occupation_Doctor 0.013854
Occupation_Engineer 0.012983
Occupation_Teacher 0.012427
Gender_Female 0.011748
BMI Category_Normal Weight 0.008467
BMI Category_Obese 0.006766
Blood Pressure_120/80 0.005265
Occupation_Accountant 0.003363
Occupation_Lawyer 0.002648
Blood Pressure_132/87 0.002587
Occupation_Sales Representative 0.002245
Blood Pressure_130/86 0.001697
Blood Pressure_128/85 0.001614
Blood Pressure_128/84 0.001585
Blood Pressure_142/92 0.001332
Blood Pressure_131/86 0.001149
Blood Pressure_139/91 0.000985
Blood Pressure_129/84 0.000805
Blood Pressure_140/90 0.000802
Blood Pressure_135/88 0.000744
Blood Pressure_126/83 0.000704
Blood Pressure_118/75 0.000622
Occupation_Software Engineer 0.000531
Blood Pressure_115/75 0.000520
Occupation_Scientist 0.000515
Blood Pressure_121/79 0.000504
Blood Pressure_117/76 0.000260
Blood Pressure_115/78 0.000051
Blood Pressure_119/77 0.000001
Blood Pressure_122/80 0.000000
Blood Pressure_118/76 0.000000
Blood Pressure_125/82 0.000000
Occupation_Manager 0.000000
dtype: float64

额外的树
这里我们将使用额外的树模型，我们将测试熵和基尼系数的计算。
应用网格搜索

from sklearn.ensemble import ExtraTreesClassifierparameters = {'max_depth': [3, 4, 5, 6, 7, 9, 11],'min_samples_split': [2, 3, 4, 5, 6, 7],'criterion': ['entropy', 'gini']}model = ExtraTreesClassifier()
gridExtraTrees = RandomizedSearchCV(model, parameters, cv = 3, n_jobs = -1)
gridExtraTrees.fit(X_train, y_train)print('Algorithm: ', gridExtraTrees.best_estimator_.criterion)
print('Score: ', gridExtraTrees.best_score_)
print('Mín Split: ', gridExtraTrees.best_estimator_.min_samples_split)
print('Max Nvl: ', gridExtraTrees.best_estimator_.max_depth)

在这里插入图片描述

score_extra = 0.8849557522123894

kNN
这里我们将使用K-Neighbors模型，我们将使用GridSearch模型来找出在该模型中使用的最佳指标。
在这里，我们将使用GridSearch来找出在该模型中使用的最佳指标。

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()k_list = list(range(1,10))
k_values = dict(n_neighbors = k_list)
grid = GridSearchCV(knn, k_values, cv = 2, scoring = 'accuracy', n_jobs = -1)
grid.fit(X_train, y_train)grid.best_params_, grid.best_score_

运行K-Neighbors。
虽然我们的结果稍差，但它仍然是一个很好的模型，准确率为88.49%。

knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)
previsoes = knn.predict(X_test)cm = ConfusionMatrix(knn)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)

在这里插入图片描述

score_knn = 0.8849557522123894

逻辑回归
这里我们将使用线性回归模型。
我们设法得到了一个更好的结果，在逻辑回归模型中，我们有91.11%的准确率。

from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression(random_state = 1, max_iter=10000)
logistic.fit(X_train, y_train)
previsoes = logistic.predict(X_test)cm = ConfusionMatrix(logistic)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)

在这里插入图片描述

logistic_normal = 0.911504424778761

adaboost算法
这里我们将使用AdaBoost模型，我们将使用GridSearch模型来找出在该模型中使用的最佳指标。
应用网格搜索

from sklearn.ensemble import AdaBoostClassifierparameters = {'learning_rate': [0.01, 0.02, 0.05, 0.07, 0.09, 0.1, 0.3, 0.001, 0.005],'n_estimators': [300, 500]}model = AdaBoostClassifier()
gridAdaBoost = RandomizedSearchCV(model, parameters, cv = 2, n_jobs = -1)
gridAdaBoost.fit(X_train, y_train)print('Learning Rate: ', gridAdaBoost.best_estimator_.learning_rate)
print('Score: ', gridAdaBoost.best_score_)

在这里插入图片描述
运行Ada Boost。
这里，在AdaBoost模型中，我们设法保持与模型相同的质量，准确率为91.15%。

ada_boost = AdaBoostClassifier(n_estimators = 500, learning_rate =  0.02, random_state = 0)
ada_boost.fit(X_train, y_train)
previsoes = ada_boost.predict(X_test)cm = ConfusionMatrix(ada_boost)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)

在这里插入图片描述

score_ada_scaler = 0.911504424778761

结论

我们的数据不是很多，，我们的样本总共只有13列374个数据，另一件使我们的工作更容易的事情是，我们没有空值，所以我们不需要执行处理。
当进行数据分析时，我们可以看到我们的数据之间有很多相关性，但由于我们几乎没有可用的数据，所以没有删除这些数据，当我们具体查看我们的分类变量时，我们可以看到我们的数据在性别变量方面很平衡，当我们查看我们的目标变量时，我们可以看到我们的大部分数据没有睡眠问题，那些有睡眠问题的数据在两个类别之间很好地平衡(睡眠呼吸暂停和失眠)，当我们查看连续变量时，我们没有发现它们之间的模式，查看箱线图时，我发现没有必要处理异常值，数据分布良好。
当我们观察双变量分析比较我们的目标变量和我们的解释数据时，我们已经得出了一些结论，女性更有可能有睡眠问题，当我们看身体质量指数变量时，我们可以看到正常体重的人通常没有问题，超重的人通常更有可能有睡眠问题。当我们看可变职业时，有趣的是看到一些职业比其他职业更容易有睡眠问题，另一个引起我注意的变量是年龄变量，老年人更容易有睡眠问题。