【数据分析实战】—预测宠物收养状况数据分析

文章目录

数据集
- 数据集描述
- 特征
- 用途
- 注意
宠物收养预测
- 环境准备
- 探索数据帧
- 数据预处理
- 机器学习
- - 数据预处理：
  - 模型培训和评估：
  - 合奏学习：
添加底部名片获取数据集吧！

在这里插入图片描述

数据集

数据集描述

宠物收养数据集提供了对各种因素的全面调查，这些因素可能会影响宠物从收容所被收养的可能性。该数据集包括可供收养的宠物的详细信息，涵盖了各种特征和属性。

特征

PetID：每个宠物的唯一标识符。
PetType：宠物的类型（例如，狗、猫、鸟、兔子）。
Breed：宠物的特定品种。
AgeMonths：宠物的年龄（以月为单位）。
Color：宠物的颜色。
Size：宠物的尺寸类别（小、中、大）。
WeightKg：宠物的重量，单位为公斤。
Vaccinated：宠物的疫苗接种状态（0-未接种，1-已接种）。
HealthCondition：宠物的健康状况（0-健康，1-医疗状况）。
TimeInShelterDays：宠物在庇护所的持续时间（天）。
AdoptionFee：宠物的收养费（美元）。
PreviousOwner:宠物是否有以前的主人（0-否，1-是）。
AdoptionLikelihood：宠物被收养的可能性（0-不太可能，1-可能）。

用途

该数据集非常适合有兴趣了解和预测宠物收养趋势的数据科学家和分析师。它可以用于：

预测建模，以确定收养宠物的可能性。
分析各种因素对采用率的影响。
制定提高收容所收养率的战略。

注意

该数据集旨在支持专注于提高宠物收养率和确保更多宠物找到他们永远的家的研究和举措。

宠物收养预测

环境准备

本 Python3 环境安装了许多有用的分析库，它是由kaggle/python Docker镜像定义的：https://github.com/kaggle/docker-python。例如，以下是要加载的几个有用的包

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

输入数据文件在只读“…/Input/”目录中可用，例如，运行此操作（通过单击run或按Shift+Enter）将列出输入目录下的所有文件

import os
for dirname, _, filenames in os.walk('/kaggle/input'):for filename in filenames:print(os.path.join(dirname, filename))

导入数据集 df = pd.read_csv('/kaggle/input/predict-pet-adoption-status-dataset/pet_adoption_data.csv')

探索数据帧

输入df.head()，输出

在这里插入图片描述
输入

def get_df_info(df):print("\n\033[1mShape of DataFrame:\033[0m ", df.shape)print("\n\033[1mColumns in DataFrame:\033[0m ", df.columns.to_list())print("\n\033[1mData types of columns:\033[0m\n", df.dtypes)print("\n\033[1mInformation about DataFrame:\033[0m")df.info()print("\n\033[1mNumber of unique values in each column:\033[0m")for col in df.columns:print(f"\033[1m{col}\033[0m: {df[col].nunique()}")print("\n\033[1mNumber of null values in each column:\033[0m\n", df.isnull().sum())print("\n\033[1mNumber of duplicate rows:\033[0m ", df.duplicated().sum())print("\n\033[1mDescriptive statistics of DataFrame:\033[0m\n", df.describe().transpose())# Call the function
get_df_info(df)

输出如图所示

在这里插入图片描述

数据预处理

1、删除‘PetID’列

df = df.drop('PetID', axis = 1)

2、将数据帧划分为特征（X）和目标（y）

X = df.drop('AdoptionLikelihood', axis=1)
y = df['AdoptionLikelihood']

3、处理X中的范畴变量

X = pd.get_dummies(X)

机器学习

输入

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import VotingClassifier, StackingClassifier

函数apply_models以特征（X）和目标标签（y）为输入，并执行以下任务：

数据预处理：

将数据拆分为训练集和测试集。
检查类不平衡，并在需要时应用SMOTE（过采样）。
使用StandardScaler缩放要素。

模型培训和评估：

定义一组机器学习分类模型。
根据训练数据训练每个模型。
使用准确性和F1分数对测试数据上的每个模型进行评估。
打印每个模型的详细报告（准确性、混淆矩阵、分类报告）。

合奏学习：

根据F1成绩确定表现最好的三款车型。
使用前3个模型创建两个集成模型（投票分类器和堆叠分类器）。
使用准确性、混淆矩阵和分类报告对测试数据上的集成模型进行评估。
总之，该功能旨在探索各种分类模型，确定性能最好的分类模型，并通过集成学习技术潜在地提高性能。

输入

def apply_models(X, y):# Split the data into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Check for class imbalanceclass_counts = np.bincount(y_train)if len(class_counts) > 2 or np.min(class_counts) / np.max(class_counts) < 0.1:print("Class imbalance detected. Applying SMOTE...")# Apply SMOTE (class imbalance)smote = SMOTE(random_state=42)X_train, y_train = smote.fit_resample(X_train, y_train)# Initialize the StandardScalerscaler = StandardScaler()# Fit the scaler on the training data and transform both training and test dataX_train = scaler.fit_transform(X_train)X_test = scaler.transform(X_test)# Define the modelsmodels = {'LogisticRegression': LogisticRegression(),'SVC': SVC(),'DecisionTree': DecisionTreeClassifier(),'RandomForest': RandomForestClassifier(),'ExtraTrees': ExtraTreesClassifier(),'AdaBoost': AdaBoostClassifier(),'GradientBoost': GradientBoostingClassifier(),'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'),'LightGBM': LGBMClassifier(),'CatBoost': CatBoostClassifier(verbose=0)}# Initialize a dictionary to hold the performance of each modelmodel_performance = {}# Apply each modelfor model_name, model in models.items():print(f"\n\033[1mClassification with {model_name}:\033[0m\n{'-' * 30}")# Fit the model to the training datamodel.fit(X_train, y_train)# Make predictions on the test datay_pred = model.predict(X_test)# Calculate the accuracy and f1 scoreaccuracy = accuracy_score(y_test, y_pred)f1 = f1_score(y_test, y_pred, average='weighted')# Store the performance in the dictionarymodel_performance[model_name] = (accuracy, f1)# Print the accuracy scoreprint("\033[1m**Accuracy**:\033[0m\n", accuracy)# Print the confusion matrixprint("\n\033[1m**Confusion Matrix**:\033[0m\n", confusion_matrix(y_test, y_pred))# Print the classification reportprint("\n\033[1m**Classification Report**:\033[0m\n", classification_report(y_test, y_pred))# Sort the models based on f1 score and pick the top 3top_3_models = sorted(model_performance.items(), key=lambda x: x[1][1], reverse=True)[:3]print("\n\033[1mTop 3 Models based on F1 Score:\033[0m\n", top_3_models)# Extract the model names and classifiers for the top 3 modelstop_3_model_names = [model[0] for model in top_3_models]top_3_classifiers = [models[model_name] for model_name in top_3_model_names]# Create a Voting Classifier with the top 3 modelsprint("\n\033[1mInitializing Voting Classifier with top 3 models...\033[0m\n")voting_clf = VotingClassifier(estimators=list(zip(top_3_model_names, top_3_classifiers)), voting='hard')voting_clf.fit(X_train, y_train)y_pred = voting_clf.predict(X_test)print("\n\033[1m**Voting Classifier Evaluation**:\033[0m\n")print("\033[1m**Accuracy**:\033[0m\n", accuracy_score(y_test, y_pred))print("\n\033[1m**Confusion Matrix**:\033[0m\n", confusion_matrix(y_test, y_pred))print("\n\033[1m**Classification Report**:\033[0m\n", classification_report(y_test, y_pred))# Create a Stacking Classifier with the top 3 modelsprint("\n\033[1mInitializing Stacking Classifier with top 3 models...\033[0m\n")stacking_clf = StackingClassifier(estimators=list(zip(top_3_model_names, top_3_classifiers)))stacking_clf.fit(X_train, y_train)y_pred = stacking_clf.predict(X_test)print("\n\033[1m**Stacking Classifier Evaluation**:\033[0m\n")print("\033[1m**Accuracy**:\033[0m\n", accuracy_score(y_test, y_pred))print("\n\033[1m**Confusion Matrix**:\033[0m\n", confusion_matrix(y_test, y_pred))print("\n\033[1m**Classification Report**:\033[0m\n", classification_report(y_test, y_pred))