Explain Python Machine Learning Models with SHAP Library

Explain Python Machine Learning Models with SHAP Library – Minimatech

(能翻墙直接看原文)

Explain Python Machine Learning Models with SHAP Library

  • 11 September 2021
  • Muhammad Fawi
  • Machine Learning

Using SHapley Additive exPlainations (SHAP) Library to Explain Python ML Models

Almost always after developing an ML model, we find ourselves in a position where we need to explain this model. Even when the model is very good, it is still a black box that needs to be deciphered. Explaining a model is a very important step in a data science project that we usually overlook. SHAP library helps in explaining python machine learning models, even deep learning ones, so easy with intuitive visualizations. It also demonstrates feature importances and how each feature affects model output.

Here we are going to explore some of SHAP’s power in explaining a Logistic Regression model.

We will use the Bank Marketing dataset[1] to predict whether a customer will subscribe a term deposit.

Data Exploration

We will start by importing all necessary libraries and reading the data. We will use the smaller dataset in the bank-additional zip file.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import shap

import zipfile

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix, precision_recall_curve

from sklearn.metrics import accuracy_score, precision_score

from sklearn.metrics import recall_score, auc, roc_curve

zf = zipfile.ZipFile("bank-additional.zip")

df = pd.read_csv(zf.open("bank-additional/bank-additional.csv"), sep = ";")

df.shape

# (4119, 21)

Let’s look closely at the data and its structure. We will not go in depth in the exploratory data analysis step. However, we will see how data looks like and perform sum summary and descriptive stats.

df.isnull().sum().sum() # no NAs

# 0

## looking at numeric variables summary stats

df.describe()

Let’s have a quick look at how the object variables are distributed between the two classes; yes and no.

## counts

df.groupby("y").size()

# y

# no 3668

# yes 451

# dtype: int64

num_cols = list(df.select_dtypes(np.number).columns)

print(num_cols)

# ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']

obj_cols = list(df.select_dtypes(object).drop("y", axis = 1).columns)

print(obj_cols)

# ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome']

df[obj_cols + ["y"]].groupby("y").agg(["nunique"])

# job marital education default housing loan contact month day_of_week poutcome

# nunique nunique nunique nunique nunique nunique nunique nunique nunique nunique

# y

# no 12 4 8 3 3 3 2 10 5 3

# yes 12 4 7 2 3 3 2 10 5 3

Seems like categorical variables are equally distributed between the classes.

I know that this is so quick analysis and shallow. But EDA is out of the scope of this blog.

Feature Preprocessing

Now it is time to prepare the features for the LR model. Scaling the numer variables and one hot encode the categorical ones. We will use ColumnTransformer to apply different preprocessors on different columns and wrap everything in a pipeline.

## change classes to float

df["y"] = np.where(df["y"] == "yes", 1., 0.)

## the pipeline

scaler = Pipeline(steps = [

## there are no NAs anyways

("imputer", SimpleImputer(strategy = "median")),

("scaler", StandardScaler())

])

encoder = Pipeline(steps = [

("imputer", SimpleImputer(strategy = "constant", fill_value = "missing")),

("onehot", OneHotEncoder(handle_unknown = "ignore")),

])

preprocessor = ColumnTransformer(

transformers = [

("num", scaler, num_cols),

("cat", encoder, obj_cols)

])

pipe = Pipeline(steps = [("preprocessor", preprocessor)])

Split data into train and test and fit the pipeline on train data and transform both train and test.

X_train, X_test, y_train, y_test = train_test_split(

df.drop("y", axis = 1), df.y,

stratify = df.y,

random_state = 13,

test_size = 0.25)

X_train = pipe.fit_transform(X_train)

X_test = pipe.transform(X_test)

Reverting to the exploratory phase. A good way to visualize one hot encoded data, sparse matrices with 1s and 0s, is by using imshow(). We will look at the last contact month columns which is now is converted into several columns with 1 in the month when the contact happened. The plot will also be split between yes and no.

First let’s get the new feature names from the pipeline.

## getting feature names from the pipeline

nums = pipe["preprocessor"].transformers_[0][2]

obj = list(pipe["preprocessor"].transformers_[1][1]["onehot"].get_feature_names(obj_cols))

fnames = nums + obj

len(fnames) ## new number of columns due to one hot encoder

# 62

Let’s now visualize!

from matplotlib.colors import ListedColormap

print([i for i in obj if "month" in i])

# ['month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep']

## filter the train data on the month data

tr = X_train[:, [True if "month" in i else False for i in fnames]]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (15,7))

fig.suptitle("Subscription per Contact Month", fontsize = 20)

cmapmine1 = ListedColormap(["w", "r"], N = 2)

cmapmine2 = ListedColormap(["w", "b"], N = 2)

ax1.imshow(tr[y_train == 0.0], cmap = cmapmine1, interpolation = "none", extent = [3, 6, 9, 12])

ax1.set_title("Not Subscribed")

ax2.imshow(tr[y_train == 1.0], cmap = cmapmine2, interpolation = "none", extent = [3, 6, 9, 12])

ax2.set_title("Subscribed")

plt.show()

Of course, we need to sort the columns with months order and put labels so that the plot can be more readable. But it is just to quickly visualize sparse matrices with 1s and 0s.

Model Development

Now it is time to develop the model and fit it.

clf = LogisticRegression(

solver = "newton-cg", max_iter = 50, C = .1, penalty = "l2"

)

clf.fit(X_train, y_train)

# LogisticRegression(C=0.1, max_iter=50, solver='newton-cg')

Now we will look at model’s AUC and set the threshold to predict the test data.

y_pred_proba = clf.predict_proba(X_test)[:, 1]

fpr, tpr, _ = roc_curve(y_test, y_pred_proba)

roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, ls = "--", label = "LR AUC = %0.2f" % roc_auc)

plt.plot([0,1], [0,1], c = "r", label = "No Skill AUC = 0.5")

plt.legend(loc = "lower right")

plt.ylabel("true positive rate")

plt.xlabel("false positive rate")

plt.show()

The model shows a very good AUC. Let’s now set the threshold that gives the best combination between recall and precision.

precision, recall, threshold = precision_recall_curve(

y_test, y_pred_proba)

tst_prt = pd.DataFrame({

"threshold": threshold,

"recall": recall[1:],

"precision": precision[1:]

})

tst_prt_melted = pd.melt(tst_prt, id_vars = ["threshold"],

value_vars = ["recall", "precision"])

sns.lineplot(x = "threshold", y = "value",

hue = "variable", data = tst_prt_melted)

We can spot that 0.3 can be a very good threshold. Let’s test it on test data.

y_pred = np.zeros(len(y_test))

y_pred[y_pred_proba >= 0.3] = 1.

print("Accuracy: %.2f%%" % (100 * accuracy_score(y_test, y_pred)))

print("Precision: %.2f%%" % (100 * precision_score(y_test, y_pred)))

print("Recall: %.2f%%" % (100 * recall_score(y_test, y_pred)))

# Accuracy: 91.65%

# Precision: 61.54%

# Recall: 63.72%

Great! The model is performing good. Maybe it can be enhanced, but for now let’s go and try to explain how it behaves with SHAP.

Model Explanation and Feature Importance

Introducing SHAP

From SHAP’s documentation; SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.

In brief, aside from the math behind, this is how it works. When we pass a model and a training dataset, a base value is calculated, which is the average model output over the training dataset. Then shap values are calculated for each feature per each example. Then each feature, with its shap values, contributes to push the model output from that base value to left and right. In a binary classification model, features that push the model output above the base value contribute to the positive class. While the features contributing to negative class will push towards below the base value.

Let’s have a look at how this looks like. First we define our explainer and calculate the shap values.

explainer = shap.Explainer(clf, X_train, feature_names = np.array(fnames))

shap_values = explainer(X_test)

Now let’s visualize how this works in an example.

Individual Visualization

## we init JS once in our session

shap.initjs()

ind = np.argmax(y_test == 0)

print("actual is:", y_test.values[ind], "while pred is:", y_pred[ind])

shap.plots.force(shap_values[ind])

# actual is: 0.0 while pred is: 0.0

We can see how the shown observations (scaled) of duration, number of employees, 3 month euribor and contact via telephone = 1 push the model below the base value (-3.03) resulting in a negative example. While last contact in June not May and 1.53 scaled consumer price index tried to push to the right but couldn’t beat the blue force.

We can also look at the same graph using waterfall graph representing cumulative sum and how the shap values are added together to give the model output from the base value.

shap.plots.waterfall(shap_values[ind])

We can see the collision between the features pushing left and right until we have the output. The numbers on the left side is the actual observations in the data. While the numbers inside the graph are the shap values for each feature for this example.

Let’s look at a positive example using the same two graphs.

ind = np.argmax(y_test == 1)

print("actual is:", y_test.values[ind], "while pred is:", y_pred[ind])

shap.plots.force(shap_values[ind])

# actual is: 1.0 while pred is: 1.0

shap.plots.waterfall(shap_values[ind])

It is too obvious how values are contributing now to the positive class. We can see from the two examples that high duration contributes to positive class while low duration contributes to negative. Unlike number of employees. High nr_employed contributes to negative and low nr_employed contibutes to positive.

Collective Visualization

We saw how the force plot shows how features explain the model output. However, it is only for one observation. We now will look at the same force plot but for multiple observations at the same time.

shap.force_plot(explainer.expected_value, shap_values.values, X_test, feature_names = fnames)

This plot (interactive in the notebook) is the same as individual force plot. Just imagine multiple force plots rotated 90 degrees and added together for each example. A heatmap also can be viewed to see the effect of each feature on each example.

shap.plots.heatmap(shap_values)

The heatmap shows the shap value of each feature per each example in the data. Also, above the map, the model output per each example is shown. The small line plot going above and below the base line.

Another very useful graph is the beeswarm. It gives an overview of which features are most important for the model. It plots the shap values of every feature for every sample as the heatmap and sorts these features by the sum of its shap value magnitudes over all examples.

shap.plots.beeswarm(shap_values)

We can see that duration is the most important variable and high duration increases the probability for positive class, subscription in our example. While high number of employees decreases the probability for subscription.

We can also get the mean of the absolute shap values for each feature and plot a bar chart.

shap.plots.bar(shap_values)

Fantastic! We have seen how SHAP can help in explaining our logistic regression model with very useful visualizations. The library can explain so many models including neural networks and the project github repo has so many notebook examples.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/28159.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

用户管理相关命令(修改sudoer文件添加用户权限)visudo: /etc/sudoers: 权限不够

1.useradd <用户名>&#xff1a;用来新建一个用户&#xff08;-m&#xff1a;创建用户的家目录 -s&#xff1a;指定/bin/bash&#xff09; 2.userdel <用户名>&#xff1a;删除一个用户&#xff0c;还会存在家目录&#xff08;-r&#xff1a;删除用户的同时&…

jeecg在线表单开发模式保存表时报The jdbcUrl is Null, Cannot read database type

报错信息如图 原因分析 使用jeecg框架 数据库使用的是DM数据库&#xff0c;在JeecgSystemApplication中&#xff0c;使用了注解过滤DruidDataSourceAutoConfigure&#xff0c;配置文件使用的是多数据源的方式 会出现这种情况 源码分析 getOnlineDataBaseConfig方法的dataBa…

Gson的常见用法

一引入依赖 <!-- json解析的工具包 --> <dependency><groupId>com.google.code.gson</groupId><artifactId>gson</artifactId><version>2.8.6</version> </dependency> <!-- 主要为了代码简洁和日志打印 --> <…

怎么找抖音视频素材?在哪里找爆款热门的素材呢?

在短视频时代&#xff0c;拍摄和分享短视频已经成为一种潮流。但是&#xff0c;许多人都会面临一个问题&#xff0c;那就是——视频素材从哪里来&#xff1f;今天&#xff0c;我将为大家介绍几个优质的网站&#xff0c;让你的视频素材不再愁。 蛙学府&#xff1a;https://www.…

STM32项目分享:智慧农业(机智云)系统

目录 一、前言 二、项目简介 1.功能详解 2.主要器件 三、原理图设计 四、PCB硬件设计 1.PCB图 2.PCB板打样焊接图 五、程序设计 六、实验效果 七、资料内容 项目分享 一、前言 项目成品图片&#xff1a; 哔哩哔哩视频链接&#xff1a; https://www.bilibili.c…

C++ 54 之 继承中同名的静态成员处理

#include <iostream> using namespace std;// 父类 class Base07{ public:static int m_a; // 静态成员&#xff0c;类内定义static void fun(){cout << "Base中的fun"<< endl;}static void fun(int a){cout << "Base中的fun(int a)&qu…

几个小创新模型,KAN组合网络(LSTM、GRU、Transformer)时间序列预测,python预测全家桶...

截止到本期&#xff0c;一共发了8篇关于机器学习预测全家桶Python代码的文章。参考往期文章如下&#xff1a; 1.终于来了&#xff01;python机器学习预测全家桶 2.机器学习预测全家桶-Python&#xff0c;一次性搞定多/单特征输入&#xff0c;多/单步预测&#xff01;最强模板&a…

【MySQL】索引(上)

https://www.wolai.com/curry00/fzTPy3kSsMDEgEcdvo4G5w https://www.bilibili.com/video/BV1Kr4y1i7ru/?p69 https://jimhackking.github.io/%E8%BF%90%E7%BB%B4/MySQL%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/#%E7%B4%A2%E5%BC%95 索引是一种用于快速查询和检索数据的数据结构…

AI大模型探索之路-实战篇:智能化IT领域搜索引擎之知乎网站数据获取(流程优化)

系列篇章&#x1f4a5; No.文章1AI大模型探索之路-实战篇&#xff1a;智能化IT领域搜索引擎的构建与初步实践2AI大模型探索之路-实战篇&#xff1a;智能化IT领域搜索引擎之GLM-4大模型技术的实践探索3AI大模型探索之路-实战篇&#xff1a;智能化IT领域搜索引擎之知乎网站数据获…

linux笔记8--安装软件

文章目录 1. PMS和软件安装的介绍2. 安装、更新、卸载安装更新ubuntu20.04更新镜像源&#xff1a; 卸载 3. 其他发行版4. 安装第三方软件5. 推荐 1. PMS和软件安装的介绍 PMS(package management system的简称)&#xff1a;包管理系统 作用&#xff1a;方便用户进行软件安装(也…

【深度学习】解析Vision Transformer (ViT): 从基础到实现与训练

之前介绍&#xff1a; https://qq742971636.blog.csdn.net/article/details/132061304 文章目录 背景实现代码示例解释 训练数据准备模型定义训练和评估总结 Vision Transformer&#xff08;ViT&#xff09;是一种基于transformer架构的视觉模型&#xff0c;它最初是由谷歌研究…

blender bpy将顶点颜色转换为UV纹理vertex color to texture

一、关于环境 安装blender的bpy&#xff0c;不需要额外再安装blender软件。在python控制台中直接输入pip install bpy即可。 二、关于代码 本文所给出代码仅为参考&#xff0c;禁止转载和引用&#xff0c;仅供个人学习。 本文所给出的例子是https://download.csdn.net/downl…

BerkeleyDB练习

代码; #include <db.h> #include <stdio.h>int main() {DB *dbp;db_create(&dbp, NULL, 0);printf("Berkeley DB version: %s\n", db_version(NULL, NULL, NULL));dbp->close(dbp, 0);return 0; } 编译运行

4-异常-log4j配置日志滚动覆盖出现日志丢失问题

4-异常-log4j配置日志打印滚动覆盖出现日志丢失问题(附源码分析) 更多内容欢迎关注我&#xff08;持续更新中&#xff0c;欢迎Star✨&#xff09; Github&#xff1a;CodeZeng1998/Java-Developer-Work-Note 技术公众号&#xff1a;CodeZeng1998&#xff08;纯纯技术文&…

XGBoost预测及调参过程(+变量重要性)--血友病计数数据

所使用的数据是血友病数据&#xff0c;如有需要&#xff0c;可在主页资源处获取&#xff0c;数据信息如下&#xff1a; 读取数据及数据集区分 数据预处理及区分数据集代码如下&#xff08;详细预处理说明见上篇文章--随机森林&#xff09;&#xff1a; import pandas as pd im…

异常封装类统一后端响应的数据格式

异常封装类 如何统一后端响应的数据格式 1. 背景 后端作为数据的处理和响应&#xff0c;如何才能和前端配合好&#xff0c;能够高效的完成任务&#xff0c;其中一个比较重要的点就是后端返回的数据格式。 没有统一的响应格式&#xff1a; // 第一种&#xff1a; {"dat…

探索开源世界:2024年值得关注的热门开源项目推荐

文章目录 每日一句正能量前言GitCode成立背景如何使用GitCode如何把你现有的项目迁移至 GitCode&#xff1f;热门开源项目推荐actions-poetry - 管理 Python 依赖项的 GitLab CI/CD 工具项目概述技术分析应用场景特点项目地址 Spider - 网络爬虫框架项目简介技术分析应用场景项…

【RabbitMQ】异步消息及Rabbitmq安装

https://blog.csdn.net/weixin_73077810/article/details/133836287 https://www.bilibili.com/video/BV1mN4y1Z7t9/ 同步调用和异步调用 如果我们的业务需要实时得到服务提供方的响应&#xff0c;则应该选择同步通讯&#xff08;同步调用&#xff09;。 如果我们追求更高的效…

Jupyter Notebook简介

目录 1.概述 2.诞生背景 3.历史版本 4.安装 5.卸载 6.如何使用 7.菜单和菜单项 8.示例 9.未来展望 10.总结 1.概述 Jupyter Notebook是一种基于Web的交互式计算环境&#xff0c;主要用于数据分析、数据科学、机器学习以及探索性编程等领域。允许用户在单个文档中编写…

批量文本编辑神器:一键拆分每行内容,高效实现批量处理与保存,让文本编辑更高效快捷!

在信息化快速发展的今天&#xff0c;文本编辑已经成为我们工作、学习和生活中不可或缺的一部分。然而&#xff0c;面对大量的文本内容&#xff0c;如何高效地进行编辑和处理&#xff0c;成为了许多人面临的难题。今天&#xff0c;我要向大家介绍一款批量文本编辑神器&#xff0…