机器学习常用模型:决策树_fairmodels:让我们与有偏见的机器学习模型作斗争

机器学习常用模型:决策树

TL; DR (TL;DR)

The R Package fairmodels facilitates bias detection through model visualizations. It implements a few mitigation strategies that could reduce bias. It enables easy to use checks for fairness metrics and comparison between different Machine Learning (ML) models.

R Package 公平模型 通过模型可视化促进偏差检测。 它实施了一些缓解策略,可以减少偏差。 它使易于使用的公平性检查检查和不同机器学习(ML)模型之间的比较成为可能。

长版 (Long version)

Bias mitigation is an important topic in Machine Learning (ML) fairness field. For python users, there are algorithms already implemented, well-explained, and described (see AIF360). fairmodels provides an implementation of a few popular, effective bias mitigation techniques ready to make your model fairer.

偏差缓解是机器学习(ML)公平性领域的重要主题。 对于python用户,已经实现,充分解释和描述了算法(请参阅AIF360 )。 fairmodels提供了一些流行的有效的偏差缓解技术的实现,这些技术可以使您的模型更加公平。

我的模型有偏见,现在呢? (I have a biased model, now what?)

Having a biased model is not the end of the world. There are lots of ways to deal with it. fairmodels implements various algorithms to help you tackle that problem. Firstly, I must describe the difference between the pre-processing algorithm and the post-processing one.

带有偏见的模型并不是世界末日。 有很多方法可以处理它。 fairmodels实现了各种算法来帮助您解决该问题。 首先,我必须描述预处理算法和后处理算法之间的区别。

  • Pre-processing algorithms work on data before the model is trained. They try to mitigate the bias between privileged subgroup and unprivileged ones through inference from data.

    在训练模型之前, 预处理算法会对数据进行处理 。 他们试图通过数据推断来减轻特权子群体与非特权子群体之间的偏见。

  • Post-processing algorithms change the output of the model explained with DALEX so that its output does not favor the privileged subgroup so much.

    后处理算法更改了用DALEX解释的模型的输出,因此其输出不太喜欢特权子组。

这些算法如何工作? (How do these algorithms work?)

In this section, I will briefly describe how these bias mitigation techniques work. Code for more detailed examples and some visualizations used here may be found in this vignette.

在本节中,我将简要描述这些偏差缓解技术的工作原理。 在此插图中可以找到更详细的示例代码和此处使用的一些可视化效果。

前处理 (Pre-processing)

不同的冲击消除剂(Feldman等,2015) (Disparate impact remover (Feldman et al., 2015))

Image for post
(image by author) Disparate impact removing. Blue and red distribution are transformed into “middle” distribution.
(作者提供的图片)不同的影响消除。 蓝色和红色分布转换为“中间”分布。

This algorithm works on numeric, ordinal features. It changes the column values so the distributions for the unprivileged (blue) and privileged (red) subgroups are close to each other. In general, we would like our algorithm not to judge on value of the feature but rather on percentiles (e.g., hiring 20% of best applicants for the job from both subgroups). The way that this algorithm works is that it finds such distribution that minimizes earth mover’s distance. In simple words, it finds the “middle” distribution and changes values in this feature for each subgroup.

此算法适用于数字顺序特征。 它更改列的值,以便非特权(蓝色)和特权(红色)子组的分布彼此接近。 总的来说,我们希望我们的算法不是根据功能的价值来判断,而是根据百分位数来判断(例如,从两个子组中聘请20%的最佳求职者)。 该算法的工作方式是找到使推土机距离最小的分布。 用简单的话说,它找到“中间”分布并为每个子组更改此功能中的值。

Reweightnig (Kamiran et al。,2012) (Reweightnig (Kamiran et al., 2012))

Image for post
(image by author) In this mockup example, S=1 is a privileged subgroup. There is a weight for each unique combination of S and y.
(作者提供的图像)在此模型示例中,S = 1是特权子组。 S和y的每个唯一组合都有权重。

Reweighting is a simple but effective tool for minimizing bias. The algorithm looks at the protected attribute and on the real label. Then, it calculates the probability of assigning favorable label (y=1) assuming the protected attribute and y are independent. Of course, if there is bias, they will be statistically dependent. Then, the algorithm divides calculated theoretical probability by true, empirical probability of this event. That is how weight is created. With these 2 vectors (protected variable and y ) we can create weights vector for each observation in data. Then, we pass it to the model. Simple as that. But some models don’t have weights parameter and therefore can’t benefit from this method.

重新加权是最小化偏差的简单但有效的工具。 该算法查看受保护的属性和实标签。 然后,假设受保护的属性和y独立,则计算分配有利标签(y = 1)的可能性。 当然,如果存在偏差,它们将在统计上相关。 然后,算法将计算出的理论概率除以该事件的真实,经验概率。 重量就是这样产生的。 使用这两个向量(保护变量和y),我们可以为数据中的每个观察值创建权重向量。 然后,我们将其传递给模型。 就那么简单。 但是某些模型没有权重参数,因此无法从该方法中受益。

重采样(Kamiran et al。,2012) (Resampling (Kamiran et al., 2012))

Image for post
(image by author) Uniform sampling. Circles denote duplication and x’s omitting of observation.
(作者提供的图片)统一采样。 圆圈表示重复和x省略了观察。

Resampling is closely related to the prior method as it implicitly uses reweighting to calculate how many observations must be omitted/duplicated in a particular case. Imagine there are 2 groups, deprived (S = 0) and favored (S = 1). This method duplicates observations from a deprived subgroup when the label is positive and omits observations with a negative label. The opposite is then performed on the favored group. There are 2 types of resampling methods implemented- uniform and preferential. Uniform randomly picks observations (like in the picture) whereas preferential utilizes probabilities to pick/omit observations close to cutoff (default is 0.5).

重采样与先前的方法密切相关,因为它隐式使用重新加权来计算在特定情况下必须省略/重复多少个观测值。 想象一下,有2个组,被剥夺(S = 0)和受青睐(S = 1)。 当标签为正时,此方法复制来自被剥夺子组的观察结果,而忽略带有负标签的观察。 然后对喜欢的组执行相反的操作。 有两种重采样方法: 统一优先均匀地随机选择观测值(如图中所示),而“ 优先级”则利用概率来选择/忽略接近临界值的观测值(默认值为0.5)。

后期处理 (Post-processing)

Post-processing takes place after creating an explainer. To create explainer we need the model and DALEX explainer. Gbm model will be trained on adult dataset predicting whether a certain person earns more than 50k annually.

在创建解释器后进行后处理。 要创建解释器,我们需要模型和DALEX解释器。 Gbm模型将接受成人训练 预测某个人的年收入是否超过5万的数据集。

library(gbm)library(DALEX)library(fairmodels)data("adult")
adult$salary <- as.numeric(adult$salary) -1
protected <- adult$sex
adult <- adult[colnames(adult) != "sex"] # sex not specified
# making modelset.seed(1)
gbm_model <-gbm(salary ~. , data = adult, distribution = "bernoulli")
# making explainer
gbm_explainer <- explain(gbm_model,
data = adult[,-1],
y = adult$salary,
colorize = FALSE)

基于拒绝选项的分类(数据透视) (Kamiran等,2012) (Reject Option based Classification (pivot) (Kamiran et al., 2012))

Image for post
(image by author) Red- privileged, blue- unprivileged. If the value is close (-theta + cutoff, theta + cutoff) and particular case, the probability changes place (and value) to the opposite side od cutoff.
(作者提供的图片)红色特权,蓝色特权。 如果该值接近(-theta +截止值,theta +截止值)并且在特定情况下,则概率将位置(和值)更改为od截止值的另一侧。

ROC pivot is implemented based on Reject Option based Classification. Algorithm switches labels if an observation is from the unprivileged group and on the left of the cutoff. The opposite is then performed for the privileged group. But there is an assumption that the observation must be close (in terms of probabilities) to cutoff. So the user must input some value theta so that the algorithm will know how close must observation be to the cutoff for the switch. But there is a catch. If just the label was changed DALEX explainer would have a hard time properly calculating the performance of the model. For that reason instead of labels, in fairmodels implementation of this algorithm that is the probabilities that are switched (pivoted). They are just moved to the other side but with equal distance to the cutoff.

ROC数据透视是基于基于拒绝选项的分类实现的。 如果观察值来自非特权组且位于截止值的左侧,则算法会切换标签。 然后对特权组执行相反的操作。 但是有一个假设,即观察结果(在概率方面)必须接近临界值。 因此,用户必须输入一些值theta,以便算法将知道必须观察到与开关的截止点有多接近。 但是有一个问题! 如果仅更改标签,DALEX解释器将很难正确计算模型的性能。 因此,在公平模型中 ,此算法代替了标签,而是切换(枢轴化)的概率。 它们只是移动到另一侧,但与截止点的距离相等。

截止操作 (Cutoff manipulation)

Image for post
(image by author) plot(ceteris_paribus_cutoff(fobject, cumulated = TRUE))
(作者提供的图像)plot(ceteris_paribus_cutoff(fobject,cumulated = TRUE))

Cutoff manipulation might be a great idea for minimizing the bias in a model. We simply choose metrics and subgroup for which the cutoff will change. The plot shows where the minimum is and for that value of cutoff parity loss will be the lowest. How to create fairness_object with the different cutoff for certain subgroup? It is easy!

截断操作对于最小化模型中的偏差可能是一个好主意。 我们仅选择截止值将更改的指标和子组。 该图显示了最小值所在的位置,并且对于该阈值, 奇偶校验损耗将是最低的。 如何为某些子组创建具有不同截止值的fairness_object? 这很容易!

fobject <- fairness_check(gbm_explainer,
protected = protected,
privileged = "Male",
label = "gbm_cutoff",
cutoff = list(Female = 0.35))

Now the fairness_object (fobject) is a structure with specified cutoff and it will affect both fairness metrics and performance.

现在, fairness_object(fobject)是具有指定截止值的结构,它将同时影响公平性指标和性能。

公平与准确性之间的权衡 (The tradeoff between fairness and accuracy)

If we want to mitigate bias we must be aware of possible drawbacks of this action. Let’s say that Statical Parity is the most important metric for us. Lowering parity loss of this metric will (probably) result in an increase of False Positives which will cause the accuracy to drop. For this example (that you can find here) a gbm model was trained and then treated with different bias mitigation techniques.

如果我们想减轻偏见,我们必须意识到这一行动的可能弊端。 假设静态奇偶校验是我们最重要的指标。 降低此度量的奇偶校验损失将(可能)导致误报的增加,这将导致准确性下降。 对于此示例(您可以在此处找到),对gbm模型进行了训练,然后使用了不同的偏差缓解技术对其进行了处理。

Image for post
image by author
图片作者

The more we try to mitigate the bias, the less accuracy we get. This is something natural for this metric and the user should be aware of it.

我们越努力减轻偏差,获得的准确性就越低。 这对于该指标是很自然的事情,用户应该意识到这一点。

摘要 (Summary)

Debiasing methods implemented in fairmodels are certainly worth trying. They are flexible and most of them are suited for every model. Most of all they are easy to use.

公平模型中实现的偏置方法当然值得尝试。 它们非常灵活,并且大多数适用于每种型号。 最重要的是它们易于使用。

接下来要读什么? (What to read next?)

  • Blog post about introduction to fairness, problems, and solutions

    关于公平,问题和解决方案介绍的博客文章

  • Blog post about fairness visualization

    关于公平可视化的博客文章

学到更多 (Learn more)

  • Check the package’s GitHub website for more details

    检查软件包的GitHub网站以获取更多详细信息

  • Tutorial on full capabilities of the fairmodels package

    fairmodels软件包的全部功能教程

  • Tutorial on bias mitigation techniques

    缓解偏见技术的教程

翻译自: https://towardsdatascience.com/fairmodels-lets-fight-with-biased-machine-learning-models-f7d66a2287fc

机器学习常用模型:决策树

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391631.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

高德地图如何将比例尺放大到10米?

2019独角兽企业重金招聘Python工程师标准>>> var map new AMap.Map(container, {resizeEnable: true,expandZoomRange:true,zoom:20,zooms:[3,20],center: [116.397428, 39.90923] }); alert(map.getZoom());http://lbs.amap.com/faq/web/javascript-api/expand-zo…

Android 手把手带你玩转自己定义相机

本文已授权微信公众号《鸿洋》原创首发&#xff0c;转载请务必注明出处。概述 相机差点儿是每一个APP都要用到的功能&#xff0c;万一老板让你定制相机方不方&#xff1f;反正我是有点方。关于相机的两天奋斗总结免费送给你。Intent intent new Intent(); intent.setAction(M…

100米队伍,从队伍后到前_我们的队伍

100米队伍,从队伍后到前The last twelve months have brought us a presidential impeachment trial, the coronavirus pandemic, sweeping racial justice protests triggered by the death of George Floyd, and a critical presidential election. News coverage of these e…

idea使用 git 撤销commit

2019独角兽企业重金招聘Python工程师标准>>> 填写commit的id 就可以取消这一次的commit 转载于:https://my.oschina.net/u/3559695/blog/1596669

mongodb数据可视化_使用MongoDB实时可视化开放数据

mongodb数据可视化Using Python to connect to Taiwan Government PM2.5 open data API, and schedule to update data in real time to MongoDB — Part 2使用Python连接到台湾政府PM2.5开放数据API&#xff0c;并计划将数据实时更新到MongoDB —第2部分 目标 (Goal) This ti…

4.kafka的安装部署

为了安装过程对一些参数的理解&#xff0c;我先在这里提一下kafka一些重点概念,topic,broker,producer,consumer,message,partition,依赖于zookeeper, kafka是一种消息队列,他的服务端是由若干个broker组成的&#xff0c;broker会向zookeeper&#xff0c;producer生成者对应一个…

ecshop 前台个人中心修改侧边栏 和 侧边栏显示不全 或 导航现实不全

怎么给个人中心侧边栏加项或者减项 在模板文件default/user_menu.lbi 文件里添加或者修改,一般看到页面都会知道怎么加,怎么删,这里就不啰嗦了 添加一个栏目以后,这个地址跳的页面怎么写 这是最基本的一个包括左侧个人信息,头部导航栏 <!DOCTYPE html PUBLIC "-//W3C//…

面向对象编程思想-观察者模式

一、引言 相信猿友都大大小小经历过一些面试&#xff0c;其中有道经典题目&#xff0c;场景是猫咪叫了一声&#xff0c;老鼠跑了&#xff0c;主人被惊醒&#xff08;设计有扩展性的可加分&#xff09;。对于初学者来说&#xff0c;可能一脸懵逼&#xff0c;这啥跟啥啊是&#x…

Python:在Pandas数据框中查找缺失值

How to find Missing values in a data frame using Python/Pandas如何使用Python / Pandas查找数据框中的缺失值 介绍&#xff1a; (Introduction:) When you start working on any data science project the data you are provided is never clean. One of the most common …

监督学习-回归分析

一、数学建模概述 监督学习&#xff1a;通过已有的训练样本进行训练得到一个最优模型&#xff0c;再利用这个模型将所有的输入映射为相应的输出。监督学习根据输出数据又分为回归问题&#xff08;regression&#xff09;和分类问题&#xff08;classfication&#xff09;&#…

微服务架构技能

2019独角兽企业重金招聘Python工程师标准>>> 微服务架构技能 博客分类&#xff1a; 架构 &#xff08;StuQ 微服务技能图谱&#xff09; 2课程简介 本课程分为基础篇和高级篇两部分&#xff0c;旨在通过完整的案例&#xff0c;呈现微服务的开发、测试、构建、部署、…

Tableau Desktop认证:为什么要关心以及如何通过

Woah, Tableau!哇&#xff0c;Tableau&#xff01; By now, almost everyone’s heard of the data visualization software that brought visual analytics to the public. Its intuitive drag and drop interface makes connecting to data, creating graphs, and sharing d…

约束布局constraint-layout导入失败的解决方案 - 转

今天有同事用到了约束布局&#xff0c;但是导入我的工程出现错误 **提示错误&#xff1a; Could not find com.Android.support.constraint:constraint-layout:1.0.0-alpha3** 我网上查了一下资料&#xff0c;都说是因为我的androidStudio版本是最新的稳定版导入这个包就会报这…

算法复习:冒泡排序

思想&#xff1a;对于一个列表,每个数都是一个"气泡 "&#xff0c;数字越大表示"越重 "&#xff0c;最重的气泡移动到列表最后一位&#xff0c;冒泡排序后的结果就是“气泡”按照它们的重量依次移动到列表中它们相应的位置。 算法&#xff1a;搜索整个列表…

前端基础进阶(七):函数与函数式编程

纵观JavaScript中所有必须需要掌握的重点知识中&#xff0c;函数是我们在初学的时候最容易忽视的一个知识点。在学习的过程中&#xff0c;可能会有很多人、很多文章告诉你面向对象很重要&#xff0c;原型很重要&#xff0c;可是却很少有人告诉你&#xff0c;面向对象中所有的重…

显示与删除使用工具

右击工具菜单栏中的空白处选择自定义 在弹出的自定义菜单中选择命令选项在选择想要往里面添加工具的菜单&#xff0c;之后在选择要添加的工具 若想要删除工具栏中的某个工具&#xff0c;在打开自定义菜单后&#xff0c;按住鼠标左键拖动要删除工具到空白处 例如 转载于:https:/…

js值的拷贝和值的引用_到达P值的底部:直观的解释

js值的拷贝和值的引用介绍 (Introduction) Welcome to this lesson on calculating p-values.欢迎参加有关计算p值的课程。 Before we jump into how to calculate a p-value, it’s important to think about what the p-value is really for.在我们开始计算p值之前&#xff…

监督学习-KNN最邻近分类算法

分类&#xff08;Classification&#xff09;指的是从数据中选出已经分好类的训练集&#xff0c;在该训练集上运用数据挖掘分类的技术建立分类模型&#xff0c;从而对没有分类的数据进行分类的分析方法。 分类问题的应用场景&#xff1a;用于将事物打上一个标签&#xff0c;通常…

无监督学习-主成分分析和聚类分析

聚类分析&#xff08;cluster analysis&#xff09;是将一组研究对象分为相对同质的群组&#xff08;clusters&#xff09;的统计分析技术&#xff0c;即将观测对象的群体按照相似性和相异性进行不同群组的划分&#xff0c;划分后每个群组内部各对象相似度很高&#xff0c;而不…

struts实现分页_在TensorFlow中实现点Struts

struts实现分页If you want to get started on 3D Object Detection and more specifically on Point Pillars, I have a series of posts written on it just for that purpose. Here’s the link. Also, going through the Point Pillars paper directly will be really help…