决策树之前要不要处理缺失值_不要使用这样的决策树

决策树之前要不要处理缺失值

As one of the most popular classic machine learning algorithm, the Decision Tree is much more intuitive than the others for its explainability. In one of my previous article, I have introduced the basic idea and mechanism of a Decision Tree model. It demonstrated this machine learning model using an algorithm called ID3, which is one of the most classic ones for training a Decision Tree classification model.

作为最受欢迎的经典机器学习算法之一,决策树在可解释性方面比其他决策树更为直观。 在上一篇文章中,我介绍了决策树模型的基本概念和机制。 它演示了使用称为ID3的算法的机器学习模型,该算法是训练决策树分类模型的最经典模型之一。

If you are not that familiar with Decision Tree, it is highly recommended to check out the above article before reading into this one.

如果您不熟悉决策树,强烈建议您在阅读本文之前先阅读以上文章。

To intuitively understand Decision Trees, it is indeed good to start with ID3. However, it is probably not a good idea to use it in practice. In this article, I’ll introduce a commonly used algorithm to build Decision Tree models — C4.5.

为了直观地理解决策树,从ID3开始确实不错。 但是,在实践中使用它可能不是一个好主意。 在本文中,我将介绍一种用于构建决策树模型的常用算法-C4.5。

经典ID3算法的缺点 (Drawbacks of Classic ID3 Algorithm)

Image for post
Photo by aitoff on Pixabay
照片由aitoff在Pixabay上发布

Before we can demonstrate the major drawbacks of the ID3 algorithm, let’s have a look at what are the major building blocks of it. Basically, the important is the Entropy and Information Gain.

在我们证明ID3算法的主要缺点之前,让我们看一下它的主要构成部分。 基本上,重要的是熵和信息增益。

熵回顾 (Recap of Entropy)

Here is the formula of Entropy:

这是熵的公式:

Image for post

The set “X” is everything in the set of the node, and “xᵢ” refers to the specific decision of each sample. Therefore, “P(xᵢ)” is the probability of the set to be made with a certain decision.

集合“ X ”是节点集合中的所有内容,而“ x ”是指每个样本的特定决策。 因此,“ P(xᵢ) ”是通过确定的决定进行集合的概率。

Image for post

Let’s use the same training dataset as an example. Suppose that we have an internal node in our decision tree with “weather = rainy”. It is can be seen that the final decisions are both “No”. Then, we can easily calculate the entropy of this node as follows:

让我们以相同的训练数据集为例。 假设我们的决策树中有一个内部节点,其“天气=下雨天”。 可以看出,最终决定都是“否”。 然后,我们可以轻松地计算出该节点的熵,如下所示:

Image for post

Basically, the probability of being “No” is 2/2 = 1, whereas the probability of being “Yes” is 0/2 = 0.

基本上,“否”的概率为2/2 = 1,而“是”的概率为0/2 = 0。

信息获取回顾 (Recap of Information Gain)

On top of the concept of Entropy, we can calculate the Information Gain, which is the basic criterion to decide whether a feature should be used as a node to be split.

在熵的概念之上,我们可以计算信息增益,这是决定是否将特征用作要分割的节点的基本标准。

For example, we have three features: “Weather”, “Temperature” and “Wind Level”. When we start to build our Decision Tree using ID3, how can we decide which one of them should be used as the root node?

例如,我们具有三个功能:“天气”,“温度”和“风力等级”。 当我们开始使用ID3构建决策树时,如何确定应将其中一个用作根节点?

ID3 makes use Information Gain as the criterion. The rule is that, select the feature with the maximum Information Gain among all of them. Here is the formula of calculating Information Gain:

ID3以信息增益为标准。 规则是,在所有选项中选择具有最大信息增益的功能。 这是计算信息增益的公式:

Image for post

where

哪里

  • “T” is the parent node and “a” is the set of attributes of “T”

    “ T”是父节点,“ a”是“ T”的属性集
  • The notation “|T|” means the size of the set

    表示法“ | T |” 表示集合的大小

Using the same example, when we calculating the Information Gain for “Weather = Rainy”, we also need to take its child nodes’ Entropy into account. Specific derivation and calculating progress can be found in the article that was shared in the introduction.

使用相同的示例,当我们计算“天气=多雨”的信息增益时,我们还需要考虑其子节点的熵。 在引言中共享的文章中可以找到特定的推导和计算进度。

使用信息增益的主要缺点 (Major Drawbacks of Using Information Gain)

The major drawbacks of using Information Gain as the criterion for determining which feature to be used as the root/next node is that it tends to use the feature that has more unique values.

使用信息增益作为确定哪个特征用作根/下一个节点的标准的主要缺点是,它倾向于使用具有更多唯一值的特征。

But why? Let me demonstrate it using an extreme scenario. Let’s say, we have got the training set with one more feature: “Date”.

但为什么? 让我用一个极端的场景来演示它。 假设,我们为培训设置了另一个功能:“日期”。

Image for post

You might say that the feature “Date” should not be considered in this case because it intuitively will not be helpful to decide whether we should go out for running or not. Yes, you’re right. However, practically, we may have much more complicated dataset to be classified, and we may not be able to understand all the features. So, we may not always be able to determine whether a feature does make sense or not. In here, I will just use “Date” as an example.

您可能会说在这种情况下不应考虑“日期”功能,因为它从直觉上对决定我们是否应该运行不起作用没有帮助。 你是对的。 但是,实际上,我们可能要分类的数据集要复杂得多,并且我们可能无法理解所有功能。 因此,我们可能无法始终确定某个功能是否有意义。 在这里,我仅以“日期”为例。

Now, let’s calculate the Information Gain for “Date”. We can start to calculate the entropy for one of the dates, such as “2020–01–01”.

现在,让我们计算“日期”的信息增益。 我们可以开始计算其中一个日期的熵,例如“ 2020-01-01”。

Image for post

Since there is only 1 row for each date, the final decision must be either “Yes” or “No”. So, the entropy must be 0! In terms of the information theory, it is equivalent to say:

由于每个日期只有一行,因此最终决定必须为“是”或“否”。 因此,熵必须为0! 就信息论而言,它等同于说:

The date tells us nothing, because the result is just one, which is certain. So, there is no “uncertainty” at all.

日期没有告​​诉我们任何信息,因为结果只是一个,可以肯定。 因此,根本没有“不确定性”。

Similarly, for all the other dates, their entropies are 0, too.

同样,对于所有其他日期,它们的熵也为0。

Now, let’s calculate the entropy for the date itself.

现在,让我们计算日期本身的熵。

Image for post

WoW, that is a pretty large number compared to the other features. So, we can calculate the Information Gain of “Date” now.

哇,与其他功能相比,这是一个很大的数目。 因此,我们现在可以计算“日期”的信息增益。

Image for post

Unsurprisingly, the Information Gain of “Date” is the entropy of itself because all its attribute having entropies that are 0.

毫不奇怪,“日期”的信息增益是其自身的熵,因为其所有属性的熵均为0。

If we calculate the Information Gain for the other three features (you can find details in the article that is linked in the introduction), they are:

如果我们计算其他三个功能的信息增益(您可以在简介中链接的文章中找到详细信息),则它们是:

  • Information Gain of Weather is 0.592

    信息的天气增益为0.592
  • Information Gain of Temperature is 0.522

    信息的温度增益为0.522
  • Information Gain of Wind Level is 0.306

    风信息增益为0.306

Obviously, the Information Gain of Date is overwhelmingly larger than the others. Also, it can be seen that it will be even larger if the training dataset is larger. After that, don’t forget that the feature “Date” actually does not make sense in deciding whether we should go out for running or not, but it is decided as the “Best” one to be the root node.

显然,最新的信息获取比其他的要大得多。 另外,可以看出,如果训练数据集更大,则该范围将更大。 此后,请不要忘记,“日期”功能在决定是否应该运行时实际上没有意义,而是被确定为“最佳”根节点。

Even funnier, after we decided to use “Date” as our root node, we’re done :)

更有趣的是,在我们决定使用“ Date”作为根节点之后,我们就完成了:)

Image for post

We end up with a Decision Tree as shown above. This is because the feature “Date” is too good. If we use it as the root node, all its attributes will simply tell us whether we should go out for running or not. It is not necessary to have the other features.

我们最终得到了如上所示的决策树。 这是因为功能“日期”太好了。 如果我们将其用作根节点,则其所有属性将简单地告诉我们是否应该运行。 不必具有其他功能。

Image for post
Image by Clker-Free-Vector-Images on Pixabay
该图片由Clker-Free-Vector-Images在Pixabay上发布

Yes, you may have a face like this fish at the moment, so do I.

是的,您现在可能有一张像这样的鱼的脸,我也是。

解决信息增益限制 (Fix the Information Gain Limitation)

Image for post
Photo by jarmoluk on Pixabay
照片由jarmoluk在Pixabay上发布

The easiest fix of the Information Gain limitation that exists in ID3 Algorithm is from another Decision Tree algorithm called C4.5. The basic idea of reducing this issue is to use Information Gain Ratio rather than Information Gain.

ID3算法中存在的信息增益限制的最简单解决方法来自另一种称为C4.5的决策树算法。 减少此问题的基本思想是使用信息增益比而不是信息增益。

Specifically, Information Gain Ratio is simply adding a penalty on the Information Gain by dividing with the entropy of the parent node.

具体而言,信息增益比只是通过除以父节点的熵来对信息增益添加惩罚。

Image for post

In other words,

换一种说法,

Image for post

Therefore, if we’re using C4.5 rather than ID3, the Information Gain Ratio of the feature “Date” will be as follows.

因此,如果我们使用的是C4.5而不是ID3,则“日期”功能的信息增益比如下。

Image for post

Well, it is indeed still the largest one compared to the other features, but don’t forget that we are really using an extreme example where each attribute value of the feature “Date” will have only one row. In practice, Information Gain Ratio will be quite enough to avoid most of the scenarios that Information Gain will cause bias.

嗯,与其他功能相比,它确实仍然是最大的功能,但是请不要忘记,我们确实使用了一个极端的示例,其中“ Date”功能的每个属性值将只有一行。 实际上,信息增益比率将足以避免大多数情况下信息增益会引起偏差。

C4.5的其他改进 (Other Improvements of C4.5)

Image for post
Photo by silviarita on Pixabay
照片由silviarita在Pixabay上发布

In my opinion, using Information Gain Ratio is the most significant improvement from ID3 to C4.5. Nevertheless, there are more improvements in C4.5 that you should know.

我认为,使用信息增益比率是从ID3到C4.5的最大改进。 但是,您应该知道C4.5还有更多改进。

PEP(悲观错误修剪) (PEP (Pessimistic Error Pruning))

If you are not familiar with the concept “Pruning” of Decision Tree, again, you may need to check out my previous article that is attached in the introduction of this article.

如果您不熟悉决策树的“修剪”概念,则可能需要查看本文简介中附带的我以前的文章。

PEP is another significant improvement in C4.5. Specifically, it will prune the tree in a top-down manner. For every internal node, the algorithm will calculate its error rate. Then, try to prune this branch to compare the error rate before and after the pruning. So, it is decided whether we should reserve this branch.

PEP是C4.5的另一个重大改进。 具体来说,它将以自顶向下的方式修剪树。 对于每个内部节点,算法将计算其错误率。 然后,尝试修剪此分支以比较修剪前后的错误率。 因此,决定是否应保留此分支。

Some characteristics of PEP:

PEP的一些特征:

  1. It is one of the Post-Pruning methods.

    它是修剪后的方法之一。
  2. It prunes the tree without the dependency of a validation dataset.

    它修剪树而不依赖验证数据集。
  3. Usually quite good to avoid overfitting, and consequently improve the performance in classifying unknown data.

    通常很好避免过度拟合,因此提高了对未知数据进行分类的性能。

离散连续特征 (Discretising the Continuous Features)

C4.5 supports continuous values. So, we are not limited to have “Low”, “Medium” and “High” such categorical values. Instead, C4.5 will automatically detect the thresholds of the continuous value that can generate the maximum Information Gain Ratio and then split the node using this threshold.

C4.5支持连续值。 因此,我们不限于具有“低”,“中”和“高”这样的分类值。 取而代之的是,C4.5将自动检测可产生最大信息增益比的连续值的值,然后使用该阈值拆分节点。

摘要 (Summary)

Image for post
Bessi on Bessi在PixabayPixabay上

In this article, I have illustrated why ID3 is not ideal. The major reason is that the criterion it uses — Information Gain — might significantly bias to those features have larger numbers of distinct values.

在本文中,我已说明了为什么ID3不理想。 主要原因是它使用的标准-信息增益-可能会严重偏向那些具有大量不同值的功能。

The solution has been given in another Decision Tree algorithm called C4.5. It evolves the Information Gain to Information Gain Ratio that will reduce the impact of large numbers of distinct values of the attributes.

该解决方案已在另一种称为C4.5的决策树算法中给出。 它改进了信息增益与信息增益之比,从而减少了属性的大量不同值的影响。

Again, if you feel that you need more context and basic knowledge about Decision Trees, please check out my previous article.

同样,如果您觉得需要更多有关决策树的知识和基础知识,请查阅我以前的文章。

翻译自: https://towardsdatascience.com/do-not-use-decision-tree-like-this-369769d6104d

决策树之前要不要处理缺失值

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389627.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

说说 C 语言中的变量与算术表达式

我们先来写一个程序&#xff0c;打印英里与公里之间的对应关系表。公式&#xff1a;1 mile1.61 km 程序如下&#xff1a; #include <stdio.h>/* print Mile to Kilometre table*/ main() {float mile, kilometre;int lower 0;//lower limitint upper 1000;//upper limi…

gl3520 gl3510_带有gl gl本机的跨平台地理空间可视化

gl3520 gl3510Editor’s note: Today’s post is by Ib Green, CTO, and Ilija Puaca, Founding Engineer, both at Unfolded, an “open core” company that builds products and services on the open source deck.gl / vis.gl technology stack, and is also a major contr…

uiautomator +python 安卓UI自动化尝试

使用方法基本说明&#xff1a;https://www.cnblogs.com/mliangchen/p/5114149.html&#xff0c;https://blog.csdn.net/Eugene_3972/article/details/76629066 环境准备&#xff1a;https://www.cnblogs.com/keeptheminutes/p/7083816.html 简单实例 1.自动化安装与卸载 &#…

5922. 统计出现过一次的公共字符串

5922. 统计出现过一次的公共字符串 给你两个字符串数组 words1 和 words2 &#xff0c;请你返回在两个字符串数组中 都恰好出现一次 的字符串的数目。 示例 1&#xff1a;输入&#xff1a;words1 ["leetcode","is","amazing","as",&…

Python+Appium寻找蓝牙/wifi匹配

前言&#xff1a; 此篇是介绍怎么去寻找蓝牙&#xff0c;进行匹配。主要2个问题点&#xff1a; 1.在不同环境下&#xff0c;搜索到的蓝牙数量有变 2.在不同环境下&#xff0c;搜索到的蓝牙排序会变 简单思路&#xff1a; 将搜索出来的蓝牙名字添加到一个list去&#xff0c;然后…

power bi中的切片器_在Power Bi中显示选定的切片器

power bi中的切片器Just recently, while presenting my session: “Magnificent 7 — Simple tricks to boost your Power BI Development” at the New Stars of Data conference, one of the questions I’ve received was:就在最近&#xff0c;在“新数据之星”会议上介绍我…

字符串匹配 sunday算法

#include"iostream" #include"string.h" using namespace std;//BF算法 int strfind(char *s1,char *s2,int pos){int len1 strlen(s1);int len2 strlen(s2);int i pos - 1,j 0;while(j < len2){if(s1[i j] s2[j]){j;}else{i;j 0;}}if(j len2){…

5939. 半径为 k 的子数组平均值

5939. 半径为 k 的子数组平均值 给你一个下标从 0 开始的数组 nums &#xff0c;数组中有 n 个整数&#xff0c;另给你一个整数 k 。 半径为 k 的子数组平均值 是指&#xff1a;nums 中一个以下标 i 为 中心 且 半径 为 k 的子数组中所有元素的平均值&#xff0c;即下标在 i …

Adobe After Effects CS6 操作记录

安装 After Effects CS6 在Mac OS 10.12.5 上无法直接安装, 需要浏览到安装的执行文件后才能进行 https://helpx.adobe.com/creative-cloud/kb/install-creative-suite-mac-os-sierra.html , 但是即使安装成功, 也不能正常启动, 会报"You can’t use this version of the …

数据库逻辑删除的sql语句_通过数据库的眼睛查询sql的逻辑流程

数据库逻辑删除的sql语句Structured Query Language (SQL) is famously known as the romance language of data. Even thinking of extracting the single correct answer from terabytes of relational data seems a little overwhelming. So understanding the logical flow…

好用的模块

import requests# 1、发get请求urlhttp://api.xxx.xx/api/user/sxx_infodata{stu_name:xxx}reqrequests.get(url,paramsdata) #发get请求print(req.json()) #字典print(req.text) #string,json串# 返回的都是什么# 返回的类型是什么# 中文的好使吗# 2、发请求posturlhttp://api…

5940. 从数组中移除最大值和最小值

5940. 从数组中移除最大值和最小值 给你一个下标从 0 开始的数组 nums &#xff0c;数组由若干 互不相同 的整数组成。 nums 中有一个值最小的元素和一个值最大的元素。分别称为 最小值 和 最大值 。你的目标是从数组中移除这两个元素。 一次 删除 操作定义为从数组的 前面 …

BZOJ4127Abs——树链剖分+线段树

题目描述 给定一棵树,设计数据结构支持以下操作 1 u v d  表示将路径 (u,v) 加d 2 u v 表示询问路径 (u,v) 上点权绝对值的和 输入 第一行两个整数n和m&#xff0c;表示结点个数和操作数接下来一行n个整数a_i,表示点i的权值接下来n-1行,每行两个整数u,v表示存在一条(u,v)的…

数据挖掘流程_数据流挖掘

数据挖掘流程1-简介 (1- Introduction) The fact that the pace of technological change is at its peak, Silicon Valley is also introducing new challenges that need to be tackled via new and efficient ways. Continuous research is being carried out to improve th…

北门外的小吃街才是我的大学食堂

学校北门外的那些小吃摊&#xff0c;陪我度过了漫长的大学四年。 细数下来&#xff0c;我最怀念的是…… &#xff08;1&#xff09;烤鸡翅 吸引指数&#xff1a;★★★★★ 必杀技&#xff1a;酥流油 烤鸡翅有蜂蜜味、香辣味、孜然味……最爱店家独创的秘制鸡翅。鸡翅的外皮被…

786. 第 K 个最小的素数分数

786. 第 K 个最小的素数分数 给你一个按递增顺序排序的数组 arr 和一个整数 k 。数组 arr 由 1 和若干 素数 组成&#xff0c;且其中所有整数互不相同。 对于每对满足 0 < i < j < arr.length 的 i 和 j &#xff0c;可以得到分数 arr[i] / arr[j] 。 那么第 k 个…

[LeetCode]最长公共前缀(Longest Common Prefix)

题目描述 编写一个函数来查找字符串数组中的最长公共前缀。如果不存在公共前缀&#xff0c;返回空字符串 ""。 示例 1:输入: ["flower","flow","flight"]输出: "fl"示例 2:输入: ["dog","racecar",&quo…

域嵌套太深_pyspark如何修改嵌套结构域

域嵌套太深In our adventures trying to build a data lake, we are using dynamically generated spark cluster to ingest some data from MongoDB, our production database, to BigQuery. In order to do that, we use PySpark data frames and since mongo doesn’t have …

redis小结

Redis 切换到redis的目录 启动&#xff1a;./redis-server 关闭&#xff1a;killall redis-server Redis的数据类型&#xff1a; String字符 list链表 set集合&#xff08;无序&#xff09; Sort Set排序&#xff08;有序&#xff09; hash数据类型 string类型的数据操作 re…

WIN10下ADB工具包安装的教程和总结 --201809

ADB&#xff08;Android Debug Bridge&#xff09;是Android SDK中的一个工具, 使用ADB可以直接操作管理Android模拟器或者真实的Andriod设备。 ADB主要功能有: 在Android设备上运行Shell(命令行)管理模拟器或设备的端口映射在计算机和设备之间上传/下载文件将电脑上的本地APK软…