决策树之前要不要处理缺失值_不要使用这样的决策树

决策树之前要不要处理缺失值

As one of the most popular classic machine learning algorithm, the Decision Tree is much more intuitive than the others for its explainability. In one of my previous article, I have introduced the basic idea and mechanism of a Decision Tree model. It demonstrated this machine learning model using an algorithm called ID3, which is one of the most classic ones for training a Decision Tree classification model.

作为最受欢迎的经典机器学习算法之一,决策树在可解释性方面比其他决策树更为直观。 在上一篇文章中,我介绍了决策树模型的基本概念和机制。 它演示了使用称为ID3的算法的机器学习模型,该算法是训练决策树分类模型的最经典模型之一。

If you are not that familiar with Decision Tree, it is highly recommended to check out the above article before reading into this one.

如果您不熟悉决策树,强烈建议您在阅读本文之前先阅读以上文章。

To intuitively understand Decision Trees, it is indeed good to start with ID3. However, it is probably not a good idea to use it in practice. In this article, I’ll introduce a commonly used algorithm to build Decision Tree models — C4.5.

为了直观地理解决策树,从ID3开始确实不错。 但是,在实践中使用它可能不是一个好主意。 在本文中,我将介绍一种用于构建决策树模型的常用算法-C4.5。

经典ID3算法的缺点 (Drawbacks of Classic ID3 Algorithm)

Image for post
Photo by aitoff on Pixabay
照片由aitoff在Pixabay上发布

Before we can demonstrate the major drawbacks of the ID3 algorithm, let’s have a look at what are the major building blocks of it. Basically, the important is the Entropy and Information Gain.

在我们证明ID3算法的主要缺点之前,让我们看一下它的主要构成部分。 基本上,重要的是熵和信息增益。

熵回顾 (Recap of Entropy)

Here is the formula of Entropy:

这是熵的公式:

Image for post

The set “X” is everything in the set of the node, and “xᵢ” refers to the specific decision of each sample. Therefore, “P(xᵢ)” is the probability of the set to be made with a certain decision.

集合“ X ”是节点集合中的所有内容,而“ x ”是指每个样本的特定决策。 因此,“ P(xᵢ) ”是通过确定的决定进行集合的概率。

Image for post

Let’s use the same training dataset as an example. Suppose that we have an internal node in our decision tree with “weather = rainy”. It is can be seen that the final decisions are both “No”. Then, we can easily calculate the entropy of this node as follows:

让我们以相同的训练数据集为例。 假设我们的决策树中有一个内部节点,其“天气=下雨天”。 可以看出,最终决定都是“否”。 然后,我们可以轻松地计算出该节点的熵,如下所示:

Image for post

Basically, the probability of being “No” is 2/2 = 1, whereas the probability of being “Yes” is 0/2 = 0.

基本上,“否”的概率为2/2 = 1,而“是”的概率为0/2 = 0。

信息获取回顾 (Recap of Information Gain)

On top of the concept of Entropy, we can calculate the Information Gain, which is the basic criterion to decide whether a feature should be used as a node to be split.

在熵的概念之上,我们可以计算信息增益,这是决定是否将特征用作要分割的节点的基本标准。

For example, we have three features: “Weather”, “Temperature” and “Wind Level”. When we start to build our Decision Tree using ID3, how can we decide which one of them should be used as the root node?

例如,我们具有三个功能:“天气”,“温度”和“风力等级”。 当我们开始使用ID3构建决策树时,如何确定应将其中一个用作根节点?

ID3 makes use Information Gain as the criterion. The rule is that, select the feature with the maximum Information Gain among all of them. Here is the formula of calculating Information Gain:

ID3以信息增益为标准。 规则是,在所有选项中选择具有最大信息增益的功能。 这是计算信息增益的公式:

Image for post

where

哪里

  • “T” is the parent node and “a” is the set of attributes of “T”

    “ T”是父节点,“ a”是“ T”的属性集
  • The notation “|T|” means the size of the set

    表示法“ | T |” 表示集合的大小

Using the same example, when we calculating the Information Gain for “Weather = Rainy”, we also need to take its child nodes’ Entropy into account. Specific derivation and calculating progress can be found in the article that was shared in the introduction.

使用相同的示例,当我们计算“天气=多雨”的信息增益时,我们还需要考虑其子节点的熵。 在引言中共享的文章中可以找到特定的推导和计算进度。

使用信息增益的主要缺点 (Major Drawbacks of Using Information Gain)

The major drawbacks of using Information Gain as the criterion for determining which feature to be used as the root/next node is that it tends to use the feature that has more unique values.

使用信息增益作为确定哪个特征用作根/下一个节点的标准的主要缺点是,它倾向于使用具有更多唯一值的特征。

But why? Let me demonstrate it using an extreme scenario. Let’s say, we have got the training set with one more feature: “Date”.

但为什么? 让我用一个极端的场景来演示它。 假设,我们为培训设置了另一个功能:“日期”。

Image for post

You might say that the feature “Date” should not be considered in this case because it intuitively will not be helpful to decide whether we should go out for running or not. Yes, you’re right. However, practically, we may have much more complicated dataset to be classified, and we may not be able to understand all the features. So, we may not always be able to determine whether a feature does make sense or not. In here, I will just use “Date” as an example.

您可能会说在这种情况下不应考虑“日期”功能,因为它从直觉上对决定我们是否应该运行不起作用没有帮助。 你是对的。 但是,实际上,我们可能要分类的数据集要复杂得多,并且我们可能无法理解所有功能。 因此,我们可能无法始终确定某个功能是否有意义。 在这里,我仅以“日期”为例。

Now, let’s calculate the Information Gain for “Date”. We can start to calculate the entropy for one of the dates, such as “2020–01–01”.

现在,让我们计算“日期”的信息增益。 我们可以开始计算其中一个日期的熵,例如“ 2020-01-01”。

Image for post

Since there is only 1 row for each date, the final decision must be either “Yes” or “No”. So, the entropy must be 0! In terms of the information theory, it is equivalent to say:

由于每个日期只有一行,因此最终决定必须为“是”或“否”。 因此,熵必须为0! 就信息论而言,它等同于说:

The date tells us nothing, because the result is just one, which is certain. So, there is no “uncertainty” at all.

日期没有告​​诉我们任何信息,因为结果只是一个,可以肯定。 因此,根本没有“不确定性”。

Similarly, for all the other dates, their entropies are 0, too.

同样,对于所有其他日期,它们的熵也为0。

Now, let’s calculate the entropy for the date itself.

现在,让我们计算日期本身的熵。

Image for post

WoW, that is a pretty large number compared to the other features. So, we can calculate the Information Gain of “Date” now.

哇,与其他功能相比,这是一个很大的数目。 因此,我们现在可以计算“日期”的信息增益。

Image for post

Unsurprisingly, the Information Gain of “Date” is the entropy of itself because all its attribute having entropies that are 0.

毫不奇怪,“日期”的信息增益是其自身的熵,因为其所有属性的熵均为0。

If we calculate the Information Gain for the other three features (you can find details in the article that is linked in the introduction), they are:

如果我们计算其他三个功能的信息增益(您可以在简介中链接的文章中找到详细信息),则它们是:

  • Information Gain of Weather is 0.592

    信息的天气增益为0.592
  • Information Gain of Temperature is 0.522

    信息的温度增益为0.522
  • Information Gain of Wind Level is 0.306

    风信息增益为0.306

Obviously, the Information Gain of Date is overwhelmingly larger than the others. Also, it can be seen that it will be even larger if the training dataset is larger. After that, don’t forget that the feature “Date” actually does not make sense in deciding whether we should go out for running or not, but it is decided as the “Best” one to be the root node.

显然,最新的信息获取比其他的要大得多。 另外,可以看出,如果训练数据集更大,则该范围将更大。 此后,请不要忘记,“日期”功能在决定是否应该运行时实际上没有意义,而是被确定为“最佳”根节点。

Even funnier, after we decided to use “Date” as our root node, we’re done :)

更有趣的是,在我们决定使用“ Date”作为根节点之后,我们就完成了:)

Image for post

We end up with a Decision Tree as shown above. This is because the feature “Date” is too good. If we use it as the root node, all its attributes will simply tell us whether we should go out for running or not. It is not necessary to have the other features.

我们最终得到了如上所示的决策树。 这是因为功能“日期”太好了。 如果我们将其用作根节点,则其所有属性将简单地告诉我们是否应该运行。 不必具有其他功能。

Image for post
Image by Clker-Free-Vector-Images on Pixabay
该图片由Clker-Free-Vector-Images在Pixabay上发布

Yes, you may have a face like this fish at the moment, so do I.

是的,您现在可能有一张像这样的鱼的脸,我也是。

解决信息增益限制 (Fix the Information Gain Limitation)

Image for post
Photo by jarmoluk on Pixabay
照片由jarmoluk在Pixabay上发布

The easiest fix of the Information Gain limitation that exists in ID3 Algorithm is from another Decision Tree algorithm called C4.5. The basic idea of reducing this issue is to use Information Gain Ratio rather than Information Gain.

ID3算法中存在的信息增益限制的最简单解决方法来自另一种称为C4.5的决策树算法。 减少此问题的基本思想是使用信息增益比而不是信息增益。

Specifically, Information Gain Ratio is simply adding a penalty on the Information Gain by dividing with the entropy of the parent node.

具体而言,信息增益比只是通过除以父节点的熵来对信息增益添加惩罚。

Image for post

In other words,

换一种说法,

Image for post

Therefore, if we’re using C4.5 rather than ID3, the Information Gain Ratio of the feature “Date” will be as follows.

因此,如果我们使用的是C4.5而不是ID3,则“日期”功能的信息增益比如下。

Image for post

Well, it is indeed still the largest one compared to the other features, but don’t forget that we are really using an extreme example where each attribute value of the feature “Date” will have only one row. In practice, Information Gain Ratio will be quite enough to avoid most of the scenarios that Information Gain will cause bias.

嗯,与其他功能相比,它确实仍然是最大的功能,但是请不要忘记,我们确实使用了一个极端的示例,其中“ Date”功能的每个属性值将只有一行。 实际上,信息增益比率将足以避免大多数情况下信息增益会引起偏差。

C4.5的其他改进 (Other Improvements of C4.5)

Image for post
Photo by silviarita on Pixabay
照片由silviarita在Pixabay上发布

In my opinion, using Information Gain Ratio is the most significant improvement from ID3 to C4.5. Nevertheless, there are more improvements in C4.5 that you should know.

我认为,使用信息增益比率是从ID3到C4.5的最大改进。 但是,您应该知道C4.5还有更多改进。

PEP(悲观错误修剪) (PEP (Pessimistic Error Pruning))

If you are not familiar with the concept “Pruning” of Decision Tree, again, you may need to check out my previous article that is attached in the introduction of this article.

如果您不熟悉决策树的“修剪”概念,则可能需要查看本文简介中附带的我以前的文章。

PEP is another significant improvement in C4.5. Specifically, it will prune the tree in a top-down manner. For every internal node, the algorithm will calculate its error rate. Then, try to prune this branch to compare the error rate before and after the pruning. So, it is decided whether we should reserve this branch.

PEP是C4.5的另一个重大改进。 具体来说,它将以自顶向下的方式修剪树。 对于每个内部节点,算法将计算其错误率。 然后,尝试修剪此分支以比较修剪前后的错误率。 因此,决定是否应保留此分支。

Some characteristics of PEP:

PEP的一些特征:

  1. It is one of the Post-Pruning methods.

    它是修剪后的方法之一。
  2. It prunes the tree without the dependency of a validation dataset.

    它修剪树而不依赖验证数据集。
  3. Usually quite good to avoid overfitting, and consequently improve the performance in classifying unknown data.

    通常很好避免过度拟合,因此提高了对未知数据进行分类的性能。

离散连续特征 (Discretising the Continuous Features)

C4.5 supports continuous values. So, we are not limited to have “Low”, “Medium” and “High” such categorical values. Instead, C4.5 will automatically detect the thresholds of the continuous value that can generate the maximum Information Gain Ratio and then split the node using this threshold.

C4.5支持连续值。 因此,我们不限于具有“低”,“中”和“高”这样的分类值。 取而代之的是,C4.5将自动检测可产生最大信息增益比的连续值的值,然后使用该阈值拆分节点。

摘要 (Summary)

Image for post
Bessi on Bessi在PixabayPixabay上

In this article, I have illustrated why ID3 is not ideal. The major reason is that the criterion it uses — Information Gain — might significantly bias to those features have larger numbers of distinct values.

在本文中,我已说明了为什么ID3不理想。 主要原因是它使用的标准-信息增益-可能会严重偏向那些具有大量不同值的功能。

The solution has been given in another Decision Tree algorithm called C4.5. It evolves the Information Gain to Information Gain Ratio that will reduce the impact of large numbers of distinct values of the attributes.

该解决方案已在另一种称为C4.5的决策树算法中给出。 它改进了信息增益与信息增益之比,从而减少了属性的大量不同值的影响。

Again, if you feel that you need more context and basic knowledge about Decision Trees, please check out my previous article.

同样,如果您觉得需要更多有关决策树的知识和基础知识,请查阅我以前的文章。

翻译自: https://towardsdatascience.com/do-not-use-decision-tree-like-this-369769d6104d

决策树之前要不要处理缺失值

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389627.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

gl3520 gl3510_带有gl gl本机的跨平台地理空间可视化

gl3520 gl3510Editor’s note: Today’s post is by Ib Green, CTO, and Ilija Puaca, Founding Engineer, both at Unfolded, an “open core” company that builds products and services on the open source deck.gl / vis.gl technology stack, and is also a major contr…

uiautomator +python 安卓UI自动化尝试

使用方法基本说明:https://www.cnblogs.com/mliangchen/p/5114149.html,https://blog.csdn.net/Eugene_3972/article/details/76629066 环境准备:https://www.cnblogs.com/keeptheminutes/p/7083816.html 简单实例 1.自动化安装与卸载 &#…

power bi中的切片器_在Power Bi中显示选定的切片器

power bi中的切片器Just recently, while presenting my session: “Magnificent 7 — Simple tricks to boost your Power BI Development” at the New Stars of Data conference, one of the questions I’ve received was:就在最近,在“新数据之星”会议上介绍我…

5939. 半径为 k 的子数组平均值

5939. 半径为 k 的子数组平均值 给你一个下标从 0 开始的数组 nums ,数组中有 n 个整数,另给你一个整数 k 。 半径为 k 的子数组平均值 是指:nums 中一个以下标 i 为 中心 且 半径 为 k 的子数组中所有元素的平均值,即下标在 i …

数据库逻辑删除的sql语句_通过数据库的眼睛查询sql的逻辑流程

数据库逻辑删除的sql语句Structured Query Language (SQL) is famously known as the romance language of data. Even thinking of extracting the single correct answer from terabytes of relational data seems a little overwhelming. So understanding the logical flow…

数据挖掘流程_数据流挖掘

数据挖掘流程1-简介 (1- Introduction) The fact that the pace of technological change is at its peak, Silicon Valley is also introducing new challenges that need to be tackled via new and efficient ways. Continuous research is being carried out to improve th…

北门外的小吃街才是我的大学食堂

学校北门外的那些小吃摊,陪我度过了漫长的大学四年。 细数下来,我最怀念的是…… (1)烤鸡翅 吸引指数:★★★★★ 必杀技:酥流油 烤鸡翅有蜂蜜味、香辣味、孜然味……最爱店家独创的秘制鸡翅。鸡翅的外皮被…

[LeetCode]最长公共前缀(Longest Common Prefix)

题目描述 编写一个函数来查找字符串数组中的最长公共前缀。如果不存在公共前缀,返回空字符串 ""。 示例 1:输入: ["flower","flow","flight"]输出: "fl"示例 2:输入: ["dog","racecar",&quo…

spark的流失计算模型_使用spark对sparkify的流失预测

spark的流失计算模型Churn prediction, namely predicting clients who might want to turn down the service, is one of the most common business applications of machine learning. It is especially important for those companies providing streaming services. In thi…

区块链开发公司谈区块链与大数据的关系

在过去的两千多年的时间长河中,数字一直指引着我们去探索很多未知的科学世界。到目前为止,随着网络和信息技术的发展,一切与人类活动相关的活动,都直接或者间接的连入了互联网之中,一个全新的数字化的世界展现在我们的…

Jupyter Notebook的15个技巧和窍门,可简化您的编码体验

Jupyter Notebook is a browser bases REPL (read eval print loop) built on IPython and other open-source libraries, it allows us to run interactive python code on the browser.Jupyter Notebook是基于IPL和其他开源库构建的基于REPL(读取评估打印循环)的浏览器&#…

bi数据分析师_BI工程师和数据分析师的5个格式塔原则

bi数据分析师Image by Author图片作者 将美丽融入数据 (Putting the Beauty in Data) Have you ever been ravished by Vizzes on Tableau Public that look like only magic could be in play to display so much data in such a pleasing way?您是否曾经被Tableau Public上的…

BSOJ 2423 -- 【PA2014】Final Zarowki

Description 有n个房间和n盏灯,你需要在每个房间里放入一盏灯。每盏灯都有一定功率,每间房间都需要不少于一定功率的灯泡才可以完全照亮。 你可以去附近的商店换新灯泡,商店里所有正整数功率的灯泡都有售。但由于背包空间有限,你…

WPF绑定资源文件错误(error in binding resource string with a view in wpf)

报错:无法将“***Properties.Resources.***”StaticExtension 值解析为枚举、静态字段或静态属性 解决办法:尝试右键单击在Visual Studio解决方案资源管理器的资源文件,并选择属性选项,然后设置自定义工具属性 PublicResXFile cod…

因果推论第六章

因果推论 (Causal Inference) This is the sixth post on the series we work our way through “Causal Inference In Statistics” a nice Primer co-authored by Judea Pearl himself.这是本系列的第六篇文章,我们将通过Judea Pearl本人与他人合着的《引诱统计学…

如何优化网站加载时间

一、背景 我们要监测网站的加载情况,可以使用 window.performance 来简单的检测。 window.performance 是W3C性能小组引入的新的API,目前IE9以上的浏览器都支持。一个performance对象的完整结构如下图所示: memory字段代表JavaScript对内存的…

熊猫数据集_处理熊猫数据框中的列表值

熊猫数据集Have you ever dealt with a dataset that required you to work with list values? If so, you will understand how painful this can be. If you have not, you better prepare for it.您是否曾经处理过需要使用列表值的数据集? 如果是这样&#xff0…

旋转变换(一)旋转矩阵

1. 简介 计算机图形学中的应用非常广泛的变换是一种称为仿射变换的特殊变换,在仿射变换中的基本变换包括平移、旋转、缩放、剪切这几种。本文以及接下来的几篇文章重点介绍一下关于旋转的变换,包括二维旋转变换、三维旋转变换以及它的一些表达方式&#…

数据预处理 泰坦尼克号_了解泰坦尼克号数据集的数据预处理

数据预处理 泰坦尼克号什么是数据预处理? (What is Data Pre-Processing?) We know from my last blog that data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incom…

Pytorch中DNN入门思想及实现

DNN全连接层(线性层) 计算公式: y w * x b W和b是参与训练的参数 W的维度决定了隐含层输出的维度,一般称为隐单元个数(hidden size) b是偏差值(本文没考虑) 举例: 输…