Class Imbalance Problem

本文转自:http://www.chioka.in/class-imbalance-problem/#comment-202282

What is the Class Imbalance Problem?

It is the problem in machine learning where the total number of a class of data (positive) is far less than the total number of another class of data (negative). This problem is extremely common in practice and can be observed in various disciplines including fraud detection, anomaly detection, medical diagnosis, oil spillage detection, facial recognition, etc.

Why is it a problem?

Most machine learning algorithms and works best when the number of instances of each classes are roughly equal. When the number of instances of one class far exceeds the other, problems arise. This is best illustrated below with an example.

Given a dataset of transaction data, we would like to find out which are fraudulent and which are genuine ones. Now, it is highly cost to the e-commerce company if a fraudulent transaction goes through as this impacts our customers trust in us, and costs us money. So we want to catch as many fraudulent transactions as possible.

If there is a dataset consisting of 10000 genuine and 10 fraudulent transactions, the classifier will tend to classify fraudulent transactions as genuine transactions. The reason can be easily explained by the numbers. Suppose the machine learning algorithm has two possibly outputs as follows:

  1. Model 1 classified 7 out of 10 fraudulent transactions as genuine transactions and 10 out of 10000 genuine transactions as fraudulent transactions.
  2. Model 2 classified 2 out of 10 fraudulent transactions as genuine transactions and 100 out of 10000 genuine transactions as fraudulent transactions.

If the classifier’s performance is determined by the number of mistakes, then clearly Model 1 is better as it makes only a total of 17 mistakes while Model 2 made 102 mistakes. However, as we want to minimize the number of fraudulent transactions happening, we should pick Model 2 instead which only made 2 mistakes classifying the fraudulent transactions. Of course, this could come at the expense of more genuine transactions being classified as fraudulent transactions, but will be a cost we can bear for now. Anyhow, a general machine learning algorithm will just pick Model 1 than Model 2, which is a problem. In practice, this means we will let a lot of fraudulent transactions go through although we could have stopped them by using Model 2. This translates to unhappy customers and money lost for the company.

How to tell the machine learning algorithm which is the better solution?

To tell the machine learning algorithm (or the researcher) that Model 2 is better than Model 1, we need to show that Model 2 above is better than Model 1 above. For that, we will need better metrics than just counting the number of mistakes made.

We introduce the concept of True Positive, True Negative, False Positive and False Negative:

  • True Positive (TP) – An example that is positive and is classified correctly as positive
  • True Negative (TN) – An example that is negative and is classified correctly as negative
  • False Positive (FP) – An example that is negative but is classified wrongly as positive
  • False Negative (FN) – An example that is positive but is classified wrongly as negative

Based on this above. We will have also the following of True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate:

Metrics Table

With these new metrics, let’s compare it with the conventional metrics of counting the number of mistakes made with the example above. First, we will use the old metrics to calculate the number of mistakes made (error):

Traditional Metrics

As illustrated above, Model 1 looks like it has lower error (0.1% error) than Model 2 (1.0% error) but we know that Model 2 is the better one, as it makes less false negatives (FN) (maximize true positive (TP)). Now let’s see what the performance of Model 1 and Model 2 are like with the new metrics:

New Metrics

Now, we can see that the false negative rate of Model 1 is at 70% while the false negative rate of Model 2 is just at 20%, which is clearly a better classifier. This is what we should educate the machine learning algorithm (or us) to use in order to allow it to pick a better algorithm.

How to mitigate this problem?

Now knowing what the Class Imbalance Problem is and why is it a problem, we need to know how to deal with this problem.

We can roughly classify the approaches into two major categories: sampling based approaches and cost function based approaches.

Cost function based approaches

The intuition behind cost function based approaches is that if we think one false negative is worse than one false positive, we will count that one false negative as, e.g., 100 false negatives instead. For example, if 1 false negative is as costly as 100 false positives, then the machine learning algorithm will try to make fewer false negatives compared to false positives (since it is cheaper). For example, in the case of SVM, the generic formula is:

SVM_formula

where w is the normal vector to the hyperplane. and E[i] is the error of each data instance, Cis a cost constant and n is the number of data instances. To assign a different cost function to false negative and false positive, we can modify the formula to as follows:

SVM_formula_2

Where C+ is a cost constant for positive cases and C- is a cost constant for negative cases, n+is the total number of positive cases and n- is the total number of negative cases. Without diving too deep into the formula above, this is just an example how one may assign different to cost to positive and negative classes.

Sampling based approaches

This can be roughly classified into three categories:

  1. Oversampling, by adding more of the minority class so it has more effect on the machine learning algorithm
  2. Undersampling, by removing some of the majority class so it has less effect on the machine learning algorithm
  3. Hybrid, a mix of oversampling and undersampling

However, these approaches have clear drawbacks, as explained below.

Undersampling

By undersampling, we could risk removing some of the majority class instances which is more representative, thus discarding useful information. This can be illustrated as follows: Undersampling

Here the green line is the ideal decision boundary we would like to have, and blue is the actual result. On the left side is the result of just applying a general machine learning algorithm without using undersampling. On the right, we undersampled the negative class but removed some informative negative class, and caused the blue decision boundary to be slanted, causing some negative class to be classified as positive class wrongly.

Oversampling

By oversampling, just duplicating the minority classes could lead the classifier to overfitting to a few examples, which can be illustrated below:Oversampling

On the left hand side is before oversampling, where as on the right hand side is oversampling has been applied. On the right side, The thick positive signs indicate there are multiple repeated copies of that data instance. The machine learning algorithm then sees these cases many times and thus designs to overfit to these examples specifically, resulting in a blue line boundary as above.

Hybrid approach

By combining undersampling and oversampling approaches, we get the advantages but also drawbacks of both approaches as illustrated above, which is still a tradeoff.

More recent approaches to the problem

In 2002, an sampling based algorithm called SMOTE (Synthetic Minority Over-Sampling Technique) was introduced that try to address the class imbalance problem. It is one of the most adopted approaches due to its simplicity and effectiveness. It is a combination of oversampling and undersampling, but the oversampling approach is not by replicating minority class but constructing new minority class data instance via an algorithm.

In traditional oversampling, minority class is replicated exactly. In SMOTE, new minority instances are constructed in this way:

SMOTE

The intuition behind the construction algorithm is that oversampling causes overfit because of repeated instances causes the decision boundary to tighten. Instead, we will create “similar” examples instead. To the machine learning algorithm, these new constructed instances are not exact copies and thus softens the decision boundary as a result. This be can illustrated as follows:

SMOTE boundary

As a result, the classifier is more general and does not overfit.

Even more recent approaches

Interested readers may look into more recent literature regarding RUSBoost, SMOTEBagging and Underbagging, which are all regarded as more promising approaches since SMOTE.

However, SMOTE is still very popular due to its simplicity.

Summary

The Class Imbalance Problem is a common problem affecting machine learning due to having disproportionate number of class instances in practice. To compare solutions, we will use alternative metrics (True Positive, True Negative, False Positive, False Negative) instead of general accuracy of counting number of mistakes.

Due to its prevalence, there are many approaches out there to deal with this problem. They can be generally classified into two major categories of 1) sampling based, and 2) cost function based. Sampling based can be broken into three major categories: a) over sampling b) under sampling c) hybrid of oversampling and undersampling.

Now armed with what, why and how, you can start dealing with real world scenarios of handling datasets with this problem.


本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/247057.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

程序代码编辑器和浏览器代码编辑器&代码可视化执行过程

tutorialspoint http://www.tutorialspoint.com/codingground.htm 1. Sublime Text :http://blog.l1n3.net/editor/sublime-text-introduce/ 下载 :http://www.sublimetext.com/3 2. Notepad https://notepad-plus-plus.org/zh/ 更多细节请查看 htt…

听技术播客:一边学Python编程一边学英语

本文转自:http://codingpy.com/article/recommended-python-podcasts/ 学技术的朋友一般都会关注不少技术博客(blog),但是关注技术播客(podcast)的人估计不会太多。这里一方面也是由于相关的播客数量&#…

mysql补充

mysql补充 mysql使用流程 开启服务端,mysqld或者net start mysqlcmd下键入mysql -u root -p,输入设置好的密码,连接mysql客户端show databases;展示所有的mysql仓库创建一个库:create database CRM;然后sho…

编程书单:十本Python编程语言的入门书籍

本文转自:http://codingpy.com/article/10-python-beginner-books/ 本文与大家分享一些Python编程语言的入门书籍,其中不乏经典。我在这里分享的,大部分是这些书的英文版,如果有中文版的我也加上了。有关书籍的介绍,大…

Sublime配置与各种插件

本文转自:http://www.cnblogs.com/yyhh/p/4232063.html Sublime Text 3 安装Package Control 点击View -> Show Console 在下方命令行内,输入以下命令。 import urllib.request,os;pfPackage Control.sublime-package;ippsublime.installed_packages_…

把Sublime Text 2打造成一个轻量级Python的IDE

本文转自:http://blog.l1n3.net/python/sublime-text-to-python-ide/ 因为这段时间迷上了Python,所以想吧Sublime Text 2弄成一个Python的简易IDE,Python自带的IDLE简直太难用!!!! 配置Python环…

数据库表操作、数据类型及完整性约束

数据库表操作、数据类型及完整性约束 库操作补充 数据库命名规则: 可以由字母、数字、下划线、@、#、$区分大小写唯一性不能使用关键字如 create select不能单独使用数字最长128位表操作补充 #语法: create table 表名…

第99:真正理解拉格朗日乘子法和 KKT 条件

转载于:https://www.cnblogs.com/invisible2/p/11441485.html

有了CodinGame,玩着游戏就能学编程

本文转自:http://www.codingpy.com/article/learning-to-code-becomes-a-game/ 今天编程派向大家推荐一个有趣的编程练习平台,而它与其他平台的差异,就在于它将编程练习变成了一个个游戏。这个平台的名字叫CodinGame,是一家法国创…

第98:svd原理

SVD分解:任何矩阵都可以分解成第一行的形式,3个相乘。UV都是正交矩阵,中间的是奇异值。 3个相乘的形式可以拆分。即奇异值*第一行*第一列。在相加。 奇异值有时很小,在这种情况下,丢掉,可以减少计算量&…

第97:一文读懂协方差与协方差矩阵

转载于:https://www.cnblogs.com/invisible2/p/11442777.html

设计模式之模板方法模式实战解析

本文微信公众号「AndroidTraveler」首发。 背景 最近在看《设计模式之禅》,为了能够更加深入的理解设计模式,达到学以致用。 这边记录一下自己的一些感受和看法,并结合具体代码实战来进行说明。 模板方法模式 但凡和设计模式挂上钩&#xff0…

第96:SVM简介与简单应用

详细推到见:https://blog.csdn.net/v_july_v/article/details/7624837 python实现方式: 转载于:https://www.cnblogs.com/invisible2/p/11448307.html

matlab如何把选中区域标亮

下面给出的是初始图像为彩色图像的情况。 %% Example on how to color select pixels in an image. % Kawahara (2013).% The original COLOR image. origImg imread(1.jpg); oldorigImg origImg; % Make sure the values are within 0-255. origImg uint8(origImg);% View…

Calendar是日历类

Calendar是日历类,在Date后出现,替换掉了许多Date的方法。该类将所有可能用到的时间信息封装为静态成员变量,方便获取。 Calendar为抽象类,由于语言敏感性,Calendar类在创建对象时并非直接创建,而是通过静态…

玩转大数据21:基于FP-Growth算法的关联规则挖掘及实现

1.引言 关联规则挖掘是大数据领域中重要的数据分析任务之一,其可以帮助我们发现数据集中项目之间的关联关系。关联规则挖掘是指在交易数据或者其他数据集中,发现一些常见的关联项,如购物篮中经常一起出现的商品组合。关联规则挖掘的应用非常…

MATLAB – TreeBagger example

In MATLAB, Decision Forests go under the rather deceiving name of TreeBagger. 随机森林分类器(Random Forest) B TreeBagger(nTree,train_data,train_label,Method,classification); predict_label predict(B,test_data); 利用随机…

第95:PCA

输入数据矩阵->计算每条记录的平均值和标准差->计算协方差矩阵->得到协方差矩阵的所有特征值和特征向量->对特征值进行从大到小的排序,并且得到与之对应的特征向量 PCA是无监督的。没有标签也可以做,是基于方差的。 精髓在于将协方差矩阵进行…

国外少儿PYTHON编程书推荐

1,Python for Kids: A Playful Introduction to Programming 中文版已结有了,叫做 趣学Python——教孩子学编程 51wK-ZIUImL 亚马逊上人气很高,适合10岁以上儿童,内容浅显易懂,非常适合儿童入门 2,He…

将sublime text3添加到右键菜单中(可执行)

安装了sublime text3,发现右键文件里面没有使用sublime text3打开的选项,所以需要手动添加使用sublime text3编辑的选项。 打开注册表编辑器。 开始——>运行——>regedit 选择HKEY_CLASSES_ROOT——>*——>shell,右键&#…