分类预测回归预测_我们应该如何汇总分类预测?

分类预测回归预测

If you are reading this, then you probably tried to predict who will survive the Titanic shipwreck. This Kaggle competition is a canonical example of machine learning, and a right of passage for any aspiring data scientist. What if instead of predicting who will survive, you only had to predict how many will survive? Or, what if you had to predict the average age of survivors, or the sum of the fare that the survivors paid?

如果您正在阅读本文,那么您可能试图预测谁将在泰坦尼克号沉船中幸存。 这场Kaggle竞赛是机器学习的典范,也是任何有抱负的数据科学家的通行权。 如果不必预测谁将生存,而只需要预测多少将生存怎么办? 或者,如果您必须预测幸存者的平均年龄或幸存者支付的车费怎么办?

There are many applications where classification predictions need to be aggregated. For example, a customer churn model may generate probabilities that a customer will churn, but the business may be interested in how many customers are predicted to churn, or how much revenue will be lost. Similarly, a model may give a probability that a flight will be delayed, but we may want to know how many flights will be delayed, or how many passengers are affected. Hong (2013) lists a number of other examples from actuarial assessment to warranty claims.

在许多应用中,需要汇总分类预测。 例如,客户流失模型可能会产生客户流失的概率,但是企业可能会对预计有多少客户流失或将损失多少收入感兴趣。 同样,模型可能会给您一个航班延误的可能性,但我们可能想知道有多少航班会延误,或者有多少乘客受到影响。 Hong(2013)列举了从精算评估到保修索赔的许多其他示例。

Most binary classification algorithms estimate probabilities that examples belong to the positive class. If we treat these probabilities as known values (rather than estimates), then the number of positive cases is a random variable with a Poisson Binomial probability distribution. (If the probabilities were all the same, the distribution would be Binomial.) Similarly, the sum of two-value random variables where one value is zero and the other value some other number (e.g. age, revenue) is distributed as a Generalized Poisson Binomial. Under these assumptions we can report mean values as well as prediction intervals. In summary, if we had the true classification probabilities, then we could construct the probability distributions of any aggregate outcome (number of survivors, age, revenue, etc.).

大多数二进制分类算法都会估计示例属于肯定类的概率。 如果我们将这些概率视为已知值(而不是估计值),则阳性病例数是具有泊松二项式概率分布的随机变量。 (如果概率都相同,则分布将为二项式。)类似地,二值随机变量的总和(其中一个值为零,而另一个值为其他数字(例如年龄,收入))作为广义泊松分布二项式 在这些假设下,我们可以报告平均值以及预测间隔。 总而言之,如果我们拥有真正的分类概率,那么我们可以构建任何总体结果(幸存者的数量,年龄,收入等)的概率分布。

Of course, the classification probabilities we obtain from machine learning models are just estimates. Therefore, treating the probabilities as known values may not be appropriate. (Essentially, we would be ignoring the sampling error in estimating these probabilities.) However, if we are interested only in the aggregate characteristics of survivors, perhaps we should focus on estimating parameters that describe the probability distributions of these aggregate characteristics. In other words, we should recognize that we have a numerical prediction problem rather than a classification problem.

当然,我们从机器学习模型中获得的分类概率只是估计值。 因此,将概率视为已知值可能不合适。 (从本质上讲,在估计这些概率时,我们将忽略采样误差。)但是,如果我们仅对幸存者的总体特征感兴趣,那么也许我们应该专注于估算描述这些总体特征的概率分布的参数。 换句话说,我们应该认识到我们有一个数值预测问题,而不是分类问题。

I compare two approaches to getting aggregate characteristics of Titanic survivors. The first is to classify and then aggregate. I estimate three popular classification models and then aggregate the resulting probabilities. The second approach is a regression model to estimate how aggregate characteristics of a group of passengers affect the share that survives. I evaluate each approach using many random splits of test and train data. The conclusion is that many classification models do poorly when the classification probabilities are aggregated.

我比较了两种获取泰坦尼克号幸存者总体特征的方法。 首先是分类,然后汇总 。 我估计了三种流行的分类模型,然后合计了得出的概率。 第二种方法是一种回归模型,用于估计一组乘客的总体特征如何影响幸存的份额。 我使用许多随机的测试和训练数据评估每种方法。 结论是,当汇总分类概率时,许多分类模型的效果不佳。

1.分类和汇总方法 (1. Classify and Aggregate Approach)

Let’s use the Titanic data to estimate three different classifiers. The logistic model will use only age and passenger class as predictors; Random Forest and XGBoost will also use sex. I train the model on the 891 passengers in Kaggle’s training data. I evaluate the predictions on the 418 in the test data. (I obtained the labels for the test set to be able to evaluate my models.)

让我们使用Titanic数据来估计三个不同的分类器。 逻辑模型将仅使用年龄和乘客等级作为预测因子; 随机森林和XGBoost也将使用性别。 我在Kaggle的训练数据中为891名乘客训练了模型。 我在测试数据中评估418的预测。 (我获得了测试集的标签,以便能够评估我的模型。)

Performance of classification algorithms on aggregate prediction.
Performance of classification algorithms on aggregate prediction.
分类算法在聚合预测上的性能。

The logistic model with only age and passenger class as predictors has an AUC of 0.67. Random Forest and XGBoost that also use sex reach a very respectable AUC of around 0.8. Our task, however, is to predict how many passengers will survive. We can estimate this by adding up the probabilities that a passenger will survive. Interestingly, of the three classifiers, the logistic model was the closest to the actual number of survivors despite having the lowest AUC. It is also worth noting that a naive estimate based on the share of survivors in the training data did best of all.

仅以年龄和乘客等级为预测因子的逻辑模型的AUC为0.67。 同样使用性行为的Random Forest和XGBoost的AUC达到了非常可观的0.8。 但是,我们的任务是预测有多少乘客能够幸存。 我们可以通过将乘客生存的概率相加来估计这一点。 有趣的是,在三个分类器中,逻辑模型尽管AUC最低,但与实际幸存者数量最接近。 还值得注意的是,基于幸存者在训练数据中所占份额的天真估计最能说明问题。

Given the probabilities of survival for each passenger in the test set, the number of passengers that will survive is a random variable distributed Poisson Binomial. The mean of this random variable is the sum of the individual probabilities. The percentiles of this distribution can be obtained using the `poibin` R package developed by Hong (2013). A similar package for Python is under development. The percentiles can also be obtained through brute force by simulating 10,000 different sets of outcomes for the 418 passengers in the test set. The percentiles can be interpreted as prediction intervals telling us that the actual number of survivors will be within this interval with 95% probability.

给定测试集中每个乘客的生存概率,将生存的乘客数量是一个随机变量分布的Poisson Binomial。 该随机变量的平均值是各个概率的总和。 可以使用Hong(2013)开发的`poibin` R软件包来获得该分布的百分位数。 类似的Python包正在开发中。 通过为测试集中的418位乘客模拟10,000种不同的结果集,还可以通过蛮力获得百分位数。 百分位可以解释为预测间隔,告诉我们幸存者的实际数量将以95%的概率在此间隔内。

Prediction intervals using Poisson Binomial and Generalized Poisson Binomial percentiles.
Prediction intervals using Poisson Binomial and Generalized Poisson Binomial percentiles.
使用泊松二项式和广义泊松二项式百分位数的预测间隔。

The interval based on the Random Forest probabilities widely missed the actual number of survivors. It is worth noting that the width of the interval is not necessarily based on the accuracy of the individual probabilities. Instead, it depends on how far those individual probabilities are from 0.5. Probabilities close to 0.9 or 0.1 rather than 0.5 mean that there is a lot less uncertainty as to how many passengers will survive. A good discussion of forecast reliability versus sharpness is here.

基于随机森林概率的时间间隔大大错过了幸存者的实际数量。 值得注意的是,间隔的宽度不一定基于各个概率的准确性。 取而代之的是,它取决于这些个体概率与0.5之间的差值。 概率接近0.9或0.1而不是0.5意味着,有多少乘客能够幸存,其不确定性要小得多。 这里对预测的可靠性与清晰度进行了很好的讨论。

While the number of survivors is a sum of zero/one random variables (Bernoulli trials), we may also be interested in predicting other aggregate characteristics of the survivors, e.g. total fare paid by the survivors. This measure is a sum of two-value random variables where one value is zero (passenger did not survive) and the other one is the fare that the passenger paid. Zhang, Hong and Balakrishnan (2018) call the probability distribution of this sum Generalized Poisson Binomial. As with Poisson Binomial, Hong, co-wrote an R package, GPB, that makes computing the probability distributions straightforward. Once again, simulating the distribution is an alternative to using the packages to compute percentiles.

虽然幸存者的数量是零/一个随机变量的总和(Bernoulli试验),但我们也可能对预测幸存者的其他总体特征感兴趣,例如,由幸存者支付的总票价。 此度量是两个值随机变量的总和,其中一个值为零(乘客无法幸存),另一个为乘客支付的票价。 Zhang,Hong和Balakrishnan(2018)称该和为广义泊松二项式的概率分布。 像Hong的Poisson Binomial一样,编写了R程序包GPB ,这使得计算概率分布变得简单。 再一次,模拟分布是使用软件包计算百分位数的替代方法。

2.总体回归法 (2. Aggregate Regression Approach)

If we only care about the aggregate characteristics of survivors, then we really have a numerical prediction problem. The simplest estimate of the share of survivors in the test set is the share of survivors in the training set — it is the naive estimate from the previous section. This estimate is probably unbiased and efficient if the characteristics of passengers in the test and train sets are identical. If not, then we would want an estimate of the share of survivors conditional on the characteristics of the passengers.

如果我们只关心幸存者的总体特征,那么我们确实有一个数值预测问题。 测试集中幸存者份额的最简单估计是训练集中幸存者的份额-这是上一节中的幼稚估计。 如果测试组和火车组中的乘客特征相同,则此估计可能是公正且有效的。 如果没有,那么我们将希望根据乘客的特征估算幸存者的份额。

The issue is that we don’t have the data to estimate how aggregate characteristics of a group of passengers affect the share that survived. After all, the Titanic hit the iceberg only once. Perhaps in other applications such as customer churn, we may have new data every month.

问题在于,我们没有数据来估计一组乘客的总体特征如何影响幸存的份额。 毕竟,泰坦尼克号只击中了冰山一次。 也许在其他应用程序(例如客户流失)中,我们可能每个月都有新数据。

In the Titanic case I resort to simulating many different training data sets by re-sampling the original training data set. I calculate the average characteristics of each simulated data set to estimate of how these characteristics affect the share that will survive. I then take the average characteristics of passengers in the test set and predict how many will survive in the test set. There are many different ways one could summarize the aggregate characteristics. I use the share of passengers in first class, the share of passengers under the age of 10 and the share of female passengers. Not surprisingly, the samples of passengers that have more women, children and first class passengers have a higher share of survivors.

在泰坦尼克号案例中,我通过对原始训练数据集进行重新采样来模拟许多不同的训练数据集。 我计算每个模拟数据集的平均特征,以估计这些特征如何影响将生存的份额。 然后,我将测试集中的乘客的平均特征,并预测有多少人将在测试集中幸存。 有多种不同的方式可以总结总体特征。 我使用头等舱乘客的份额,10岁以下乘客的份额和女性乘客的份额。 毫不奇怪,拥有更多妇女,儿童和头等舱乘客的乘客样本中幸存者的比例更高。

Results of a regression of share of survived on aggregate passenger characteristics using 500 simulated training sets.
Results of a regression of share of survived on aggregate passenger characteristics using 500 simulated training sets.
使用500个模拟训练集对总乘客特征幸存者所占份额进行回归的结果。

Applying the above equation to aggregate characteristics of the test data, I predict 162 survivors against the actual of 158 with a prediction interval of 151 to 173. Thus, the regression approach worked quite well.

将上述方程式应用到测试数据的总体特征中,我预测了162个幸存者,而实际值是158,而预测间隔为151到173。因此,回归方法工作得很好。

3.两种方法比较如何? (3. How Do the Two Approaches Compare?)

So far, we evaluated the two approaches using only one test set. In order to compare the two approaches more systematically, I re-sampled from the union of the original train and test data set to create five hundred new train and test data sets. I then applied the two approaches five hundred times and calculated the mean square error of each approach across these five hundred samples. The graphs below show the relative performance of each approach.

到目前为止,我们仅使用一个测试集评估了这两种方法。 为了更系统地比较这两种方法,我从原始火车和测试数据集的联合中重新采样以创建五百个新的火车和测试数据集。 然后,我对这两种方法进行了500次应用,并计算了这500种样本中每种方法的均方误差。 下图显示了每种方法的相对性能。

Evaluation of various approached to aggregate prediction using 500 random train and test splits.
Evaluation of various approaches to aggregate prediction using 500 random train and test splits.
使用500个随机训练和测试分割对各种方法进行聚集预测的评估。

Among the classification models, the logistic model did best (had the lowest MSE). XGBoost is a relatively close second. Random Forest is way off. The accuracy of aggregate predictions depends crucially on the accuracy of the estimated probabilities. The logistic regression directly estimates the probability of survival. Similarly, XGBoost optimizes a logistic loss function. Therefore, both provide a decent estimate of probabilities. In contrast, Random Forest estimates probabilities as shares of trees that classified the example as success. As pointed out by Olson and Wyner (2018), the share of trees that classified the example as a success has nothing to do with the probability that the example will be a success. (For the same reason, calibration plots for Random Forest tend to be poor.) Although Random Forest can deliver a high AUC, the estimated probabilities are inappropriate for aggregation.

在分类模型中,逻辑模型表现最好(MSE最低)。 XGBoost相对来说排名第二。 随机森林渐行渐远。 聚合预测的准确性主要取决于估计概率的准确性。 逻辑回归直接估计生存的可能性。 同样,XGBoost优化了物流损失功能。 因此,两者都提供了不错的概率估计。 相反,随机森林将概率估计为将示例归类为成功的树木份额。 正如Olson和Wyner(2018)指出的那样,将示例成功分类为树木的份额与示例成功的可能性无关。 (出于同样的原因,随机森林的标定图往往很差。)尽管随机森林可以提供较高的AUC,但估计的概率不适合汇总。

The aggregate regression model had the lowest MSE of all the approaches, beating even the classification logistic model. The naive predictions are handicapped in this evaluation because the share of survivors in the test data is not independent of the share of survivors in the train data. If we happen to have many survivors in the train, we will naturally have fewer survivors in the test. Even with this handicap, naive predictions handily beat XGBoost and Random Forest.

总体回归模型具有所有方法中最低的MSE,甚至超过了分类逻辑模型。 由于测试数据中幸存者的比例与火车数据中幸存者的比例无关,因此天真的预测在此评估中受到了限制。 如果我们碰巧有很多幸存者在火车上,那么我们自然会减少测试中的幸存者。 即使有这种障碍,幼稚的预测也轻易击败了XGBoost和Random Forest。

4。结论 (4. Conclusion)

If we only need aggregate characteristics, estimating and aggregating individual classification probabilities seems like more trouble than is needed. In many cases, the share of survivors in the train set is a pretty good estimate of the share of survivors in the test set. Customer churn rate this month is probably a pretty good estimate of churn rate next month. More complicated models are worth building if we want to understand what drives survival or churn. It is also worth building more complicated models when our training data has very different characteristics than the test data, and when these characteristics affect survival or churn. Still, even in these cases, it is clear that using methods that are optimized for individual classifications could be inferior to methods optimized for a numerical prediction when a numerical prediction is needed.

如果我们只需要汇总特征,则估计和汇总单个分类概率似乎比需要的麻烦更多。 在许多情况下,训练集中幸存者的比例是对测试集中幸存者比例的一个很好的估计。 本月的客户流失率可能是下个月流失率的相当不错的估计。 如果我们想了解驱动生存或流失的因素,则更复杂的模型值得构建。 当我们的训练数据与测试数据具有非常不同的特征并且这些特征影响生存或流失时,也值得建立更复杂的模型。 尽管如此,即使在这些情况下,很明显,当需要数值预测时,使用针对单个分类优化的方法可能不如针对数值预测优化的方法。

You can find the R code behind this note here.

您可以在此处找到此注释后面的R代码。

翻译自: https://towardsdatascience.com/how-should-we-aggregate-classification-predictions-2f204e64ede9

分类预测回归预测

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391145.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

“机器换人”之潮涌向珠三角,蓝领工人将何去何从

企业表示很无奈,由于生产需要,并非刻意换人。 随着传统产业向更加现代化、自动化的新产业转型,“机器换人”似乎是历史上不可逆转的潮流。 据报道,珠三角经济圈所在的广东省要从传统的制造大省向制造强省转变,企业转型…

深入理解InnoDB(6)—独立表空间

InnoDB的表空间 表空间可以看做是InnoDB存储引擎逻辑结构的最高层 ,所有的数据都是存放在表空间中。 1. Extent 对于16KB的页来说,连续的64个页就是一个区,也就是说一个区默认占用1MB空间大小。 每256个区被划分成一组,第一组的前3个页面是…

神经网络推理_分析神经网络推理性能的新工具

神经网络推理Measuring the inference time of a trained deep neural model on different hardware devices is a critical task when making deployment decisions. Should you deploy your inference on 8 Nvidia V100s, on 12 P100s, or perhaps you can use 64 CPU cores?…

Eclipse断点调试

1.1 Eclipse断点调试概述Eclipse的断点调试可以查看程序的执行流程和解决程序中的bug1.2 Eclipse断点调试常用操作:A:什么是断点:就是一个标记,从哪里开始。B:如何设置断点:你想看哪里的程序,你就在那个有效程序的左边双击即可。C…

深入理解InnoDB(7)—系统表空间

系统表空间 可以看到,系统表空间和独立表空间的前三个页面(页号分别为0、1、2,类型分别是FSP_HDR、IBUF_BITMAP、INODE)的类型是一致的,只是页号为3~7的页面是系统表空间特有的 页号3 SYS: Insert Buffer …

CodeForces - 869B The Eternal Immortality

题意&#xff1a;已知a,b&#xff0c;求的最后一位。 分析&#xff1a; 1、若b-a>5&#xff0c;则尾数一定为0&#xff0c;因为连续5个数的尾数要么同时包括一个5和一个偶数&#xff0c;要么包括一个0。 2、若b-a<5&#xff0c;直接暴力求即可。 #include<cstdio>…

如何在24行JavaScript中实现Redux

90% convention, 10% library. 90&#xff05;的惯例&#xff0c;10&#xff05;的图书馆。 Redux is among the most important JavaScript libraries ever created. Inspired by prior art like Flux and Elm, Redux put JavaScript functional programming on the map by i…

卡方检验 原理_什么是卡方检验及其工作原理?

卡方检验 原理As a data science engineer, it’s imperative that the sample data set which you pick from the data is reliable, clean, and well tested for its usability in machine learning model building.作为数据科学工程师&#xff0c;当务之急是从数据中挑选出的…

Web UI 设计(网页设计)命名规范

Web UI 设计命名规范 一.网站设计及基本框架结构: 1. Container“container“ 就是将页面中的所有元素包在一起的部分&#xff0c;这部分还可以命名为: “wrapper“, “wrap“, “page“.2. Header“header” 是网站页面的头部区域&#xff0c;一般来讲&#xff0c;它包含…

27个机器学习图表翻译_使用机器学习的信息图表信息组织

27个机器学习图表翻译Infographics are crucial for presenting information in a more digestible fashion to the audience. With their usage being expanding to many (if not all) professions like journalism, science, and research, advertisements, business, the re…

面向Tableau开发人员的Python简要介绍(第4部分)

用PYTHON探索数据 (EXPLORING DATA WITH PYTHON) Between data blends, joins, and wrestling with the resulting levels of detail in Tableau, managing relationships between data can be tricky.在数据混合&#xff0c;联接以及在Tableau中产生的详细程度之间进行搏斗之间…

蝙蝠侠遥控器pcb_通过蝙蝠侠从Circle到ML:第二部分

蝙蝠侠遥控器pcbView Graph查看图 背景 (Background) Wait! Isn’t the above equation different from what we found last time? Yup, very different but still looks exactly the same or maybe a bit better. Just in case you are wondering what I am talking about, p…

camera驱动框架分析(上)

前言 camera驱动框架涉及到的知识点比较多&#xff0c;特别是camera本身的接口就有很多&#xff0c;有些是直接连接到soc的camif口上的&#xff0c;有些是通过usb接口导出的&#xff0c;如usb camera。我这里主要讨论前者&#xff0c;也就是与soc直连的。我认为凡是涉及到usb的…

探索感染了COVID-19的动物的数据

数据 (The data) With the number of cases steadily rising day by day, COVID-19 has been pretty much in the headlines of every newspaper known to man. Despite the massive amount of attention, a topic that has remained mostly untouched (some exceptions being …

Facebook哭晕在厕所,调查显示用VR体验社交的用户仅为19%

美国娱乐软件协会ESA调查显示&#xff0c;有74%的用户使用VR玩游戏&#xff0c;而仅有19%的用户会用VR进行社交。 当我们说到VR社交&#xff0c;必然离不开Facebook。在刚刚结束的F8大会上&#xff0c;小扎展示了VR社交平台Facebook Spaces测试版&#xff0c;巧的是此前也有好…

解决Javascript疲劳的方法-以及其他所有疲劳

Learn your fundamentals, and never worry again. 了解您的基础知识&#xff0c;再也不用担心。 新工具让我担心 (New Tools Worry Me) When JavaScripts shiny tool of the day comes out, I sometimes overreact. 当JavaScript一天一度的闪亮工具问世时&#xff0c;我有时R…

已知两点坐标拾取怎么操作_已知的操作员学习-第4部分

已知两点坐标拾取怎么操作有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING) These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as mu…

北京供销大数据集团发布SinoBBD Cloud 一体化推动产业云发展

9月5日&#xff0c;第五届全球云计算大会在上海世博展览馆盛大开幕&#xff0c;国内外顶尖企业汇聚一堂&#xff0c;新一代云计算技术产品纷纷亮相。作为国内领先的互联网基础服务提供商&#xff0c;北京供销大数据集团(以下简称“SinoBBD”)受邀参加此次大会&#xff0c;并正式…

“陪护机器人”研报:距离真正“陪护”还差那么一点

一款有“缺陷”的机器人&#xff0c;怎能做到真正的“陪护”&#xff1f; 近日&#xff0c;鼎盛智能发布了一款名为Ibotn的&#xff08;爱蹦&#xff09;幼儿陪伴机器人&#xff0c;核心看点就是通过人脸识别、场景识别等计算机视觉技术来实现机器人对儿童的陪护。不过&#xf…

【转】消息队列应用场景

一、消息队列概述 消息队列中间件是分布式系统中重要的组件&#xff0c;主要解决应用耦合&#xff0c;异步消息&#xff0c;流量削锋等问题。实现高性能&#xff0c;高可用&#xff0c;可伸缩和最终一致性架构。是大型分布式系统不可缺少的中间件。 目前在生产环境&#xff0c;…