成像数据更好的展示_为什么更多的数据并不总是更好

成像数据更好的展示

Over the past few years, there has been a growing consensus that the more data one has, the better the eventual analysis will be.

在过去的几年中,越来越多的共识是,数据越多,最终的分析就越好。

However, just as humans can become overwhelmed by too much information — so can machine learning models.

但是,就像人类会被太多的信息所淹没一样,机器学习模型也是如此。

以酒店取消为例 (Hotel Cancellations as an example)

I was thinking about this issue recently when reflecting on a side project I have been working on for the past year — Predicting Hotel Cancellations with Machine Learning.

最近,当我反思过去一年来一直在做的一个辅助项目时,我正在考虑这个问题- 通过机器学习预测酒店取消

Having written numerous articles about the topic on Medium — it is clear that the landscape for the hospitality industry has changed fundamentally in the past year.

在“媒介”这个主题上写了许多文章之后,很显然,在过去一年中,酒店业的格局发生了根本变化。

With a growing emphasis on “staycations”, or local holidays — this fundamentally changes the assumptions that any machine learning model should make when predicting hotel cancellations.

随着人们越来越重视“住宿”或当地假期,这从根本上改变了任何机器学习模型在预测酒店取消时都应做出的假设。

The original data from Antonio, Almeida and Nunes (2016) used datasets from Portuguese hotels with a response variable indicating whether the customer had cancelled their booking or not, along with other information on that customer such as country of origin, market segment, etc.

Antonio,Almeida和Nunes(2016)的原始数据使用了来自葡萄牙酒店的数据集,其响应变量指示客户是否取消了预订,以及该客户的其他信息,例如原籍国,细分市场等。

In the two datasets in question, approximately 55-60% of all customers were international customers.

在上述两个数据集中,大约55-60%的客户是国际客户。

However, let’s assume this scenario for a moment. This time next year — hotel occupancy is back to normal levels — but the vast majority of customers are domestic, in this case from Portugal. For the purposes of this example, let’s assume the extreme scenario that 100% of customers are domestic.

但是,让我们暂时假设这种情况。 明年的这个时候-酒店入住率恢复到正常水平-但绝大多数客户来自国内,在这种情况下来自葡萄牙。 出于本示例的目的,我们假设100%的国内客户是极端情况。

Such an assumption will radically affect the ability of any previously trained model to accurately forecast cancellations. Let’s take an example.

这样的假设将从根本上影响任何先前训练的模型准确预测取消的能力。 让我们举个例子。

使用SVM模型进行分类 (Classification using SVM Model)

An SVM model was originally used to predict hotel cancellations — with the model being trained on one dataset (H1) and the predictions then compared to a test set (H2) using the feature data from that test set. The response variable is categorical (1 = booking cancelled by customer, 0 = booking not cancelled by customer).

SVM模型最初用于预测酒店的取消情况-在一个数据集(H1)上对该模型进行训练,然后使用来自该测试集的特征数据将该预测与测试集(H2)进行比较。 响应变量是分类变量(1 =客户取消预订,0 =客户未取消预订)。

Here are the results as displayed by a confusion matrix across three different scenarios.

这是在三种不同情况下的混淆矩阵显示的结果。

方案1:在H1(完整数据集)上训练,在H2(完整数据集)上测试 (Scenario 1: Trained on H1 (full dataset), tested on H2 (full dataset))

[[25217 21011]
[ 8436 24666]]
precision recall f1-score support
0 0.75 0.55 0.63 46228
1 0.54 0.75 0.63 33102
accuracy 0.63 79330
macro avg 0.64 0.65 0.63 79330
weighted avg 0.66 0.63 0.63 79330

Overall accuracy comes in at 63%, while recall for the positive class (cancellations) came in at 75%. To clarify, recall in this instance means that of all the cancellation incidences — the model correctly identifies 75% of them.

总体准确度为63%,而正面评价(取消)的查全率为75%。 为了明确起见,在这种情况下,召回意味着所有取消事件-该模型正确地识别了其中的75%。

Now let’s see what happens when we train the SVM model on the full training set, but only include domestic customers from Portugal in our test set.

现在,让我们看看在完整的训练集上训练SVM模型但仅在测试集中包括葡萄牙的国内客户时会发生什么。

方案2:在H1(完整数据集)上进行培训,在H2(仅适用于本地)上进行了测试 (Scenario 2: Trained on H1 (full dataset), tested on H2 (domestic only))

[[10879     0]
[20081 0]]
precision recall f1-score support
0 0.35 1.00 0.52 10879
1 0.00 0.00 0.00 20081
accuracy 0.35 30960
macro avg 0.18 0.50 0.26 30960
weighted avg 0.12 0.35 0.18 30960

Accuracy has dropped dramatically to 35%, while recall for the cancellation class has dropped to 0% (meaning the model has not predicted any of the cancellation incidences in the test set). The performance in this instance is clearly very poor.

准确性急剧下降到35%,而取消类的召回率下降到0%(这意味着模型尚未预测测试集中的任何取消发生率)。 在这种情况下,性能显然很差。

方案3:在H1(仅限于国内)上受过培训,并在H2(仅限于国内)上进行了测试 (Scenario 3: Trained on H1 (domestic only), tested on H2 (domestic only))

However, what if the training set was modified to only include customers from Portugal and the model trained once again?

但是,如果将培训集修改为仅包括来自葡萄牙的客户,并且再次对模型进行了培训,该怎么办?

[[ 8274  2605]
[ 6240 13841]]
precision recall f1-score support
0 0.57 0.76 0.65 10879
1 0.84 0.69 0.76 20081
accuracy 0.71 30960
macro avg 0.71 0.72 0.70 30960
weighted avg 0.75 0.71 0.72 30960

Accuracy is back up to 71%, while recall is at 69%. Using less, but more relevant data in the training set has allowed for the SVM model to predict cancellations across the test set much more accurately.

准确率回升到71%,而召回率则达到69%。 在训练集中使用更少但更相关的数据可以使SVM模型更加准确地预测整个测试集中的取消情况。

如果数据错误,则模型结果也将错误 (If The Data Is Wrong, Model Results Will Also Be Wrong)

More data is not better if much of that data is irrelevant to what you are trying to predict. Even machine learning models can be misled if the training set is not representative of reality.

如果很多数据与您要预测的内容无关,则更多的数据并不会更好。 如果训练集不能代表现实,那么甚至会误导机器学习模型。

This was cited by a Columbia Business School study as an issue in the 2016 U.S. Presidential Elections, where the polls had put Clinton on a firm lead against Trump. However, it turned out that there were many “secret Trump voters” who had not been accounted for in the polls — and this had skewed the results towards a predicted Clinton win.

哥伦比亚大学商学院的一项研究将其引用为2016年美国总统大选的一个问题,民意测验使克林顿在对抗特朗普方面处于坚决领先地位。 然而,事实证明,有许多“秘密特朗普选民”并未在民意调查中得到解释,这使结果偏向了预期的克林顿胜利。

I’m non-U.S. and neutral on the subject by the way — I simply use this as an example to illustrate that even data we often think of as “big” can still contain inherent biases and may not be representative of what is actually going on.

顺便说一下,我不是美国人,对这个问题持中立态度-我仅以此为例来说明,即使我们经常认为“大”的数据也可能包含固有偏差,并且可能无法代表实际情况上。

Instead, the choice of data needs to be scrutinised as much as model selection, if not more so. Is inclusion of certain data relevant to the problem that we are trying to solve?

取而代之的是,数据选择需要与模型选择一样仔细检查,如果不是更多的话。 是否包含与我们要解决的问题相关的某些数据?

Going back to the hotel example, inclusion of international customer data in the training set did not enhance our model when our goal is to predict cancellations across the domestic customer base.

回到酒店的例子,当我们的目标是预测整个国内客户群的取消时,将国际客户数据包含在培训集中并不能改善我们的模型。

结论 (Conclusion)

There is increasingly a push to gather more data across all domains. While more data in and of itself is not a bad thing — it should not be assumed that blindly introducing more data into a model will improve its accuracy.

越来越多的人要求跨所有域收集更多数据。 尽管更多的数据本身并不是一件坏事,但不应认为盲目地将更多数据引入模型可以提高其准确性。

Rather, data scientists still need the ability to determine the relevance of such data to the problem at hand. From this point of view, model selection becomes somewhat of an afterthought. If the data is representative of the problem that you are trying to solve in the first instance, then even the more simple machine learning models will generate strong predictive results.

而是,数据科学家仍然需要能够确定此类数据与当前问题的相关性。 从这个角度来看,模型选择变得有些事后思考。 如果数据代表您首先要解决的问题,那么即使是更简单的机器学习模型也将产生强大的预测结果。

Many thanks for reading, and feel free to leave any questions or feedback in the comments below.

非常感谢您的阅读,并随时在下面的评论中留下任何问题或反馈。

If you are interested in taking a deeper look at the hotel cancellation example, you can find my GitHub repository here.

如果您想深入了解酒店取消示例,则可以在此处找到我的GitHub存储库 。

Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.

免责声明:本文按“原样”撰写,不作任何担保。 它旨在提供数据科学概念的概述,并且不应以任何方式解释为专业建议。

翻译自: https://towardsdatascience.com/why-more-data-is-not-always-better-de96723d1499

成像数据更好的展示

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/387936.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

支付宝架构

支付宝系统架构图如下: 支付宝架构文档有两个搞支付平台设计的人必须仔细揣摩的要点。 一个是账务处理。在记账方面,涉及到内外两个子系统,外部子系统是单边账,满足线上性能需求;内部子系统走复式记账,满足…

怎样可以跨进程测试

在Android系统下模拟鼠标键盘等输入设备,网络上资料非常多。但不少是人云亦云,甚至测试都不愿测试一下就抄上来了。这次写一点体会,当作抛砖引玉。0. 背景知识:众所周知,Android是将Framework架在Linux之上的系统。Lin…

Android Studio 导入新工程项目

1 导入之前先修改工程下相关文件 1.1 只需修改如下三个地方1.2 修改build.gradle文件 1.3 修改gradle/wrapper/gradle-wrapper.properties 1.4 修改app/build.gradle 2 导入修改后的工程 2.1 选择File|New|Import Project 2.2 选择修改后的工程 如果工程没有变成AS符号&#xf…

马蜂窝张矗:绩效考核是为了激发工作潜力,而不是逃避问题

3 月 23 日,由高端技术领导者社交平台 TGO 鲲鹏会主办的 GTLC 全球技术领导峰会分站首站在北京举行。会上马蜂窝技术副总裁 \u0026amp; TGO 鲲鹏会会员张矗发表了主题为“我在马蜂窝的技术管理实践”的演讲。本文根据其演讲整理而成。大家好,我是来自马蜂…

vue domo网站_DOMO与Tableau-逐轮

vue domo网站Let me be your BI consultant. Best yet, let me be your free consultant on the following question:让我成为您的BI顾问。 最好的是,让我成为您的免费顾问 ,解决以下问题: DOMO vs. Tableau — What should I use?DOMO vs.…

fiddler抓包1-抓小程序https包

抓小程序包和抓app包是一样的操作方法;安卓用fiddler,ios用charles; 一、环境准备 1.电脑已装最新版fiddler 2.手机和电脑在同一局域网 二、fiddler设置 1.fiddler>Tools>Options>HTTPS 勾选Capture HTTPS CONNECTs 及下边的子项&am…

多态使用的前提

1:必须是继承(extends),实现(implements) 才行2:必须要重写(覆盖)父类的方法。转载于:https://www.cnblogs.com/liyunchuan/p/10663788.html

Linux下的 FTP

1.安装vsftpd yum install vsftpd 2.启动/重启/关闭vsftpd服务器 [rootlocalhost ftp]# /sbin/service vsftpd restart Shutting down vsftpd: [ OK ] Starting vsftpd for vsftpd: [ OK ] OK表示重启成功了. 启动和关闭分别把restart改为start/stop即可. 如果是源码安装的,到…

python入门23 pymssql模块(python连接sql server增删改数据 )

增删改数据必须connect.commit()才会生效 回滚函数 connect.rollback() 连接数据库 dinghanhua sql server增删改 import pymssqlserver 192.168.1.1 user user password 111111 database testdbconnect pymssql.connect(server server,user user,passwordpassword,da…

每个人都应该使用的Python 3中被忽略的3个功能

重点 (Top highlight)Python 3 has been around for a while now, and most developers — especially those picking up programming for the first time — are already using it. But while plenty of new features came out with Python 3, it seems like a lot of them ar…

iframe自适应高度

为什么需要使用iframe自适应高度呢?其实就是为了美观,要不然iframe和窗口长短大小不一,看起来总是不那么舒服,特别是对于我们这些编程的来说,如鲠在喉的感觉。 首先设置样式 body{margin:0; padding:0;} 如果不设置bod…

.Net转Java自学之路—SpringMVC框架篇八(RESTful支持)

RESTful架构,REST即Representational State Transfer。表现层状态转换,就是目前最流行的一种互联网软件架构。它结构清晰、符合标准、易于理解、扩展方便,所以得到越来越多网站的采用。 RESTful其实就是一个开发理念,是对http的很…

冲刺第七天

今天任务进行情况:今天我们将我们的游戏导到界面形成可用的应用程序,并且进行调试与运行,让同学试玩,发现了困难并加以改正。 遇到的困难及解决方法: 运行时发现游戏界面中UI的button和image的位置会随分辨率的不同而发…

数据探查_数据科学家,开始使用探查器

数据探查Data scientists often need to write a lot of complex, slow, CPU- and I/O-heavy code — whether you’re working with large matrices, millions of rows of data, reading in data files, or web-scraping.数据科学家经常需要编写许多复杂,缓慢&…

Node.js Streams:你需要知道的一切

Node.js Streams:你需要知道的一切 图像来源 Node.js流以难以使用而闻名,甚至更难理解。好吧,我有个好消息 - 不再是这样了。 多年来,开发人员在那里创建了许多软件包,其唯一目的是简化流程。但在本文中,我…

oracle表分区

1.表空间:是一个或多个数据文件的集合,主要存放的是表,所有的数据对象都存放在指定的表空间中;一个数据文件只能属于一个表空间,一个数据库空间由若干个表空间组成,其中包括:a.系统表空间:10g以前,默认系统表空间是System,10g包括10g以后,默认系统表空间是User,存放数据字典和视…

oracle异机恢复 open resetlogs 报:ORA-00392

参考文档:ALTER DATABASE OPEN RESETLOGS fails with ORA-00392 (Doc ID 1352133.1) 打开一个克隆数据库报以下错误: SQL> alter database open resetlogs; alter database open resetlogs * ERROR at line 1: ORA-00392: log 1 of thread 1 is being…

从ncbi下载数据_如何从NCBI下载所有细菌组件

从ncbi下载数据One of the most important steps in genome analysis is gathering the data required for downstream research. This sometimes requires us to have the assembled reference genomes (mostly bacterial) so we can verify the classifiers trained or bins …

shell之引号嵌套引号大全

万恶的引号 这个能看懂你就出师了! 转载于:https://www.cnblogs.com/theodoric008/p/10000480.html

oracle表分区详解

oracle表分区详解 从以下几个方面来整理关于分区表的概念及操作: 表空间及分区表的概念表分区的具体作用表分区的优缺点表分区的几种类型及操作方法对表分区的维护性操作 1.表空间及分区表的概念 表空间: 是一个或多个数据文件的集合,所有的数据对象都存…