open clip论文阅读摘要

看下open clip论文
Learning Transferable Visual Models From Natural Language Supervision
These results suggest that the aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpasses that of high-quality crowd-labeled NLP datasets.

CNNs trained to predict words in image captions learn useful image representations

learn image representations from text

我好奇,在OCR上是怎么测试的?
CLIP训练样本要怎么准备,400 million (image, text) pairs,这个量级的样本集是怎么准备出来的

论文说CLIP这种预训练,zero-shot可以媲美基于监督学习构建的模型,我需要打个问号,在特定领域的业务数据上好像不太够啊?

Learning from natural language has several potential strengths over other training methods. It’s much easier to scale natural language supervision compared to standard crowd-sourced labeling for image classification since it does not require annotations to be in a classic “machine learning compatible format” such as the canonical 1-of-N majority vote “gold label”

MS-COCO and Visual Genome are high quality crowd-labeled datasets, they are small by modern standards with approximately 100,000 training photos each

YFCC100M, at 100 million photos, is a possible alternative, but the metadata for each image is sparse and of varying quality
Many images use automatically generated filenames like 20160716 113957.JPG as “titles” or contain “descriptions” of camera exposure settings. After filtering to keep only images with natural language titles and/or descriptions in English, the dataset shrunk by a factor of 6 to only 15 million photos. This is
approximately the same size as ImageNet

A major motivation for natural language supervision is the large quantities of data of this form available publicly on the internet.

we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet

We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText
注重样本的类别平衡

we found training efficiency was key to successfully scaling natural language supervision and we selected our final pre-training method based on this metric
是的,在这样规模的数据集上训练,需要的时间是令人畏惧的,所以掌握更快速的训练效率是关键

Recent work in contrastive representation learning for images has found that contrastive objectives can learn better representations than their equivalent
predictive objective
这个发现很有意思,这说明我们可以不需要准确预测每个图片的text caption,这太难了

Other work has found that although generative models of images can learn high quality image representations, they require over an order of magnitude more compute than contrastive models with the same performance
这里又提到了生成模型,在学习表征方面,有监督学习CNN、对比学习CLIP、生成模型Stable Diffusion

We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights.
在这么一个大数据集上,甚至比ImageNet还大,加载预训练的ImageNet模型和text encoder模型确实没必要

CLIP is pre-trained to predict if an image and a text snippet are paired together in its dataset. To perform zero-shot classification, we reuse this capability. For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable (image, text)
pair according to CLIP. In a bit more detail, we first compute the feature embedding of the image and the feature embedding of the set of possible texts by their respective encoders. The cosine similarity of these embeddings is then calculated, scaled by a temperature parameter τ , and normalized into a
probability distribution via a softmax. Note that this prediction layer is a multinomial logistic regression classifier with L2-normalized inputs, L2-normalized weights, no bias, and temperature scaling

Another issue we encountered is that it’s relatively rare in our pre-training dataset for the text paired with the image to be just a single word. Usually the text is a full sentence describing the image in some way. To help bridge this distribution gap, we found that using the prompt template “A photo of a {label}.” to be a good default that helps specify the text is about the content of the image. This often improves performance over the baseline of using only the label text
出现这个问题的原因是模型没能理解语言,不过现在GPT4可以做到了,估计会有点儿突破?
Similar to the “prompt engineering” discussion around GPT3 (Brown et al., 2020; Gao et al., 2020), we have also observed that zero-shot performance can be significantly improved by customizing the prompt text to each task. A few, non exhaustive, examples follow. We found on several fine-grained image classification datasets that it helped to specify the category. For example on Oxford-IIIT Pets, using “A photo of a {label}, a type of pet.” to help provide context worked well. Likewise, on Food101 specifying a type of food and on FGVC Aircraft a type of aircraft helped too. For OCR datasets, we found that putting quotes around the text or number to be recognized improved performance. Finally, we found that on satellite image classification datasets it helped to specify that the images were of this form and we use variants of “a satellite photo of a {label}.”.
这种prompt对于性能的提升是肯定的

We also experimented with ensembling over multiple zeroshot classifiers as another way of improving performance. These classifiers are computed by using different context prompts such as ‘A photo of a big {label}” and “A photo of a small {label}”. We construct the ensemble over the embedding space instead of probability space. This allows us to cache a single set of averaged text embeddings so that the compute cost of the ensemble is the same as using a single classifier when amortized over many predictions
这里使用emsemble的方法提升性能

说白了,比监督学习强在:
1、数据量更多
2、任务种类更多
3、加上文本学习语义信息,不单单是空间信息

we see that zero-shot CLIP is quite weak on several specialized, complex, or abstract tasks such as satellite image classification (EuroSAT and RESISC45), lymph node tumor detection (PatchCamelyon), counting objects in synthetic scenes (CLEVRCounts), self-driving related tasks such as
German traffic sign recognition (GTSRB), recognizing distance to the nearest car (KITTI Distance). These results highlight the poor capability of zero-shot CLIP on more complex tasks.
貌似跟GPT4也有点像?虽然通用性很不错,但是没办法做到全能,特别是复杂任务上,我感觉还是数据的问题吧,当然也有可能是现在的模型架构没办法应对复杂任务,所以需要拆解成更简单的子任务。不可否认的是在业务数据标注上存在加速作用

First, CLIP’s zero-shot classifier is generated via natural language which allows for visual concepts to be directly specified (“communicated”). By contrast,
“normal” supervised learning must infer concepts indirectly from training examples. Context-less example-based learning has the drawback that many different hypotheses can be consistent with the data, especially in the one-shot case. A single image often contains many different visual concepts. Although a capable learner is able to exploit visual cues and heuristics, such as assuming that the concept being demonstrated is the primary object in an image, there is no guarantee
是的,few-shot的难点主要在于,你不知道模型把什么特征跟最终的标签做了关联,所以需要加大样本数据量,才能使模型正确找到这个路径
zero-shot主要是在大数据量上预训练了,所以跟few-shot还是有区别的
我比较看好在大数据上预训练过的大模型


其实这个评估有点儿问题?如何评判稳定性?你这个只是在建立的测试样本集上的结果而已,并不是大量的数据评估结果,特别是放到真实业务场景下的分析结果,我觉得每个类别多点儿数据不是坏事,可以加强特征到标签的连接,特别是捕获正确的特征

If we assume that evaluation datasets are large enough that the parameters of linear classifiers trained on them are well estimated, then, because CLIP’s zero-shot classifier is also a linear classifier, the performance of the fully supervised classifiers roughly sets an upper bound for what zero-shot transfer can achieve
从拟合能力上来看,监督学习可以拟合的性能上限,也是zero-shot可以达到的上限

在大数据上学习到通用表征能力,跟在特定数据集上做监督训练,并不是冲突的

Over the past few years, empirical studies of deep learning
systems have documented that performance is predictable as
a function of important quantities such as training compute
and dataset size
这里提到,近年来的深度学习预测能力,是可以评估的,这个确实有点儿意思哈

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/139997.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

记录--vue3 setup 中国省市区三级联动options最简洁写法,无需任何库

这里给大家分享我在网上总结出来的一些知识,希望对大家有所帮助 在写页面的时候,发现表单里面有一个省市区的 options 组件要写,因为表单很多地方都会用到这个地址选择,我便以为很简单嘛。 虽然很简单的一个功能,但是网…

如何使用ArcGIS Pro制作个性三维地形图

制作三维地图制作的多了,想着能不能换个“口味”,恰好看见制作六边形蜂窝图,灵光一闪,想着将二者结合,将平滑的三维地形图改成柱状图,从结果来看还可以,这里将制作方法分享给大家,希…

Leetcode2833. 距离原点最远的点

Every day a Leetcode 题目来源:2833. 距离原点最远的点 解法1:贪心 要使得到达的距离原点最远的点,就看 left 和 right 谁大,将 left 和 right 作为矢量相加,再往同方向加上 underline。 答案即为 abs(left - rig…

Windows 安装 Maven

目录 安装 JDK下载 Maven配置阿里云镜像配置环境变量 安装 JDK Windows 安装 JDK 下载 Maven 下载地址:https://maven.apache.org/download.cgi 下载 apache-maven-3.9.5-bin.zip 到本地解压到 D:\Software\apache-maven-3.9.5 配置阿里云镜像 配置阿里云远程仓…

docker更改存储目录原因及方案

为什么一定要将docker的存储目录挂载到其他目录 docker在安装时默认存储目录在/var/lib/docker,而该目录是在系统盘下的。docker安装后,会使用各种各样的镜像,动辄几个G,那么如此多的镜像文件,装着装着系统盘就撑爆了…

Halcon WPF 开发学习笔记(4):Halcon 锚点坐标打印

文章目录 专栏前言锚点二次开发添加回调函数辅助Model类 下集预告 专栏 Halcon开发 博客专栏 WPF/HALCON机器视觉合集 前言 Halcon控件C#开发是我们必须掌握的,因为只是单纯的引用脚本灵活性过低,我们要拥有Halcon辅助开发的能力 锚点开发是我们常用的…

【笔记】结合P02项目——maven继承与聚合

maven的继承关系 P02项目大概是这个样子,下图展示的是其父工程 父工程配置了parent依赖springb-boot-starter-parent,子工程配置其parant为父工程 子工程引用common子工程 maven的版本锁定 管理子工程的版本号问题 父工程添加dependencyManageMent…

P6入门:项目初始化3-项目详情之记事本Notebook

前言 使用项目详细信息查看和编辑有关所选项目的详细信息,在项目创建完成后,初始化项目是一项非常重要的工作,涉及需要设置的内容包括项目名,ID,责任人,日历,预算,资金,分类码等等&…

机器学习——实践

目录 一、数据集划分 1、交叉验证 2、不平衡数据的处理 代价敏感学习 二、评价指标 三、正则化、偏差和方差 为什么要标准化/归一化? 过拟合的处理——Dropout 过拟合的处理——Early stopping 过拟合的处理——数据增强 偏差和方差 ​编辑 一、数据集划分…

ida81输入密码验证算法分析以及破解思路

本文分析了ida81对输入密码的验证流程,分别对输入密码到生成解密密钥、密码素材的生成过程以及文件数据的加密过程这三个流程进行分析,并尝试找一些可利用的破绽。很遗憾,由于水平有限,目前也只是有个思路未能完全实现&#xff0c…

【C++】单例模式【两种实现方式】

目录 一、了解单例模式前的基础题 1、设计一个类,不能被拷贝 2、设计一个类,只能在堆上创建对象 3、设计一个类,只能在栈上创建对象 4、设计一个类,不能被继承 二、单例模式 1、单例模式的概念 2、单例模式的两种实现方式 …

20231112_DNS详解

DNS是实现域名与IP地址的映射。 1.映射图2.DNS查找顺序图3.DNS分类和地址4.如何清除缓存 1.映射图 图片来源于http://egonlin.com/。林海峰老师课件 2.DNS查找顺序图 3.DNS分类和地址 4.如何清除缓存

工业摄像机参数计算

在工业相机选型的时候有点懵,有一些参数都不知道咋计算的。有些概念也没有区分清楚。‘’ 靶面尺寸 CMOS 或者是 CCD 使用几分之几英寸来标注的时候,这个几分之几英寸计算的是什么尺寸? 一开始我以为这个计算的就是靶面的实际对角线的尺寸…

ASP.NETWeb开发(C#版)-day1-C#基础+实操

目录 .NET实操:创建项目执行 C#基础语法数据类型变量实操001_变量如何在一个解决方案 中创建另一个项目实操002结构实操003-if else实操004-多分支多行注释按钮实操:循环 面向对象基础如何在同一个项目下创建新的.cs文件实操-类的定义与访问实操-练习实操…

Qt 自定义按钮 区分点按与长按信号,适配触摸事件

Qt 自定义按钮 区分点按与长按信号 适配触摸事件 效果 使用示例 // 点按connect(ui.btnLeft, &JogButton::stepclicked, this, &MainWindow::btnLeft_clicked);// 长按开始connect(ui.btnLeft, &JogButton::continueOn, this, &MainWindow::slotJogLeftOn);//…

Clickhouse学习笔记(11)—— 数据一致性

使用合并树引擎时,无论是ReplacingMergeTree还是SummingMergeTree,都只能保证数据的最终一致性,因为数据的去重、聚合等操作会在数据合并的期间进行,而合并会在后台以一个不确定的时间进行,因此无法预先计划&#xff1…

c语言:用指针解决有关字符串等问题

题目1&#xff1a;将一个字符串str的内容颠倒过来&#xff0c;并输出。 数据范围&#xff1a;1≤len(str)≤10000 代码和思路&#xff1a; #include <stdio.h> #include<string.h> int main() {char str1[10000];gets(str1);//读取字符串内容char* p&str1[…

有源RS低通滤波

常用的滤波电路有无源滤波和有源滤波两大类。若滤波电路元件仅由无源元件&#xff08;电阻、电容、电感&#xff09;组成&#xff0c;则称为无源滤波电路。无源滤波的主要形式有电容滤波、电感滤波和复式滤波(包括倒L型、LC滤波、LCπ型滤波和RCπ型滤波等)。若滤波电路不仅有无…

从0开始python学习-32.pytest.mark()

目录 1. 用户自定义标记 1.1 注册标记​编辑 1.2 给测试用例打标记​编辑 1.3 运行标记的测试用例 1.4 运行多个标记的测试用例 1.5 运行指定标记以外的所有测试用例 2. 内置标签 2.1 skip &#xff1a;无条件跳过&#xff08;可使用在方法&#xff0c;类&#xff0c;模…

[vuex] unknown mutation type: SET_SOURCE

项目中使用了vuex&#xff0c;并且以模块的形式分好之后。在调用的时候出现了以上问题 /*当我们commit的时候要注意要加上模块的名字 user是模块名称&#xff0c;SET_SOURCE是user模块中定义的方法 正确写法&#xff1a;*/ this.$store.commit("user/SET_SOURCE", th…