ALBEF泛读
- Title
- Links
- Motivation
- How to solve it?(Contribution)
- Model
- Experiments
- Pre-training Datasets
- Downstream tasks
- Ablation Experiment
Title
《Align before Fuse: Vision and Language
Representation Learning with Momentum Distillation》
Links
Paper地址
Motivation
大多数多模态模型都是用transformer的编码器同时编码视觉的token(region-based image features)和文本的token。用了目标检测器后