1 Title
Hierarchical Text-Conditional Image Generation with CLIP Latents(Aditya Ramesh、Prafulla Dhariwal、Alex Nichol、Casey Chu、Mark Chen)
2 Conclusion
Contrastive models like CLIP have been shown to learn robust representations of
images that capture both semantics and style. To leverage these representations for
image generation, this study proposes a two-stage model: a prior that generates a CLIP
image embedding given a text caption, and a decoder that generates an image
conditioned on the image embedding.
3 Good Sentences
1、 We use only spatial convolutions in the model (i.e., no attention layers) and at inference time directly apply the model at the target resolution, observing that it readily generalizes to the higher resolution. We found no benefit from conditioning the upsamplers on the caption, and use unconditional ADMNets [11] with no guidance.(The work which are waiting to be improved (can add attention layers in it))
2、Although we train a prior to generate CLIP image embeddings from captions, the prior is not strictly necessary for caption-to-image generation. For instance, our decoder can condition on both CLIP image embeddings and captions, but the CLIP image embedding is dropped 5% of the time during training in order to enable classifier-free guidance(The prior is not necessary for T2I project)
3、Compared to GLIDE, we qualitatively observe that unCLIP is able to generate more diverse images while leveraging the guidance technique to improve sample quality. To understand why, consider Figure 9 where we increase guidance scale for both GLIDE and unCLIP. For GLIDE, the semantics (camera angle, color, size) converge as we increase guidance scale, whereas for unCLIP the semantic information of the scene is frozen in the CLIP image embedding and therefore does not collapse when guiding the decoder.(The advantage of CLIP when compared with GLIDE)
本文将将zero-shot和扩散模型两种方法结合起来,用于文本条件下的图像生成问题。该项工作提出了一个两阶段的模型:一个给定文本字幕生成CLIP图像嵌入的先验器,以及一个以图像嵌入为条件生成图像的解码器。
首先要提的就是CLIP具有打破预定义好的标签的能力,也就是zero-shot,它的标签很灵活,两个标签就是二分类任务,十个就是十分类,不需要预定义任务是分几个类。在使用引导的时候,与glide相比,unclip不会导致坍缩问题(也就是随着引导条件的增多,绘制出的图多样性越来越少,基本都一样了最后)。但是clip也有它的问题,就是在多目标属性绑定上容易造成混淆,unclip在这方面做的更差,属性绑定问题更严重。