文章目录
- 背景
- 1 核心思想
- 2 方法
- 2.1 方法建模
- 2.2 数据工程
- 2.2.1 image-edit任务类别定义
- 2.2.2 指令集生成
- 2.2.3 图片对的生成
- 3 结果
Paper: https://emu-edit.metademolab.com/assets/emu_edit.pdf
Project web: https://emu-edit.metademolab.com/
Code: have not opensource
背景
在发布Emu后,近日,META又发布了两个非常惊艳的工作:EmuEdit、EmuVideo。文本将对EmuEdit相关技术进行总结。
1 核心思想
作者将intruction-base image editing任务建模为生成任务,并用diffusion model进行求解。核心创新点有两个
- 详细定义了instruction-based image edit处理的任务,并设计了一个高效高质量的数据构建方法。
- 为提升模型对instruction的理解能力,引入learnable task embedding,能较好的解决上述问题。并且提出task inversion的训练方法,只需少量数据就能有效将模型扩展到新的task(类似textual inversion的思想)。
2 方法
2.1 方法建模
前面提到,作者将一系列的intruction-base image editing任务建模为生成任务,并用diffusion model来求解。具体来看intruction-base image editing任务做的是这么一件事:给定一张参考图片和一段表述文本,输出符合上述两个条件的图片。从上述描述可知:intruction-base image editing的训练数据应当至少是一个三元组 D = { ( c T i , c I i , x i ) ∣ i = 1 , ⋯ N } \mathcal{D} = \{(c_T^{i}, c_I^{i}, x^{i})|i = 1, \cdots N\} D={(cTi,cIi,xi)∣i=1,⋯N},其中
c I c_I cI: 参考图片(condition of image)
c T c_T cT: 参考文本(condition of text)
x x x: 目标图片
这样,基于diffusion model的优化目标可建模为:
min θ E y , ϵ , t [ ∥ ϵ − ϵ θ ( z t , t , E ( c I ) , c T ) ∥ 2 2 ] 2 (1) \min _ { \theta } \mathbb{E} _ { y , \epsilon , t } [ \Vert \epsilon - \epsilon _ { \theta } ( z _ { t } , t , E ( c _ { I } ) , c _ { T } ) \Vert _ { 2 } ^ { 2 } ] ^ { 2 } \tag{1} θminEy,ϵ,t[∥ϵ−ϵθ(zt,t,E(cI),cT)∥22]2(1)
和经典的classifier-free有所区别的是,此处多了一个参考图片的condition E ( c I ) E ( c _ { I } ) E(cI)。条件融入的方法上,
-
作者参考
Instructpix2pix
将image condition融入到输入层(在通道维度进行concat)。 -
参考classifier-free将text condition的融入在cross-attention。
通过实验,作者发现用上述方法训练的模型对task的理解不够准确如下图所示。为此,作者引入learnable task embedding来增强模型对task的理解。此时的优化目标建模为:
min θ , v 1 , … , v k E y ^ , ϵ , t [ ∥ ϵ − ϵ θ ( z t , t , E ( c I ) , c T , v i ) ∥ 2 2 ] (2) \min _ { \theta , v _ { 1 } , \dots , v _ { k } } \mathbb{E} _ { \hat { y } , \epsilon , t } [ \Vert \epsilon - \epsilon _ { \theta } ( z _ { t } , t , E ( c _ { I } ) , c _ { T } , v _ { i } ) \Vert _{2}^{2} ] \tag{2} θ,v1,…,vkminEy^,ϵ,t[∥ϵ−ϵθ(zt,t,E(cI),cT,vi)∥22](2)
为了求解上述目标方程,构造的训练数据集的每一个元素应当是一个四元组 D n e w = { ( c T i , c I i , x i , v j i ) ∣ i = 1 , ⋯ N } \mathcal{D}_{new} = \{(c_T^{i}, c_I^{i}, x^{i}, v^{i}_{j})|i = 1, \cdots N\} Dnew={(cTi,cIi,xi,vji)∣i=1,⋯N}, v j v_j vj为这条数据所属的task类别。并且此时的diffusion model的噪声预测模型多了一个task embedding v i v_i vi的条件。作者的融入方式是将其与time-step的embedding进行相加,共同融入到cross-attention中。这样设计还保留了可扩展性:当有一个新的task时,可以将优化目标转化为
min v n e w E y , ϵ , t [ ∥ ϵ − ϵ θ ( z t , t , E ( c I ) , c T , v n e w ) ∥ 2 2 (3) \min _ { v _ { \mathrm{n e w} } } \mathbb{E} _ { y , \epsilon , t } [ \Vert \epsilon - \epsilon _ { \theta } ( z _ { t } , t , E ( c _ { I } ) , c _ { T } , v _ { n e w } ) \Vert _ { 2 } ^ { 2 } \tag{3} vnewminEy,ϵ,t[∥ϵ−ϵθ(zt,t,E(cI),cT,vnew)∥22(3)
此时训练的参数仅为新增的task embedding,其它参数都freeze。作者将其称之为task inversion(类似textual inversion)。
在用户层的推理阶段,用户无需输入task index,作者基于Flan-T5-XL训练了一个task index预测模型,来根据用户输入的instruction预测出相应的task index。
从实现原理上,上述方法不难想到。论文取得的卓越的效果取决于训练的数据集。下面来看作者是如何用一种高效的方法构建高质量的数据集。
2.2 数据工程
前文提到,训练一个image-edit diffusion model训练数据至少是一个三元组 D = { ( c T i , c I i , x i ) ∣ i = 1 , ⋯ N } \mathcal{D} = \{(c_T^{i}, c_I^{i}, x^{i})|i = 1, \cdots N\} D={(cTi,cIi,xi)∣i=1,⋯N} (其中 c I c_I cI: 参考图片(condition of image) c T c_T cT: 参考文本(condition of text) x x x: 目标图片)。手动构建数据集的成本非常大,开源数据规模又不够大,一些规模大的合成数据多样性和质量又不高,因此需要探寻如何用cheap的方法来构建一个高质量、大规模、高多样的image-edit数据集。为了结合task inversion,新构建的数据集应当是一个四元组 D n e w = { ( c T i , c I i , x i , v j i ) ∣ i = 1 , ⋯ N } \mathcal{D}_{new} = \{(c_T^{i}, c_I^{i}, x^{i}, v^{i}_{j})|i = 1, \cdots N\} Dnew={(cTi,cIi,xi,vji)∣i=1,⋯N}, v j v_j vj为这条数据所属的task类别。
2.2.1 image-edit任务类别定义
作者将image-edit分为了三大类,分别是Region-based Editing、Free-From Editing、Vision tasks,每个大类中有若干小类。下图展示了每一个image-edit任务所做的事
2.2.2 指令集生成
任务定义:已知image caption和编辑任务,输出满足编辑任务新的caption
-
输入:image caption + edit任务
-
输出:edit instruction, edit instruction应当包含:1)edit指令;2)edit的目标(edited object);3)新的image caption;4)原始目标(original object)(7.2节提到有这个字段,但在7.1中的示例没有,实际上应当要有这个字段,否则后续的mask提取无法进行)
举个例子(对于add
的image-edit任务)
输入:{“image_caption”: “Beautiful cat with mojito sitting in a cafe on the street”, “task”: “Add”}
输出:{“edit”: “include a hat”, “edited object”: “hat”, “output”: “Beautiful cat wearing a hat with mojito sitting in a cafe on the street”, “original object”: “cat”}
作者用context learning的方法来实现上述任务的目标。作者构建的prompt方案如下:(作者所用的LLM是微调了的70B LLama2,我用chatgpt尝试了一下,也能实现类似的效果)
import random
import torch
from random import choice, shufflefew_shot_examples = ["""[INST]User: "Beautiful cat with mojito sitting in a cafe on the street"[/INST] Assistant: {"edit": "include a hat", "edited object": "hat", "output": "Beautiful cat wearing a hat with mojito sitting in a cafe on the street"}""","""[INST]User: "robot playing chess at home."[/INST] Assistant: {"edit": "add a cheerful smiling face.", "edited object": "robot", "output": "robot playing chess at home with a cheerful smiling face."} ""","""[INST]User: "A cute creature sits at the beach."[/INST] Assistant: {"edit": "set a dog besides the creature", "edited object": "dog", "output": "A cute creature and a dog sit at the beach."} ""","""[INST]User: "Superhero on the street in sunny day working on his tablet."[/INST] Assistant: {"edit": "put a vintage tie on the superhero.", "edited object": "tie", "output": "Superhero with a vintage tie on the street in sunny day working on his tablet."} ""","""[INST]User: "Picture clouds, birds, the wind, foliage, rainbow, hill, art, pair, guy"[/INST] Assistant: {"edit": "together with a dog on the left", "edited object": "dog", "output": "Picture clouds, birds, the wind, foliage, rainbow, hill, art, pair, guy, dog on the left"} ""","""[INST]User: "horse on a red Boat Near Mountains During Golden Hour"[/INST] Assistant: {"edit": "give the horse sunglassess", "edited object": "sunglassess", "output": "horse with sunglassess on a red Boat Near Mountains During Golden Hour"} ""","""[INST]User: "An animal family on studio background."[/INST] Assistant: {"edit": "make them hold a teddy bear.", "edited object": "teddy bear", "output": "An animal family holding a teddy bear on studio background."} ""","""[INST]User: "Baked Salmon With Bell Peppers"[/INST] Assistant: {"edit": "insert kale pesto to the dish", "edited object": "kale pesto", "output": "Baked Salmon With Kale Pesto And Bell Peppers"} ""","""[INST]User: "An airplaine is flying in the sky in rainy day."[/INST] Assistant: {"edit": "add flowers in the windows", "edited object": "flowers", "output": "An airplaine with flowers in the windows is flying in the sky in rainy day."} ""","""[INST]User: "photo of mountains and trees"[/INST] Assistant: {"edit": "position a castle between the trees", "edited object": "castle", "output": "photo of mountains, trees and castle between the trees"} ""","""[INST]User: "Little bunny in the park"[/INST] Assistant: {"edit": "Make the bunny play with a kite.", "edited object": "kite", "output": "Little bunny playing with a kite in the park"} ""","""[INST]User: "Attic Bedroom With Large Ceilings"[/INST] Assistant: {"edit": "decorate the room with beautiful chandeliers", "edited object": "chandeliers", "output": "Attic Bedroom With Beautiful Chandeliers on Large Ceilings"} ""","""[INST]User: "Wedding rings and yellow flower on a red background"[/INST] Assistant: {"edit": "place a bird by the yellow flower", "edited object": "bird", "output": "Wedding rings, a bird, and yellow flower on a red background"} ""","""[INST]User: "Tree Near the lake in the morning"[/INST] Assistant: {"edit": "Give it autumn leaves on top", "edited object": "leaves", "output": "Tree with autumn leaves on top Near the lake in the morning"} ""","""[INST]User: "robot and alien sitting on hanging bridge at daytime"[/INST] Assistant: {"edit": "make them hold three books.", "edited object": "threebooks", "output": "robot and alien holding three books while sitting on hanging bridge at daytime"} ""","""[INST]User: "Skogafoss waterfall in the south of Iceland"[/INST] Assistant: {"edit": "Set a colorful rainbow in the backhground!", "edited object": "rainboe", "output": "Skogafoss waterfall with a colorful rainbow in the south of Iceland"} ""","""[INST]User: "Polar Bear with rubber gloves pushing shopping carts"[/INST] Assistant: {"edit": "Make it wear a coat", "edited object": "coat", "output": "Polar Bear with a coat pushing shopping carts"}"""]def get_content_instruction(new_prompt): optional_verbs = choice(["include", "place", "position", "set", "incorporate", "alongside", " give", "put", "insert", "together with", "with", "make", "integrate", "have", "append", " make", "add", "include"]) # system message # system_message = f"""<<SYS>>You are an assistant that only speaks JSON. Do not write normal text. The assistant answer is JSON with the following string fields: 'edit', 'edited object','output'. Here is the latest conversation between Assistant and User.<</SYS>>"""# introduction message intro_message = f"""[INST]User: Hi, My job to take a given caption ('input') and to output the following: an instruction for {optional_verbs} an object to the image ('edit'), the object to {optional_verbs} ('edited object'), and the caption with the object ('output'). Please help me do it.I will give you the 'input', and you will help. When you reply, use the following format: {"edit": '<instruction>', 'edited object': '<object>', 'output': '<caption>'}[/INST]Assistant: Sure, I'd be happy to help! Please provide the actual input caption you'd like me to read and I'll assist you with writing an instruction to {optional_verbs} an object to the image, writing the added object and writing the caption with the object.""" random.seed(torch.randint(1 << 32, ()).item())shuffle(few_shot_examples)few_shot_examples = few_shot_examples[:int(len(few_shot_examples) * 0.6)] prompt = system_message + intro_message + "".join(few_shot_examples) # add the test prompt prompt = prompt + f"[INST]User: {new_prompt}[/INST]"return prompt
2.2.3 图片对的生成
通过上面的步骤我们拿到了4元组 ( c T , c I , x , v j ) (c_T, c_I, x, v_{j}) (cT,cI,x,vj),中的 c T , c I , v j c_T, c_I, v_{j} cT,cI,vj,其中 c T c_T cT还有很多附加信息:
如:编辑的对象,新的image caption,如:
{“edit”: “include a hat”, “edited object”: “hat”, “output”: “Beautiful cat wearing a hat with mojito sitting in a cafe on the street”, “original object”: “cat”}
此处需要进行的是根据上面的条件,得到对应的图片pair ( x x x)。
任务目标:根据输入图片、instruction信息生成对应的图片pair ( x x x)并且除了编辑的区域, x x x与 c I c_I cI的差异应当尽可能的小。
max S I M ( I n s t r u c t i o n , c I E d i t ) min D i s t ( x , c I n o t E d i t ) (4) \begin{aligned} &\max \mathrm{SIM}(\mathrm{Instruction}, c_I^{\mathrm{Edit}}) \\ &\min \mathrm{Dist}(x, c_I^{\mathrm{not Edit}}) \end{aligned} \tag{4} maxSIM(Instruction,cIEdit)minDist(x,cInotEdit)(4)
为了解决上述的任务目标,作者提出一种mask-based attention control
的方法(相当于DiffEdit和P2P的结合)。具体分为以下几个步骤:
已知条件:
- C a p o r i \mathrm{Cap_{ori}} Capori: image caption 。
example:Beautiful cat with mojito sitting in a cafe on the street
- I m g o r i : \mathrm{Img_{ori}}: Imgori:image caption用DM生成的图片
- C a p e d i t \mathrm{Cap_{edit}} Capedit:编辑后的image caption。
Beautiful cat wearing a hat with mojito sitting in a cafe on the street
- O b j o r i \mathrm{Obj_{ori}} Objoriimage caption的原始目标(original object)。
cat
- O b j e d i t \mathrm{Obj_{edit}} Objedit编辑目标(edited object):
hat
STEP1: 提取mask。将 O b j o r i \mathrm{Obj_{ori}} Objori与 I m g o r i \mathrm{Img_{ori}} Imgori送入到SAM+DINO模型中,得到3类mask
- 精确的mask,有sam+dino的生成
- 将1中的mask进行膨胀,在进行高斯模糊,作为新的mask
- 取第一步mask的bounding box作为新的mask
SETP2: 通过mask-based attention control
进行图片生成。具体为:先用P2P的cross-attention control的方法将common token的对应的attention map进行注入,随后用diffedit的根据mask融合方法进行融合。
STEP3: 图片Filter。通过上述步骤得到3个目标图片,留存最好的一个。filter的规则为
- 用CLIP filtering metrics,留存最相关的一个
- 留存edit image与input image在深度图上的L1 距离最小的图片。
- …
每一类Edit方法的详细的数据构造细节见论文7.2.3
最后得到的各类训练数据比例如下:
3 结果
EmuEdit在多个测试集取得了SOTA。并且作者公开了一个新的基于EmuEdit的benchmark:https://huggingface.co/datasets/facebook/emu_edit_test_set
一些惊艳结果: