ECCV2024｜AIGC(图像生成，视频生成，3D生成等)相关论文汇总（附论文链接/开源代码）【持续更新】

ECCV2024｜AIGC相关论文汇总（如果觉得有帮助，欢迎点赞和收藏）

Awesome-ECCV2024-AIGC
1.图像生成(Image Generation/Image Synthesis)
- - Accelerating Diffusion Sampling with Optimized Time Steps
  - AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation
  - A Watermark-Conditioned Diffusion Model for IP Protection
  - BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion
  - ComFusion: Personalized Subject Generation in Multiple Specific Scenes From Single Image
  - Data Augmentation for Saliency Prediction via Latent Diffusion
  - Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics
  - DiffFAS: Face Anti-Spoofing via Generative Diffusion Models
  - DiffiT: Diffusion Vision Transformers for Image Generation
  - Large-scale Reinforcement Learning for Diffusion Models
  - MasterWeaver: Taming Editability and Identity for Personalized Text-to-Image Generation
  - Memory-Efficient Fine-Tuning for Quantized Diffusion Model
  - OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models
  - Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts
2.图像编辑(Image Editing)
- - A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
  - BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion
  - FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing
  - StableDrag: Stable Dragging for Point-based Image Editing
  - TinyBeauty: Toward Tiny and High-quality Facial Makeup with Data Amplify Learning
3.视频生成(Video Generation/Video Synthesis)
- - Audio-Synchronized Visual Animation
  - Dyadic Interaction Modeling for Social Behavior Generation
  - EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis
  - FreeInit : Bridging Initialization Gap in Video Diffusion Models
  - MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model
  - ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video
4.视频编辑(Video Editing)
- - Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation
  - DragAnything: Motion Control for Anything using Entity Representation
5.3D生成(3D Generation/3D Synthesis)
- - EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion
  - GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes
  - GVGEN:Text-to-3D Generation with Volumetric Representation
  - Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM
  - ParCo: Part-Coordinating Text-to-Motion Synthesis
  - Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models
6.3D编辑(3D Editing)
- - Gaussian Grouping: Segment and Edit Anything in 3D Scenes
  - SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer
  - Texture-GS: Disentangling the Geometry and Texture for 3D Gaussian Splatting Editing
7.多模态大语言模型(Multi-Modal Large Language Models)
- - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
  - ControlCap: Controllable Region-level Captioning
  - DriveLM: Driving with Graph Visual Question Answering
  - Elysium: Exploring Object-level Perception in Videos via MLLM
  - Empowering Multimodal Large Language Model as a Powerful Data Generator
  - GiT: Towards Generalist Vision Transformer through Universal Language Interface
  - How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
  - Long-CLIP: Unlocking the Long-Text Capability of CLIP
  - MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
  - Merlin:Empowering Multimodal LLMs with Foresight Minds
  - Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
  - MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
  - PointLLM: Empowering Large Language Models to Understand Point Clouds
  - R2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations
  - SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation
  - ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
  - ST-LLM: Large Language Models Are Effective Temporal Learners
  - TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias
  - UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
8.其他任务(Others)
参考
相关整理

Awesome-ECCV2024-AIGC

A Collection of Papers and Codes for ECCV2024 AIGC

整理汇总下2024年ECCV AIGC相关的论文和代码，具体如下。

欢迎star，fork和PR~
优先在Github更新：Awesome-ECCV2024-AIGC，欢迎star~
知乎：https://zhuanlan.zhihu.com/p/706699484

参考或转载请注明出处

ECCV2024官网：https://eccv.ecva.net/

ECCV接收论文列表：

ECCV完整论文库：

开会时间：2024年9月29日-10月4日

论文接收公布时间：2024年

【Contents】

1.图像生成(Image Generation/Image Synthesis)
2.图像编辑（Image Editing)
3.视频生成(Video Generation/Image Synthesis)
4.视频编辑(Video Editing)
5.3D生成(3D Generation/3D Synthesis)
6.3D编辑(3D Editing)
7.多模态大语言模型(Multi-Modal Large Language Model)
8.其他多任务(Others)

1.图像生成(Image Generation/Image Synthesis)

Accelerating Diffusion Sampling with Optimized Time Steps

Paper: https://arxiv.org/abs/2402.17376
Code: https://github.com/scxue/DM-NonUniform

AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation

Paper: https://arxiv.org/abs/2406.18958
Code: https://github.com/open-mmlab/AnyControl

A Watermark-Conditioned Diffusion Model for IP Protection

Paper:
Code: https://github.com/rmin2000/WaDiff

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Paper: https://arxiv.org/abs/2404.04544
Code: https://github.com/gwang-kim/BeyondScene

ComFusion: Personalized Subject Generation in Multiple Specific Scenes From Single Image

Paper: https://arxiv.org/abs/2402.11849
Code:

Data Augmentation for Saliency Prediction via Latent Diffusion

Paper:
Code: https://github.com/IVRL/AugSal

Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics

Paper: https://arxiv.org/abs/2310.17316
Code: https://github.com/EnVision-Research/Defect_Spectrum

DiffFAS: Face Anti-Spoofing via Generative Diffusion Models

Paper:
Code: https://github.com/murphytju/DiffFAS

DiffiT: Diffusion Vision Transformers for Image Generation

Paper: https://arxiv.org/abs/2312.02139
Code: https://github.com/NVlabs/DiffiT

Large-scale Reinforcement Learning for Diffusion Models

Paper: https://arxiv.org/abs/2401.12244
Code: https://github.com/pinterest/atg-research/tree/main/joint-rl-diffusion

MasterWeaver: Taming Editability and Identity for Personalized Text-to-Image Generation

Paper: https://arxiv.org/abs/2405.05806
Code: https://github.com/csyxwei/MasterWeaver

Memory-Efficient Fine-Tuning for Quantized Diffusion Model

Paper:
Code: https://github.com/ugonfor/TuneQDM

OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models

Paper: https://arxiv.org/abs/2403.10983
Code: https://github.com/kongzhecn/OMG

Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts

Paper: https://arxiv.org/abs/2403.09176
Code: https://github.com/byeongjun-park/Switch-DiT

2.图像编辑(Image Editing)

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

Paper: https://arxiv.org/abs/2312.03594
Code: https://github.com/open-mmlab/PowerPaint

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

Paper: https://arxiv.org/abs/2403.06976
Code: https://github.com/TencentARC/BrushNet

FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing

Paper:
Code: https://github.com/kookie12/FlexiEdit

StableDrag: Stable Dragging for Point-based Image Editing

Paper: https://arxiv.org/abs/2403.04437
Code:

TinyBeauty: Toward Tiny and High-quality Facial Makeup with Data Amplify Learning

Paper: https://arxiv.org/abs/2403.15033
Code: https://github.com/TinyBeauty/TinyBeauty

3.视频生成(Video Generation/Video Synthesis)

Audio-Synchronized Visual Animation

Paper: https://arxiv.org/abs/2403.05659
Code: https://github.com/lzhangbj/ASVA

Dyadic Interaction Modeling for Social Behavior Generation

Paper: https://arxiv.org/abs/2403.09069
Code: https://github.com/Boese0601/Dyadic-Interaction-Modeling

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

Paper: https://arxiv.org/abs/2404.01647
Code: https://github.com/tanshuai0219/EDTalk

FreeInit : Bridging Initialization Gap in Video Diffusion Models

Paper: https://arxiv.org/abs/2312.07537
Code: https://github.com/TianxingWu/FreeInit

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

Paper: https://arxiv.org/abs/2405.20222
Code: https://github.com/MyNiuuu/MOFA-Video

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

Paper: https://arxiv.org/abs/2310.01324
Code:

4.视频编辑(Video Editing)

Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation

Paper: https://arxiv.org/abs/2403.13745
Code: https://github.com/G-U-N/Be-Your-Outpainter

DragAnything: Motion Control for Anything using Entity Representation

Paper: https://arxiv.org/abs/2403.07420
Code: https://github.com/showlab/DragAnything

5.3D生成(3D Generation/3D Synthesis)

EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion

Paper: https://arxiv.org/abs/2405.00915
Code: https://github.com/ymxlzgy/echoscene

GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes

Paper: https://arxiv.org/abs/2405.00915
Code: https://github.com/ibrahimethemhamamci/GenerateCT

GVGEN:Text-to-3D Generation with Volumetric Representation

Paper:
Code: https://github.com/SOTAMak1r/GVGEN

Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM

Paper: https://arxiv.org/abs/2403.07487
Code: https://github.com/steve-zeyu-zhang/MotionMamba

ParCo: Part-Coordinating Text-to-Motion Synthesis

Paper: https://arxiv.org/abs/2403.18512
Code: https://github.com/qrzou/ParCo

Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models

Paper: https://arxiv.org/abs/2311.17050
Code: https://github.com/Yzmblog/SurfD

6.3D编辑(3D Editing)

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

Paper: https://arxiv.org/abs/2312.00732
Code: https://github.com/lkeab/gaussian-grouping

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

Paper: https://arxiv.org/abs/2403.18512
Code: https://github.com/JarrentWu1031/SC4D

Texture-GS: Disentangling the Geometry and Texture for 3D Gaussian Splatting Editing

Paper: https://arxiv.org/abs/2403.10050
Code: https://github.com/slothfulxtx/Texture-GS

7.多模态大语言模型(Multi-Modal Large Language Models)

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Paper: https://arxiv.org/abs/2403.06764
Code: https://github.com/pkunlp-icler/FastV

ControlCap: Controllable Region-level Captioning

Paper: https://arxiv.org/abs/2401.17910
Code: https://github.com/callsys/ControlCap

DriveLM: Driving with Graph Visual Question Answering

Paper: https://arxiv.org/abs/2312.14150
Code: https://github.com/OpenDriveLab/DriveLM

Elysium: Exploring Object-level Perception in Videos via MLLM

Paper: https://arxiv.org/abs/2403.16558
Code: https://github.com/Hon-Wong/Elysium

Empowering Multimodal Large Language Model as a Powerful Data Generator

Paper:
Code: https://github.com/zhaohengyuan1/Genixer

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Paper: https://arxiv.org/abs/2403.09394
Code: https://github.com/Haiyang-W/GiT

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

Paper: https://arxiv.org/abs/2311.17600
Code: https://github.com/UCSC-VLAA/vllm-safety-benchmark

Long-CLIP: Unlocking the Long-Text Capability of CLIP

Paper: https://arxiv.org/abs/2403.15378
Code: https://github.com/beichenzbc/Long-CLIP

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Paper: https://arxiv.org/abs/2403.14624
Code: https://github.com/ZrrSkywalker/MathVerse

Merlin:Empowering Multimodal LLMs with Foresight Minds

Paper: https://arxiv.org/abs/2312.00589
Code: https://github.com/Ahnsun/merlin

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Paper: https://arxiv.org/abs/2403.11755
Code: https://github.com/jmiemirza/Meta-Prompting

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Paper: https://arxiv.org/abs/2403.14624
Code: https://github.com/isXinLiu/MM-SafetyBench

PointLLM: Empowering Large Language Models to Understand Point Clouds

Paper: https://arxiv.org/abs/2308.16911
Code: https://github.com/OpenRobotLab/PointLLM

R2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations

Paper: https://arxiv.org/abs/2403.04924
Code: https://github.com/lxa9867/r2bench

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

Paper:
Code: https://github.com/AI-Application-and-Integration-Lab/SAM4MLLM

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Paper: https://arxiv.org/abs/2311.12793
Code: https://github.com/ShareGPT4Omni/ShareGPT4V

ST-LLM: Large Language Models Are Effective Temporal Learners

Paper: https://arxiv.org/abs/2404.00308
Code: https://github.com/TencentARC/ST-LLM

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

Paper: https://arxiv.org/abs/2404.00384
Code: https://github.com/shjo-april/TTD

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

Paper: https://arxiv.org/abs/2311.17136
Code: https://github.com/TIGER-AI-Lab/UniIR

8.其他任务(Others)

持续更新~