GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting
GaussianEditor:快速可控的3D编辑与高斯飞溅
陈怡雯 *1,2 陈子龙 *3,5 张驰 2 王峰 3 杨晓峰 2
Yikai Wang3 Zhongang Cai4 Lei Yang4 Huaping Liu3 Guosheng Lin**1,2
王一凯 3 蔡忠昂 4 杨磊 4 刘华平 3 林国胜 **1,2
1S-Lab, Nanyang Technological University
1 南洋理工大学S-Lab
2School of Computer Science and Engineering, Nanyang Technological University
2 南洋理工大学计算机科学与工程学院
3Department of Computer Science and Technology, Tsinghua University
3 清华大学计算机科学技术系
4SenseTime Research 5ShengShu
4 商汤研究 5 ShengShu
GaussianEditor
Abstract 摘要 [2311.14521] GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting
3D editing plays a crucial role in many areas such as gaming and virtual reality. Traditional 3D editing methods, which rely on representations like meshes and point clouds, often fall short in realistically depicting complex scenes. On the other hand, methods based on implicit 3D representations, like Neural Radiance Field (NeRF), render complex scenes effectively but suffer from slow processing speeds and limited control over specific scene areas. In response to these challenges, our paper presents GaussianEditor, an innovative and efficient 3D editing algorithm based on Gaussian Splatting (GS), a novel 3D representation. GaussianEditor enhances precision and control in editing through our proposed Gaussian semantic tracing, which traces the editing target throughout the training process. Additionally, we propose Hierarchical Gaussian splatting (HGS) to achieve stabilized and fine results under stochastic generative guidance from 2D diffusion models. We also develop editing strategies for efficient object removal and integration, a challenging task for existing methods. Our comprehensive experiments demonstrate GaussianEditor’s superior control, efficacy, and rapid performance, marking a significant advancement in 3D editing.
3D编辑在游戏和虚拟现实等许多领域发挥着至关重要的作用。传统的3D编辑方法依赖于网格和点云等表示,通常无法真实地描绘复杂的场景。另一方面,基于隐式3D表示的方法,如神经辐射场(NeRF),可以有效地渲染复杂场景,但处理速度慢,对特定场景区域的控制有限。为了应对这些挑战,我们提出了GaussianEditor,一个创新的和高效的3D编辑算法的基础上高斯飞溅(GS),一种新的3D表示。GaussianEditor通过我们提出的高斯语义跟踪增强了编辑的精度和控制,高斯语义跟踪在整个训练过程中跟踪编辑目标。此外,我们提出了分层高斯溅射(HGS),以实现稳定和精细的结果下随机生成的指导下,从二维扩散模型。 我们还开发了有效的对象删除和整合,现有的方法具有挑战性的任务的编辑策略。我们的综合实验证明了GaussianEditor的上级控制、功效和快速性能,标志着3D编辑的重大进步。
Figure 1:Results of GaussianEditor. GaussianEditor offers swift, controllable, and versatile 3D editing. A single editing session only takes 5-10 minutes. Please note our precise editing control, where only the desired parts are modified. Taking the “Make the grass on fire” example from the first row of the figure, other objects in the scene such as the bench and tree remain unaffected.
图1:GaussianEditor的结果。GaussianEditor提供快速、可控和多功能的3D编辑。一次编辑只需要5-10分钟。请注意我们精确的编辑控制,只有所需的部分被修改。以图中第一行的“使草着火”为例,场景中的其他对象(如长凳和树)不受影响。
** Corresponding author. ** 通讯作者。* The first two authors contributed equally to this work.
* 前两位作者对这项工作的贡献相当。
1Introduction 1介绍
In the evolving field of computer vision, the development of user-friendly 3D representations and editing algorithms is a key objective. Such technologies are vital in various applications, ranging from digital gaming to the growing MetaVerse. Traditional 3D representations like meshes and point clouds have been preferred due to their interactive editing capabilities. However, these methods face challenges in accurately rendering complex 3D scenes.
在不断发展的计算机视觉领域,开发用户友好的3D表示和编辑算法是一个关键目标。这些技术在各种应用中至关重要,从数字游戏到不断增长的MetaVerse。传统的3D表示(如网格和点云)由于其交互式编辑功能而受到青睐。然而,这些方法在精确渲染复杂的3D场景方面面临挑战。
The recent rise of implicit 3D representations, exemplified by the Neural Radiance Field (NeRF) [28], represents a paradigm shift in 3D scene rendering. NeRF’s capacity for high-fidelity rendering, coupled with its implicit nature that offers significant expansibility, marks a substantial improvement over conventional approaches[2, 55, 32]. This dual advantage has placed a significant focus on the NeRF framework in 3D editing [45, 46, 12, 57, 31], establishing it as a foundational approach for a considerable duration. However, NeRF’s reliance on high-dimensional multilayer perception (MLP) networks for scene data encoding presents limitations. It restricts direct modification of specific scene parts and complicates tasks, like inpainting and scene composition. This complexity extends to the training and rendering processes, hindering practical applications.
最近兴起的隐式3D表示,例如神经辐射场(NeRF)[28],代表了3D场景渲染的范式转变。NeRF的高保真渲染能力,加上其隐含的性质,提供了显着的可扩展性,标志着对传统方法的重大改进[2,55,32]。这种双重优势使NeRF框架在3D编辑中得到了极大的关注[45,46,12,57,31],并将其作为一种基础方法。然而,NeRF对用于场景数据编码的高维多层感知(MLP)网络的依赖存在局限性。它限制了对特定场景部分的直接修改,并使修复和场景合成等任务复杂化。这种复杂性扩展到训练和渲染过程,阻碍了实际应用。
In light of these challenges, our research is focused on developing an advanced 3D editing algorithm. This algorithm aims for flexible and rapid editing of 3D scenes, integrating both implicit editing, like text-based editing, and explicit control, such as bounding box usage for specific area modifications. To achieve these goals, we choose Gaussian Splatting (GS) [15] for its real-time rendering and explicit point cloud-like representations.
鉴于这些挑战,我们的研究重点是开发一种先进的3D编辑算法。该算法旨在灵活和快速地编辑3D场景,集成了隐式编辑(如基于文本的编辑)和显式控制(如用于特定区域修改的边界框)。为了实现这些目标,我们选择高斯溅射(GS)[15]用于其实时渲染和显式点云表示。
However, editing Gaussian Splatting (GS) [15] faces distinct challenges. A primary issue is the absence of efficient methods to accurately identify target Gaussians, which is crucial for precise controllable editing. Moreover, it has been observed in [7, 44, 52] that optimizing Gaussian Splatting (GS) using highly random generative guidance like Score Distillation Sampling [36] poses significant challenges. One possible explanation is that, unlike implicit representations buffered by neural networks, GS is directly affected by the randomness in loss. Such direct exposure results in unstable updates, as the properties of Gaussians are directly changed during training. Besides, each training step of GS may involve updates to a vast number of Gaussian points. This process occurs without the moderating influence of neural network-style buffering mechanisms. As a result, the excessive fluidity of the 3D GS scene hinders its ability to converge to finely detailed results like implicit representations when trained with generative guidance.
然而,编辑高斯飞溅(GS)[15]面临着独特的挑战。一个主要问题是缺乏准确识别目标高斯的有效方法,这对于精确可控编辑至关重要。此外,在[7,44,52]中已经观察到,使用高度随机生成指导(如分数蒸馏采样[36])优化高斯溅射(GS)带来了重大挑战。一种可能的解释是,与神经网络缓冲的隐式表示不同,GS直接受到损失随机性的影响。这种直接暴露导致不稳定的更新,因为高斯的属性在训练期间直接改变。此外,GS的每个训练步骤可能涉及对大量高斯点的更新。这个过程的发生没有神经网络式缓冲机制的调节影响。 因此,3D GS场景的过度流动性阻碍了其在使用生成指导进行训练时收敛到精细细节结果(如隐式表示)的能力。
To counter these issues, in this work, we propose GaussianEditor , a novel, swift, and highly controllable 3D editing algorithm for Gaussian Splatting. GaussianEditor can fulfill various high-quality editing needs within minutes. A key feature of our method is the introduction of Gaussian semantic tracing, which enables precise control over Gaussian Splatting (GS). Gaussian semantic tracing consistently identifies the Gaussians requiring editing at every moment during training. This contrasts with traditional 3D editing methods that often depend on static 2D or 3D masks. Such masks become less effective as the geometries and appearances of 3D models evolve during training. Gaussian semantic tracing is achieved by unprojecting 2D segmentation masks into 3D Gaussians and assigning each Gaussian a semantic tag. As the Gaussians evolve during training, these semantic tags enable the tracking of the specific Gaussians targeted for editing. Our Gaussian tracing algorithm ensures that only the targeted areas are modified, enabling precise and controllable editing.
为了解决这些问题,在这项工作中,我们提出了GaussianEditor,这是一种新颖的,快速的,高度可控的高斯飞溅3D编辑算法。GaussianEditor可以在几分钟内满足各种高质量的编辑需求。我们的方法的一个关键特征是引入高斯语义跟踪,它可以精确控制高斯飞溅(GS)。高斯语义跟踪一致地识别在训练期间的每个时刻需要编辑的高斯。这与通常依赖于静态2D或3D遮罩的传统3D编辑方法形成对比。随着3D模型的几何形状和外观在训练期间的演变,这种掩模变得不那么有效。高斯语义跟踪通过将2D分割掩模非投影到3D高斯中并为每个高斯分配语义标签来实现。随着高斯模型在训练过程中的发展,这些语义标签可以跟踪要编辑的特定高斯模型。 我们的高斯追踪算法确保仅修改目标区域,从而实现精确可控的编辑。
Additionally, to tackle the significant challenge of Gaussian Splatting (GS) struggling to fit fine results under highly random generative guidance, we propose a novel GS representation: hierarchical Gaussian splatting (HGS). In HGS, Gaussians are organized into generations based on their sequence in multiple densification processes during training. Gaussians formed in earlier densification stages are deemed older generations and are subject to stricter constraints, aimed at preserving their original state and thus reducing their mobility. Conversely, those formed in later stages are considered younger generations and are subjected to fewer or no constraints, allowing for more adaptability. HGS’s design effectively moderates the fluidity of GS by imposing restrictions on older generations while preserving the flexibility of newer generations. This approach enables continuous optimization towards better outcomes, thereby simulating the buffering function achieved in implicit representations through neural networks. Our experiments also demonstrate that HGS is more adept at adapting to highly random generative guidance.
此外,为了解决高斯溅射(GS)在高度随机生成指导下难以拟合精细结果的重大挑战,我们提出了一种新的GS表示:分层高斯溅射(HGS)。在HGS中,高斯模型在训练过程中根据它们在多个致密化过程中的序列被组织成几代。在早期致密化阶段形成的高斯人被认为是老一代,并受到更严格的限制,旨在保持其原始状态,从而减少其流动性。相反,那些在后期形成的被认为是年轻的一代,受到较少或没有限制,允许更多的适应性。HGS的设计有效地缓和了GS的流动性,对老一代施加限制,同时保留了新一代的灵活性。 这种方法可以实现持续优化,从而获得更好的结果,从而模拟通过神经网络隐式表示实现的缓冲功能。我们的实验还表明,HGS更善于适应高度随机的生成指导。
Finally, we have specifically designed a 3D inpainting algorithm for Gaussian Splatting (GS). As demonstrated in Fig. 1, we have successfully removed specific objects from scenes and seamlessly integrated new objects into designated areas. For object removal, we developed a specialized local repair algorithm that efficiently eliminates artifacts at the intersection of the object and the scene. For adding objects, we first request users to provide a prompt and a 2D inpainting mask for a particular view of the GS. Subsequently, we employ a 2D inpainting method to generate a single-view image of the object to be added. This image is then transformed into a coarse 3D mesh using image-to-3D conversion techniques. The 3D mesh is subsequently converted into the HGS representation and refined. Finally, this refined representation is concatenated into the original GS. The entire inpainting process described above is completed within 5 minutes.
最后,我们专门设计了一个高斯飞溅(GS)的3D修复算法。如图1所示,我们成功地从场景中删除了特定对象,并将新对象无缝集成到指定区域。对于对象去除,我们开发了一种专门的局部修复算法,可以有效地消除对象和场景相交处的伪影。对于添加对象,我们首先要求用户为GS的特定视图提供提示和2D修复蒙版。随后,我们采用2D修复方法来生成要添加的对象的单视图图像。然后使用图像到3D转换技术将该图像转换为粗略的3D网格。随后将3D网格转换为HGS表示并细化。最后,将此细化表示连接到原始GS中。上述整个修复过程在5分钟内完成。
GaussianEditor offers swift, controllable, and versatile 3D editing. A single editing session typically only takes 5-10 minutes, significantly faster than previous editing processes. Our contributions can be summarized in four aspects:
GaussianEditor提供快速、可控和多功能的3D编辑。单个编辑会话通常只需要5-10分钟,比以前的编辑过程快得多。我们的贡献可以概括为四个方面:
- 1.
We have introduced Gaussian semantic tracing, enabling more detailed and effective editing control.
1.我们引入了高斯语义跟踪,使更详细和有效的编辑控制。 - 2.
We propose Hierarchical Gaussian Splatting (HGS), a novel GS representation capable of converging more stably to refined results under highly random generative guidance.
2.我们提出了分层高斯溅射(HGS),一种新的GS表示能够更稳定地收敛到精细的结果下,高度随机生成的指导。 - 3.
We have specifically designed a 3D inpainting algorithm for Gaussian Splatting, which allows swift removal and addition objects.
3.我们专门设计了一个用于高斯飞溅的3D修复算法,它允许快速删除和添加对象。 - 4.
Extensive experiments demonstrate that our method surpasses previous 3D editing methods in terms of effectiveness, speed, and controllability.
4.大量的实验表明,我们的方法优于以前的3D编辑方法的有效性,速度和可控性。
2Related Works 2相关作品
2.13D Representations 2.13D表示
Various 3D representations have been proposed to address diverse 3D tasks. The groundbreaking work, Neural Radiance Fields (NeRF) [28], employs volumetric rendering and has gained popularity for enabling 3D optimization with only 2D supervision. However, optimizing NeRF can be time-consuming, despite its wide usage in 3D reconstruction [21, 6, 3, 13] and generation [35, 22] tasks.
已经提出了各种3D表示来解决不同的3D任务。开创性的工作,神经辐射场(NeRF)[28],采用体积渲染,并已获得普及,使3D优化,只有2D监督。然而,优化NeRF可能很耗时,尽管它在3D重建[21,6,3,13]和生成[35,22]任务中广泛使用。
While efforts have been made to accelerate NeRF training [29, 40], these approaches primarily focus on the reconstruction setting, leaving the generation setting less optimized. The common technique of spatial pruning does not effectively speed up the generation setting.
虽然已经努力加速NeRF训练[29,40],但这些方法主要集中在重建设置上,使生成设置不太优化。常用的空间剪枝技术不能有效地加快生成速度。
Recently, 3D Gaussian splatting [15] has emerged as an alternative 3D representation to NeRF, showcasing impressive quality and speed in 3D and 4D reconstruction tasks [15, 26, 50, 47, 51]. It has also attracted considerable research interest in the field of generation [7, 44, 52]. Its efficient differentiable rendering implementation and model design facilitate fast training without the need for spatial pruning.
最近,3D高斯溅射[15]已经成为NeRF的替代3D表示,在3D和4D重建任务中展示了令人印象深刻的质量和速度[15,26,50,47,51]。它也吸引了相当大的研究兴趣,在该领域的代[7,44,52]。其高效的可微分渲染实现和模型设计有助于快速训练,而无需空间修剪。
In this work, we pioneer the adaptation of 3D Gaussian splatting to 3D editing tasks, aiming to achieve swift and controllable 3D editing, harnessing the advantages of this representation for the first time in this context.
在这项工作中,我们率先将3D高斯溅射适应于3D编辑任务,旨在实现快速可控的3D编辑,在这种背景下首次利用这种表示的优势。
2.23D Editing 2.23D编辑
Editing neural fields is inherently challenging due to the intricate interplay between their shape and appearance. EditNeRF [24] stands as a pioneering work in this domain, as they edit both the shape and color of neural fields by conditioning them on latent codes. Additionally, some works [45, 46, 10, 1] leverage CLIP models to facilitate editing through the use of text prompts or reference images.
编辑神经场本质上是具有挑战性的,因为它们的形状和外观之间存在复杂的相互作用。EditNeRF [24]是这一领域的开创性工作,因为他们通过对潜在代码进行调节来编辑神经场的形状和颜色。此外,一些作品[45,46,10,1]利用CLIP模型通过使用文本提示或参考图像来促进编辑。
Another line of research focuses on predefined template models or skeletons to support actions like re-posing or re-rendering within specific categories [33, 30]. Geometry-based methods [54, 49, 48, 20] translate neural fields into meshes and synchronize mesh deformation with implicit fields. Additionally, 3D editing techniques involve combining 2D image manipulation, such as inpainting, with neural fields training [23, 19].
另一种研究关注于预定义的模板模型或骨架,以支持特定类别中的重新构图或重新渲染等操作[33,30]。基于几何的方法[54,49,48,20]将神经场转换为网格,并将网格变形与隐式场同步。此外,3D编辑技术涉及将2D图像操作(如修复)与神经场训练相结合[23,19]。
Concurrent works [31, 57] leverage static 2D and 3D masks to constrain the edit area of NeRF. However, these approaches have their limitations because the training of 3D models is a dynamic process, and static masks cannot effectively constrain it. In contrast, our research employs Gaussian semantic tracing to track the target Gaussian throughout the entire training process.
并发作品[31,57]利用静态2D和3D掩码来约束NeRF的编辑区域。然而,这些方法都有其局限性,因为3D模型的训练是一个动态的过程,静态的面具不能有效地约束它。相比之下,我们的研究采用高斯语义跟踪跟踪目标高斯在整个训练过程中。
3Preliminary 3初步
3.13D Gaussian Splatting 3.13D高斯溅射
GS (Gaussian Splatting) [15] represents an explicit 3D scene using point clouds, where Gaussians are employed to depict the scene’s structure. In this representation, every Gaussian is defined by a center point, denoted as �, and a covariance matrix Σ. The center point � is commonly known as the Gaussian’s mean value:
GS(高斯飞溅)[15]表示使用点云的显式3D场景,其中高斯用于描绘场景的结构。在该表示中,每个高斯由表示为 � 的中心点和协方差矩阵 Σ 定义。中心点 � 通常被称为高斯平均值:
�(�)=�−12��Σ−1�. | (1) |
The covariance matrix Σ can be decomposed into a rotation matrix 𝐑 and a scaling matrix 𝐒 for differentiable optimization:
协方差矩阵 Σ 可以被分解为旋转矩阵 𝐑 和缩放矩阵 𝐒 以用于可微优化:
Σ=𝐑𝐒𝐒�𝐑�, | (2) |
the calculation of gradient flow is detailed in [15].
梯度流的计算详见[15]。
For rendering new viewpoints, the method of splatting, as described in [53], is utilized for positioning the Gaussians on the camera planes. This technique, originally presented in [58], involves a viewing transformation denoted by � and the Jacobian � of the affine approximation of the projective transformation. Using these, the covariance matrix Σ′ in camera coordinates is determined as follows:
为了渲染新的视点,如[53]中所述的溅射方法用于在相机平面上定位高斯。这种技术最初在[58]中提出,涉及由 � 表示的观察变换和投影变换的仿射近似的雅可比矩阵 � 。使用这些,如下确定相机坐标中的协方差矩阵 Σ′ :
Σ′=��Σ����. | (3) |
To summarize, each Gaussian point in the model is characterized by a set of attributes: its position, denoted as �∈ℝ3, its color represented by spherical harmonics coefficients �∈ℝ� (where � indicates the degrees of freedom), its opacity �∈ℝ, a rotation quaternion �∈ℝ4, and a scaling factor �∈ℝ3.
总之,模型中的每个高斯点由一组属性表征:其位置,表示为 �∈ℝ3 ,其颜色由球谐系数 �∈ℝ� 表示(其中 � 表示自由度),其不透明度 �∈ℝ ,旋转四元数 �∈ℝ4 和缩放因子 �∈ℝ3 。
Before Editing 编辑前
During Editing 在编辑期间
Figure 2:Illustration of Gaussian semantic tracing. Prompt: Turn him into an old lady. The red mask in the images represents the projection of the Gaussians that will be updated and densified. The dynamic change of the masked area during the training process, as driven by the updating of Gaussians, ensures consistent effectiveness throughout the training duration. Despite starting with potentially inaccurate segmentation masks due to 2D segmentation errors, Gaussian semantic tracing still guarantees high-quality editing results.
图2:高斯语义跟踪的说明。提示:把他变成一个老太太。图像中的红色遮罩表示将被更新和加密的高斯投影。在训练过程中,由高斯更新驱动的掩蔽区域的动态变化确保了整个训练期间的一致有效性。尽管由于2D分割错误而可能导致不准确的分割掩码,但高斯语义跟踪仍然可以保证高质量的编辑结果。
Particularly, for every pixel, the color and opacity of all Gaussians are calculated based on the Gaussian’s representation as described in Eq. 1. The blending process of � ordered points overlapping a pixel follows a specific formula:
特别地,对于每个像素,所有高斯的颜色和不透明度基于如等式(1)中所述的高斯表示来计算。1.与像素重叠的 � 有序点的混合过程遵循特定公式:
�=∑�∈�����∏�=1�−1(1−��). | (4) |
where �� and �� signify the color and density of a given point respectively. These values are determined by a Gaussian with a covariance matrix Σ, which is then scaled by optimizable per-point opacity and spherical harmonics (SH) color coefficients.
其中 �� 和 �� 分别表示给定点的颜色和密度。这些值由具有协方差矩阵 Σ 的高斯确定,然后由可优化的每点不透明度和球谐(SH)颜色系数来缩放。
3.2Diffusion-based Editing Guidance
3.2基于扩散的编辑指南
Recent advancements have seen numerous works elevating 2D diffusion processes to 3D, applying these processes extensively in the realm of 3D editing. Broadly, these works can be categorized into two types. The first type [36, 31, 57, 42, 27, 8], exemplified by Dreamfusion’s [36] introduction of SDS loss, involves feeding the noised rendering of the current 3D model, along with other conditions, into a 2D diffusion model [39]. The scores generated by the diffusion model then guide the direction of model updates. The second type [12, 43, 37, 5] focuses on conducting 2D editing based on given prompts for the multiview rendering of a 3D model. This approach creates a multi-view 2D image dataset, which is then utilized as a training target to provide guidance for the 3D model.
最近的进展已经看到许多作品将2D扩散过程提升到3D,并将这些过程广泛应用于3D编辑领域。大体上,这些作品可以分为两类。第一种类型[36,31,57,42,27,8],以Dreamfusion [36]引入SDS损失为例,涉及将当前3D模型的噪声渲染沿着其他条件馈送到2D扩散模型[39]中。然后,由扩散模型生成的分数指导模型更新的方向。第二种类型[12,43,37,5]侧重于根据给定的提示进行2D编辑,用于3D模型的多视图渲染。该方法创建多视图2D图像数据集,然后将其用作训练目标以为3D模型提供指导。
Our work centers on leveraging the exemplary properties of Gaussian Splatting’s explicit representation to enhance 3D editing. Consequently, we do not design specific editing guidance mechanisms but instead directly employ the guidance methods mentioned above. Both types of guidance can be applied in our method. For simplicity, we denote the guidance universally as �. Given the parameters of a 3D model, Θ, along with the rendered camera pose � and prompt �, the editing loss from the 2D diffusion prior can be formulated as follows:
我们的工作集中在利用高斯飞溅的显式表示的示例性属性,以增强3D编辑。因此,我们不设计具体的编辑指导机制,而是直接采用上述指导方法。这两种类型的指导都可以应用在我们的方法中。为了简单起见,我们将该指南统一表示为 � 。给定3D模型的参数 Θ ,沿着渲染的相机姿态 � 和提示 � ,来自2D扩散先验的编辑损失可以公式化如下:
ℒEdit=�(Θ;�,�) | (5) |
4Method 4方法
We define the task of 3D editing on Gaussian Splatting (GS) as follows: Given a prompt � and a 3D scene represented by 3D Gaussians, denoted by Θ, where each Θ�={��,��,��,��,��} represents the parameters of the �-th Gaussian as detailed in Sec. 3.1, the objective is to achieve an edited 3D Gaussians, referred to as Θ�, that aligns with or adheres to the specifications of the prompt �.
给定提示 � 和由3D高斯表示的3D场景,由 Θ 表示,其中每个 Θ�={��,��,��,��,��} 表示第 � 高斯的参数,如在第12节中详细描述的。3.1中,目标是实现编辑的3D高斯,称为 Θ� ,其与提示 � 的规范对齐或遵守。
We then introduce our novel framework for performing editing tasks on GS. We first introduce Gaussian semantic tracing in Sec. 4.1, along with a new representation method known as Hierarchical Gaussian Splatting (HGS) in Sec. 4.2. The GS semantic tracing enables precise segmentation and tracing within GS, facilitating controllable editing operations. Compared to the standard GS, the HGS representation demonstrates greater robustness against randomness in generative guidance and is more adept at accommodating a diverse range of editing scenarios. Additionally, we have specifically designed 3D inpainting for GS, which encompasses object removal and addition (Sec. 4.3).
然后,我们介绍了我们的新框架执行GS上的编辑任务。我们首先介绍了高斯语义跟踪SEC。4.1,沿着一种新的表示方法,称为分层高斯溅射(HGS)在第4.1节。4.2. GS语义跟踪实现了GS内的精确分割和跟踪,便于可控编辑操作。与标准GS相比,HGS表示在生成指导中对随机性表现出更大的鲁棒性,并且更善于适应各种各样的编辑场景。此外,我们还专门为GS设计了3D修复,其中包括对象删除和添加(第二节)。4.3)。
4.1Gaussian Semantic Tracing
4.1高斯语义跟踪
Previous works [31, 57] in 3D editing usually utilize static 2D or 3D masks to apply loss only within the masked pixels, thus constraining the editing process to only edit the desired area. However, this method has limitations. As 3D representations dynamically change during training, static segmentation masks would become inaccurate or even ineffective. Furthermore, the use of static masks to control gradients in NeRF editing poses a significant limitation, as it confines the editing strictly within the masked area. This restriction prevents the edited content from naturally extending beyond the mask, thus ’locking’ the content within a specified spatial boundary.
以前的作品[31,57]在3D编辑中通常使用静态2D或3D掩模来仅在掩模像素内应用损失,从而将编辑过程限制为仅编辑所需区域。然而,这种方法有局限性。由于3D表示在训练期间动态变化,静态分割掩码将变得不准确甚至无效。此外,在NeRF编辑中使用静态遮罩来控制渐变构成了一个显著的限制,因为它将编辑严格限制在遮罩区域内。这种限制防止编辑的内容自然地延伸到遮罩之外,从而将内容“锁定”在指定的空间边界内。
Even with the implementation of semantic NeRF [56], the gradient control is still only effective at the very beginning of the training since the ongoing updates to NeRF lead to a loss of accuracy in the semantic field.
即使实现了语义NeRF [56],梯度控制仍然只在训练的最初阶段有效,因为对NeRF的持续更新导致语义场的准确性损失。
Original GS 一般事务人员
A Garfield cat on the bench
长凳上的一只加菲猫
Novel view 1 新观点1
Novel view 2 新颖的观点2
Figure 3:3D inpainting for object incorporation. GaussianEditor is capable of adding objects at specified locations in a scene, given a 2D inpainting mask and a text prompt from a single view. The whole process takes merely five minutes.
图3:3D修复对象合并。GaussianEditor能够在场景中的指定位置添加对象,给定2D修复蒙版和来自单个视图的文本提示。整个过程仅需5分钟。
To address the aforementioned issue, we have chosen Gaussian Splatting (GS) as our 3D representation due to its explicit nature. This allows us to directly assign semantic labels to each Gaussian point, thereby facilitating semantic tracing in 3D scenes.
为了解决上述问题,我们选择高斯溅射(GS)作为我们的3D表示,因为它的显式性质。这允许我们直接为每个高斯点分配语义标签,从而促进3D场景中的语义跟踪。
Specifically, we enhance the 3D Gaussians Θ by adding a new attribute �, where ��� represents the semantic Gaussian mask for the �-th Gaussian point and the �-th semantic label. With this attribute, we can precisely control the editing process by selectively updating only the target 3D Gaussians. During the densification process, newly densified points inherit the semantic label of their parent point. This ensures that we have an accurate 3D semantic mask at every moment throughout the training process.
具体来说,我们通过添加新属性 � 来增强3D高斯图 Θ ,其中 ��� 表示第 � 个高斯点和第 � 个语义标签的语义高斯掩码。有了这个属性,我们可以通过选择性地只更新目标3D高斯曲线来精确控制编辑过程。在加密过程中,新加密的点继承其父点的语义标签。这确保了我们在整个训练过程中的每一刻都有一个准确的3D语义掩码。
As illustrated in Fig. 2, Gaussian semantic tracing enables continuous tracking of each Gaussian’s categories during training, adjusting to their evolving properties and numbers. This feature is vital, as it permits selective application of gradients, densification and pruning of Gaussians linked to the specified category. Additionally, it facilitates training solely by rendering the target object, significantly speeding up the process in complex scenes. The semantic Gaussian mask � functions as a dynamic 3D segmentation mask, evolving with the training, allowing content to expand freely in space. This contrasts with NeRF, where content is restricted to a fixed spatial area.
如图2所示,高斯语义跟踪允许在训练期间连续跟踪每个高斯类别,以适应其不断变化的属性和数量。这个特性是至关重要的,因为它允许选择性地应用梯度,致密化和修剪与指定类别相关的高斯。此外,它仅通过渲染目标对象来促进训练,显着加快复杂场景中的过程。语义高斯掩码 � 用作动态3D分割掩码,随着训练而发展,允许内容在空间中自由扩展。这与NeRF形成对比,其中内容被限制在固定的空间区域。
Next, we discuss Gaussian Splatting unprojection, the method we propose to obtain semantic Gaussian mask �. For a set of 3D Gaussians Θ, we render them from multiple viewpoints to generate a series of renderings ℐ. These renderings are then processed using 2D segmentation techniques [18] to obtain 2D segmentation masks ℳ, with each ℳ�, representing the �-th semantic labels.
接下来,我们讨论高斯溅射反投影,我们提出的方法来获得语义高斯掩码 � 。对于一组3D高斯图 Θ ,我们从多个视点渲染它们以生成一系列渲染图 ℐ 。然后使用2D分割技术[18]处理这些渲染,以获得2D分割掩码 ℳ ,每个 ℳ� 表示第 � 个语义标签。
Original View 原始视图
Removed 移除
Inpainted 经修复
Figure 4:3D inpainting for object removal. Typically, removing the target object based on a Gaussian semantic mask generates artifacts at the interface between the target object and the scene. To address this, we generate a repaired image using a 2D inpainting method and employ Mean Squared Error (MSE) loss for supervision. The whole process takes merely two minutes.
图4:3D修复对象删除。通常,基于高斯语义掩模移除目标对象在目标对象与场景之间的界面处生成伪影。为了解决这个问题,我们使用2D修复方法生成修复后的图像,并采用均方误差(MSE)损失进行监督。整个过程仅需两分钟。
To obtain the semantic label for each Gaussian, we unproject the posed 2D semantic label back to the Gaussians with inverse rendering. Concretely, we maintain a weight and a counter for each Gaussian. For pixel 𝒑 on the semantic maps, we unproject the semantic label back to the Gaussians that affects it by
为了获得每个高斯的语义标签,我们用逆渲染将构成的2D语义标签反投影回高斯。具体地说,我们为每个高斯保持一个权重和一个计数器。对于语义图上的像素 𝒑 ,我们将语义标签反投影回高斯,
���=∑��(𝒑)∗���(𝒑)∗ℳ�(𝒑), | (6) |
where ��� represents the weight of the �-th Gaussian for the �-th semantic label, while ��(𝒑), ���(𝒑), and ℳ�(𝒑) denote the opacity, transmittance from pixel 𝒑, and semantic mask of pixel 𝒑 for the �-th Gaussian, respectively. After updating all the Gaussian weights and counters, we determine whether a Gaussian belongs to the �-th semantic class based on whether its average weight exceeds a manually set threshold.
其中 ��� 表示第 � 高斯对于第 � 语义标签的权重,而 ��(𝒑) 、 ���(𝒑) 和 ℳ�(𝒑) 分别表示第 � 高斯的不透明度、来自像素 𝒑 的透射率和像素 𝒑 的语义掩模。在更新所有高斯权重和计数器之后,我们基于其平均权重是否超过手动设置的阈值来确定高斯是否属于第9语义类。
The entire labeling process is remarkably fast, typically taking less than a second. Once this semantic label assignment is completed, the entire Gaussian scene becomes parsed by us, making a variety of operations possible. These include manually changing colors, moving properties of a specific category, and deleting certain categories. Notably, 2D diffusion guidance often struggles to effectively edit small objects in complex scenes. Thanks to Gaussian semantic tracing, we can now render these small objects independently and input them into the 2D diffusion model, thereby achieving more precise supervision.
整个标记过程非常快,通常不到一秒。一旦语义标签分配完成,整个高斯场景就会被我们解析,从而使各种操作成为可能。这些操作包括手动更改颜色、移动特定类别的属性以及删除某些类别。值得注意的是,2D扩散引导通常难以有效地编辑复杂场景中的小对象。由于高斯语义跟踪,我们现在可以独立地渲染这些小对象,并将其输入到2D扩散模型中,从而实现更精确的监督。
4.2Hierarchical Gaussian Splatting
4.2分层高斯溅射
The effectiveness of vanilla GS [17] in reconstruction tasks lies in the high-quality initialization provided by point clouds derived from SFM [41], coupled with stable supervision from ground truth datasets.
香草GS [17]在重建任务中的有效性在于由SFM [41]导出的点云提供的高质量初始化,以及来自地面实况数据集的稳定监督。
However, the scenario changes in the field of generation. In previous work involving GS in text-to-3D and image-to-3D [44, 7, 52], GS has shown limitations when facing the randomness of generative guidance due to its nature as a point cloud-like representation. This instability in GS is mainly due to their direct exposure to the randomness of loss functions, unlike neural network-based implicit representations. GS models, which update a large number of Gaussian points each training step, lack the memorization and moderating ability of neural networks. This leads to erratic updates and prevents GS from achieving the detailed results seen in neural network-based implicit representations, as GS’s excessive fluidity hampers its convergence in generative training.
然而,发电领域的情况发生了变化。在先前涉及文本到3D和图像到3D的GS的工作中[44,7,52],GS在面对生成指导的随机性时表现出局限性,因为它的性质是点云表示。GS中的这种不稳定性主要是由于它们直接暴露于损失函数的随机性,这与基于神经网络的隐式表示不同。GS模型在每一步训练中都需要更新大量的高斯点,缺乏神经网络的记忆和调节能力。这导致不稳定的更新,并阻止GS实现基于神经网络的隐式表示中看到的详细结果,因为GS的过度流动性阻碍了其在生成训练中的收敛。