文献速递：基于SAM的医学图像分割---SAM-Med2D

Title

题目

SAM-Med2D

文献速递介绍

医学图像分割在通过识别和勾画各种组织、器官或感兴趣区域来分析医学图像中发挥着至关重要的作用。准确的分割可以帮助医生精确识别和定位病理区域，从而实现更准确的诊断和治疗。此外，对医学图像进行定量和定性分析能够提供对不同组织或器官的形态、结构和功能的全面洞察，促进疾病研究和发现。然而，由于医学图像的特点，如众多的成像模式、复杂的组织和器官结构，以及少量的标注数据，大多数现有方法仅限于特定的模式、器官或病变。这一限制阻碍了算法的泛化能力和适应性，使得它们难以跨多种临床场景应用。

最近，大规模模型的趋势在整个人工智能领域引起了广泛关注。如ChatGPT2、ERNIE Bot 3、DINO、SegGPT、SAM[8]等通用人工智能模型的出现，使得使用单一模型解决多个任务成为可能。作为最新的大规模视觉模型，SAM使用户能够通过交互式点击、边界框或提供自然语言提示来为特定感兴趣区域生成掩码。它在自然图像上的零样本和少样本能力在各个领域引起了显著关注。

Abstract

摘要

The Segment Anything Model (SAM) represents a state-of-the-art research ad vancement in natural image segmentation, achieving impressive results with input prompts such as points and bounding boxes. However, our evaluation and recent research indicate that directly applying the pretrained SAM to medical image segmentation does not yield satisfactory performance. This limitation primarily arises from significant domain gap between natural images and medical images. To bridge this gap, we introduce SAM-Med2D, the most comprehensive studies on applying SAM to medical 2D images. Its comprehensiveness manifests in three aspects: the comprehensive analysis on collecting the largest medical data, the most comprehensive studies on various fine-tuning options, the most comprehensive eval uation on the performance. Specifically, we first collect and curate approximately 4.6M images and 19.7M masks from public and private datasets, constructing a large-scale medical image segmentation dataset encompassing various modalities and objects. Then, we comprehensively fine-tune SAM on this dataset and turn it into SAM-Med2D. Unlike previous methods that only adopt bounding box or point prompts as interactive segmentation approach, we adapt SAM to medical image segmentation through more comprehensive prompts involving bounding boxes, points, and masks. We additionally fine-tune the encoder and decoder of the original SAM to obtain a well-performed SAM-Med2D, leading to the most comprehensive fine-tuning strategies to date. Finally, we conducted a comprehen sive evaluation and analysis to investigate the performance of SAM-Med2D in medical image segmentation across various modalities, anatomical structures, and organs. Concurrently, we validated the generalization capability of SAM-Med2D on 9 datasets from MICCAI 2023 challenge. Overall, our approach demonstrated significantly superior performance and generalization capability compared to SAM. Our codes can be found at https://github.com/uni-medical/SAM-Med2D.

段落任何模型（SAM）代表了在自然图像分割领域的最先进研究进展，通过使用点和边界框等输入提示，取得了令人印象深刻的结果。然而，我们的评估和最近的研究表明，直接将预训练的SAM应用于医学图像分割并未达到令人满意的性能。这一限制主要源于自然图像和医学图像之间存在显著的领域差异。

为了弥合这一差距，我们介绍了SAM-Med2D，这是迄今为止最全面的研究，专门应用于医学2D图像的SAM。其全面性体现在三个方面：在收集最大医学数据方面进行的全面分析，对各种微调选项进行的最全面研究，以及对性能进行的最全面评估。具体来说，我们首先从公共和私有数据集中收集和整理了大约460万张图像和1970万个掩模，构建了一个涵盖各种模式和对象的大规模医学图像分割数据集。然后，我们对这个数据集进行了全面的微调，将其转变为SAM-Med2D。与之前仅采用边界框或点提示作为交互式分割方法的方法不同，我们通过涉及边界框、点和掩模的更全面提示，将SAM适配到医学图像分割中。我们还对原始SAM的编码器和解码器进行了额外的微调，获得了性能良好的SAM-Med2D，这导致了迄今为止最全面的微调策略。最后，我们进行了全面的评估和分析，以调查SAM-Med2D在医学图像分割中跨越各种模态、解剖结构和器官的性能。同时，我们在MICCAI 2023挑战的9个数据集上验证了SAM-Med2D的泛化能力。总的来说，我们的方法展示了与SAM相比显著更优的性能和泛化能力。我们的代码可以在https://github.com/uni-medical/SAM-Med2D找到。

METHOD

方法

3.1 Incorporation of Medical Knowledge into SAM

Recent research has reaffirmed the pivotal role of training data volume in the learning capacity of large models . By learning from larger-scale data, models can acquire richer domain-specific

knowledge and adapt better to various application scenarios. Though trained on over 1B masks, SAM achieves suboptimal performance in the realm of medical image analysis due to the significant domain gap between natural images and medical data. To address this gap, we have collected and curated the largest medical image segmentation dataset to date. This dataset is composed of numerous public and private datasets, ensuring comprehensive coverage and diversity. Figure 3 (b) illustrates the dataset’s 10 different imaging modalities and their corresponding data proportions. To enhance visual presentation, we have used logarithmic scaling to visualize the differences in quantity. Based on anatomical structures and the presence of lesions, we categorized the dataset into head and neck, thorax, abdomen, pelvic, and lesions (Figure 3 (c)). Additionally, we curated and consolidated 31 main organs from the 271 labels in these datasets, as depicted in Figure 3 (a). This covers almost all object types in the currently available public datasets, addressing the deficiency of SAM in medical domain knowledge.

3.1 将医学知识纳入SAM

近期研究再次证实了训练数据量在大模型学习能力中的关键作用[7, 8, 23]。通过从更大规模的数据中学习，模型能够获得更丰富的领域特定知识，并更好地适应各种应用场景。尽管SAM在超过10亿掩码的数据上进行了训练，但由于自然图像与医学数据之间存在显著的领域差异，其在医学图像分析领域的性能并不理想。为了解决这一差距，我们收集并整理了迄今为止最大的医学图像分割数据集。该数据集由众多公共和私有数据集组成，确保了全面的覆盖和多样性。图3(b)展示了数据集的10种不同成像模态及其对应的数据比例。为了增强视觉呈现，我们使用了对数缩放来可视化数量上的差异。基于解剖结构和病变的存在，我们将数据集分类为头颈部、胸部、腹部、盆腔和病变（图3(c)）。此外，我们从这些数据集中的271个标签中精选并整合了31个主要器官，如图3(a)所示。这几乎涵盖了当前公开数据集中所有的对象类型，解决了SAM在医学领域知识中的不足。

CONCLUSION

结论

In this study, we obtain SAM-Med2D by fine-tuning a SAM on a large-scale medical image dataset,which is able to significantly improve various medical image segmentation tasks. We employed two explicit prompts strategies to generate masks for quantitative and qualitative comparisons. At an equal resolution, only the fine-tuned mask decoder (FT-SAM) achieved an improvement of 11.93% in the Bbox prompt mode, while the fully fine-tuned SAM-Med2D achieved a 17.67% improvement. Surprisingly, our approach demonstrated overwhelming superiority in the 1 pt prompt (18.94% vs. 70.01%).

Furthermore, SAM-Med2D exhibited excellent generalization capabilities in both prompt modes, indicating its practical value in the medical field. We conduct a comprehensive evaluation of the model from different dimensions of the data. From an anatomical perspective, at a resolution of 1024×1024, SAM had advantages over FT-SAM in the chest, abdomen, and other regions, SAM-Med2D outperformed all other methods in overall segmentation performance. Regarding different modalities, SAM demonstrated good generalization when the target modality data resembled natural image attributes. We compared the two fine-tuning methods on more than 30 major organs, and our SAM-Med2D achieved better results on 24 organs, with a maximum improvement of 6.95% compared to FT-SAM. Additionally, our generalization experiments on 9 publicly available datasets demonstrated strong domain transferability of models pretrained on large-scale datasets. While the Bbox prompt always outperformed the 1 pt prompt, adding more points significantly improved the segmentation results, surpassing even the Bbox mode.

在本研究中，我们通过在大规模医学图像数据集上对SAM进行微调，获得了SAM-Med2D，它能够显著改善各种医学图像分割任务。我们采用了两种明确的提示策略来生成掩码，进行定量和定性比较。在相同分辨率下，仅经过微调的掩码解码器（FT-SAM）在边界框提示模式下实现了11.93%的改进，而完全微调的SAM-Med2D实现了17.67%的改进。令人惊讶的是，我们的方法在1点提示中展现出压倒性的优势（18.94% vs. 70.01%）。此外，SAM-Med2D在两种提示模式下都表现出了出色的泛化能力，表明了其在医学领域的实际价值。

我们从数据的不同维度对模型进行了全面评估。从解剖学角度来看，在1024×1024的分辨率下，SAM在胸部、腹部和其他区域相比FT-SAM有优势，SAM-Med2D在总体分割性能上超过了所有其他方法。关于不同的成像模态，当目标模态数据类似自然图像属性时，SAM展示了良好的泛化能力。我们比较了两种微调方法在30多个主要器官上的表现，我们的SAM-Med2D在24个器官上取得了更好的结果，与FT-SAM相比，最大改进了6.95%。此外，我们在9个公开可用的数据集上进行的泛化实验证明了在大规模数据集上预训练的模型具有强大的域迁移能力。虽然边界框提示模式总是优于1点提示模式，但增加更多点显著改善了分割结果，甚至超过了边界框模式。

Fig

图

Figure 1: Comparison between examples in SA-1B (a) and in our dataset (b). SA-1B consists of 11M natural images and their corresponding 1129M masks. Our dataset consists of 4.6M medical images

and their corresponding 19.7M masks.

图1：SA-1B中示例（a）与我们数据集中示例（b）的比较。SA-1B包含1100万自然图像及其对应的1129百万掩码。我们的数据集包含460万医学图像及其对应的1970万掩码。

Figure 2: Results of interactive segmentation using SAM in various medical scenarios.

图2：在各种医疗场景中使用SAM进行交互式分割的结果。

Figure 3: Overview of the dataset used in this study. (a) A total of 31 major organs, along with their corresponding anatomical structures, with an asterisk (*) denoting the presence of lesion labels within the dataset. (b) The distribution of modalities along with their corresponding proportions in the dataset are presented (scaled logarithmically). (c) The number of images and masks categorized by anatomical structure, along with the total count encompassing the dataset.

图3：本研究使用的数据集概览。(a)共有31个主要器官及其对应的解剖结构，带有星号(*)表示数据集内含有病变标签。(b)展示了数据集中模态的分布及其对应的比例（对数缩放）。(c)按解剖结构分类的图像和掩码数量，以及包含整个数据集的总数。

Figure 4: The pipeline of SAM-Med2D. We freeze the image encoder and incorporate learnable adapter layers in each Transformer block to acquire domain-specific knowledge in the medical field. We fine-tune the prompt encoder using point, Bbox, and mask information, while updating the parameters of the mask decoder through interactive training.

图4：SAM-Med2D的流程图。我们冻结了图像编码器，并在每个Transformer块中加入了可学习的适配器层，以获取医学领域的特定领域知识。我们使用点、边界框（Bbox）和掩码信息微调提示编码器，同时通过交互式训练更新掩码解码器的参数。

Figure 5: (a) Comparison from the perspective of anatomical structures. (b) Comparison from the perspective of different Modalities. (c) Comparison of segmentation performance between FT-SAM and our SAM-Med2D across 31 organs

图5：(a) 从解剖结构的角度进行比较。(b) 从不同成像模态的角度进行比较。(c) 在31个器官中，FT-SAM与我们的SAM-Med2D的分割性能比较。

Figure 6: Qualitative comparisons were made between the segmentation results of SAM-Med2D and SAM. The first three rows depict the segmentation results of different modalities, while the last three rows illustrate the segmentation results of different anatomical structures.

图6：在SAM-Med2D和SAM的分割结果之间进行了定性比较。前三行展示了不同成像模态的分割结果，而后三行展示了不同解剖结构的分割结果。

Figure 7: The fusion of segmentation results for multiple target regions within a single image. For

clarity of presentation, we visualized only the results of Bbox prompt and 1 point prompt.

图7：单个图像内多个目标区域的分割结果融合。为了清晰地展示，我们仅可视化了边界框（Bbox）提示和1点提示的结果。

Table

表

Table 1: Comparison of SAM fine-tuning models. Our SAM-Med2D is a comprehensive fine-tuning method that supports multiple prompts on medical images to generate masks.

表1：SAM微调模型的比较。我们的SAM-Med2D是一种全面的微调方法，支持在医学图像上使用多种提示来生成掩码。

Table 2: Quantitative comparison of different methods on the test set.

表2：在测试集上不同方法的定量比较。

Table 3: Segmentation performance in point prompt mode. The left values represent Dice scores of different models under 1 pt prompt. The numbers in parentheses indicate the Dice score increment

after 5 pts prompt, with red indicating improvement and green indicating decline

表3：点提示模式下的分割性能。左侧值表示1点提示下不同模型的Dice分数。括号中的数字表示5点提示后Dice分数的增量，红色表示提高，绿色表示下降。