用于牙科的多任务视频增强

Multi-task Video Enhancement for Dental Interventions

2022 miccai

Abstract

微型照相机牢牢地固定在牙科手机上,这样牙医就可以持续地监测保守牙科手术的进展情况。但视频辅助牙科干预中的视频增强减轻了低光、噪音、模糊和相机握手等降低视觉舒适度的问题。为此,我们引入了一种新的深度网络,用于多任务视频增强,使牙科场景的宏观可视化。特别是,该网络以多尺度方式联合利用视频恢复和时间对齐来有效增强视频。我们对虚幻场景中自然牙齿的视频进行的实验表明,所提出的网络在多任务中获得了接近实时处理的最新结果。我们在https://doi.org/10.34808/1jby-ay90 上发布了video -lab,这是第一个具有多任务标签的牙科视频数据集,以促进相关视频处理应用的进一步研究。

Related Work

UberNet [9] and cross-stitch networks [16] are encoder-focused architectures that propagate task outputs across scales in the encoder.

Multi-modal distillation in PAD-Net [27] and PAP-Net [29] are decoder-focused networks that fuse outputs of task heads to make the final dense predictions but only at a single scale.

MTI-Net [24], which is most similar to our architecture, extends the decoder fusion by propagating task-specific features bottom-up across multiple scales through the encoder.

Instead of propagating the task features in scale-specific distillation modules across scales to the encoder, our network simultaneously propagates task outputs to the encoder and to the task heads in the decoder. Furthermore, the networks make dense task prediction in static images while we extend our network to videos.

Contribution

i) a novel application of a microcamera in computer-aided dental intervention for continuous tooth macro-visualization during drilling (居然是硬件创新)(悻悻离去)

(ii)    a new, asymmetrically annotated dataset of natural teeth in phantom scenes with pairs of frames of compromised and good quality using a beam splitter,

(iii)  a novel deep network for video processing that propagates task outputs to encoder and decoder across multiple scales to model task interactions, and (iv) demonstration that an instantiated model e˙ectively addresses multi-task video enhancement in our application by matching and surpassing state-of-the-art re-sults of single task networks in near real-time.

Method

通过不同任务间的交互来增强视频的处理效果

视频增强任务是相互关联的。比如:
--对齐视频帧(aligning video frames)有助于去模糊(deblurring)。
--去噪(denoising)和去模糊可以揭示有助于运动估计(motion estimation)的图像特征。
这种相互依赖性可以通过设计一个多任务模型来充分利用。

MOST-Net 是一种多输出、多尺度、多任务的网络架构。它的目标是通过编码器和解码器之间的多尺度特性建模任务间的交互。网络的输出包括多个任务(用 T 表示),这些任务在不同尺度(用 s 表示)上都有输出。例如

传播方式:

  • 尺度内传播:任务的输出会在当前尺度内传播。
  • 跨尺度传播:任务输出会从较低的尺度上采样(upsample),然后传播到较高尺度的解码器层和任务分支中。

约束条件:

ui denotes some operator, for instance, the upsampling operator for seg-mentation or the scaling operator for homography estimation.

Problem Statement

模型需要同时解决视频恢复、牙齿分割和运动估计任务,并在一个退化图像生成模型的假设下进行学习和优化。

T = 3 and O1: video restoration, O2:segmentation , O3: homography esti-mation. 

video stream generates observations , where t is the time index and P > 0 is a scalar value referring to the number of past frames.

The problem is to 1. estimate a clean frame, 2. a binary teeth segmentation mask and 3. approximate the inter-frame motion by a homography matrix, denoted by the triplet (三个任务的联合输出在尺度 s=1上表示为一个三元组↓)



Let x correspond to pixel location. Given per-pixel blur kernels kx,t of size K, the degraded image(为了模拟输入视频的退化过程(如模糊和噪声)) at s = 1 is generated as:

We assume multiple independently moving objects present in the considered scenes, while our task is to estimate only the motion related to the object of interest (i.e. teeth), which is present in the region indicated by non-zero values of mask M:

∀t ∀x 是指所有t和x

Training***

在多任务和多尺度的深度学习模型中定义损失函数和优化目标

数据集

Loss Function

 

需要对 N(样本数)、T(任务数)和 S(尺度数)进行总共 N * T * S 次求和操作。

损失函数类型

模型通过最小化总损失函数来学习参数 Θ,以便同时优化所有任务和所有尺度下的输出预测。优化过程需要考虑不同任务之间的相互关系和尺度之间的协同作用(多任务多尺度学习的核心思想)。

感觉这个multi task learning这块还是有点没搞清楚,我再看看别的论文

Structure

MOST-Net enables refinement of lower scale segmentations by upsampling and inputting them at the task-specific branches of higher scales.

Encoders

MOST-Net extracts features from two input frames Bt−1 and Bt independently at three scales.也就是说,模型同时在多个尺度上处理输入数据。

U-shaped Downsampling : features are extracted via 3 × 3 convolutions with strides of 1, 2, 2 for s = 1, 2, 3 followed by ReLU activations and 5 residual blocks [4] at each scale. The residual connections are augmented with an additional branch of convolutions in the Fast Fourier domain.
output channel dimension :2^(s+4)

At each scale, featuresandare concatenated and a channel attention mechanism follows [30] to fuse them into

MOST-Net uses homography outputs from lower scales to warp encoder features from the previous time step as

Decoders

encoder featuresare passed onto the expanding blocks scale-wisely via the skipping connections.

At the lower scale (s = 3),are directly passed on a stack of two residual blocks with 128 output channels. transposed convolutions with strides of 2 are used twice to recover the resolution scale.

At higher scales (s < 3), featuresare first concatenated with the upsampled decoder features and convolved by 3X 3 kernels to halve the number of channels.(为啥要减半?)Subsequently, they are propagated onto two residual blocks with 64 and 32 output channels each. The residual block outputs constitute scale-specific shared backbones. Lightweight task-specific branches follow to estimate the dense outputs. Specifically, one 3×3 convolution estimates  and two 3 × 3 convolutions, separated by ReLU, yieldat each scale

At each scale, homography estimation modules estimate 4 offsets(偏移量), related 1-1 to homographies via the Direct Linear Transformation (DLT) as in [5,12].  The motion gated attention modules multiply featureswith segmentationsto filter out context irrelevant to the motion of the teeth.The channel dimensionality is then halved by a 3 × 3 convolution while a second one extracts features from the restored output. The concatenation of the two streams forms features 

Homography Estimation Module: At each scale,  and are employed to predict the offsets with shallow downstream networks. Predicted offsets at lower scales are transformed back to homographies and cascaded(串联) bottom-up [12] to refine the higher scale ones.

Similarly to [5], we use blocks of 3 × 3 convolutions coupled with ReLU, batch normalization and max-pooling to reduce the spatial size of the features. Before the regression layer, a 0.2 dropout is applied.or s = 1, the convolution output channels are 64, 128, 256, 256 and 256. For s=2,3 the network depth is cropped from the second and third layers onwards respectively.

Task-Specific Branches

这段是自己根据gpt加的,以前没弄过多任务学习,方便理解*

Each task (colorization, motion estimation, segmentation) is handled by separate branches of the network. These branches can be seen in the image as the paths where F1,F2,F3 (the features at different scales) are passed through different processing stages (e.g., motion gated attention, channel attention, homography estimation) to produce task-specific outputs, such as the colorized frame Rt, mask Mt, and flow Ht​.

The network is optimized for multiple tasks by using shared features across different task-specific branches, while each branch focuses on a particular task's output (colorization, segmentation, motion estimation).The losses corresponding to each task are computed separately and combined in the final objective function, which allows the model to simultaneously learn multiple tasks while sharing common feature representations.

Experiment

Dataset

Vident-lab: a dataset for multi-task video processing of phantom dental scenes - Open Research Data - Bridge of Knowledge

  • Frame-to-Frame (F2F) Training:

    • The model is trained using static video fragments recorded with a camera (C1). The goal is to apply a trained image denoiser to clean noisy frames, obtain denoised frames and and their noise maps
  • Denoising Process:

    • The noisy frames are first denoised using the trained model. Then, these denoised frames are temporally interpolated (using 17 frames) to generate a blurry effect. The temporal interpolation helps in simulating realistic motion blur.
  • Adding Noise:

    • After the blur effect, noise maps are added to the blurry frames(The denoised frames are tem-porally interpolated [19] 8 times and averaged over a temporal window of 17 frames to synthesize real-istic blur) to form the input video frames (B). The noise maps represent the original noise that would have been present in the actual noisy frames.
  • Colorization: registration of frames between two di˙erent modalities C1 and C2

    • To generate output video frames (R), frames from camera C1 are colorized using a process where frames from a second camera (C2) are mapped to create the ground truth frames.
    • Specifically, the frames from C1 are colorized based on data from C2 to form the colorized video frames. This helps in overcoming the difficulty of aligning the frames between the two cameras and creating accurate pixel-to-pixel correspondences.
  • Color Mapping Network:

    • A color mapping (CM) network is learned to predict parameters that map 3D functions from the dental scene colors of camera C2 to the camera C1. This network helps achieve precise color mapping and ensures accurate spatial correspondence between frames B and R.

Segmentation masks and homographies 单应性

HRNet48 [22] pretrained on ImageNet, is fine-tuned on our annotations to automatically segment the teeth in the remaining frames in all three sets. We compute optical flows between consecutive clean frames with RAFT [23]. Motion fields are cropped with teeth masks Mt to discard other moving objects, such as the dental bur or the suction tube, as we are interested in stabilizing the videos with respect to the teeth. Subsequently, a partial aÿne homography H is fitted by RANSAC to the segmented motion field.

Setup

We train, validate, and test all methods on our dataset (Tab. 1). In all MOST-Net training runs, we set λ1, λ2, λ3 to 2 × 10−4, 5 × 10−5 and 1 for balancing tasks in Eq. 4.

augmented by horizontal and vertical flips with 0.5 probability, random channel perturbations, and color jittering, after [31]. 

batch size 16 , Adam , Learning rate 1e − 4, decayed to 1e − 6 with cosine annealing

PyTorch 1.10 (FP32). The inference speed is reported in frames-per-second (FPS) on GPU NVidia RTX 5000.

Results

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/893559.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Vue3轮播图左右联动

1、轮播图部分&#xff0c;右边鼠标移入&#xff0c;左边对应展示轮播图 可以在swiper 官网 Swiper中文网-轮播图幻灯片js插件,H5页面前端开发 选择vue中使用swiper npm i swiper 左右两边的联动&#xff1a;左边的轮播图和右边的小的列表他们的列表组成结构是一样的&#…

windows下本地部署安装hadoop+scala+spark-【不需要虚拟机】

注意版本依赖【本实验版本如下】 Hadoop 3.1.1 spark 2.3.2 scala 2.11 1.依赖环境 1.1 java 安装java并配置环境变量【如果未安装搜索其他教程】 环境验证如下&#xff1a; C:\Users\wangning>java -version java version "1.8.0_261" Java(TM) SE Runti…

go-zero框架基本配置和错误码封装

文章目录 加载配置信息配置 env加载.env文件配置servicecontext 查询数据生成model文件执行查询操作 错误码封装配置拦截器错误码封装 接上一篇&#xff1a;《go-zero框架快速入门》 加载配置信息 配置 env 在项目根目录下新增 .env 文件&#xff0c;可以配置当前读取哪个环…

2025 最新flutter面试总结

目录 1.Dart是值传递还是引用传递&#xff1f; 2.Flutter 是单引擎还是双引擎 3. StatelessWidget 和 StatefulWidget 在 Flutter 中有什么区别&#xff1f; 4.简述Dart语音特性 5. Navigator 是什么&#xff1f;在 Flutter 中 Routes 是什么&#xff1f; 6、Dart 是不是…

HarmonyOS Next构建工具 lycium 原理介绍

HarmonyOS Next构建工具 lycium 原理介绍 背景介绍 HarmonyOS Next中很多系统API是以C接口提供&#xff0c;如果要使用C接口&#xff0c;必须要使用NAPI在ArkTS与C间交互&#xff0c;这种场景在使用DevEco-Studio中集成的交叉编译工具&#xff0c;以及cmake构建工具就完全够用…

【远程视频必备】Briefing:安全视频群聊让远程办公无忧

文章目录 前言1.关于briefing2.本地部署briefing3.使用briefing4.cpolar内网穿透工具安装5.创建远程连接公网地址6.固定briefing公网地址 前言 对于有远程办公或者身处异地与家人好友视频聊天需求的人来说&#xff0c;在享受高效沟通的同时&#xff0c;也或多或少会有对信息泄…

热更新杂乱记

热更新主要有一个文件的MD5值的比对过程&#xff0c;期间遇到2个问题&#xff0c;解决起来花费了一点时间 1. png 和 plist 生成zip的时候再生成MD5值会发生变动。 这个问题解决起来有2种方案&#xff1a; &#xff08;1&#xff09;.第一个方案是将 png和plist的文件时间改…

Elementor Pro 3.27 汉化版 2100套模板 安装教程 wordpress主题中文编辑器插件免费下载

插件下载地址 https://a5.org.cn/a5ziyuan/732506.html 转载请注明出处! Elementor Pro 是流行的 Elementor 的付费扩展 WordPress 页面构建器插件. 它为免费的 Elementor 插件添加了许多附加功能和增强功能&#xff0c;使其成为创建美丽的更强大的工具 WordPress 网站。 如果…

计算机工程:解锁未来科技之门!

计算机工程与应用是一个充满无限可能性的领域。随着科技的迅猛发展&#xff0c;计算机技术已经深深渗透到我们生活的方方面面&#xff0c;从医疗、金融到教育&#xff0c;无一不在彰显着计算机工程的巨大魅力和潜力。 在医疗行业&#xff0c;计算机技术的应用尤为突出。比如&a…

AT8870单通道直流电机驱动芯片

AT8870单通道直流电机驱动芯片 典型应用原理图 描述 AT8870是一款刷式直流电机驱动器&#xff0c;适用于打印机、电器、工业设备以及其他小型机器。两个逻辑输入控制H桥驱动器&#xff0c;该驱动器由四个N-MOS组成&#xff0c;能够以高达3.6A的峰值电流双向控制电机。利用电流…

Vue2.0+ElementUI实现查询条件展开和收起功能组件

一、需求 el-form如果查询条件过多&#xff0c;影响页面的展示效果。查询条件表单是我们系统中非常常见的功能&#xff0c;我们需要把它封装成一个通用的组件&#xff0c;方便在系统开发中提升开发效率。除了在实现基本查询条件的功能上&#xff0c;还需要实现多条件的折叠和展…

Scrapy之一个item包含多级页面的处理方案

目标 在实际开发过程中&#xff0c;我们所需要的数据往往需要通过多个页面的数据汇总得到&#xff0c;通过列表获取到的数据只有简单的介绍。站在Scrapy框架的角度来看&#xff0c;实际上就是考虑如何处理一个item包含多级页面数据的问题。本文将以获取叶子猪网站的手游排行榜及…

MySQL8【学习笔记】

第一章前提须知 1.1 需要学什么 Dbeaver 的基本使用SQL 语句&#xff1a;最重要的就是查询&#xff08;在实战的时候&#xff0c;你会发现我们做的绝大部分工作就是 “查询”&#xff09;MySQL 存储过程&#xff08;利用数据库底层提供的语言&#xff0c;去进行业务逻辑的封装…

【JVM】垃圾收集器详解

你将学到 1. Serial 收集器 2. ParNew 收集器 3. Parallel Scavenge 收集器 4. Serial Old 收集器 5. Parallel Old 收集器 6. CMS 收集器 7. G1 收集器 在 Java 中&#xff0c;垃圾回收&#xff08;GC&#xff09;是自动管理内存的一个重要机制。HotSpot JVM 提供了多种…

SOME/IP服务接口

本系列文章将分享我在学习 SOME/IP 过程中积累的一些感悟&#xff0c;并结合 SOME/IP 的理论知识进行讲解。主要内容是对相关知识的梳理&#xff0c;并结合实际代码展示 SOME/IP 的使用&#xff0c;旨在自我复习并与大家交流。文中引用了一些例图&#xff0c;但由于未能找到原作…

编写0号中断的处理程序

实验内容、程序清单及运行结果 编写0号中断的处理程序&#xff08;课本实验12&#xff09; 解&#xff1a; assume cs:code code segment start: mov ax,cs mov ds,ax mov si,offset do mov ax,0 mov es,ax mov di,200h mov cx,offset doend-offset do ;安装中断例…

Android系统开发(十五):从 60Hz 到 120Hz,多刷新率进化简史

引言 欢迎来到“帧率探索实验室”&#xff01;今天&#xff0c;我们要聊聊 Android 11 中对多种刷新率设备的支持。你可能会问&#xff1a;“这和我写代码有什么关系&#xff1f;”别急&#xff0c;高刷新率不仅仅让屏幕更顺滑&#xff0c;还会直接影响用户体验。想象一下&…

基于JAVA的微信点餐小程序设计与实现(LW+源码+讲解)

专注于大学生项目实战开发,讲解,毕业答疑辅导&#xff0c;欢迎高校老师/同行前辈交流合作✌。 技术范围&#xff1a;SpringBoot、Vue、SSM、HLMT、小程序、Jsp、PHP、Nodejs、Python、爬虫、数据可视化、安卓app、大数据、物联网、机器学习等设计与开发。 主要内容&#xff1a;…

ChatGPT结合Excel辅助学术数据分析详细步骤分享!

目录 一.Excel在学术论文中的作用✔ 二.Excel的提示词✔ 三. 编写 Excel 命令 四. 编写宏 五. 执行复杂的任务 六. 将 ChatGPT 变成有用的 Excel 助手 一.Excel在学术论文中的作用✔ Excel作为一种广泛使用的电子表格软件&#xff0c;在学术论文中可以发挥多种重要作用&a…

国内有哪些著名的CRM系统提供商?

嘿&#xff0c;你有没有想过&#xff0c;在这个信息爆炸的时代里&#xff0c;企业怎么才能更好地管理客户关系呢&#xff1f;答案就是使用高效的CRM系统。今天我就来给大家聊聊那些在国际上非常有名的CRM系统提供商吧。 悟空CRM 首先不得不提的就是悟空CRM了&#xff01;这可…