FlashSpeech、ID-Animator、TalkingGaussian、FlowMap、CutDiffusion

本文首发于公众号:机器感知

FlashSpeech、ID-Animator、TalkingGaussian、FlowMap、CutDiffusion

图片

Gradient Guidance for Diffusion Models: An Optimization Perspective

图片

Diffusion models have demonstrated empirical successes in various applications and can be adapted to task-specific needs via guidance. This paper introduces a form of gradient guidance for adapting or fine-tuning diffusion models towards user-specified optimization objectives. We study the theoretic aspects of a guided score-based sampling process, linking the gradient-guided diffusion model to first-order optimization. We show that adding gradient guidance to the sampling process of a pre-trained diffusion model is essentially equivalent to solving a regularized optimization problem, where the regularization term acts as a prior determined by the pre-training data. Diffusion models are able to learn data's latent subspace, however, explicitly adding the gradient of an external objective function to the sample process would jeopardize the structure in generated samples. To remedy this issue, we consider a modified form of gradient guidance based on a forward prediction loss, ......

FlashSpeech: Efficient Zero-Shot Speech Synthesis

图片

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio qualit......

SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose  Estimation

图片

Existing Transformers for monocular 3D human shape and pose estimation typically have a quadratic computation and memory complexity with respect to the feature length, which hinders the exploitation of fine-grained information in high-resolution features that is beneficial for accurate reconstruction. In this work, we propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation, which allow effective utilization of high-resolution features in the Transformer. In addition, based on these two designs, we also introduce several novel modules including a multi-scale attention and a joint-aware attention to further boost the reconstruction performance. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods both quantitatively and qualitatively. Notably, the proposed algorithm achieves an M......

ID-Animator: Zero-Shot Identity-Preserving Human Video Generation

图片

Generating high fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case finetuning or usually missing the identity details in video generation process. In this study, we present ID-Animator, a zero-shot human-video generation approach that can perform personalized video generation given single reference facial image without further training. ID-Animator inherits existing diffusion-based video generation backbones with a face adapter to encode the ID-relevant embeddings from learnable facial latent queries. To facilitate the extraction of identity information in video generation, we introduce an ID-oriented dataset construction pipeline, which incorporates decoupled human attribute and action captioning technique from a constructed facial image pool. Based on this p......

From Parts to Whole: A Unified Reference Framework for Controllable  Human Image Generation

图片

Recent advancements in controllable human image generation have led to zero-shot generation using structural signals (e.g., pose, depth) or facial appearance. Yet, generating human images conditioned on multiple parts of human appearance remains challenging. Addressing this, we introduce Parts2Whole, a novel framework designed for generating customized portraits from multiple reference images, including pose images and various aspects of human appearance. To achieve this, we first develop a semantic-aware appearance encoder to retain details of different human parts, which processes each image based on its textual label to a series of multi-scale feature maps rather than one image token, preserving the image dimension. Second, our framework supports multi-image conditioned generation through a shared self-attention mechanism that operates across reference and target features during the diffusion process. We enhance the vanilla attention mechanism by incorporating mask informa......

TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via  Gaussian Splatting

图片

Radiance fields have demonstrated impressive performance in synthesizing lifelike 3D talking heads. However, due to the difficulty in fitting steep appearance changes, the prevailing paradigm that presents facial motions by directly modifying point appearance may lead to distortions in dynamic regions. To tackle this challenge, we introduce TalkingGaussian, a deformation-based radiance fields framework for high-fidelity talking head synthesis. Leveraging the point-based Gaussian Splatting, facial motions can be represented in our method by applying smooth and continuous deformations to persistent Gaussian primitives, without requiring to learn the difficult appearance change like previous methods. Due to this simplification, precise facial motions can be synthesized while keeping a highly intact facial feature. Under such a deformation paradigm, we further identify a face-mouth motion inconsistency that would affect the learning of detailed speaking motions. To address this c......

Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization

图片

We introduce a new system for Multi-Session SLAM, which tracks camera motion across multiple disjoint videos under a single global reference. Our approach couples the prediction of optical flow with solver layers to estimate camera pose. The backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose. The full system can connect disjoint sequences, perform visual odometry, and global optimization. Compared to existing approaches, our design is accurate and robust to catastrophic failures. Code is available at github.com/princeton-vl/MultiSlam_DiffPose ......

FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient  Descent

图片

This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce differentiable re-parameterizations of depth, intrinsics, and pose that are amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360-degree trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the dow......

Rethinking LLM Memorization through the Lens of Adversarial Compression

图片

Large language models (LLMs) trained on web-scale datasets raise substantial concerns regarding permissible data usage. One major question is whether these models "memorize" all their training data or they integrate many data sources in some way more akin to how a human would learn and synthesize information. The answer hinges, to a large degree, on $\textit{how we define memorization}$. In this work, we propose the Adversarial Compression Ratio (ACR) as a metric for assessing memorization in LLMs -- a given string from the training data is considered memorized if it can be elicited by a prompt shorter than the string itself. In other words, these strings can be "compressed" with the model by computing adversarial prompts of fewer tokens. We outline the limitations of existing notions of memorization and show how the ACR overcomes these challenges by (i) offering an adversarial view to measuring memorization, especially for monitoring unlearning and compliance; and (ii) allow......

CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation  Method

图片

Transforming large pre-trained low-resolution diffusion models to cater to higher-resolution demands, i.e., diffusion extrapolation, significantly improves diffusion adaptability. We propose tuning-free CutDiffusion, aimed at simplifying and accelerating the diffusion extrapolation process, making it more affordable and improving performance. CutDiffusion abides by the existing patch-wise extrapolation but cuts a standard patch diffusion process into an initial phase focused on comprehensive structure denoising and a subsequent phase dedicated to specific detail refinement. Comprehensive experiments highlight the numerous almighty advantages of CutDiffusion: (1) simple method construction that enables a concise higher-resolution diffusion process without third-party engagement; (2) fast inference speed achieved through a single-step higher-resolution diffusion process, and fewer inference patches required; (3) cheap GPU cost resulting from patch-wise inference and fewer patch......

Taming Diffusion Probabilistic Models for Character Control

图片

We present a novel character control framework that effectively utilizes motion diffusion probabilistic models to generate high-quality and diverse character animations, responding in real-time to a variety of dynamic user-supplied control signals. At the heart of our method lies a transformer-based Conditional Autoregressive Motion Diffusion Model (CAMDM), which takes as input the character's historical motion and can generate a range of diverse potential future motions conditioned on high-level, coarse user control. To meet the demands for diversity, controllability, and computational efficiency required by a real-time controller, we incorporate several key algorithmic designs. These include separate condition tokenization, classifier-free guidance on past motion, and heuristic future trajectory extension, all designed to address the challenges associated with taming motion diffusion probabilistic models for character control. As a result, our work represents the first mode......

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/3215.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

《MATLAB科研绘图与学术图表绘制从入门到精通》示例:绘制婴儿性别比例饼图

在MATLAB 中可以使用 pie 函数来创建饼图。饼图是一种展示不同部分占总体的相对比例的图表。 本示例从“婴儿出生数据.csv”文件读取婴儿出生数据,然后计算男性和女性婴儿的数量,使用MATLAB绘制饼图。 配套图书链接:https://item.jd.com…

AI图书推荐:AI驱动的图书写作工作流—从想法构思到变现

《AI驱动的图书写作工作流—从想法到变现》(AI-Driven Book Creation: From Concept to Cash)是Martynas Zaloga倾力打造的一本实用指南,它巧妙地将写作艺术与人工智能前沿技术相结合。此书不仅揭示了AI在图书出版领域的无限潜力,…

应用层协议 -- HTTPS 协议

目录 一、了解 HTTPS 协议 1、升级版的 HTTP 协议 2、理解“加密” 二、对称加密 1、理解对称加密 2、对称加密存在的问题 三、非对称加密 1、理解非对称加密 2、中间人攻击 3、CA 证书和数字签名 四、总结 一、了解 HTTPS 协议 1、升级版的 HTTP 协议 HTTPS 也是…

fatal: unable to access ‘https://github.com/alibaba/flutter_boost.git/

Git error. Command: git fetch stdout: stderr: fatal: unable to access ‘https://github.com/alibaba/flutter_boost.git/’: Failed to connect to github.com port 443 after 75005 ms: Couldn’t connect to server exit code: 128 GitHub (国际型)代码 分发平台/托管平…

Mycat(一)入门概述

文章目录 概述作用原理 Mycat1.x 与 Mycat2 功能对比1.x 与 2.0 功能对比图 Mycat2 相关概念概念描述 配置文件1、服务(server)2、用户(user)3、数据源(datasource)4、集群(cluster)…

车企的数智化“内功”,大模型帮修炼

文|白 鸽 编|王一粟 时隔4年回归的北京车展,遇上了中国智能汽车的热潮。 开年价格战的持续洗礼,不仅让一众中国车企都慌得一批,也让全球巨头特斯拉也面临一季度销量大跌局面。 与此同时,智能汽车还在…

C++初识内存管理和模版

目录 前言 1.C/C内存分布 2. C的内存管理方式 2.1 new/delete操作内置类型 2. new和delete操作自定义类型 3. operator new和operator delete函数 4. new和delete的实现原理 4.1 内置类型 4.2 自定义类型 5. malloc/free和new/delete的区别 6. 初识模版 6.1 泛型编…

ERROR: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).

今天本来想在A服务器上传文件给B服务器的结果发现明明给root用户设置了密码就是远程登陆不了,后来才发现在容器中很多服务都是没有的,所以刚安装后忘记了修改配置文件,导致远程登陆失败。 报错: 解决方法: 在/etc/ssh…

申请高德地图,报错INVALID_USER_SCODE处理

配置上key后 报错: 解决:将应用类型修改为出行,问题解决 扩展:应用申请 进入应用管理,创建新应用(这里我选了导航,就报了上边的错误) 新应用中添加 key,服务平台选择…

static和extern关键字详解

目录 创作不易,如对您有帮助,还望一键三连,谢谢!!! 回顾 1.作用域和声明周期 1.1作用域 1.2生命周期 2.static和extern 2.1extern 2.2static 2.2-1static修饰局部变量 2.2-2static修饰全局变量 创…

解决在服务器中减少删除大文件夹耗时太久的问题

在数据驱动的现代商业环境中,企业对服务器的高效运作有着极高的依赖性。然而,IT管理员们常常面临一个棘手的问题:删除服务器上的大型文件夹过程缓慢,这不仅降低了工作效率,还可能对用户体验造成负面影响。本文将介绍一…

2024 年 Rust 开发者路线图

Rust 近年来因其对性能、安全性和并发性的关注而广受欢迎。作为一名开发人员,掌握 Rust 可以为各种机会打开大门,包括 Web 开发。 在 github 上发现了这个优秀的路线图,由 Anshul Goyal 创建,它提供了一条全面的路径,概…

MIEC CS172(Prolog)

Chapter 1 and 2 Fact Facts: Facts are statements that areassumed to be true. The dot ‘.’ character must come at the end of a fact. Example: We want to tell “John likes Mary” : English interpretation The standard form of fact in Prolog Likes (john, ma…

怎么用AI绘画进行人物修复?

用过AI绘画生成人物图片的朋友们是不是都碰到过这样的问题:诡异的造型、崩坏的五官、离谱的手指头、乱七八糟的背景...指望AI一次性生成百分百完美的图貌似有点难啊。 现在AI绘画有了【脸部修复】【手部修复】功能,就能够轻松解决这些的问题了&#xff0…

Facebook的时间机器:回溯社交媒体的历史

1. 社交媒体的起源与早期模式 社交媒体的历史可以追溯到互联网的早期发展阶段。在Web 1.0时代,互联网主要是一个信息发布平台,用户主要是被动地接收信息。但随着Web 2.0的兴起,互联网逐渐转变为一个互动和参与的平台,社交媒体应运…

2024.4.23 关于 LoadRunner 性能测试工具详解 —— VUG

目录 引言 LoadRunner 三大组件之间的关系 LoadRunner 脚本录制 启动并访问 WebTours 脚本录制 编译 运行(回放) LoadRunner 脚本加强 事务插入 插入集合点 插入检查点 参数化 ​编辑 打印日志 引言 问题: 此处为啥选择使用 Lo…

JdbcTemplate详解

1 概述 为了使JDBC更加易于使用,Spring在JDBC API上定义了一个抽象层,以此建立一个JDBC存取框架。 作为Spring JDBC框架的核心,JDBC模板的设计目的是为不同类型的JDBC操作提供模板方法,通过这种方式,可以在尽可能保留…

Linux安装Docker的多版本PHP和多版本MySQL共存

1: 先安装docker 安装完后执行,权限设置 sudo usermod -aG docker $USER或者sudo usermod -aG docker kentrl#添加当前用户到Docker用户组中 sudo newgrp docker#更新用户组数据,必须执行否则无效 sudo systemctl restart docker 先看目录结构: 2:按照目录结构挂载磁盘,…

【Qt常用控件】—— QWidget 核心属性

目录 (一)控件概述 1.1 关于控件体系的发展 (二)QWidget 核心属性 2.1 核心属性概览 2.2 enabled 2.3 geometry 2.4 windowTitle 2.5 windowIcon 2.6 windowOpacity 2.7 cursor 2.8 font 2.9 toolTip 2.10 focus…

Esko Ukkonen: On-line Construction of Suffix Trees

Esko Ukkonen: On-line Construction of Suffix Trees 文章目录 Esko Ukkonen: On-line Construction of Suffix Trees一、后缀树的概念及应用【详见刘方州同学报告】1.1 字典树 Trie1.2 后缀树 Suffix Tree2 后缀树的应用 二、朴素后缀树构造方法及问题三、线性时间内后缀树在…