LLM量化、高保真图生视频、多模态肢体运动生成、高分辨率图像合成、低光图像/视频增强、相机相对姿态估计

本文首发于公众号:机器感知

LLM量化、高保真图生视频、多模态肢体运动生成、高分辨率图像合成、低光图像/视频增强、相机相对姿态估计

EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs

图片

Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work, we explore an important question: Can we design a data-independent quantization method for LLMs to guarantee its generalization performance? In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers (less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model. Since EasyQuant does not depend on any training data, the generalization performance of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the first work that achieves almost lossless quantization performance for LLMs under a data-independent setting and our algorithm runs over 10 times faster than the data-dependent methods.

Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation

图片

Image-to-video (I2V) generation tasks always suffer from keeping high fidelity in the open domains. Traditional image animation techniques primarily focus on specific domains such as faces or human poses, making them difficult to generalize to open domains. Several recent I2V frameworks based on diffusion models can generate dynamic content for open domain images but fail to maintain fidelity. We found that two main factors of low fidelity are the loss of image details and the noise prediction biases during the denoising process. To this end, we propose an effective method that can be applied to mainstream video diffusion models. This method achieves high fidelity based on supplementing more precise image information and noise rectification. Specifically, given a specified image, our method first adds noise to the input image latent to keep more details, then denoises the noisy latent with proper rectification to alleviate the noise prediction biases. Our method is tuning-free and plug-and-play. The experimental results demonstrate the effectiveness of our approach in improving the fidelity of generated videos. For more image-to-video generated results, please refer to the project website: https://noise-rectification.github.io.

Zero-LED: Zero-Reference Lighting Estimation Diffusion Model for  Low-Light Image Enhancement

图片

Diffusion model-based low-light image enhancement methods rely heavily on paired training data, leading to limited extensive application. Meanwhile, existing unsupervised methods lack effective bridging capabilities for unknown degradation. To address these limitations, we propose a novel zero-reference lighting estimation diffusion model for low-light image enhancement called Zero-LED. It utilizes the stable convergence ability of diffusion models to bridge the gap between low-light domains and real normal-light domains and successfully alleviates the dependence on pairwise training data via zero-reference learning. Specifically, we first design the initial optimization network to preprocess the input image and implement bidirectional constraints between the diffusion model and the initial optimization network through multiple objective functions. Subsequently, the degradation factors of the real-world scene are optimized iteratively to achieve effective light enhancement. In addition, we explore a frequency-domain based and semantically guided appearance reconstruction module that encourages feature alignment of the recovered image at a fine-grained level and satisfies subjective expectations. Finally, extensive experiments demonstrate the superiority of our approach to other state-of-the-art methods and more significant generalization capabilities. We will open the source code upon acceptance of the paper.

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

图片

The body movements accompanying speech aid speakers in expressing their ideas. Co-speech motion generation is one of the important approaches for synthesizing realistic avatars. Due to the intricate correspondence between speech and motion, generating realistic and diverse motion is a challenging task. In this paper, we propose MMoFusion, a Multi-modal co-speech Motion generation framework based on the diffusion model to ensure both the authenticity and diversity of generated motion. We propose a progressive fusion strategy to enhance the interaction of inter-modal and intra-modal, efficiently integrating multi-modal information. Specifically, we employ a masked style matrix based on emotion and identity information to control the generation of different motion styles. Temporal modeling of speech and motion is partitioned into style-guided specific feature encoding and shared feature encoding, aiming to learn both inter-modal and intra-modal features. Besides, we propose a geometric loss to enforce the joints' velocity and acceleration coherence among frames. Our framework generates vivid, diverse, and style-controllable motion of arbitrary length through inputting speech and editing identity and emotion. Extensive experiments demonstrate that our method outperforms current co-speech motion generation methods including upper body and challenging full body.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

图片

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation

图片

Estimating relative camera poses between images has been a central problem in computer vision. Methods that find correspondences and solve for the fundamental matrix offer high precision in most cases. Conversely, methods predicting pose directly using neural networks are more robust to limited overlap and can infer absolute translation scale, but at the expense of reduced precision. We show how to combine the best of both methods; our approach yields results that are both precise and robust, while also accurately inferring translation scales. At the heart of our model lies a Transformer that (1) learns to balance between solved and learned pose estimations, and (2) provides a prior to guide a solver. A comprehensive analysis supports our design choices and demonstrates that our method adapts flexibly to various feature extractors and correspondence estimators, showing state-of-the-art performance in 6DoF pose estimation on Matterport3D, InteriorNet, StreetLearn, and Map-free Relocalization.

A Spatio-temporal Aligned SUNet Model for Low-light Video Enhancement

图片

Distortions caused by low-light conditions are not only visually unpleasant but also degrade the performance of computer vision tasks. The restoration and enhancement have proven to be highly beneficial. However, there are only a limited number of enhancement methods explicitly designed for videos acquired in low-light conditions. We propose a Spatio-Temporal Aligned SUNet (STA-SUNet) model using a Swin Transformer as a backbone to capture low light video features and exploit their spatio-temporal correlations. The STA-SUNet model is trained on a novel, fully registered dataset (BVI), which comprises dynamic scenes captured under varying light conditions. It is further analysed comparatively against various other models over three test datasets. The model demonstrates superior adaptivity across all datasets, obtaining the highest PSNR and SSIM values. It is particularly effective in extreme low-light conditions, yielding fairly good visualisation results.

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

图片

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model the intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/724275.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

android实战视频教程,flutter开发实战详解pdf

前言 这是一篇软文、但是绝对不是鸡汤;为啥不是呢?因为我文笔太差…偶尔矫情发发牢骚(勿喷) 说说程序猿行业 现在社会上给IT行业贴上了几个标签:高薪、高危、高大上、秃顶(哈哈)。这些标签我…

C++的类与对象(三)

目录 类的6个默认成员函数 构造函数 语法 特性 析构函数 特性 类的6个默认成员函数 问题:一个什么成员都没的类叫做空类,空类中真的什么都没有吗? 基本概念:任何类在什么都不写时,编译器会自动生成以下六个默认…

Linux 性能优化的全景指南,都在这一篇里了,建议收藏!

Linux 性能优化 性能指标 高并发和响应快对应着性能优化的两个核心指标:吞吐和延时 应用负载角度:直接影响了产品终端的用户体验 系统资源角度:资源使用率、饱和度等 性能问题的本质就是系统资源已经到达瓶颈,但请求的处理还…

MySQL下实现纯SQL语句的递归查询

需求 有一个部门表,部门表中有一个字段用于定义它的父部门; 在实际业务中有一个『部门中心』的业务; 比如采购单,我们需要显示本部门及子部门的采购单显示出来。 结构 数据如下: 实现方式如下: WITH RECUR…

内衣洗衣机名牌排行榜前十名:十款强大性能内衣洗衣机精心力荐

小型内衣洗衣机一般是为婴儿宝宝,或者一些有特殊需要的用户而设计使用的,宝宝衣物换洗频繁,而且对卫生方面的除菌要求高,而为避免交叉感染,所以一般不适合和大人的衣物放在一起洗,因此对于有宝宝的家庭来说…

Android多线程实现方式及并发与同步,Android面试题汇总

一. 开发背景 想要成为一名优秀的Android开发,你需要一份完备的知识体系,在这里,让我们一起成长为自己所想的那样。 我们的项目需要开发一款智能硬件。它由 Web 后台发送指令到一款桌面端应用程序,再由桌面程序来控制不同的硬件设…

Plasmo框架开发浏览器插件配置newtab页面,并可以跳转

有关plasmo框架添加页面可以看官方文档:Browser Extension Pages – Plasmo 想要给插件添加一个页面,可以通过添加newtab.tsx添加: 或者通过添加tabs文件夹添加多个页面: 想要访问的话,只需要通过:chrome-…

Python爬虫实战第三例【三】【上】

零.实现目标 爬取视频网站视频 视频网站你们随意,在这里我选择飞某速(狗头保命)。 例如,作者上半年看过的“铃芽之旅”,突然想看了,但是在正版网站看要VIP,在盗版网站看又太卡了,…

2024年腾讯云轻量16核32G28M服务器优惠价格3468元15个月

2024年腾讯云轻量16核32G28M服务器优惠价格3468元15个月,380GB SSD云硬盘,6000GB月流量。 一张表看懂腾讯云服务器租用优惠价格表,一目了然,腾讯云服务器分为轻量应用服务器和云服务器CVM,CPU内存配置从2核2G、2核4G、…

Linux下du命令和df命令的使用

du命令作用是估计文件系统的磁盘已使用量,常用于查看文件或目录所占磁盘容量。df命令是统计磁盘使用情况,可以用来查看磁盘已被使用多少空间和还剩余多少空间。du命令语法du [选项] [文件或目录名称]参数:-a:--all, 列…

C#,数值计算,求解微分方程的预测校正法(修正欧拉法)算法与源代码

Leonhard Euler 1 微分方程 微分方程,是指含有未知函数及其导数的关系式。解微分方程就是找出未知函数。 微分方程是伴随着微积分学一起发展起来的。微积分学的奠基人Newton和Leibniz的著作中都处理过与微分方程有关的问题。微分方程的应用十分广泛,可…

To 有缘看到的朋友,To myself

To 有缘看到的朋友,To myself 零、00时光宝盒 我们生而为人,而不是什么神仙妖怪,自然逃不脱凡尘种种不易。 世界并不完美,面对很多事情我们都很无奈甚至悲哀,但生活总要继续下去,当困难悄悄地来临&#xff…

【vue3之组合式API】

组合式API 一、setup1.写法2.如何访问3.语法糖4.同步返回对象 二、reactive()和ref()1.reactive()2.ref() 三、computed四、watch函数1侦听单个数据2.侦听多个数据3. immediate4. deep5.精确侦听对象的某个属性 五、生命周期函数六、组件通信1.父传子2. 子传父 七、模版引用1. …

shell脚本一键部署docker

Docker介绍 Docker 是一个开源的平台,用于开发、交付和运行应用程序。它利用容器化技术,可以帮助开发人员更轻松地打包应用程序及其依赖项,并将其部署到任何环境中,无论是开发工作站、数据中心还是云中。以下是 Docker 的一些关键…

【Linux】软件包管理器yum

目录 一、yum是什么? 二、查看软件包 三、安装与卸载软件 1、如何安装软件 2、如何卸载软件 四、yum源的配置 一、yum是什么? 在Linux下安装软件, 一个通常的办法是下载到程序的源代码, 并进行编译, 得到可执行程序. 但是这样太麻烦了, 于是有些人…

如何在华为云服务器部署安防监控EasyCVR平台?

随着视频技术的快速发展,安防视频汇聚平台EasyCVR可支持的协议也在不断拓展,平台兼容多类型的协议接入,包括:国标GB28181、RTSP/Onvif、RTMP,以及厂家的私有协议与SDK,如:海康ehome、海康sdk、大…

YOLOv7创新改进:SPPF创新涨点篇 | SPPELAN:SPP创新结合ELAN ,效果优于SPP、SPPF| YOLOv9

💡💡💡本文独家改进:新颖SPPF创新涨点改进,SPP创新结合ELAN,来自于YOLOv9,助力YOLOv7,将SPPELAN代替原始的SPPF 💡💡💡在多个私有数据集和公开数据集VisDrone2019、PASCAL VOC实现涨点 收录 YOLOv7原创自研 https://blog.csdn.net/m0_63774211/category…

C语言常见关键字:一文打尽

关键字 1. 前言2. 什么是关键字3. extern-声明外部符号4. auto-自动5. typedef-类型重定义(类型重命名)6. register-寄存器6.1 存储器6.2 register关键字的作用 7. static-静态7.1 static修饰局部变量7.1.1 代码对比7.1.2 原理分析 7.2 static修饰全局变…

Java中常见延时队列的实现方案总结

🏷️个人主页:牵着猫散步的鼠鼠 🏷️系列专栏:Java全栈-专栏 🏷️个人学习笔记,若有缺误,欢迎评论区指正 前些天发现了一个巨牛的人工智能学习网站,通俗易懂,风趣幽默&…

2024/3/6打卡最短编辑距离---线性DP

题目: 给定两个字符串 A 和 B,现在要将 A 经过若干操作变为 B,可进行的操作有: 删除–将字符串 A 中的某个字符删除。插入–在字符串 A 的某个位置插入某个字符。替换–将字符串 A 中的某个字符替换为另一个字符。 现在请你求出&a…