​Inf-DiT:Upsampling Any-Resolution Image、Vidu、MVDiff、Trio-ViT

本文首发于公众号:机器感知

​Inf-DiT:Upsampling Any-Resolution Image、Vidu、MVDiff、Trio-ViT

图片

Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

图片

Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 4096*4096), the resolution of generated images is often limited to 1024*1024. In this work. we propose a unidirectional block attention mechanism that can adaptively adjust the memory overhead during the inference process and handle global dependencies. Building on this module, we adopt the DiT structure for upsampling and develop an infinite super-resolution model capable of upsampling images of various shapes and resolutions. Comprehensive experiments show that our model achieves SOTA performance in generating ultra-high-resolution images in both machine and human evaluation. Compared to commonly used UNet structures, our model can save more than 5x memory when generating 4096*4096 images. The project URL is .......

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

图片

We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora -- the most powerful reported text-to-video generator. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation, video prediction and subject-driven generation, which demonstrate promising results.......

Space-time Reinforcement Network for Video Object Segmentation

图片

Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a J&F score of 86.4%, and attains a competitive result 85.0% on YouTube VOS 2018. In addition, our network exhibits a high inference sp......

Structured Click Control in Transformer-based Interactive Segmentation

图片

Click-point-based interactive segmentation has received widespread attention due to its efficiency. However, it's hard for existing algorithms to obtain precise and robust responses after multiple clicks. In this case, the segmentation results tend to have little change or are even worse than before. To improve the robustness of the response, we propose a structured click intent model based on graph neural networks, which adaptively obtains graph nodes via the global similarity of user-clicked Transformer tokens. Then the graph nodes will be aggregated to obtain structured interaction features. Finally, the dual cross-attention will be used to inject structured interaction features into vision Transformer features, thereby enhancing the control of clicks over segmentation results. Extensive experiments demonstrated the proposed algorithm can serve as a general structure in improving Transformer-based interactive segmenta?tion performance. The code and data will be released at......

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

图片

In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language. SEED-Data-Edit is composed of three distinct types of data: (1) High-quality editing data produced by an automated pipeline, ensuring a substantial volume of diverse image editing pairs. (2) Real-world scenario data collected from the internet, which captures the intricacies of user intentions for promoting the practical application of image editing in the real world. (3) High-precision multi-turn editing data annotated by humans, which involves multiple rounds of edits for simulating iterative editing processes. The combination of these diverse data sources makes SEED-Data-Edit a comprehensive and versatile dataset for training language-guided image editing model. We fine-tune a pretrained Multimodal Large Language Model (MLLM) that unifies comprehension and generation with SEED-Data-Edit. T......

Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

图片

Current state-of-the-art diffusion models employ U-Net architectures containing convolutional and (qkv) self-attention layers. The U-Net processes images while being conditioned on the time embedding input for each sampling step and the class or caption embedding input corresponding to the desired conditional generation. Such conditioning involves scale-and-shift operations to the convolutional layers but does not directly affect the attention layers. While these standard architectural choices are certainly effective, not conditioning the attention layers feels arbitrary and potentially suboptimal. In this work, we show that simply adding LoRA conditioning to the attention layers without changing or tuning the other parts of the U-Net architecture improves the image generation quality. For example, a drop-in addition of LoRA conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for unconditional and class-conditional CIFAR-10 generation, improving upon the baseli......

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

图片

Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments reveal that CQ outperforms or is competitive with existing baselines in preserving model qual......

MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View

图片

Generating consistent multiple views for 3D reconstruction tasks is still a challenge to existing image-to-3D diffusion models. Generally, incorporating 3D representations into diffusion model decrease the model's speed as well as generalizability and quality. This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model. In the model, we introduce epipolar geometry constraints and multi-view attention to enforce 3D consistency. From as few as one image input, our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.......

Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer

图片

Motivated by the huge success of Transformers in the field of natural language processing (NLP), Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. Unfortunately, due to the existence of hardware-unfriendly and quantization-sensitive non-linear operations, particularly {Softmax}, it is non-trivial to completely quantize all operations in ViTs, yielding either significant accuracy drops or non-negligible hardware costs. In response to challenges associated with \textit{standard ViTs}, we focus our attention towards the quantization and acceleration for \textit{efficient ViTs}, which not only eliminate the troublesome Softmax but also integrate linear attention with low computational complexity, and propose \emph{Trio-ViT} accordingl......

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/9164.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

js原生手写一个拖拽小功能

先上效果图 附上代码 <!DOCTYPE html> <html lang"en"><head><meta charset"UTF-8" /><meta http-equiv"X-UA-Compatible" content"IEedge" /><meta name"viewport" content"widthd…

Python自动化测试五大框架(测试员收藏夹必备)

&#x1f525; 交流讨论&#xff1a;欢迎加入我们一起学习&#xff01; &#x1f525; 资源分享&#xff1a;耗时200小时精选的「软件测试」资料包 &#x1f525; 教程推荐&#xff1a;火遍全网的《软件测试》教程 &#x1f4e2;欢迎点赞 &#x1f44d; 收藏 ⭐留言 &#x1…

Java 语法 (杂七杂八的知识)

面向对象三大特性 封装, 多态, 继承 基本数据类型 一字节 (Byte) 占八位 (bit) JDK, JRE, JVM JDK (Java Development Kit) : Java 开发工具包, 包括了 JRE, 编译器 javac, 和调试工具 Jconsole, jstack 等 JRE (Java Runtime Environment) : Java 运行时环境, 包括了 JVM , …

基于Vue3与ElementUI Plus的酷企秀场景可视化DIY设计器:前端技术引领下的数字化展示新篇章

一、引言 在当今信息化高速发展的时代&#xff0c;企业对于展示自身形象、提升用户体验以及增强品牌知名度的需求日益迫切。针对这一市场需求&#xff0c;我们推出了基于Vue3与ElementUI Plus的酷企秀场景可视化DIY设计器。该产品不仅具备电子画册、VR全景、地图秀三大核心功能…

如何在PPT中插入网页?这样操作,免费还高效!

融合课、跨学科课&#xff0c;已经是近两年来教育界的热门词。 在公开课、微课比赛中&#xff0c;不添融合一些较为先进的信息技术&#xff0c;都不好意思拿出手了。 最近&#xff0c;由不坑老师开发制作的Office插件——不坑盒子&#xff0c;实现了在PPT中插入网页&#xff…

ARM(4)缓存一致性

目录 一、缓存一致性问题 二、一致性实现方案 2.1 目录一致性协议 2.2 嗅探一致性协议 三、CHI协议 3.1 cache state 3.2 snoop维护一致性 四、其他一致性协议 4.1 MSI协议 4.2 MESI 协议 4.3 MOESI协议 本文介绍以下内容&#xff1a; 缓存一致性问题一致性实现方案…

设计模式之前端控制器模式

想象一下&#xff0c;你的Java Web应用是个交响乐团&#xff0c;每个功能模块是乐手&#xff0c;而用户请求就像是一首首待演绎的曲目。在这场音乐盛宴中&#xff0c;谁来保证演出的流畅与协调&#xff1f;答案就是——前端控制器模式&#xff01;它如同乐队的指挥&#xff0c;…

java中如何判断一个数是不是素数(质数)

相关概念 质数就是大于1的自然数字中&#xff0c;只能被1和它自己整除的数。 题目 求101~200之间的质素的个数 代码实现 判断一个数是不是质数 for (int j 2; j < i; j) {if(i % j 0){flag false;break;}}if(flag){System.out.println("当前数字是质数");…

【动态规划】:路径问题_地下城游戏

朋友们、伙计们&#xff0c;我们又见面了&#xff0c;本专栏是关于各种算法的解析&#xff0c;如果看完之后对你有一定的启发&#xff0c;那么请留下你的三连&#xff0c;祝大家心想事成&#xff01; C 语 言 专 栏&#xff1a;C语言&#xff1a;从入门到精通 数据结构专栏&…

Python的Web框架Flask+Vue生成漂亮的词云图

生成效果图 输入待生成词云图的文本&#xff0c;点击生成词云即可&#xff0c;在词云图生成之后&#xff0c;可以点击下载图片保存词云图。 运行步骤 分别用前端和后端编译器&#xff0c;打开backend和frontend文件夹。前端运行 npm install &#xff0c;安装相应的包。后端…

电脑缺失opencl.dll怎么办,轻松解决opencl.dll的多种方法分享

当我们在操作电脑过程中遇到系统提示“由于找不到opencl.dll&#xff0c;无法继续执行代码”&#xff0c;这个错误会导致软件应用无法正常运行。OpenCL.dll作为一个与Open Computing Language&#xff08;开放计算语言&#xff09;相关的动态链接库文件&#xff0c;它在执行需要…

Baidu Comate——基于AI的智能代码生成让你的编码更快、更好、更简单!

目录 Baidu Comate智能编码助手介绍 支持的编程语言 支持的 IDE 支持的操作系统 System 安装 Baidu Comate 核心场景 智能推荐 单行推荐 多行推荐 智能生成 注释生成代码 增强生成代码 生成单元测试 代码生成注释 生成文档注释 生成行间注释 代码解释 长函…

因表别名引用错误导致查询SQL执行时间长未出结果

问题描述&#xff1a; 项目组人员反馈在执行一条提取数据SQL时执行很慢&#xff0c;每次执行一段时间就报超时&#xff0c;要求帮忙提取下。 解决过程&#xff1a; 项目组人员发来SQL后&#xff0c;看了下SQL&#xff0c;没什么问题&#xff0c;就在客户端上执行了下&#xff0…

测试必备工具 —— Postman实战教程!

01、接口测试 &#xff08;1&#xff09;服务器端&#xff08;server&#xff09;&#xff1a;在使用别人的服务器上&#xff0c;例如微信APP客户端&#xff0c;服务端在腾讯的服务端上&#xff0c;微信上的账号信息&#xff0c;聊天记录均存储在服务端上&#xff1b;用户A发送…

1010: 折半查找的实现

解法&#xff1a; #include<iostream> #include<vector> using namespace std; void solve() {int n;cin >> n;vector<int> vec(n);for (int& x : vec) cin >> x;int x;cin >> x;int l 0, r n-1, cnt 0;while (l < r) {cnt;int…

C语言进阶 文件操作知识(下)

一. 文本文件和二进制文件 根据数据的组织形式&#xff0c;数据文件被称为文本文件或者二进制文件。 数据在内存中以二进制的形式存储&#xff0c;如果不加转换的输出到外存&#xff0c;就是二进制文件。 如果要求在外存上以ASCII码的形式存储&#xff0c;则需要在存储前转换。…

java爬虫代理ip(java爬虫代码示例)

java爬虫代理ip 在编写java爬虫时&#xff0c;经常会遇到需要使用代理IP来访问目标网站的情况。这时候&#xff0c;我们就需要编写代码来实现代理IP的功能。接下来&#xff0c;我们将为大家介绍如何在java爬虫中使用代理IP&#xff0c;以及给出相应的代码示例。 首先&#xff…

腾讯游戏海外扩张,增持芬兰游戏开发商股份持股比例增至14.8%

易采游戏网5月8日消息&#xff0c;近日腾讯再次出手&#xff0c;大幅增持了芬兰知名游戏开发商Remedy Entertainment的股份&#xff0c;持股比例猛增至14.8%。这一举动引起了业界和投资者的广泛关注。 据了解&#xff0c;腾讯此次增持是在2024年4月24日完成的。根据芬兰法律规…

TCP通信并发:

上次的程序只能保持&#xff0c;单线程或者进程 多进程并发服务器 进程的特点&#xff08;有血缘关系&#xff09; 创建子进程&#xff1a;fork&#xff08;&#xff09;&#xff1b; 虚拟地址空间被复制 &#xff0c;从一份变成两份&#xff08;用户区和内核区&#xff09…

JVM垃圾回收详解

一、基本概念 1、HotSpot VM &#xff1a;是由 Oracle 公司开发的一种 Java 虚拟机&#xff08;JVM&#xff09;&#xff0c;是 Java SE 平台上最广泛使用的虚拟机之一。它是 OpenJDK 的一部分&#xff0c;也是 Oracle JDK 的基础之一。使用即时编译&#xff08;Just-In-Time …