​Inf-DiT:Upsampling Any-Resolution Image、Vidu、MVDiff、Trio-ViT

本文首发于公众号:机器感知

​Inf-DiT:Upsampling Any-Resolution Image、Vidu、MVDiff、Trio-ViT

图片

Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

图片

Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 4096*4096), the resolution of generated images is often limited to 1024*1024. In this work. we propose a unidirectional block attention mechanism that can adaptively adjust the memory overhead during the inference process and handle global dependencies. Building on this module, we adopt the DiT structure for upsampling and develop an infinite super-resolution model capable of upsampling images of various shapes and resolutions. Comprehensive experiments show that our model achieves SOTA performance in generating ultra-high-resolution images in both machine and human evaluation. Compared to commonly used UNet structures, our model can save more than 5x memory when generating 4096*4096 images. The project URL is .......

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

图片

We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora -- the most powerful reported text-to-video generator. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation, video prediction and subject-driven generation, which demonstrate promising results.......

Space-time Reinforcement Network for Video Object Segmentation

图片

Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a J&F score of 86.4%, and attains a competitive result 85.0% on YouTube VOS 2018. In addition, our network exhibits a high inference sp......

Structured Click Control in Transformer-based Interactive Segmentation

图片

Click-point-based interactive segmentation has received widespread attention due to its efficiency. However, it's hard for existing algorithms to obtain precise and robust responses after multiple clicks. In this case, the segmentation results tend to have little change or are even worse than before. To improve the robustness of the response, we propose a structured click intent model based on graph neural networks, which adaptively obtains graph nodes via the global similarity of user-clicked Transformer tokens. Then the graph nodes will be aggregated to obtain structured interaction features. Finally, the dual cross-attention will be used to inject structured interaction features into vision Transformer features, thereby enhancing the control of clicks over segmentation results. Extensive experiments demonstrated the proposed algorithm can serve as a general structure in improving Transformer-based interactive segmenta?tion performance. The code and data will be released at......

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

图片

In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language. SEED-Data-Edit is composed of three distinct types of data: (1) High-quality editing data produced by an automated pipeline, ensuring a substantial volume of diverse image editing pairs. (2) Real-world scenario data collected from the internet, which captures the intricacies of user intentions for promoting the practical application of image editing in the real world. (3) High-precision multi-turn editing data annotated by humans, which involves multiple rounds of edits for simulating iterative editing processes. The combination of these diverse data sources makes SEED-Data-Edit a comprehensive and versatile dataset for training language-guided image editing model. We fine-tune a pretrained Multimodal Large Language Model (MLLM) that unifies comprehension and generation with SEED-Data-Edit. T......

Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

图片

Current state-of-the-art diffusion models employ U-Net architectures containing convolutional and (qkv) self-attention layers. The U-Net processes images while being conditioned on the time embedding input for each sampling step and the class or caption embedding input corresponding to the desired conditional generation. Such conditioning involves scale-and-shift operations to the convolutional layers but does not directly affect the attention layers. While these standard architectural choices are certainly effective, not conditioning the attention layers feels arbitrary and potentially suboptimal. In this work, we show that simply adding LoRA conditioning to the attention layers without changing or tuning the other parts of the U-Net architecture improves the image generation quality. For example, a drop-in addition of LoRA conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for unconditional and class-conditional CIFAR-10 generation, improving upon the baseli......

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

图片

Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments reveal that CQ outperforms or is competitive with existing baselines in preserving model qual......

MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View

图片

Generating consistent multiple views for 3D reconstruction tasks is still a challenge to existing image-to-3D diffusion models. Generally, incorporating 3D representations into diffusion model decrease the model's speed as well as generalizability and quality. This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model. In the model, we introduce epipolar geometry constraints and multi-view attention to enforce 3D consistency. From as few as one image input, our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.......

Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer

图片

Motivated by the huge success of Transformers in the field of natural language processing (NLP), Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. Unfortunately, due to the existence of hardware-unfriendly and quantization-sensitive non-linear operations, particularly {Softmax}, it is non-trivial to completely quantize all operations in ViTs, yielding either significant accuracy drops or non-negligible hardware costs. In response to challenges associated with \textit{standard ViTs}, we focus our attention towards the quantization and acceleration for \textit{efficient ViTs}, which not only eliminate the troublesome Softmax but also integrate linear attention with low computational complexity, and propose \emph{Trio-ViT} accordingl......

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/9164.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

力扣题目101:对称二叉树

作者介绍:10年大厂数据\经营分析经验,现任大厂数据部门负责人。 会一些的技术:数据分析、算法、SQL、大数据相关、python 欢迎加入社区:码上找工作 作者专栏每日更新: LeetCode解锁1000题: 打怪升级之旅 python数据分析…

js原生手写一个拖拽小功能

先上效果图 附上代码 <!DOCTYPE html> <html lang"en"><head><meta charset"UTF-8" /><meta http-equiv"X-UA-Compatible" content"IEedge" /><meta name"viewport" content"widthd…

Python自动化测试五大框架(测试员收藏夹必备)

&#x1f525; 交流讨论&#xff1a;欢迎加入我们一起学习&#xff01; &#x1f525; 资源分享&#xff1a;耗时200小时精选的「软件测试」资料包 &#x1f525; 教程推荐&#xff1a;火遍全网的《软件测试》教程 &#x1f4e2;欢迎点赞 &#x1f44d; 收藏 ⭐留言 &#x1…

Java 语法 (杂七杂八的知识)

面向对象三大特性 封装, 多态, 继承 基本数据类型 一字节 (Byte) 占八位 (bit) JDK, JRE, JVM JDK (Java Development Kit) : Java 开发工具包, 包括了 JRE, 编译器 javac, 和调试工具 Jconsole, jstack 等 JRE (Java Runtime Environment) : Java 运行时环境, 包括了 JVM , …

基于Vue3与ElementUI Plus的酷企秀场景可视化DIY设计器:前端技术引领下的数字化展示新篇章

一、引言 在当今信息化高速发展的时代&#xff0c;企业对于展示自身形象、提升用户体验以及增强品牌知名度的需求日益迫切。针对这一市场需求&#xff0c;我们推出了基于Vue3与ElementUI Plus的酷企秀场景可视化DIY设计器。该产品不仅具备电子画册、VR全景、地图秀三大核心功能…

Java小白_面向对象程序设计01顺序结构_01Java顺序结构之数学函数之根据三角形三边长求面积

练习 -Java顺序结构之数学函数之根据三角形三边长求面积 Java顺序结构之数学函数之根据三角形三边长求面积 练习 -Java顺序结构之数学函数之根据三角形三边长求面积1. 任务要求任务描述编程要求测试说明 2. 任务分析1. 输入输出分析2. 需求分析3. 所需知识1. java 库 如何获取输…

Docker-compose部署Fastapi项目

Docker-compose部署Fastapi、postgres、Redis、Nginx) 之前有写过使用容器部署的方式&#xff0c;这次尝试使用Docker-compose试一次大胆的尝试 使用容器的方式部署只是掌握这项技能的基础&#xff0c;在使用Docker-compose的过程中会有些稍许的不同。毕竟踩过的坑才算是跨过去…

如何在PPT中插入网页?这样操作,免费还高效!

融合课、跨学科课&#xff0c;已经是近两年来教育界的热门词。 在公开课、微课比赛中&#xff0c;不添融合一些较为先进的信息技术&#xff0c;都不好意思拿出手了。 最近&#xff0c;由不坑老师开发制作的Office插件——不坑盒子&#xff0c;实现了在PPT中插入网页&#xff…

ARM(4)缓存一致性

目录 一、缓存一致性问题 二、一致性实现方案 2.1 目录一致性协议 2.2 嗅探一致性协议 三、CHI协议 3.1 cache state 3.2 snoop维护一致性 四、其他一致性协议 4.1 MSI协议 4.2 MESI 协议 4.3 MOESI协议 本文介绍以下内容&#xff1a; 缓存一致性问题一致性实现方案…

从原始边列表到邻接矩阵Python实现图数据处理的完整指南

​​本文分享自华为云社区《从原始边列表到邻接矩阵Python实现图数据处理的完整指南》&#xff0c;作者&#xff1a; 柠檬味拥抱。 在图论和网络分析中&#xff0c;图是一种非常重要的数据结构&#xff0c;它由节点&#xff08;或顶点&#xff09;和连接这些节点的边组成。在Py…

设计模式之前端控制器模式

想象一下&#xff0c;你的Java Web应用是个交响乐团&#xff0c;每个功能模块是乐手&#xff0c;而用户请求就像是一首首待演绎的曲目。在这场音乐盛宴中&#xff0c;谁来保证演出的流畅与协调&#xff1f;答案就是——前端控制器模式&#xff01;它如同乐队的指挥&#xff0c;…

java中如何判断一个数是不是素数(质数)

相关概念 质数就是大于1的自然数字中&#xff0c;只能被1和它自己整除的数。 题目 求101~200之间的质素的个数 代码实现 判断一个数是不是质数 for (int j 2; j < i; j) {if(i % j 0){flag false;break;}}if(flag){System.out.println("当前数字是质数");…

【动态规划】:路径问题_地下城游戏

朋友们、伙计们&#xff0c;我们又见面了&#xff0c;本专栏是关于各种算法的解析&#xff0c;如果看完之后对你有一定的启发&#xff0c;那么请留下你的三连&#xff0c;祝大家心想事成&#xff01; C 语 言 专 栏&#xff1a;C语言&#xff1a;从入门到精通 数据结构专栏&…

Python的Web框架Flask+Vue生成漂亮的词云图

生成效果图 输入待生成词云图的文本&#xff0c;点击生成词云即可&#xff0c;在词云图生成之后&#xff0c;可以点击下载图片保存词云图。 运行步骤 分别用前端和后端编译器&#xff0c;打开backend和frontend文件夹。前端运行 npm install &#xff0c;安装相应的包。后端…

Java中常用类String的实例化详解

Java中常用类String的实例化详解 在Java编程中&#xff0c;String类是一个基础且非常重要的类&#xff0c;用于表示和操作字符序列。了解如何正确地实例化String对象&#xff0c;对于初学者来说是非常必要的。本文将详细解释如何在Java中实例化String对象&#xff0c;并提供带…

java加密生成签名

package demo;import java.util.Arrays; import java.util.Map;import com.google.common.collect.Maps; import org.apache.commons.lang3.StringUtils; import org.apache.commons.codec.digest.DigestUtils;/*** 加密生成签名*/ public class Encrypt {public static void m…

电脑缺失opencl.dll怎么办,轻松解决opencl.dll的多种方法分享

当我们在操作电脑过程中遇到系统提示“由于找不到opencl.dll&#xff0c;无法继续执行代码”&#xff0c;这个错误会导致软件应用无法正常运行。OpenCL.dll作为一个与Open Computing Language&#xff08;开放计算语言&#xff09;相关的动态链接库文件&#xff0c;它在执行需要…

Baidu Comate——基于AI的智能代码生成让你的编码更快、更好、更简单!

目录 Baidu Comate智能编码助手介绍 支持的编程语言 支持的 IDE 支持的操作系统 System 安装 Baidu Comate 核心场景 智能推荐 单行推荐 多行推荐 智能生成 注释生成代码 增强生成代码 生成单元测试 代码生成注释 生成文档注释 生成行间注释 代码解释 长函…

2024OD机试卷-分披萨 (java\python\c++)

题目:分披萨 题目描述 "吃货"和"馋嘴"两人到披萨店点了一份铁盘(圆形)披萨,并嘱咐店员将披萨按放射状切成大小相同的偶数个小块。但是粗心的 服务员 将披萨切成了每块大小都完全不同奇 数块,且肉眼能分辨出大小。 由于两人都想吃到最多的披萨,他们…

2023年全国职业院校技能大赛(高职组)“云计算应用”赛项赛卷1(容器云)

#需要资源&#xff08;软件包及镜像&#xff09;或有问题的&#xff0c;可私聊博主&#xff01;&#xff01;&#xff01; #需要资源&#xff08;软件包及镜像&#xff09;或有问题的&#xff0c;可私聊博主&#xff01;&#xff01;&#xff01; #需要资源&#xff08;软件包…