【多模态大模型】 端侧多模态模型 Qwen2-VL-2B-Instruct

【多模态大模型】 端侧多模态模型 Qwen2-VL-2B-Instruct

  • Qwen2-VL-2B-Instruct 模型介绍
  • 模型测评
  • 运行环境安装
  • 运行模型
    • Image Resolution for performance boost
    • two methods for fine-grained control over the image size input to the model:
  • 下载
  • 开源协议
  • 参考

Qwen2-VL-2B-Instruct 模型介绍

  • Key Enhancements:
    SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.

Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.

Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.

Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

  • Model Architecture Updates:

    在这里插入图片描述

模型测评

  • Image Benchmarks

在这里插入图片描述

  • Video Benchmarks

在这里插入图片描述

运行环境安装

pip install qwen-vl-utils
pip install transformers==4.45.2

运行模型

  • with transformers and qwen_vl_utils:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-2B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)messages = [{"role": "user","content": [{"type": "image","image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",},{"type": "text", "text": "Describe this image."},],}
]# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",
)
inputs = inputs.to("cuda")# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
  • Without qwen_vl_utils
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)conversation = [{"role": "user","content": [{"type": "image",},{"type": "text", "text": "Describe this image."},],}
]# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [output_ids[len(input_ids) :]for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
  • Multi image inference
# Messages containing multiple images and a text query
messages = [{"role": "user","content": [{"type": "image", "image": "file:///path/to/image1.jpg"},{"type": "image", "image": "file:///path/to/image2.jpg"},{"type": "text", "text": "Identify the similarities between these images."},],}
]# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",
)
inputs = inputs.to("cuda")# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
  • Video inference
# Messages containing a images list as a video and a text query
messages = [{"role": "user","content": [{"type": "video","video": ["file:///path/to/frame1.jpg","file:///path/to/frame2.jpg","file:///path/to/frame3.jpg","file:///path/to/frame4.jpg",],"fps": 1.0,},{"type": "text", "text": "Describe this video."},],}
]
# Messages containing a video and a text query
messages = [{"role": "user","content": [{"type": "video","video": "file:///path/to/video1.mp4","max_pixels": 360 * 420,"fps": 1.0,},{"type": "text", "text": "Describe this video."},],}
]# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",
)
inputs = inputs.to("cuda")# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
  • Batch inference
# Sample messages for batch inference
messages1 = [{"role": "user","content": [{"type": "image", "image": "file:///path/to/image1.jpg"},{"type": "image", "image": "file:///path/to/image2.jpg"},{"type": "text", "text": "What are the common elements in these pictures?"},],}
]
messages2 = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages1]# Preparation for batch inference
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=texts,images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",
)
inputs = inputs.to("cuda")# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)
  • More Usage Tips
    For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
## Local file path
messages = [{"role": "user","content": [{"type": "image", "image": "file:///path/to/your/image.jpg"},{"type": "text", "text": "Describe this image."},],}
]
## Image URL
messages = [{"role": "user","content": [{"type": "image", "image": "http://path/to/your/image.jpg"},{"type": "text", "text": "Describe this image."},],}
]
## Base64 encoded image
messages = [{"role": "user","content": [{"type": "image", "image": "data:image;base64,/9j/..."},{"type": "text", "text": "Describe this image."},],}
]

Image Resolution for performance boost

# 可设置范围 256-1280
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28 # 减少资源可设置为 512 * 28 * 28 
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)

two methods for fine-grained control over the image size input to the model:

# 指定 resized_height and resized_width
messages = [{"role": "user","content": [{"type": "image","image": "file:///path/to/your/image.jpg","resized_height": 280, "resized_width": 420,},{"type": "text", "text": "Describe this image."},],}
]# 指定 min_pixels and max_pixels
messages = [{"role": "user","content": [{"type": "image","image": "file:///path/to/your/image.jpg","min_pixels": 50176,"max_pixels": 50176,},{"type": "text", "text": "Describe this image."},],}
]

下载

model_id: Qwen/Qwen2-VL-2B-Instruct
下载地址:https://hf-mirror.com/Qwen/Qwen2-VL-2B-Instruct 不需要翻墙

开源协议

License: apache-2.0

参考

  • https://qwenlm.github.io/blog/qwen2-vl/
  • https://hf-mirror.com/Qwen/Qwen2-VL-2B-Instruct
  • https://github.com/QwenLM/Qwen2-VL

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/56702.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

专题十二_floodfill(洪水灌溉)算法_算法专题详细总结

目录 1. 图像渲染&#xff08;medium&#xff09; 解析&#xff1a; 函数头&#xff1a; 函数体&#xff1a;固定模板 设置全局变量&#xff1a; 总结&#xff1a; 2. 岛屿数量&#xff08;medium&#xff09; 解析&#xff1a; 注意&#xff1a; 总结&#xff1a; …

利用由 Search AI 提供支持的自动导入功能加速 Elastic Observability 中的日志分析

作者&#xff1a;来自 Elastic Bahubali Shetti 通过自动化自定义数据集成&#xff0c;以创纪录的速度将日志迁移到 AI 驱动的日志分析。 Elastic 正在通过自动提取自定义日志来加速采用 AI 驱动的日志分析&#xff08;AI-driven log analytics&#xff09;&#xff0c;随着基…

时间序列预测(六)——循环神经网络(RNN)

目录 一、RNN的基本原理 1、正向传播&#xff08;Forward Pass&#xff09;&#xff1a; 2、计算损失&#xff08;Loss Calculation&#xff09; 3、反向传播——反向传播通过时间&#xff08;Backpropagation Through Time&#xff0c;BPTT&#xff09; 4、梯度更新&…

Flink时间语义和时间窗口

前言 在实际的流计算业务场景中&#xff0c;我们会发现&#xff0c;数据和数据的计算往往都和时间具有相关性。 举几个例子&#xff1a; 直播间右上角通常会显示观看直播的人数&#xff0c;并且这个数字每隔一段时间就会更新一次&#xff0c;比如10秒。电商平台的商品列表&a…

MySQL-15.DQL-分页查询

一.DQL-分页查询 -- 分页查询 -- 1. 从 起始索引0 开始查询员工数据&#xff0c;每页展示5条记录 select * from tb_emp limit 0,5; -- 2.查询 第1页 员工数据&#xff0c;每页展示5条记录 select * from tb_emp limit 0,5; -- 3.查询 第2页 员工数据&#xff0c;每页展示5条记…

6.计算机网络_UDP

UDP的主要特点&#xff1a; 无连接&#xff0c;发送数据之前不需要建立连接。不保证可靠交付。面向报文。应用层给UDP报文后&#xff0c;UDP并不会抽象为一个一个的字节&#xff0c;而是整个报文一起发送。没有拥塞控制。网络拥堵时&#xff0c;发送端并不会降低发送速率。可以…

Chromium 前端window对象c++实现定义

前端中window.document window.alert()等一些列方法和对象在c对应定义如下&#xff1a; 1、window对象接口定义文件window.idl third_party\blink\renderer\core\frame\window.idl // https://html.spec.whatwg.org/C/#the-window-object// FIXME: explain all uses of [Cros…

git 报错 SSL certificate problem: certificate has expired

git小乌龟 报错 SSL certificate problem: certificate has expired 场景复现&#xff1a; 原因&#xff1a; 这个错误表明你在使用Git时尝试通过HTTPS进行通信&#xff0c;但是SSL证书已经过期。这通常发生在使用自签名证书或证书有效期已到期的情况下。 解决方法: 1.如果是…

【思维导图】C语言—常见概念

hello&#xff0c;友友们&#xff0c;今天我们进入一个新的专栏——思维导图&#xff01; 思维导图帮助我们复习知识的同时建构出一个清晰的框架&#xff0c;我往后会不断更新各个专栏的思维导图&#xff0c;关注我&#xff0c;一起加油&#xff01; 今天我们回顾C语言中的常见…

智慧社区服务平台:基于Spring Boot的实现

1系统概述 1.1 研究背景 随着计算机技术的发展以及计算机网络的逐渐普及&#xff0c;互联网成为人们查找信息的重要场所&#xff0c;二十一世纪是信息的时代&#xff0c;所以信息的管理显得特别重要。因此&#xff0c;使用计算机来管理基于web的智慧社区设计与实现的相关信息成…

意外发现!AI写作这样用,热点文章轻松超越同行90%!

做自媒体&#xff0c;写热点文章很重要。 热点自带流量&#xff0c;能很快吸引不少读者。 可很多自媒体新手很犯愁。 干货文还能勉强写出来&#xff0c;碰到热点文就不知咋办了。 为啥写热点文章这么难呢&#xff1f; 关键是得找个新颖角度切入。 要是只在网上反复复制粘贴那些…

R语言机器学习算法实战系列(九)决策树分类算法 (Decision Trees Classifier)

禁止商业或二改转载,仅供自学使用,侵权必究,如需截取部分内容请后台联系作者! 文章目录 介绍教程下载数据加载R包导入数据数据预处理数据描述数据切割调节参数构建模型模型的决策树预测测试数据评估模型模型准确性混淆矩阵模型评估指标ROC CurvePRC Curve特征的重要性保存模…

Spring Boot技术栈的电影评论网站设计与实现

6系统测试 6.1概念和意义 测试的定义&#xff1a;程序测试是为了发现错误而执行程序的过程。测试(Testing)的任务与目的可以描述为&#xff1a; 目的&#xff1a;发现程序的错误&#xff1b; 任务&#xff1a;通过在计算机上执行程序&#xff0c;暴露程序中潜在的错误。 另一个…

基于手机模拟器开发游戏辅助的技术选择

开发基于手机模拟器的游戏辅助工具是一项复杂且具有挑战性的任务。为了帮助开发人员选择适合的技术方案并提供详尽的开发指导&#xff0c;我们将从以下几个方面进行分析&#xff1a;发展背景、技术选型、实现原理、实际案例和相关的法律与道德考量。 1. 发展背景 随着智能手机…

Android--第一个android程序

写在前边 ※安卓开发工具常用模拟器汇总Android开发者必备工具-常见Android模拟器(MuMu、夜神、蓝叠、逍遥、雷电、Genymotion...)_安卓模拟器-CSDN博客 ※一般游戏模拟器运行速度相对较快&#xff0c;本文选择逍遥模拟器_以下是Android Studio连接模拟器实现(先从以上博文中…

C++初阶(五)--类和对象(中)--默认成员函数

目录 一、默认成员函数&#xff08;Default Member Functions&#xff09; 二、构造函数&#xff08; Constructor&#xff09; 1.构造函数的基本概念 2.构造函数的特征 3.构造函数的使用 无参构造函数 和 带参构造函数 注意事项&#xff1a; 4.默认构造函数 隐式生成的…

Node-RED开源项目的modbus通信(TCP)

一、Modbus 通信协议 Modbus是一种串行通信协议&#xff0c;是Modicon公司&#xff08;现在的施耐德电气 Schneider Electric&#xff09;于1979年为使用可编程逻辑控制器&#xff08;PLC&#xff09;通信而发表。Modbus已经成为工业领域通信协议的业界标准&#xff08;De fact…

重庆大学软件工程考研,难度如何?

C哥专业提供——计软考研院校选择分析专业课备考指南规划 重大软件专业可谓是最好上岸的985院校&#xff01;重庆大学24考研各大学院复试录取情况已出&#xff0c; 我们先说学硕部分&#xff1a; 招生人数&#xff1a; 重庆大学软件工程学硕近几年计划统招人数都不多&#xf…

【 截稿倒计时 | JPCS独立出版 | 检索快速稳定】第三届能源与动力工程国际学术会议(EPE 2024)

第三届能源与动力工程国际学术会议&#xff08;EPE 2024&#xff09; 2024 3rd International Conference on Energy and Power Engineering 2024年10月18日 线上会议 往届平均会后3个月完成见刊及EI检索&#xff0c;检索快速稳定~ EPE 2023 EI检索 EPE 2023 Scopus检索 …

Git_GitHub

Git_GitHub 创建远程仓库 远程仓库操作 创建远程仓库别名 基本语法 案例实操 推送本地分支到远程仓库 基本语法 案例实操 拉取代码 基本语法 案例实操 克隆远程仓库到本地 基本语法 案例实操 邀请加入团队 选择邀请合作者 填入想要合作的人 复制邀请函 接受邀…