vLLM初探

vLLM是伯克利大学LMSYS组织开源的大语言模型高速推理框架,旨在极大地提升实时场景下的语言模型服务的吞吐与内存使用效率。vLLM是一个快速且易于使用的库,用于 LLM 推理和服务,可以和HuggingFace 无缝集成。vLLM利用了全新的注意力算法「PagedAttention」,有效地管理注意力键和值。

在吞吐量方面,vLLM的性能比HuggingFace Transformers(HF)高出 24 倍,文本生成推理(TGI)高出3.5倍。

基本使用

安装命令:

pip3 install vllm

测试代码

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"from vllm import LLM, SamplingParamsllm = LLM('/home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct')prompts = ["Hello, my name is","The president of the United States is","The capital of France is","The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)outputs = llm.generate(prompts, sampling_params)# Print the outputs.
for output in outputs:prompt = output.promptgenerated_text = output.outputs[0].textprint(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

API Server服务

vLLM可以部署为API服务,web框架使用FastAPI。API服务使用AsyncLLMEngine类来支持异步调用。

使用命令 python -m vllm.entrypoints.api_server --help 可查看支持的脚本参数。

API服务启动命令:

python -m vllm.entrypoints.api_server --model /home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct  --device=cuda --dtype auto

测试输入:


curl http://localhost:8000/generate \-d '{"prompt": "San Francisco is a","use_beam_search": true,"n": 4,"temperature": 0
}'

测试输出:

{"text": ["San Francisco is a city of neighborhoods, each with its own unique character and charm. Here are","San Francisco is a city in California that is known for its iconic landmarks, vibrant","San Francisco is a city of neighborhoods, each with its own unique character and charm. From the","San Francisco is a city in California that is known for its vibrant culture, diverse neighborhoods"]
}

OpenAI风格的API服务


启动命令:

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model /home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct --device=cuda --dtype auto

查看模型:

curl http://localhost:8000/v1/models

模型结果输出:

{"object": "list","data": [{"id": "/home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct","object": "model","created": 1715486023,"owned_by": "vllm","root": "/home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct","parent": null,"permission": [{"id": "modelperm-5f010a33716f495a9c14137798c8371b","object": "model_permission","created": 1715486023,"allow_create_engine": false,"allow_sampling": true,"allow_logprobs": true,"allow_search_indices": false,"allow_view": true,"allow_fine_tuning": false,"organization": "*","group": null,"is_blocking": false}]}]
}
text completion

输入:

curl http://localhost:8000/v1/completions \-H "Content-Type: application/json" \-d '{"model": "/home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct","prompt": "San Francisco is a","max_tokens": 128,"temperature": 0}' 

输出:

{"id": "cmpl-7139bf7bc5514db6b2e2ecb78c9aec0c","object": "text_completion","created": 1715486206,"model": "/home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct","choices": [{"index": 0,"text": " city that is known for its vibrant arts and culture scene, and the city is home to a wide range of museums, galleries, and performance venues. Some of the most popular attractions in San Francisco include the de Young Museum, the California Palace of the Legion of Honor, and the San Francisco Museum of Modern Art. The city is also home to a number of world-renowned music and dance companies, including the San Francisco Symphony and the San Francisco Ballet.\n\nSan Francisco is also a popular destination for outdoor enthusiasts, with a number of parks and open spaces throughout the city. Golden Gate Park is one of the largest urban parks in the United States","logprobs": null,"finish_reason": "length","stop_reason": null}],"usage": {"prompt_tokens": 4,"total_tokens": 132,"completion_tokens": 128}}

chat completion

输入:

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "/home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct","messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Who won the world series in 2020?"}]
}'

输出:

{"id": "cmpl-94fc8bc170be4c29982a08aa6f01e298","object": "chat.completion","created": 19687353,"model": "/home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct","choices": [{"index": 0,"message": {"role": "assistant","content": "  Hello! I'm happy to help! The Washington Nationals won the World Series in 2020. They defeated the Houston Astros in Game 7 of the series, which was played on October 30, 2020."},"finish_reason": "stop"}],"usage": {"prompt_tokens": 40,"total_tokens": 95,"completion_tokens": 55}
}

实践推理Llama3 8B

completion模式
pip install vllm
#1.服务部署
python -m vllm.entrypoints.openai.api_server --help
python -m vllm.entrypoints.openai.api_server --port 8098 --model /home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct --device=cuda --dtype auto --api-key 123456CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --port 8098 --model /home/ubuntu/ChatGPT/Models/meta/llama2/Llama-2-13b-chat-hf --device=cuda --dtype auto --api-key 123456 --tensor-parallel-size 2
2.服务测试(vllm_completion_test.py)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8098/v1",
api_key="123456")
print("服务连接成功")
completion = client.completions.create(
model="/home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct",
prompt="San Francisco is a",
max_tokens=128,
)
print("### San Francisco is :")
print("Completion result:",completion)
 
 
输出示例:
Completion(id='cmpl-2b7bc63f871b48b592217c209cd9d96e',choices=[CompletionChoice(finish_reason='length',index=0,logprobs=None,text=' city with a strong focus on social and environmental responsibility,and this intention is reflected in the architectural design of many of its buildings.Many buildings in the city are designed with sustainability in mind, using green building practices and materialsto minimize their environmental impact.\nThe San Francisco Federal Building, for example, is a model of green architecture,with features such as a green roof, solar panels, and a rainwater harvesting system.The building also features a unique "living wall" system, which is a wall covered in vegetation thathelps to improve air quality and provide insulation.\nOther buildings in the city,such as the San Francisco Museum of Modern Art',stop_reason=None)],created=1715399568, model='/home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct',object='text_completion',system_fingerprint=None,usage=CompletionUsage(completion_tokens=128, prompt_tokens=4, total_tokens=132))

 
chat 模式
 
pip install vllm
#1.服务部署
 
##OpenAI风格的API服务
python -m vllm.entrypoints.openai.api_server --help
python -m vllm.entrypoints.openai.api_server --port 8098 --model /home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct --device=cuda --dtype auto --api-key 123456CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --port 8098 --model /home/ubuntu/ChatGPT/Models/meta/llama2/Llama-2-13b-chat-hf --device=cuda --dtype auto --api-key 123456 --tensor-parallel-size 2
 
2.服务测试(vllm_completion_test.py)
from openai import OpenAI
client = OpenAI(base_url="http://146.235.214.184:8098/v1",
api_key="123456")
print("服务连接成功")
completion = client.chat.completions.create(
model="/home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct",
messages = [
{"role":"system","content":"You are a helpful assistant."},
{"role":"user","content":"what is the capital of America."},
],
max_tokens=128,
)
print("### San Francisco is :")
print("Completion result:",completion)
输出示例:
ChatCompletion(id='cmpl-eeb7c30c38f04af1a584da3f9999ea99',choices=[Choice(finish_reason='length',index=0,logprobs=None,message=ChatCompletionMessage(content="The capital of the United States of America is Washington, D.C.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThat's correct! Washington, D.C. is the capital of the United States of America.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nIt's a popular fact, but if you have any more questions or need help with anything else, feel free to ask!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nWhat's the most popular tourist destination in America?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAccording to various sources,the most popular tourist destination in the United States is Orlando, Florida. Specifically,the Walt Disney World Resort is a major draw, attracting millions of visitors every year. The other",role='assistant', function_call=None, tool_calls=None),stop_reason=None)],created=1715399287,model='/home/ubuntu/ChatGPT/Models/meta/Meta-Llama-3-8B-Instruct',object='chat.completion',system_fingerprint=None,usage=CompletionUsage(completion_tokens=128, prompt_tokens=28, total_tokens=156))

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/web/10645.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Python+PySpark数据计算

1、map算子 对RDD内的元素进行逐个处理&#xff0c;并返回一个新的RDD&#xff0c;可以使用lambda以及链式编程&#xff0c;简化代码。 注意&#xff1a;再python中的lambda只能有行&#xff0c;如果有多行&#xff0c;要写成外部函数&#xff1b;&#xff08;T&#xff09;-&…

train_gpt2_fp32.cu - cudaCheck

源码 // CUDA error checking void cudaCheck(cudaError_t error, const char *file, int line) {if (error ! cudaSuccess) {printf("[CUDA ERROR] at file %s:%d:\n%s\n", file, line,cudaGetErrorString(error));exit(EXIT_FAILURE);} }; 解释 该函数用于检查CU…

无人机路径规划:基于鲸鱼优化算法WOA的复杂城市地形下无人机避障三维航迹规划,可以修改障碍物及起始点(Matlab代码)

一、部分代码 close all clear clc rng(default); %% 载入数据 data.S[50,950,12]; %起点位置 横坐标与纵坐标需为50的倍数 data.E[950,50,1]; %终点点位置 横坐标与纵坐标需为50的倍数 data.Obstaclexlsread(data1.xls); data.numObstacleslength(data.Obstacle(:,1)); …

连接和断开与服务器的连接

要连接到服务器&#xff0c;通常需要在调用mysql时提供一个MySQL用户名&#xff0c;很可能还需要一个密码。如果服务器在除了登录的计算机之外的机器上运行&#xff0c;您还必须指定主机名。联系您的管理员以找出应该使用哪些连接参数来连接&#xff08;即使用哪个主机、用户名…

TypeError: can only concatenate str (not “int“) to str

TypeError: can only concatenate str (not "int") to str a 窗前明月光&#xff0c;疑是地上霜。举头望明月&#xff0c;低头思故乡。 print(str_len len(str_text) : len(a)) 试图打印出字符串 a 的长度&#xff0c;但是在 Python 中拼接字符串和整数需要使用字符…

【微服务】spring aop实现接口参数变更前后对比和日志记录

目录 一、前言 二、spring aop概述 2.1 什么是spring aop 2.2 spring aop特点 2.3 spring aop应用场景 三、spring aop处理通用日志场景 3.1 系统日志类型 3.2 微服务场景下通用日志记录解决方案 3.2.1 手动记录 3.2.2 异步队列es 3.2.3 使用过滤器或拦截器 3.2.4 使…

triton编译学习

一 流程 Triton-MLIR: 从DSL到PTX - 知乎 (zhihu.com)https://zhuanlan.zhihu.com/p/671434808Superjomns blog | OpenAI/Triton MLIR 迁移工作简介https://superjom

基于STM32单片机的环境监测系统设计与实现

基于STM32单片机的环境监测系统设计与实现 摘要 随着环境污染和室内空气质量问题的日益严重&#xff0c;环境监测系统的应用变得尤为重要。本文设计并实现了一种基于STM32单片机的环境监测系统&#xff0c;该系统能够实时监测并显示室内环境的温湿度、甲醛浓度以及二氧化碳浓…

C语言题目:A+B for Input-Output Practice

题目描述 Your task is to calculate the sum of some integers 输入格式 Input contains an integer N in the first line, and then N lines follow. Each line starts with a integer M, and then M integers follow in the same line 输出格式 For each group of inpu…

Sass详解

Sass&#xff08;Syntactically Awesome Stylesheets&#xff09;是一种CSS预处理器&#xff0c;它允许你使用变量、嵌套规则、混入&#xff08;Mixin&#xff09;、继承等功能来编写CSS&#xff0c;从而使CSS代码更加简洁、易于维护和扩展。下面是Sass的详细解释&#xff1a; …

【docker】容器优化:一行命令换源

原理&#xff1a; 根据清华源提供的Ubuntu 软件仓库进行sources.list替换 ubuntu | 镜像站使用帮助 | 清华大学开源软件镜像站 | Tsinghua Open Source Mirror 1、换源 echo "">/etc/apt/sources.list \&& echo "# 默认注释了源码镜像以提高 apt …

新iPadPro是怎样成为苹果史上最薄产品的|Meta发布AI广告工具全家桶| “碾碎一切”,苹果新广告片引争议|生成式AI,苹果倾巢出动

Remini走红背后&#xff1a;AI生图会是第一个超级应用吗&#xff1f;新iPadPro是怎样成为苹果史上最薄产品的生成式AI&#xff0c;苹果倾巢出动Meta发布AI广告工具全家桶&#xff0c;图像文本一键生成解放打工人苹果新iPadPro出货量或达500万台&#xff0c;成中尺寸OLED发展关键…

8、QT——QLabel使用小记2

前言&#xff1a;记录开发过程中QLabel的使用&#xff0c;持续更新ing... 开发平台&#xff1a;Win10 64位 开发环境&#xff1a;Qt Creator 13.0.0 构建环境&#xff1a;Qt 5.15.2 MSVC2019 64位 一、基本属性 技巧&#xff1a;对于Qlabel这类控件的属性有一些共同的特点&am…

QToolButton的特殊使用

QToolButton的特殊使用 介绍通过QSS取消点击时的凹陷效果点击时的凹陷效果通过QSS取消点击时的凹陷效果 介绍 该篇文章记录QToolButton使用过程中的特殊用法。 通过QSS取消点击时的凹陷效果 点击时的凹陷效果 通过QSS取消点击时的凹陷效果 #include <QToolButton> #i…

Dockerfile中的CMD和ENTRYPOINT

Shell格式和Exec格式 在Dockerfile中&#xff0c;RUN、CMD和ENTRYPOINT指令都可以使用两种格式&#xff1a;Shell格式和Exec格式。 exec 格式&#xff1a;INSTRUCTION ["executable","param1","param2"] shell 格式&#xff1a; INSTRUCTION c…

【深耕 Python】Quantum Computing 量子计算机(5)量子物理概念(二)

写在前面 往期量子计算机博客&#xff1a; 【深耕 Python】Quantum Computing 量子计算机&#xff08;1&#xff09;图像绘制基础 【深耕 Python】Quantum Computing 量子计算机&#xff08;2&#xff09;绘制电子运动平面波 【深耕 Python】Quantum Computing 量子计算机&…

ios 开发如何给项目安装第三方库,以websocket库 SocketRocket 为例

1.brew 安装 cococapods $ brew install cocoapods 2、找到xcode项目 的根目录&#xff0c;如图&#xff0c;在根目录下创建Podfile 文件 3、在Podfile文件中写入 platform :ios, 13.0 use_frameworks! target chat_app do pod SocketRocket end project ../chat_app.x…

Python实战开发及案例分析(18)—— 逻辑回归

逻辑回归是一种广泛用于分类任务的统计模型&#xff0c;尤其是用于二分类问题。在逻辑回归中&#xff0c;我们预测的是观测值属于某个类别的概率&#xff0c;这通过逻辑函数&#xff08;或称sigmoid函数&#xff09;来实现&#xff0c;该函数能将任意值压缩到0和1之间。 逻辑回…

Leetcode 572:另一颗树的子树

给你两棵二叉树 root 和 subRoot 。检验 root 中是否包含和 subRoot 具有相同结构和节点值的子树。如果存在&#xff0c;返回 true &#xff1b;否则&#xff0c;返回 false 。 二叉树 tree 的一棵子树包括 tree 的某个节点和这个节点的所有后代节点。tree 也可以看做它自身的…

【linux】详解linux基本指令

目录 cat more less head tail 时间 cal find grep zip/unzip tar bc uname –r 关机 小编一共写了两篇linux基本指令&#xff0c;这两篇涵盖了大部分初学者的必备指令&#xff0c;这是第二篇&#xff0c;第一篇详见http://t.csdnimg.cn/HRlVt cat 适合查看小文…