在Win11上部署大模型推理加速工具vLLM

vLLM是伯克利大学LMSYS组织开源的大语言模型高速推理框架，旨在极大地提升实时场景下的语言模型服务的吞吐与内存使用效率。vLLM是一个快速且易于使用的库，用于 LLM 推理和服务，可以和HuggingFace 无缝集成。vLLM利用了全新的注意力算法PagedAttention，有效地管理注意力键和值。

在吞吐量方面，vLLM的性能比HuggingFace Transformers(HF)高出 24 倍，文本生成推理（TGI）高出3.5倍。

使用docker方式安装

拉取cuda镜像

docker pull nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04

创建容器

docker run --gpus=all -it --name vllm -p 8010:8000 -v D:\llm-model:/llm-model  nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04

安装依赖环境

apt-get update -yq --fix-missing
DEBIAN_FRONTEND=noninteractive
apt-get install -yq --no-install-recommends pkg-config wget cmake curl git vim

安装Miniconda3

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh -b -u -p ~/miniconda3
~/miniconda3/bin/conda init
source ~/.bashrc

创建环境

conda create -n vllm python=3.10
conda activate vllm

安装依赖库

pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2 xformers==0.0.23.post1+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install transformers
pip install requests
pip install gradio==4.14.0export VLLM_VERSION=0.4.0
export PYTHON_VERSION=39
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

在线调用

vLLM可以部署为API服务，web框架使用FastAPI。API服务使用AsyncLLMEngine类来支持异步调用。

启动服务

python -m vllm.entrypoints.openai.api_server --model /llm-model/Baichuan2-7B-Chat --served-model-name Baichuan2-7B-Chat --trust-remote-code#查看GPU
nvidia-smi#指定GPU和端口号
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 10086 --model /llm-model/Baichuan2-7B-Chat --served-model-name Baichuan2-7B-Chat --trust-remote-code

调用方式

curl http://localhost:8000/v1/completions \-H "Content-Type: application/json" \-d '{"model": "Baichuan2-7B-Chat","prompt": "San Francisco is a","max_tokens": 7,"temperature": 0}'

curl http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "Baichuan2-7B-Chat","messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Who won the world series in 2020?"}]}'

离线调用

import torch
from vllm import LLM, SamplingParams
MODEL_PATH = "/llm-model/Baichuan2-7B-Chat"prompts = ["San Francisco is a"]
sampling_params = SamplingParams(temperature=0, max_tokens=100)llm = LLM(model=MODEL_PATH,tokenizer_mode='auto',trust_remote_code=True,enforce_eager=True,enable_prefix_caching=True)outputs = llm.generate(prompts, sampling_params=sampling_params)
for output in outputs:generated_text = output.outputs[0].textprint(generated_text)

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/803316.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！