vLLM是伯克利大学LMSYS组织开源的大语言模型高速推理框架,旨在极大地提升实时场景下的语言模型服务的吞吐与内存使用效率。vLLM是一个快速且易于使用的库,用于 LLM 推理和服务,可以和HuggingFace 无缝集成。vLLM利用了全新的注意力算法PagedAttention,有效地管理注意力键和值。
在吞吐量方面,vLLM的性能比HuggingFace Transformers(HF)高出 24 倍,文本生成推理(TGI)高出3.5倍。
使用docker方式安装
拉取cuda镜像
docker pull nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
创建容器
docker run --gpus=all -it --name vllm -p 8010:8000 -v D:\llm-model:/llm-model nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
安装依赖环境
apt-get update -yq --fix-missing
DEBIAN_FRONTEND=noninteractive
apt-get install -yq --no-install-recommends pkg-config wget cmake curl git vim
安装Miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh -b -u -p ~/miniconda3
~/miniconda3/bin/conda init
source ~/.bashrc
创建环境
conda create -n vllm python=3.10
conda activate vllm
安装依赖库
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2 xformers==0.0.23.post1+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install transformers
pip install requests
pip install gradio==4.14.0export VLLM_VERSION=0.4.0
export PYTHON_VERSION=39
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
在线调用
vLLM可以部署为API服务,web框架使用FastAPI。API服务使用AsyncLLMEngine类来支持异步调用。
启动服务
python -m vllm.entrypoints.openai.api_server --model /llm-model/Baichuan2-7B-Chat --served-model-name Baichuan2-7B-Chat --trust-remote-code#查看GPU
nvidia-smi#指定GPU和端口号
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 10086 --model /llm-model/Baichuan2-7B-Chat --served-model-name Baichuan2-7B-Chat --trust-remote-code
调用方式
curl http://localhost:8000/v1/completions \-H "Content-Type: application/json" \-d '{"model": "Baichuan2-7B-Chat","prompt": "San Francisco is a","max_tokens": 7,"temperature": 0}'
curl http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "Baichuan2-7B-Chat","messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Who won the world series in 2020?"}]}'
离线调用
import torch
from vllm import LLM, SamplingParams
MODEL_PATH = "/llm-model/Baichuan2-7B-Chat"prompts = ["San Francisco is a"]
sampling_params = SamplingParams(temperature=0, max_tokens=100)llm = LLM(model=MODEL_PATH,tokenizer_mode='auto',trust_remote_code=True,enforce_eager=True,enable_prefix_caching=True)outputs = llm.generate(prompts, sampling_params=sampling_params)
for output in outputs:generated_text = output.outputs[0].textprint(generated_text)