文章目录
- 部署前置工作
- 下载 vLLM Docker 镜像
- 下载模型
- Qwen2.5-72B-Instruct-GPTQ-Int4
- 启动指令
- 接口文档地址:http://localhost:8001/docs
- 问题记录
- Llama-3.2-11B-Vision-Instruct
- 启动指令
- 接口文档地址:http://localhost:8001/docs
- 问题记录
- Qwen2-Audio-7B-Instruct
- 启动指令
- 问题记录
部署前置工作
下载 vLLM Docker 镜像
vLLM 提供了一个官方的 Docker 镜像用于部署,这个镜像可以用来运行与 OpenAI 兼容的服务,并且在 Docker Hub 上可用,名为 vllm/vllm-openai。
# https://hub.docker.com/r/vllm/vllm-openai
docker pull vllm/vllm-openai# 国内测试机无法直接下载(试了很多国内加速器都不能正常下载),可以在海外机器下载镜像,然后上传到国内自有库,再从国内自有库下载镜像# 国内自有库下载镜像
docker pull swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai:latest
下载模型
# 海外机器
huggingface-cli login --token **** # 在登录hf后,token在右上角settings-token里面
huggingface-cli download --resume-download meta-llama/Llama-3.1-70B-Instruct --local-dir /data/meta-llama/Llama-3.1-70B-Instruct # 国内测试机器
wget https://hf-mirror.com/hfd/hfd.sh
chmod a+x hfd.sh
export HF_ENDPOINT=https://hf-mirror.com
apt-get update
apt-get install -y aria2
aria2c --version
apt-get install git-lfs
./hfd.sh meta-llama/Llama-3.2-11B-Vision-Instruct --hf_username Dong-Hua --hf_token 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 aria2c -x 4./hfd.sh hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 --hf_username bifeng --hf_token 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 aria2c -x 4# 断点续传
aria2c --header='Authorization: Bearer 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' --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c 'https://hf-mirror.com/hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/resolve/main/model-00004-of-00009.safetensors' -d '.' -o 'model-00004-of-00009.safetensors'
Qwen2.5-72B-Instruct-GPTQ-Int4
千问最新 72B 大模型,中文友好,GPTQ int4 量化过,半精度低精度都无法在单卡部署
启动指令
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name qwen_llm_3 \--ipc=host \swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \--model /data/Qwen2.5-72B-Instruct-GPTQ-Int4 \--max-model-len 102400# docker inspect qwen_llm_3
# python3 -m vllm.entrypoints.openai.api_server --model /data/Qwen2.5-72B-Instruct-GPTQ-Int4
接口文档地址:http://localhost:8001/docs
curl http://localhost:8001/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "/data/Qwen2.5-72B-Instruct-GPTQ-Int4","messages": [{"role": "system", "content": "你是一个喜剧人"},{"role": "user", "content": "给我讲个短笑话"}],"max_tokens": 1024,"stop": "<|eot_id|>","temperature": 0.7,"top_p": 1,"top_k": -1}'
问题记录
- 千问设置 128k 上下文:https://qwen.readthedocs.io/en/latest/deployment/vllm.html#extended-context-support
注意在启动参数直接设置 --max-model-len 131072 无效,需要在模型配置文件增加以下配置:
"rope_scaling": {"factor": 4.0,"original_max_position_embeddings": 32768,"type": "yarn"
}
- 设置 128k 之后按如下指令启动会报 KV 缓存空间不足
# 启动参数
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name qwen_llm_3 \--ipc=host \swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \--model /data/Qwen2.5-72B-Instruct-GPTQ-Int4
根据以上参数可知,gpu 内存利用率(--gpu-memory-utilization)默认 90%,最大上下文长度(--max-model-len)不设置默认读取模型文件的 128k,根据 KV 缓存占用公式 vLLM 性能测试分析,当 KV 缓存不足时可以增加 gpu 内存利用率或者调低最大上下文长度或者给 KV Cache 进行量化处理或者增加显卡,比如:
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name qwen_llm_3 \--ipc=host \swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \--model /data/Qwen2.5-72B-Instruct-GPTQ-Int4 \--max-model-len 102400 \ # 设置最大上下文长度 102400--kv-cache-dtype fp8_e4m3 \ # 对 KV 进行 fp8 量化--gpu-memory-utilization 0.95 # gpu 利用率增加到 95%
PS:测试环境为单卡 80G 显存,Qwen2.5-72B-Instruct-GPTQ-Int4 模型权重已占用 38.5492 GB
日志:
(base) [root@iv-ycl6gxrcwka8j6ujk4bc data_vllm]# docker run --runtime nvidia --gpus all \
> -v /data1/data_vllm:/data \
> -p 8001:8000 \
> --name qwen_llm_3 \
> --ipc=host \
> swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \
> --model /data/Qwen2.5-72B-Instruct-GPTQ-Int4
INFO 10-11 07:16:12 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 10-11 07:16:12 api_server.py:220] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/data/Qwen2.5-72B-Instruct-GPTQ-Int4', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 10-11 07:16:12 gptq_marlin.py:87] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
WARNING 10-11 07:16:12 arg_utils.py:762] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 10-11 07:16:12 config.py:806] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 10-11 07:16:12 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/data/Qwen2.5-72B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/data/Qwen2.5-72B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/data/Qwen2.5-72B-Instruct-GPTQ-Int4, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 10-11 07:16:13 model_runner.py:680] Starting to load model /data/Qwen2.5-72B-Instruct-GPTQ-Int4...
Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:01<00:12, 1.21s/it]
Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:02<00:11, 1.29s/it]
Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:03<00:10, 1.35s/it]
Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:05<00:09, 1.40s/it]
Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:06<00:08, 1.38s/it]
Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:08<00:06, 1.36s/it]
Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:09<00:05, 1.36s/it]
Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:10<00:04, 1.35s/it]
Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:12<00:02, 1.37s/it]
Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:13<00:01, 1.34s/it]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:14<00:00, 1.21s/it]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:14<00:00, 1.31s/it]INFO 10-11 07:16:28 model_runner.py:692] Loading model weights took 38.5492 GB
INFO 10-11 07:16:29 gpu_executor.py:102] # GPU blocks: 6446, # CPU blocks: 819
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 317, in <module>
[rank0]: run_server(args)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]: if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 377, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 105, in initialize_cache
[rank0]: self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 212, in initialize_cache
[rank0]: raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 372, in raise_if_cache_size_invalid
[rank0]: raise ValueError(
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (103136). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Llama-3.2-11B-Vision-Instruct
Llama3.2 最新支持视觉大模型,不支持音频输入
启动指令
引擎启动参数:https://docs.vllm.ai/en/stable/models/engine_args.html
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name llama_audio \--ipc=host \crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \--model /data/Llama-3.2-11B-Vision-Instruct \--max_num_seqs 16 \--enforce-eager
接口文档地址:http://localhost:8001/docs
curl http://localhost:8001/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "/data/Qwen2.5-72B-Instruct-GPTQ-Int4","messages": [{"role": "system", "content": "你是一个喜剧人"},{"role": "user", "content": "给我讲个短笑话"}],"max_tokens": 1024,"stop": "<|eot_id|>","temperature": 0.7,"top_p": 1,"top_k": -1}'
问题记录
- Transformer 版本太低,需要升级到最新版本(重新下载最新 vllm 镜像即可)
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name llama_audio \--ipc=host \swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \--model /data/Llama-3.2-11B-Vision-Instruct
日志:
INFO 10-11 07:52:36 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 10-11 07:52:36 api_server.py:220] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/data/Llama-3.2-11B-Vision-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
Traceback (most recent call last):File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 989, in from_pretrainedconfig_class = CONFIG_MAPPING[config_dict["model_type"]]File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 691, in __getitem__raise KeyError(key)
KeyError: 'mllama'During handling of the above exception, another exception occurred:Traceback (most recent call last):File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_mainreturn _run_code(code, main_globals, None,File "/usr/lib/python3.10/runpy.py", line 86, in _run_codeexec(code, run_globals)File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 317, in <module>run_server(args)File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_serverif llm_engine is not None else AsyncLLMEngine.from_engine_args(File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 457, in from_engine_argsengine_config = engine_args.create_engine_config()File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 699, in create_engine_configmodel_config = ModelConfig(File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 152, in __init__self.hf_config = get_config(self.model, trust_remote_code, revision,File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 59, in get_configraise eFile "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 44, in get_configconfig = AutoConfig.from_pretrained(File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 991, in from_pretrainedraise ValueError(
ValueError: The checkpoint you are trying to load has model type `mllama` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
- OOM 异常
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name llama_audio \--ipc=host \crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \--model /data/Llama-3.2-11B-Vision-Instruct
视觉模型的 KV 计算和语言模型不太一样,尝试可以将批次大小调低(--max_num_seqs,默认256,调至 16),也可以调节 gpu 显存利用率(或者加显卡):
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name llama_audio \--ipc=host \crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \--model /data/Llama-3.2-11B-Vision-Instruct \--max_num_seqs 16
日志:
INFO 10-11 00:54:54 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-11 00:54:54 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/Llama-3.2-11B-Vision-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-11 00:54:54 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/6dd3d033-6ec0-4c32-aefe-4665201f0154 for IPC Path.
INFO 10-11 00:54:54 api_server.py:177] Started engine process with PID 26
WARNING 10-11 00:54:54 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 10-11 00:54:58 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 10-11 00:54:58 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='/data/Llama-3.2-11B-Vision-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 10-11 00:54:59 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 10-11 00:54:59 selector.py:116] Using XFormers backend.
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-11 00:54:59 model_runner.py:1014] Starting to load model /data/Llama-3.2-11B-Vision-Instruct...
INFO 10-11 00:55:00 selector.py:116] Using XFormers backend.
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:01<00:06, 1.72s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:03<00:05, 1.79s/it]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:05<00:03, 1.81s/it]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:07<00:01, 1.82s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00, 1.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00, 1.58s/it]INFO 10-11 00:55:08 model_runner.py:1025] Loading model weights took 19.9073 GB
INFO 10-11 00:55:08 enc_dec_model_runner.py:297] Starting profile run for multi-modal models.
Process SpawnProcess-1:
Traceback (most recent call last):File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrapself.run()File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in runself._target(*self._args, **self._kwargs)File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engineengine = MQLLMEngine.from_engine_args(engine_args=engine_args,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_argsreturn cls(^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__self.engine = LLMEngine(*args,^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 339, in __init__self._initialize_kv_caches()File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 474, in _initialize_kv_cachesself.model_executor.determine_num_available_blocks())^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocksreturn self.driver_worker.determine_num_available_blocks()^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_contextreturn func(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocksself.model_runner.profile_run()File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_contextreturn func(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 348, in profile_runself.execute_model(model_input, kv_caches, intermediate_tensors)File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_contextreturn func(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 201, in execute_modelhidden_or_intermediate_states = model_executable(^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1084, in forwardcross_attention_states = self.vision_model(pixel_values,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 556, in forwardoutput = self.transformer(^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 430, in forwardhidden_states = encoder_layer(^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 398, in forwardhidden_state = self.mlp(hidden_state)^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/clip.py", line 278, in forwardhidden_states, _ = self.fc1(hidden_states)^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 367, in forwardoutput_parallel = self.quant_method.apply(self, input_, bias)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 135, in applyreturn F.linear(x, layer.weight, bias)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.70 GiB. GPU 0 has a total capacity of 79.35 GiB of which 334.94 MiB is free. Process 2468 has 79.02 GiB memory in use. Of the allocated memory 62.55 GiB is allocated by PyTorch, and 15.98 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):File "<frozen runpy>", line 198, in _run_module_as_mainFile "<frozen runpy>", line 88, in _run_codeFile "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 571, in <module>uvloop.run(run_server(args))File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in runreturn __asyncio.run(^^^^^^^^^^^^^^File "/usr/lib/python3.12/asyncio/runners.py", line 194, in runreturn runner.run(main)^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/asyncio/runners.py", line 118, in runreturn self._loop.run_until_complete(task)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_completeFile "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapperreturn await main^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_serverasync with build_async_engine_client(args) as engine_client:^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__return await anext(self.gen)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_clientasync with build_async_engine_client_from_engine_args(^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__return await anext(self.gen)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_argsraise RuntimeError(
RuntimeError: Engine process failed to start
- 此视觉模型需要启用 eager-mode PyTorch
--enforce-eager:
- Always use eager-mode PyTorch. If False, will use eager mode and CUDA graph in hybrid for maximal performance and flexibility.
增加 --enforce-eager 能正常启动:
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name llama_audio \--ipc=host \crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \--model /data/Llama-3.2-11B-Vision-Instruct \--max_num_seqs 16 \--enforce-eager
日志:
INFO 10-11 01:01:19 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-11 01:01:19 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/Llama-3.2-11B-Vision-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=16, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-11 01:01:19 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/2a5bad77-628d-4503-884e-cd690e78f044 for IPC Path.
INFO 10-11 01:01:19 api_server.py:177] Started engine process with PID 26
WARNING 10-11 01:01:19 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 10-11 01:01:22 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 10-11 01:01:22 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='/data/Llama-3.2-11B-Vision-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 10-11 01:01:23 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 10-11 01:01:23 selector.py:116] Using XFormers backend.
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-11 01:01:24 model_runner.py:1014] Starting to load model /data/Llama-3.2-11B-Vision-Instruct...
INFO 10-11 01:01:24 selector.py:116] Using XFormers backend.
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:01<00:06, 1.72s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:03<00:05, 1.79s/it]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:05<00:03, 1.80s/it]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:07<00:01, 1.82s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00, 1.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00, 1.58s/it]INFO 10-11 01:01:32 model_runner.py:1025] Loading model weights took 19.9073 GB
INFO 10-11 01:01:32 enc_dec_model_runner.py:297] Starting profile run for multi-modal models.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
INFO 10-11 01:01:48 gpu_executor.py:122] # GPU blocks: 10025, # CPU blocks: 1638
INFO 10-11 01:01:50 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-11 01:01:50 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Process SpawnProcess-1:
Traceback (most recent call last):File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1724, in captureoutput_hidden_or_intermediate_states = self.model(^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1078, in forwardskip_cross_attention = max(attn_metadata.encoder_seq_lens) == 0^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: operation not permitted when stream is capturing
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.During handling of the above exception, another exception occurred:Traceback (most recent call last):File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrapself.run()File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in runself._target(*self._args, **self._kwargs)File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engineengine = MQLLMEngine.from_engine_args(engine_args=engine_args,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_argsreturn cls(^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__self.engine = LLMEngine(*args,^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 339, in __init__self._initialize_kv_caches()File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_cachesself.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 125, in initialize_cacheself.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 266, in initialize_cacheself._warm_up_model()File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 282, in _warm_up_modelself.model_runner.capture_model(self.gpu_cache)File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_contextreturn func(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1448, in capture_modelgraph_runner.capture(**capture_inputs)File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1723, in capturewith torch.cuda.graph(self._graph, pool=memory_pool, stream=stream):^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/cuda/graphs.py", line 185, in __exit__self.cuda_graph.capture_end()File "/usr/local/lib/python3.12/dist-packages/torch/cuda/graphs.py", line 83, in capture_endsuper().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.Traceback (most recent call last):File "<frozen runpy>", line 198, in _run_module_as_mainFile "<frozen runpy>", line 88, in _run_codeFile "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 571, in <module>uvloop.run(run_server(args))File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in runreturn __asyncio.run(^^^^^^^^^^^^^^File "/usr/lib/python3.12/asyncio/runners.py", line 194, in runreturn runner.run(main)^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/asyncio/runners.py", line 118, in runreturn self._loop.run_until_complete(task)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_completeFile "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapperreturn await main^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_serverasync with build_async_engine_client(args) as engine_client:^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__return await anext(self.gen)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_clientasync with build_async_engine_client_from_engine_args(^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__return await anext(self.gen)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_argsraise RuntimeError(
RuntimeError: Engine process failed to start
Qwen2-Audio-7B-Instruct
千问音频大模型
启动指令
引擎启动参数:https://docs.vllm.ai/en/stable/models/engine_args.html
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name qwen_audio \--ipc=host \crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \--model /data/Qwen2-Audio-7B-Instruct
问题记录
- ‘Qwen2AudioConfig’ object has no attribute ‘hidden_size’
vllm 暂不支持此模型,https://github.com/vllm-project/vllm/issues/8394
------------------------------------------------------------------------------
截止 2024/10/14,qwen-audio 分支合并中(审核中):https://github.com/vllm-project/vllm/pull/9248
日志:
INFO 10-11 01:20:25 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-11 01:20:25 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/Qwen2-Audio-7B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-11 01:20:25 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/cf23126e-3f74-4d96-be11-35016eaa9ef4 for IPC Path.
INFO 10-11 01:20:25 api_server.py:177] Started engine process with PID 26
INFO 10-11 01:20:29 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/Qwen2-Audio-7B-Instruct', speculative_config=None, tokenizer='/data/Qwen2-Audio-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/Qwen2-Audio-7B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 10-11 01:20:30 model_runner.py:1014] Starting to load model /data/Qwen2-Audio-7B-Instruct...
Process SpawnProcess-1:
Traceback (most recent call last):File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrapself.run()File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in runself._target(*self._args, **self._kwargs)File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engineengine = MQLLMEngine.from_engine_args(engine_args=engine_args,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_argsreturn cls(^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__self.engine = LLMEngine(*args,^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 325, in __init__self.model_executor = executor_class(^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 47, in __init__self._init_executor()File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 40, in _init_executorself.driver_worker.load_model()File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_modelself.model_runner.load_model()File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1016, in load_modelself.model = get_model(model_config=self.model_config,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_modelreturn loader.load_model(model_config=model_config,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 399, in load_modelmodel = _initialize_model(model_config, self.load_config,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 176, in _initialize_modelreturn build_model(^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 161, in build_modelreturn model_class(config=hf_config,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen.py", line 876, in __init__self.transformer = QWenModel(config, cache_config, quant_config)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen.py", line 564, in __init__config.hidden_size,^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 202, in __getattribute__return super().__getattribute__(key)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Qwen2AudioConfig' object has no attribute 'hidden_size'
Traceback (most recent call last):File "<frozen runpy>", line 198, in _run_module_as_mainFile "<frozen runpy>", line 88, in _run_codeFile "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 571, in <module>uvloop.run(run_server(args))File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in runreturn __asyncio.run(^^^^^^^^^^^^^^File "/usr/lib/python3.12/asyncio/runners.py", line 194, in runreturn runner.run(main)^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/asyncio/runners.py", line 118, in runreturn self._loop.run_until_complete(task)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_completeFile "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapperreturn await main^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_serverasync with build_async_engine_client(args) as engine_client:^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__return await anext(self.gen)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_clientasync with build_async_engine_client_from_engine_args(^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__return await anext(self.gen)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_argsraise RuntimeError(
RuntimeError: Engine process failed to start
系列文章:
一、大模型推理框架选型调研
二、TensorRT-LLM & Triton Server 部署过程记录
三、vLLM 大模型推理引擎调研文档
四、vLLM 推理引擎性能分析基准测试
五、vLLM 部署大模型问题记录
六、Triton Inference Server 架构原理