vLLM 部署大模型问题记录

文章目录

  • 部署前置工作
    • 下载 vLLM Docker 镜像
    • 下载模型
  • Qwen2.5-72B-Instruct-GPTQ-Int4
    • 启动指令
    • 接口文档地址:http://localhost:8001/docs
    • 问题记录
  • Llama-3.2-11B-Vision-Instruct
    • 启动指令
    • 接口文档地址:http://localhost:8001/docs
    • 问题记录
  • Qwen2-Audio-7B-Instruct
    • 启动指令
    • 问题记录

部署前置工作

下载 vLLM Docker 镜像

vLLM 提供了一个官方的 Docker 镜像用于部署,这个镜像可以用来运行与 OpenAI 兼容的服务,并且在 Docker Hub 上可用,名为 vllm/vllm-openai。

# https://hub.docker.com/r/vllm/vllm-openai
docker pull vllm/vllm-openai# 国内测试机无法直接下载(试了很多国内加速器都不能正常下载),可以在海外机器下载镜像,然后上传到国内自有库,再从国内自有库下载镜像# 国内自有库下载镜像
docker pull swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai:latest

下载模型

# 海外机器
huggingface-cli login --token **** # 在登录hf后,token在右上角settings-token里面
huggingface-cli download --resume-download meta-llama/Llama-3.1-70B-Instruct --local-dir /data/meta-llama/Llama-3.1-70B-Instruct # 国内测试机器
wget https://hf-mirror.com/hfd/hfd.sh
chmod a+x hfd.sh
export HF_ENDPOINT=https://hf-mirror.com
apt-get update
apt-get install -y aria2
aria2c --version
apt-get install git-lfs
./hfd.sh meta-llama/Llama-3.2-11B-Vision-Instruct --hf_username Dong-Hua --hf_token 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 aria2c -x 4./hfd.sh hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 --hf_username bifeng --hf_token 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 aria2c -x 4# 断点续传
aria2c --header='Authorization: Bearer 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' --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c 'https://hf-mirror.com/hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/resolve/main/model-00004-of-00009.safetensors' -d '.' -o 'model-00004-of-00009.safetensors'

Qwen2.5-72B-Instruct-GPTQ-Int4

千问最新 72B 大模型,中文友好,GPTQ int4 量化过,半精度低精度都无法在单卡部署

启动指令

docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name qwen_llm_3 \--ipc=host \swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \--model /data/Qwen2.5-72B-Instruct-GPTQ-Int4 \--max-model-len 102400# docker inspect qwen_llm_3
# python3 -m vllm.entrypoints.openai.api_server --model /data/Qwen2.5-72B-Instruct-GPTQ-Int4

接口文档地址:http://localhost:8001/docs

curl http://localhost:8001/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "/data/Qwen2.5-72B-Instruct-GPTQ-Int4","messages": [{"role": "system", "content": "你是一个喜剧人"},{"role": "user", "content": "给我讲个短笑话"}],"max_tokens": 1024,"stop": "<|eot_id|>","temperature": 0.7,"top_p": 1,"top_k": -1}'

问题记录

  1. 千问设置 128k 上下文:https://qwen.readthedocs.io/en/latest/deployment/vllm.html#extended-context-support

注意在启动参数直接设置 --max-model-len 131072 无效,需要在模型配置文件增加以下配置:

"rope_scaling": {"factor": 4.0,"original_max_position_embeddings": 32768,"type": "yarn"
}
  1. 设置 128k 之后按如下指令启动会报 KV 缓存空间不足
# 启动参数
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name qwen_llm_3 \--ipc=host \swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \--model /data/Qwen2.5-72B-Instruct-GPTQ-Int4
根据以上参数可知,gpu 内存利用率(--gpu-memory-utilization)默认 90%,最大上下文长度(--max-model-len)不设置默认读取模型文件的 128k,根据 KV 缓存占用公式 vLLM 性能测试分析,当 KV 缓存不足时可以增加 gpu 内存利用率或者调低最大上下文长度或者给 KV Cache 进行量化处理或者增加显卡,比如:
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name qwen_llm_3 \--ipc=host \swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \--model /data/Qwen2.5-72B-Instruct-GPTQ-Int4 \--max-model-len 102400 \ # 设置最大上下文长度 102400--kv-cache-dtype fp8_e4m3 \ # 对 KV 进行 fp8 量化--gpu-memory-utilization 0.95 # gpu 利用率增加到 95%
PS:测试环境为单卡 80G 显存,Qwen2.5-72B-Instruct-GPTQ-Int4 模型权重已占用 38.5492 GB

日志:

(base) [root@iv-ycl6gxrcwka8j6ujk4bc data_vllm]# docker run --runtime nvidia --gpus all \
>     -v /data1/data_vllm:/data \
>     -p 8001:8000 \
>     --name qwen_llm_3 \
>     --ipc=host \
>     swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \
>     --model /data/Qwen2.5-72B-Instruct-GPTQ-Int4
INFO 10-11 07:16:12 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 10-11 07:16:12 api_server.py:220] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/data/Qwen2.5-72B-Instruct-GPTQ-Int4', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 10-11 07:16:12 gptq_marlin.py:87] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
WARNING 10-11 07:16:12 arg_utils.py:762] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 10-11 07:16:12 config.py:806] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 10-11 07:16:12 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/data/Qwen2.5-72B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/data/Qwen2.5-72B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/data/Qwen2.5-72B-Instruct-GPTQ-Int4, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 10-11 07:16:13 model_runner.py:680] Starting to load model /data/Qwen2.5-72B-Instruct-GPTQ-Int4...
Loading safetensors checkpoint shards:   0% Completed | 0/11 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   9% Completed | 1/11 [00:01<00:12,  1.21s/it]
Loading safetensors checkpoint shards:  18% Completed | 2/11 [00:02<00:11,  1.29s/it]
Loading safetensors checkpoint shards:  27% Completed | 3/11 [00:03<00:10,  1.35s/it]
Loading safetensors checkpoint shards:  36% Completed | 4/11 [00:05<00:09,  1.40s/it]
Loading safetensors checkpoint shards:  45% Completed | 5/11 [00:06<00:08,  1.38s/it]
Loading safetensors checkpoint shards:  55% Completed | 6/11 [00:08<00:06,  1.36s/it]
Loading safetensors checkpoint shards:  64% Completed | 7/11 [00:09<00:05,  1.36s/it]
Loading safetensors checkpoint shards:  73% Completed | 8/11 [00:10<00:04,  1.35s/it]
Loading safetensors checkpoint shards:  82% Completed | 9/11 [00:12<00:02,  1.37s/it]
Loading safetensors checkpoint shards:  91% Completed | 10/11 [00:13<00:01,  1.34s/it]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:14<00:00,  1.21s/it]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:14<00:00,  1.31s/it]INFO 10-11 07:16:28 model_runner.py:692] Loading model weights took 38.5492 GB
INFO 10-11 07:16:29 gpu_executor.py:102] # GPU blocks: 6446, # CPU blocks: 819
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 317, in <module>
[rank0]:     run_server(args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 377, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 105, in initialize_cache
[rank0]:     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 212, in initialize_cache
[rank0]:     raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 372, in raise_if_cache_size_invalid
[rank0]:     raise ValueError(
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (103136). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

Llama-3.2-11B-Vision-Instruct

Llama3.2 最新支持视觉大模型,不支持音频输入

启动指令

引擎启动参数:https://docs.vllm.ai/en/stable/models/engine_args.html

docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name llama_audio \--ipc=host \crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \--model /data/Llama-3.2-11B-Vision-Instruct \--max_num_seqs 16 \--enforce-eager

接口文档地址:http://localhost:8001/docs

curl http://localhost:8001/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "/data/Qwen2.5-72B-Instruct-GPTQ-Int4","messages": [{"role": "system", "content": "你是一个喜剧人"},{"role": "user", "content": "给我讲个短笑话"}],"max_tokens": 1024,"stop": "<|eot_id|>","temperature": 0.7,"top_p": 1,"top_k": -1}'

问题记录

  1. Transformer 版本太低,需要升级到最新版本(重新下载最新 vllm 镜像即可)
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name llama_audio \--ipc=host \swr.cn-east-3.myhuaweicloud.com/kubesre/docker.io/vllm/vllm-openai \--model /data/Llama-3.2-11B-Vision-Instruct

日志:

INFO 10-11 07:52:36 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 10-11 07:52:36 api_server.py:220] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/data/Llama-3.2-11B-Vision-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
Traceback (most recent call last):File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 989, in from_pretrainedconfig_class = CONFIG_MAPPING[config_dict["model_type"]]File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 691, in __getitem__raise KeyError(key)
KeyError: 'mllama'During handling of the above exception, another exception occurred:Traceback (most recent call last):File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_mainreturn _run_code(code, main_globals, None,File "/usr/lib/python3.10/runpy.py", line 86, in _run_codeexec(code, run_globals)File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 317, in <module>run_server(args)File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_serverif llm_engine is not None else AsyncLLMEngine.from_engine_args(File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 457, in from_engine_argsengine_config = engine_args.create_engine_config()File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 699, in create_engine_configmodel_config = ModelConfig(File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 152, in __init__self.hf_config = get_config(self.model, trust_remote_code, revision,File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 59, in get_configraise eFile "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 44, in get_configconfig = AutoConfig.from_pretrained(File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 991, in from_pretrainedraise ValueError(
ValueError: The checkpoint you are trying to load has model type `mllama` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
  1. OOM 异常
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name llama_audio \--ipc=host \crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \--model /data/Llama-3.2-11B-Vision-Instruct
视觉模型的 KV 计算和语言模型不太一样,尝试可以将批次大小调低(--max_num_seqs,默认256,调至 16),也可以调节 gpu 显存利用率(或者加显卡):
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name llama_audio \--ipc=host \crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \--model /data/Llama-3.2-11B-Vision-Instruct \--max_num_seqs 16

日志:

INFO 10-11 00:54:54 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-11 00:54:54 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/Llama-3.2-11B-Vision-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-11 00:54:54 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/6dd3d033-6ec0-4c32-aefe-4665201f0154 for IPC Path.
INFO 10-11 00:54:54 api_server.py:177] Started engine process with PID 26
WARNING 10-11 00:54:54 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 10-11 00:54:58 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 10-11 00:54:58 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='/data/Llama-3.2-11B-Vision-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 10-11 00:54:59 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 10-11 00:54:59 selector.py:116] Using XFormers backend.
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-11 00:54:59 model_runner.py:1014] Starting to load model /data/Llama-3.2-11B-Vision-Instruct...
INFO 10-11 00:55:00 selector.py:116] Using XFormers backend.
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:01<00:06,  1.72s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:03<00:05,  1.79s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:05<00:03,  1.81s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:07<00:01,  1.82s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00,  1.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00,  1.58s/it]INFO 10-11 00:55:08 model_runner.py:1025] Loading model weights took 19.9073 GB
INFO 10-11 00:55:08 enc_dec_model_runner.py:297] Starting profile run for multi-modal models.
Process SpawnProcess-1:
Traceback (most recent call last):File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrapself.run()File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in runself._target(*self._args, **self._kwargs)File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engineengine = MQLLMEngine.from_engine_args(engine_args=engine_args,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_argsreturn cls(^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__self.engine = LLMEngine(*args,^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 339, in __init__self._initialize_kv_caches()File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 474, in _initialize_kv_cachesself.model_executor.determine_num_available_blocks())^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocksreturn self.driver_worker.determine_num_available_blocks()^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_contextreturn func(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocksself.model_runner.profile_run()File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_contextreturn func(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 348, in profile_runself.execute_model(model_input, kv_caches, intermediate_tensors)File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_contextreturn func(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 201, in execute_modelhidden_or_intermediate_states = model_executable(^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1084, in forwardcross_attention_states = self.vision_model(pixel_values,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 556, in forwardoutput = self.transformer(^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 430, in forwardhidden_states = encoder_layer(^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 398, in forwardhidden_state = self.mlp(hidden_state)^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/clip.py", line 278, in forwardhidden_states, _ = self.fc1(hidden_states)^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 367, in forwardoutput_parallel = self.quant_method.apply(self, input_, bias)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 135, in applyreturn F.linear(x, layer.weight, bias)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.70 GiB. GPU 0 has a total capacity of 79.35 GiB of which 334.94 MiB is free. Process 2468 has 79.02 GiB memory in use. Of the allocated memory 62.55 GiB is allocated by PyTorch, and 15.98 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):File "<frozen runpy>", line 198, in _run_module_as_mainFile "<frozen runpy>", line 88, in _run_codeFile "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 571, in <module>uvloop.run(run_server(args))File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in runreturn __asyncio.run(^^^^^^^^^^^^^^File "/usr/lib/python3.12/asyncio/runners.py", line 194, in runreturn runner.run(main)^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/asyncio/runners.py", line 118, in runreturn self._loop.run_until_complete(task)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_completeFile "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapperreturn await main^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_serverasync with build_async_engine_client(args) as engine_client:^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__return await anext(self.gen)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_clientasync with build_async_engine_client_from_engine_args(^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__return await anext(self.gen)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_argsraise RuntimeError(
RuntimeError: Engine process failed to start
  1. 此视觉模型需要启用 eager-mode PyTorch
--enforce-eager:
- Always use eager-mode PyTorch. If False, will use eager mode and CUDA graph in hybrid for maximal performance and flexibility.
增加 --enforce-eager 能正常启动:
docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name llama_audio \--ipc=host \crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \--model /data/Llama-3.2-11B-Vision-Instruct \--max_num_seqs 16 \--enforce-eager

日志:

INFO 10-11 01:01:19 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-11 01:01:19 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/Llama-3.2-11B-Vision-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=16, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-11 01:01:19 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/2a5bad77-628d-4503-884e-cd690e78f044 for IPC Path.
INFO 10-11 01:01:19 api_server.py:177] Started engine process with PID 26
WARNING 10-11 01:01:19 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 10-11 01:01:22 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 10-11 01:01:22 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='/data/Llama-3.2-11B-Vision-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 10-11 01:01:23 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 10-11 01:01:23 selector.py:116] Using XFormers backend.
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-11 01:01:24 model_runner.py:1014] Starting to load model /data/Llama-3.2-11B-Vision-Instruct...
INFO 10-11 01:01:24 selector.py:116] Using XFormers backend.
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:01<00:06,  1.72s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:03<00:05,  1.79s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:05<00:03,  1.80s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:07<00:01,  1.82s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00,  1.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00,  1.58s/it]INFO 10-11 01:01:32 model_runner.py:1025] Loading model weights took 19.9073 GB
INFO 10-11 01:01:32 enc_dec_model_runner.py:297] Starting profile run for multi-modal models.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
WARNING 10-11 01:01:32 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
INFO 10-11 01:01:48 gpu_executor.py:122] # GPU blocks: 10025, # CPU blocks: 1638
INFO 10-11 01:01:50 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-11 01:01:50 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Process SpawnProcess-1:
Traceback (most recent call last):File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1724, in captureoutput_hidden_or_intermediate_states = self.model(^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1078, in forwardskip_cross_attention = max(attn_metadata.encoder_seq_lens) == 0^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: operation not permitted when stream is capturing
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.During handling of the above exception, another exception occurred:Traceback (most recent call last):File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrapself.run()File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in runself._target(*self._args, **self._kwargs)File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engineengine = MQLLMEngine.from_engine_args(engine_args=engine_args,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_argsreturn cls(^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__self.engine = LLMEngine(*args,^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 339, in __init__self._initialize_kv_caches()File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_cachesself.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 125, in initialize_cacheself.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 266, in initialize_cacheself._warm_up_model()File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 282, in _warm_up_modelself.model_runner.capture_model(self.gpu_cache)File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_contextreturn func(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1448, in capture_modelgraph_runner.capture(**capture_inputs)File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1723, in capturewith torch.cuda.graph(self._graph, pool=memory_pool, stream=stream):^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/torch/cuda/graphs.py", line 185, in __exit__self.cuda_graph.capture_end()File "/usr/local/lib/python3.12/dist-packages/torch/cuda/graphs.py", line 83, in capture_endsuper().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.Traceback (most recent call last):File "<frozen runpy>", line 198, in _run_module_as_mainFile "<frozen runpy>", line 88, in _run_codeFile "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 571, in <module>uvloop.run(run_server(args))File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in runreturn __asyncio.run(^^^^^^^^^^^^^^File "/usr/lib/python3.12/asyncio/runners.py", line 194, in runreturn runner.run(main)^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/asyncio/runners.py", line 118, in runreturn self._loop.run_until_complete(task)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_completeFile "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapperreturn await main^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_serverasync with build_async_engine_client(args) as engine_client:^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__return await anext(self.gen)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_clientasync with build_async_engine_client_from_engine_args(^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__return await anext(self.gen)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_argsraise RuntimeError(
RuntimeError: Engine process failed to start

Qwen2-Audio-7B-Instruct

千问音频大模型

启动指令

引擎启动参数:https://docs.vllm.ai/en/stable/models/engine_args.html

docker run --runtime nvidia --gpus all \-v /data1/data_vllm:/data \-p 8001:8000 \--name qwen_audio \--ipc=host \crpi-esihkuc4dzkvkjot.cn-hangzhou.personal.cr.aliyuncs.com/huadong_vllm/hd \--model /data/Qwen2-Audio-7B-Instruct

问题记录

  1. ‘Qwen2AudioConfig’ object has no attribute ‘hidden_size’
vllm 暂不支持此模型,https://github.com/vllm-project/vllm/issues/8394
------------------------------------------------------------------------------
截止 2024/10/14,qwen-audio 分支合并中(审核中):https://github.com/vllm-project/vllm/pull/9248

日志:

INFO 10-11 01:20:25 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-11 01:20:25 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/Qwen2-Audio-7B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-11 01:20:25 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/cf23126e-3f74-4d96-be11-35016eaa9ef4 for IPC Path.
INFO 10-11 01:20:25 api_server.py:177] Started engine process with PID 26
INFO 10-11 01:20:29 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/Qwen2-Audio-7B-Instruct', speculative_config=None, tokenizer='/data/Qwen2-Audio-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/Qwen2-Audio-7B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 10-11 01:20:30 model_runner.py:1014] Starting to load model /data/Qwen2-Audio-7B-Instruct...
Process SpawnProcess-1:
Traceback (most recent call last):File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrapself.run()File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in runself._target(*self._args, **self._kwargs)File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engineengine = MQLLMEngine.from_engine_args(engine_args=engine_args,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_argsreturn cls(^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__self.engine = LLMEngine(*args,^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 325, in __init__self.model_executor = executor_class(^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 47, in __init__self._init_executor()File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 40, in _init_executorself.driver_worker.load_model()File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_modelself.model_runner.load_model()File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1016, in load_modelself.model = get_model(model_config=self.model_config,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_modelreturn loader.load_model(model_config=model_config,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 399, in load_modelmodel = _initialize_model(model_config, self.load_config,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 176, in _initialize_modelreturn build_model(^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 161, in build_modelreturn model_class(config=hf_config,^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen.py", line 876, in __init__self.transformer = QWenModel(config, cache_config, quant_config)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen.py", line 564, in __init__config.hidden_size,^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 202, in __getattribute__return super().__getattribute__(key)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Qwen2AudioConfig' object has no attribute 'hidden_size'
Traceback (most recent call last):File "<frozen runpy>", line 198, in _run_module_as_mainFile "<frozen runpy>", line 88, in _run_codeFile "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 571, in <module>uvloop.run(run_server(args))File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in runreturn __asyncio.run(^^^^^^^^^^^^^^File "/usr/lib/python3.12/asyncio/runners.py", line 194, in runreturn runner.run(main)^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/asyncio/runners.py", line 118, in runreturn self._loop.run_until_complete(task)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_completeFile "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapperreturn await main^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_serverasync with build_async_engine_client(args) as engine_client:^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__return await anext(self.gen)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_clientasync with build_async_engine_client_from_engine_args(^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__return await anext(self.gen)^^^^^^^^^^^^^^^^^^^^^File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_argsraise RuntimeError(
RuntimeError: Engine process failed to start

系列文章:

一、大模型推理框架选型调研
二、TensorRT-LLM & Triton Server 部署过程记录
三、vLLM 大模型推理引擎调研文档
四、vLLM 推理引擎性能分析基准测试
五、vLLM 部署大模型问题记录
六、Triton Inference Server 架构原理

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/web/55564.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

关于md5强比较和弱比较绕过的实验

在ctf比赛题中我们的md5强弱比较的绕过题型很多&#xff0c;大部分都是结合了PHP来进行一个考核。这一篇文章我将讲解一下最基础的绕过知识。 MD5弱比较 比较的步骤 在进行弱比较时&#xff0c;PHP会按照以下步骤执行&#xff1a; 确定数据类型&#xff1a;检查参与比较的两…

jmeter响应断言放进csv文件遇到的问题

用Jmeter的json 断言去测试http请求响应结果&#xff0c;发现遇到中文时出现乱码&#xff0c;导致无法正常进行响应断言&#xff0c;很影响工作。于是&#xff0c;察看了其他测试人员的解决方案&#xff0c;发现是jmeter本身对编码格式的设置导致了这一问题。解决方案是在jmete…

【文化课学习笔记】【化学】选必三:同分异构体的书写

【化学】选必三&#xff1a;同分异构体的书写 如果你是从 B 站一化儿笔记区来的&#xff0c;请先阅读我在第一篇有机化学笔记中的「读前须知」(点开头的黑色小三角展开)&#xff1a;链接 链状烃的取代和插空法 取代法 一取代物 甲烷、乙烷、丙烷、丁烷的种类 甲烷&#xff1a;只…

Java中集合类型的转换

在Java编程中&#xff0c;集合框架&#xff08;Collections Framework&#xff09;提供了一套用于存储和处理对象集合的接口和类。由于集合框架的灵活性和强大功能&#xff0c;我们经常需要在不同的集合类型之间进行转换。本文将介绍Java中常见的集合类型转换方法&#xff0c;包…

游戏逆向基础-找释放技能CALL

思路&#xff1a;通过send断点然后对send的data参数下写入断点找到游戏里面的技能或者攻击call 进入游戏先选好一个怪物&#xff08;之所以要先选好是因为选怪也会断&#xff0c;如果直接左键打怪的话就会断几次&#xff09; 断下来后对参数下硬件写入断点 硬件断点断下来后先…

如何用pyhton修改1000+图片的名字?

import os oldpath input("请输入文件路径&#xff08;在windows中复制那个图片文件夹的路径就可以):") #注意window系统中的路径用这个‘\分割&#xff0c;但是编程语言中一般都是正斜杠也就是’/‘ #这里写一个代码&#xff0c;将 \ > / path "" fo…

基于SpringBoot+Vue+uniapp的海产品加工销售一体化管理系统的详细设计和实现(源码+lw+部署文档+讲解等)

详细视频演示 请联系我获取更详细的视频演示 项目运行截图 技术框架 后端采用SpringBoot框架 Spring Boot 是一个用于快速开发基于 Spring 框架的应用程序的开源框架。它采用约定大于配置的理念&#xff0c;提供了一套默认的配置&#xff0c;让开发者可以更专注于业务逻辑而不…

6 机器学习之应用现状

在过去二十年中&#xff0c;人类收集、存储、传输、处理数据的能力取得了飞速提升&#xff0c;人类社会的各个角落都积累了大量数据&#xff0c;亟需能有效地对数据进行分析利用的计算机算法&#xff0c;而机器学习恰顺应了大时代的这个迫切需求&#xff0c;因此该学科领域很自…

基于FPGA的DDS信号发生器(图文并茂+深度原理解析)

篇幅有限,本文详细源文件已打包 至个人主页资源,需要自取...... 前言 DDS(直接数字合成)技术是先进的频率合成手段,在数字信号处理与硬件实现领域作用关键。它因低成本、低功耗、高分辨率以及快速转换时间等优点备受认可。 本文着重探究基于 FPGA 的简易 DDS 信号发生器设…

交叉熵损失 在PyTorch 中的计算过程

其实就是根据 真实值的结果&#xff0c;当成索引去取的值 import torch import torch.nn as nnaaaa torch.tensor([[2.0,1.0,3.0],[2.0,4.0,2.0]])l1 nn.LogSoftmax(dim-1) result l1(aaaa) print(result) import torch import torch.nn as nn# 定义交叉熵损失函数 criterio…

数据治理为何如此简单?

欢迎来文末免费获取数据治理相关PPT和文档 引言 随着大数据技术的迅速发展&#xff0c;企业积累的数据量呈现爆炸式增长。有效的数据管理已经成为企业提高决策效率、增强竞争优势的重要手段。在这样的背景下&#xff0c;数据治理逐渐成为企业数据管理中不可或缺的一环。它不仅…

JS中Array的常用方法

文章目录 1. 创建和初始化数组2. 添加和删除元素3. 查找元素4. 遍历数组5. 数组转换6. 排序和反转7. 其他方法 JavaScript 中的 Array 对象提供了许多常用的方法&#xff0c;这些方法可以帮助你更方便地操作数组。以下是一些常用的 Array 方法及其用法&#xff1a; 1. 创建和…

实时计算Flink应用场景

实时计算Flink应用场景 Flink是一个开源的流处理和批处理框架&#xff0c;具有低延迟、高吞吐、容错性强等特点&#xff0c;适用于大规模的实时数据处理和分析。它能够处理包括事件流、日志、传感器数据等各种类型的数据&#xff0c;因此在多个行业和领域有着广泛的应用。以下…

ABB主调制解调器(DSTC130)

‌ABB控制调解器是一种用于工业自动化控制的设备&#xff0c;具有高性能、易于编程和配置、易于集成、高可靠性和维护方便等特点。‌ 它采用先进的控制算法和数据处理技术&#xff0c;能够实现高精度的控制和监测&#xff0c;快速响应系统的变化&#xff0c;提高系统的稳定性和…

查看SQL执行计划 explain

查看SQL执行计划 explain explain使用方式 alter session set current_schematest; explain plan for sql语句; --并不会实际执行&#xff0c;因此生成的执行计划也是预估的 select * from table(dbms_xplan.display); explain使用场景 1.内存中没有谓词信息了&#xff0…

[Javase]深入理解跨平台原理

文章目录 一、Java 跨平台原理深度解析二、代码的编译与解释1、编译型语言2、解释型语言 三、Java 跨平台的核心 —— 虚拟机1、什么是虚拟机2、为什么能实现跨平台 四、JDK&#xff1a;Java 开发的强大工具包1、JDK 的介绍2、JDK 的重要组件 五、JRE&#xff1a;Java 运行的基…

FlinkSQL中 的 双流JOIN

在 Flink SQL 中&#xff0c;流与流的 JOIN 是一种复杂的操作&#xff0c;因为它涉及到实时数据的无界处理。理解 Flink SQL 流与流 JOIN 的底层原理和实现需要从多个角度来分析&#xff0c;包括 状态管理、事件时间处理、窗口机制 以及 内部数据流处理模型 等。下面将从这些角…

基于SpringBoot+Vue的益农智慧服务平台【提供源码+答辩PPT+参考文档+项目部署】

一、项目技术架构&#xff1a; 本项目是一款SpringBoot益农平台的设计与实现。 该SpringBootVue的益农平台的设计与实现&#xff0c;后端采用SpringBoot架构&#xff0c;前端采用VueElementUI实现页面的快速开发&#xff0c;并使用关系型数据库MySQL存储系统运行数据。本系统分…

java-uniapp小程序-引导关注公众号、判断用户是否关注公众号

目录 1、前期准备 公众号和小程序相互关联 准备公众号文章 注册公众号测试号 微信静默授权的独立html 文件 2&#xff1a; 小程序代码 webview页面代码 小程序首页代码 3&#xff1a;后端代码 1&#xff1a;增加公众号配置项 2&#xff1a;读取公众号配置项 3&…

MySQL中查询语句的执行流程

文章目录 前言流程图概述最后 前言 你好&#xff0c;我是醉墨居士&#xff0c;今天我们一起探讨一下执行一条查询的SQL语句在MySQL内部都发生了什么&#xff0c;让你对MySQL内部的架构具备一个宏观上的了解 流程图 概述 对于查询语句的SQL的执行流程&#xff0c;主要可以分为…