使用llama.cpp实现LLM大模型的格式转换、量化、推理、部署

使用llama.cpp实现LLM大模型的量化、推理、部署

  • 大模型的格式转换、量化、推理、部署
    • 概述
    • 克隆和编译
    • 环境准备
    • 模型格式转换
      • GGUF格式
      • bin格式
    • 模型量化
    • 模型加载与推理
    • 模型API服务
    • 模型API服务(第三方)
    • GPU推理

大模型的格式转换、量化、推理、部署

概述

llama.cpp的主要目标是能够在各种硬件上实现LLM推理,只需最少的设置,并提供最先进的性能。提供1.5位、2位、3位、4位、5位、6位和8位整数量化,以加快推理速度并减少内存使用。

GitHub:https://github.com/ggerganov/llama.cpp

克隆和编译

克隆最新版llama.cpp仓库代码

git clone https://github.com/ggerganov/llama.cpp

对llama.cpp项目进行编译,在目录下会生成一系列可执行文件

main:使用模型进行推理quantize:量化模型server:提供模型API服务

1.编译构建CPU执行环境,安装简单,适用于没有GPU的操作系统

cd llama.cppmkdir 

2.编译构建GPU执行环境,确保安装CUDA工具包,适用于有GPU的操作系统

如果CUDA设置正确,那么执行nvidia-sminvcc --version没有错误提示,则表示一切设置正确。

make clean &&  make LLAMA_CUDA=1

3.如果编译失败或者需要重新编译,可尝试清理并重新编译,直至编译成功

make clean

环境准备

1.下载受支持的模型

要使用llamma.cpp,首先需要准备它支持的模型。在官方文档中给出了说明,这里仅仅截取其中一部分

在这里插入图片描述
2.安装依赖

llama.cpp项目下带有requirements.txt 文件,直接安装依赖即可。

pip install -r requirements.txt

模型格式转换

根据模型架构,可以使用convert.pyconvert-hf-to-gguf.py文件。

转换脚本读取模型配置、分词器、张量名称+数据,并将它们转换为GGUF元数据和张量。

GGUF格式

Llama-3相比其前两代显著扩充了词表大小,由32K扩充至128K,并且改为BPE词表。因此需要使用--vocab-type参数指定分词算法,默认值是spm,如果是bpe,需要显示指定

注意:

官方文档说convert.py不支持LLaMA 3,喊使用convert-hf-to-gguf.py,但它不支持--vocab-type,且出现异常:error: unrecognized arguments: --vocab-type bpe,因此使用convert.py且没出问题

使用llama.cpp项目中的convert.py脚本转换模型为GGUF格式

root@master:~/work/llama.cpp# python3 ./convert.py  /root/work/models/Llama3-Chinese-8B-Instruct/ --outtype f16 --vocab-type bpe --outfile ./models/Llama3-FP16.gguf
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00002-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00003-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00004-of-00004.safetensors
INFO:convert:model parameters count : 8030261248 (8B)
INFO:convert:params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('/root/work/models/Llama3-Chinese-8B-Instruct'))
INFO:convert:Loaded vocab file PosixPath('/root/work/models/Llama3-Chinese-8B-Instruct/tokenizer.json'), type 'bpe'
INFO:convert:Vocab info: <BpeVocab with 128000 base tokens and 256 added tokens>
INFO:convert:Special vocab info: <SpecialVocab with 280147 merges, special tokens {'bos': 128000, 'eos': 128001}, add special tokens unset>
INFO:convert:Writing models/Llama3-FP16.gguf, format 1
WARNING:convert:Ignoring added_tokens.json since model matches vocab size without it.
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128001
INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>' }}
INFO:convert:[  1/291] Writing tensor token_embd.weight                      | size 128256 x   4096  | type F16  | T+   1
INFO:convert:[  2/291] Writing tensor blk.0.attn_norm.weight                 | size   4096           | type F32  | T+   2
INFO:convert:[  3/291] Writing tensor blk.0.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   2
INFO:convert:[  4/291] Writing tensor blk.0.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   2
INFO:convert:[  5/291] Writing tensor blk.0.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   2
INFO:convert:[  6/291] Writing tensor blk.0.ffn_norm.weight                  | size   4096           | type F32  | T+   2
INFO:convert:[  7/291] Writing tensor blk.0.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   2
INFO:convert:[  8/291] Writing tensor blk.0.attn_output.weight               | size   4096 x   4096  | type F16  | T+   2
INFO:convert:[  9/291] Writing tensor blk.0.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   3
INFO:convert:[ 10/291] Writing tensor blk.0.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   3
INFO:convert:[ 11/291] Writing tensor blk.1.attn_norm.weight                 | size   4096           | type F32  | T+   3

转换为FP16的GGUF格式,模型体积大概15G。

root@master:~/work/llama.cpp# ll models -h
-rw-r--r--  1 root root  15G May 17 07:47 Llama3-FP16.gguf

bin格式

root@master:~/work/llama.cpp# python3 ./convert.py  /root/work/models/Llama3-Chinese-8B-Instruct/ --outtype f16 --vocab-type bpe --outfile ./models/Llama3-FP16.bin
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00002-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00003-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00004-of-00004.safetensors
INFO:convert:model parameters count : 8030261248 (8B)
INFO:convert:params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('/root/work/models/Llama3-Chinese-8B-Instruct'))
INFO:convert:Loaded vocab file PosixPath('/root/work/models/Llama3-Chinese-8B-Instruct/tokenizer.json'), type 'bpe'
INFO:convert:Vocab info: <BpeVocab with 128000 base tokens and 256 added tokens>
INFO:convert:Special vocab info: <SpecialVocab with 280147 merges, special tokens {'bos': 128000, 'eos': 128001}, add special tokens unset>
INFO:convert:Writing models/Llama3-FP16.bin, format 1
WARNING:convert:Ignoring added_tokens.json since model matches vocab size without it.
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128001
INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>' }}
INFO:convert:[  1/291] Writing tensor token_embd.weight                      | size 128256 x   4096  | type F16  | T+   4
INFO:convert:[  2/291] Writing tensor blk.0.attn_norm.weight                 | size   4096           | type F32  | T+   4
INFO:convert:[  3/291] Writing tensor blk.0.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   4
INFO:convert:[  4/291] Writing tensor blk.0.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   5
INFO:convert:[  5/291] Writing tensor blk.0.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   5
INFO:convert:[  6/291] Writing tensor blk.0.ffn_norm.weight                  | size   4096           | type F32  | T+   5
INFO:convert:[  7/291] Writing tensor blk.0.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   5
INFO:convert:[  8/291] Writing tensor blk.0.attn_output.weight               | size   4096 x   4096  | type F16  | T+   5
INFO:convert:[  9/291] Writing tensor blk.0.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   5
INFO:convert:[ 10/291] Writing tensor blk.0.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   5
INFO:convert:[ 11/291] Writing tensor blk.1.attn_norm.weight                 | size   4096           | type F32  | T+   5
INFO:convert:[ 12/291] Writing tensor blk.1.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   5
INFO:convert:[ 13/291] Writing tensor blk.1.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   5
root@master:~/work/llama.cpp# ll models -h
-rw-r--r--  1 root root  15G May 17 07:47 Llama3-FP16.gguf
-rw-r--r--  1 root root  15G May 17 08:02 Llama3-FP16.bin

模型量化

模型量化使用quantize命令,其具体可用参数与允许量化的类型如下:

root@master:~/work/llama.cpp# ./quantize
usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing--pure: Disable k-quant mixtures and quantize all tensors to the same type--imatrix file_name: use data in file_name as importance matrix for quant optimizations--include-weights tensor_name: use importance matrix for this/these tensor(s)--exclude-weights tensor_name: use importance matrix for this/these tensor(s)--output-tensor-type ggml_type: use this ggml_type for the output.weight tensor--token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor--keep-split: will generate quatized model in the same shards as input  --override-kv KEY=TYPE:VALUEAdvanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used togetherAllowed quantization types:2  or  Q4_0    :  3.56G, +0.2166 ppl @ LLaMA-v1-7B3  or  Q4_1    :  3.90G, +0.1585 ppl @ LLaMA-v1-7B8  or  Q5_0    :  4.33G, +0.0683 ppl @ LLaMA-v1-7B9  or  Q5_1    :  4.70G, +0.0349 ppl @ LLaMA-v1-7B19  or  IQ2_XXS :  2.06 bpw quantization20  or  IQ2_XS  :  2.31 bpw quantization28  or  IQ2_S   :  2.5  bpw quantization29  or  IQ2_M   :  2.7  bpw quantization24  or  IQ1_S   :  1.56 bpw quantization31  or  IQ1_M   :  1.75 bpw quantization10  or  Q2_K    :  2.63G, +0.6717 ppl @ LLaMA-v1-7B21  or  Q2_K_S  :  2.16G, +9.0634 ppl @ LLaMA-v1-7B23  or  IQ3_XXS :  3.06 bpw quantization26  or  IQ3_S   :  3.44 bpw quantization27  or  IQ3_M   :  3.66 bpw quantization mix12  or  Q3_K    : alias for Q3_K_M22  or  IQ3_XS  :  3.3 bpw quantization11  or  Q3_K_S  :  2.75G, +0.5551 ppl @ LLaMA-v1-7B12  or  Q3_K_M  :  3.07G, +0.2496 ppl @ LLaMA-v1-7B13  or  Q3_K_L  :  3.35G, +0.1764 ppl @ LLaMA-v1-7B25  or  IQ4_NL  :  4.50 bpw non-linear quantization30  or  IQ4_XS  :  4.25 bpw non-linear quantization15  or  Q4_K    : alias for Q4_K_M14  or  Q4_K_S  :  3.59G, +0.0992 ppl @ LLaMA-v1-7B15  or  Q4_K_M  :  3.80G, +0.0532 ppl @ LLaMA-v1-7B17  or  Q5_K    : alias for Q5_K_M16  or  Q5_K_S  :  4.33G, +0.0400 ppl @ LLaMA-v1-7B17  or  Q5_K_M  :  4.45G, +0.0122 ppl @ LLaMA-v1-7B18  or  Q6_K    :  5.15G, +0.0008 ppl @ LLaMA-v1-7B7  or  Q8_0    :  6.70G, +0.0004 ppl @ LLaMA-v1-7B1  or  F16     : 14.00G, -0.0020 ppl @ Mistral-7B32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B0  or  F32     : 26.00G              @ 7BCOPY    : only copy tensors, no quantizing

使用quantize量化模型,它提供各种量化位数的模型:Q2、Q3、Q4、Q5、Q6、Q8、F16。

量化模型的命名方法遵循: Q + 量化比特位 + 变种。量化位数越少,对硬件资源的要求越低,但是模型的精度也越低。

模型经过量化之后,可以发现模型的大小从15G降低到8G,但模型精度从16位浮点数降低到8位整数。

root@master:~/work/llama.cpp# ./quantize ./models/Llama3-FP16.gguf  ./models/Llama3-q8.gguf q8_0
main: build = 2908 (359cbe3f)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/root/work/models/Llama3-FP16.gguf' to '/root/work/models/Llama3-q8.gguf' as Q8_0
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/work/models/Llama3-FP16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llama3-Chinese-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 1
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
[   1/ 291]                    token_embd.weight - [ 4096, 128256,     1,     1], type =    f16, converting to q8_0 .. size =  1002.00 MiB ->   532.31 MiB
[   2/ 291]               blk.0.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   3/ 291]                blk.0.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[   4/ 291]                blk.0.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[   5/ 291]                  blk.0.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[   6/ 291]                blk.0.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   7/ 291]                  blk.0.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[   8/ 291]             blk.0.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[   9/ 291]                  blk.0.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  10/ 291]                  blk.0.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  11/ 291]               blk.1.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  12/ 291]                blk.1.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[  13/ 291]                blk.1.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[  14/ 291]                  blk.1.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[  15/ 291]                blk.1.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  16/ 291]                  blk.1.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  17/ 291]             blk.1.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  18/ 291]                  blk.1.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  19/ 291]                  blk.1.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
root@master:~/work/llama.cpp# ll -h models/
-rw-r--r--  1 root root 8.0G May 17 07:54 Llama3-q8.gguf

模型加载与推理

模型加载与推理使用main命令,其支持如下可用参数:

root@master:~/work/llama.cpp# ./main -husage: ./main [options]options:-h, --help            show this help message and exit--version             show version and build info-i, --interactive     run in interactive mode--interactive-specials allow special tokens in user text, in interactive mode--interactive-first   run in interactive mode and wait for input right away-cnv, --conversation  run in conversation mode (does not print special tokens and suffix/prefix)-ins, --instruct      run in instruction mode (use with Alpaca models)-cml, --chatml        run in chatml mode (use with ChatML-compatible models)--multiline-input     allows you to write or paste multiple lines without ending each in '\'-r PROMPT, --reverse-prompt PROMPThalt generation at PROMPT, return control in interactive mode(can be specified more than once for multiple prompts).--color               colorise output to distinguish prompt and user input from generations-s SEED, --seed SEED  RNG seed (default: -1, use random seed for < 0)-t N, --threads N     number of threads to use during generation (default: 30)-tb N, --threads-batch Nnumber of threads to use during batch and prompt processing (default: same as --threads)-td N, --threads-draft N                        number of threads to use during generation (default: same as --threads)-tbd N, --threads-batch-draft Nnumber of threads to use during batch and prompt processing (default: same as --threads-draft)-p PROMPT, --prompt PROMPTprompt to start generation with (default: empty)

可以加载预训练模型或者经过量化之后的模型,这里选择加载量化后的模型进行推理。

在llama.cpp项目的根目录,执行如下命令,加载模型进行推理。

root@master:~/work/llama.cpp# ./main -m models/Llama3-q8.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1
Log start
main: build = 2908 (359cbe3f)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1715935175
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/Llama3-q8.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llama3-Chinese-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128== Running in interactive mode. ==- Press Ctrl+C to interject at any time.- Press Return to return control to LLaMa.- To return control without starting a new line, end your input with '/'.- If you want to submit another line, end your input with '\'.<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.
> hi
Hello! How can I help you today?<|eot_id|>>

在提示符>之后输入prompt,使用ctrl+c中断输出,多行信息以\作为行尾。执行./main -h命令查看帮助和参数说明,以下是一些常用的参数:
`

命令描述
-m指定 LLaMA 模型文件的路径
-mu指定远程 http url 来下载文件
-i以交互模式运行程序,允许直接提供输入并接收实时响应。
-ins以指令模式运行程序,这在处理羊驼模型时特别有用。
-f指定prompt模板,alpaca模型请加载prompts/alpaca.txt
-n控制回复生成的最大长度(默认:128)
-c设置提示上下文的大小,值越大越能参考更长的对话历史(默认:512)
-b控制batch size(默认:8),可适当增加
-t控制线程数量(默认:4),可适当增加
--repeat_penalty控制生成回复中对重复文本的惩罚力度
--temp温度系数,值越低回复的随机性越小,反之越大
--top_p, top_k控制解码采样的相关参数
--color区分用户输入和生成的文本

模型API服务

llama.cpp提供了完全与OpenAI API兼容的API接口,使用经过编译生成的server可执行文件启动API服务。

root@master:~/work/llama.cpp# ./server -m models/Llama3-q8.gguf --host 0.0.0.0 --port 8000
{"tid":"140018656950080","timestamp":1715936504,"level":"INFO","function":"main","line":2942,"msg":"build info","build":2908,"commit":"359cbe3f"}
{"tid":"140018656950080","timestamp":1715936504,"level":"INFO","function":"main","line":2947,"msg":"system info","n_threads":30,"n_threads_batch":-1,"total_threads":30,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "}
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/Llama3-q8.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llama3-Chinese-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336

启动API服务后,可以使用curl命令进行测试

curl --request POST \--url http://localhost:8000/completion \--header "Content-Type: application/json" \--data '{"prompt": "Hi"}'

模型API服务(第三方)

在llamm.cpp项目中有提到各种语言编写的第三方工具包,可以使用这些工具包提供API服务,这里以Python为例,使用llama-cpp-python提供API服务。

安装依赖

pip install llama-cpp-pythonpip install llama-cpp-python -i https://mirrors.aliyun.com/pypi/simple/

注意:可能还需要安装以下缺失依赖,可根据启动时的异常提示分别安装。

pip install sse_starlette starlette_context pydantic_settings

启动API服务,默认运行在http://localhost:8000

python -m llama_cpp.server --model models/Llama3-q8.gguf

安装openai依赖

pip install openai

使用openai调用API服务

import os
from openai import OpenAI  # 导入OpenAI库# 设置OpenAI的BASE_URL
os.environ["OPENAI_BASE_URL"] = "http://localhost:8000/v1"client = OpenAI()  # 创建OpenAI客户端对象# 调用模型
completion = client.chat.completions.create(model="llama3", # 任意填messages=[{"role": "system", "content": "你是一个乐于助人的助手。"},{"role": "user", "content": "你好!"}]
)# 输出模型回复
print(completion.choices[0].message)

在这里插入图片描述

GPU推理

如果编译构建了GPU执行环境,可以使用-ngl N --n-gpu-layers N参数,指定offload层数,让模型在GPU上运行推理

例如:-ngl 40表示offload 40层模型参数到GPU

未使用-ngl N --n-gpu-layers N参数,程序默认在CPU上运行

root@master:~/work/llama.cpp# ./server -m models/Llama3-FP16.gguf  --host 0.0.0.0 --port 8000

可从以下关键启动日志看出,模型并没有在GPU上执行

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  8137.64 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512

使用-ngl N --n-gpu-layers N参数,程序默认在GPU上运行

root@master:~/work/llama.cpp# ./server -m models/Llama3-FP16.gguf  --host 0.0.0.0 --port 8000   --n-gpu-layers 1000

可从以下关键启动日志看出,模型在GPU上执行

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1002.00 MiB
llm_load_tensors:      CUDA0 buffer size = 14315.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0

执行nvidia-smi命令,可以进一步验证模型已在GPU上运行。
在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/14516.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【软考中级 软件设计师】数据结构

数据结构是计算机科学中一个基础且重要的概念&#xff0c;它研究数据的存储结构以及在此结构上执行的各种操作。在准备软考中级-软件设计师考试时&#xff0c;掌握好数据结构部分对于通过考试至关重要。下面是一些核心知识点概览&#xff1a; 基本概念&#xff1a; 数据结构定义…

VBA_MF系列技术资料1-615

MF系列VBA技术资料1-615 为了让广大学员在VBA编程中有切实可行的思路及有效的提高自己的编程技巧&#xff0c;我参考大量的资料&#xff0c;并结合自己的经验总结了这份MF系列VBA技术综合资料&#xff0c;而且开放源码&#xff08;MF04除外&#xff09;&#xff0c;其中MF01-0…

spring-boot集成slf4j(二)logback配置详解

一、configuration 根节点&#xff1a;configuration&#xff0c;作为顶级标签&#xff0c; 可以用来配置一些lockback的全局属性&#xff0c;常见的属性如下&#xff1a; &#xff08;1&#xff09;scan“true” &#xff1a;scan是否开启自动扫描&#xff0c;监控配置文件更…

el-table 组件实现 “合并单元格 + N行数据小计” 功能

目录 需求 - 要实现的效果初始代码代码升级&#xff08;可供多个表格使用&#xff09;CommonTable.vue 子组件 使用子组件1 - 父组件 - 图1~图3使用效果展示 使用子组件2 - 父组件 - 图4使用效果展示 注意【代码优化 - 解决bug】 需求 - 要实现的效果 父组件中 info 数据示例 …

内网安全之证书服务基础知识

PKI公钥基础设施 PKI(Public Key Infrastructure)公钥基础设施&#xff0c;是提供公钥加密和数字签名服务的系统或平台&#xff0c;是一个包括硬件、软件、人员、策略和规程的集合&#xff0c;用来实现基于公钥密码体制的密钥和证书的产生、管理、存储、分发和撤销等功能。企业…

element-plus:踩坑日记

el-table Q&#xff1a;有fixed属性时&#xff0c;无数据时&#xff0c;可能出现底部边框消失的bug 现象&#xff1a; 解决方法&#xff1a; .el-table__empty-block {border-bottom: 1px solid var(--el-table-border-color); } el-collapse 折叠面板 Q&#xff1a;标题上…

云平台的安全能力提升解决方案

提升云平台的安全能力是确保数据和服务安全的关键步骤。针对大型云平台所面临的云上安全建设问题&#xff0c;安全狗提供完整的一站式云安全解决方案&#xff0c;充分匹配云平台安全管理方的需求和云租户的安全需求。协助大型云平台建设全网安全态势感知、统一风险管理、统一资…

PCIE协议-4-物理层逻辑模块

4.1 简介 物理层将事务层和数据链路层与用于链路数据交换的信令技术隔离开来。物理层被划分为逻辑物理层和电气物理层子模块&#xff08;见图4-1&#xff09;。 4.2 逻辑物理层子模块 逻辑子模块有两个主要部分&#xff1a;一个发送部分&#xff0c;它准备从数据链路层传递过…

v-md-editor和SSE实现ChatGPT的打字机式输出

概述 不论是GPT还是文心一言&#xff0c;在回答的时候类似于打字机式的将答案呈现给我们&#xff0c;这样的交互一方面比较友好&#xff0c;另一方面&#xff0c;当答案比较多、生成比较慢的时候也能争取一些答案的生成时间。本文后端使用express和stream&#xff0c;使用SSE将…

Boosting Cache Performance by Access Time Measurements——论文泛读

TOC 2023 Paper 论文阅读笔记整理 问题 大多数现代系统利用缓存来减少平均数据访问时间并优化其性能。当缓存未命中的访问时间不同时&#xff0c;最大化缓存命中率与最小化平均访问时间不同。例如&#xff1a;系统使用多种不同存储介质时&#xff0c;不同存储介质访问时间不同…

【C++初阶】—— 类和对象 (上)

&#x1f4dd;个人主页&#x1f339;&#xff1a;EterNity_TiMe_ ⏩收录专栏⏪&#xff1a;C “ 登神长阶 ” &#x1f339;&#x1f339;期待您的关注 &#x1f339;&#x1f339; 类和对象 1. 初步认识C2. 类的引入3. 类的定义声明和定义全部放在类体中声明和定义分开存放 4.…

8个实用网站和软件,收藏起来一定不后悔~

整理了8个日常生活中经常能用得到的网站和软件&#xff0c;收藏起来一定不会后悔~ 1.ZLibrary zh.zlibrary-be.se/这个网站收录了超千万的书籍和文章资源&#xff0c;国内外的各种电子书资源都可以在这里搜索&#xff0c;98%以上都可以在网站内找到&#xff0c;并且支持免费下…

linux中ansible整理笔记

一、工作模式 1. adhoc临时命令 语法&#xff1a; ansible 主机或者组列表 -m 模块 -a “参数” 2. playbook 语法&#xff1a; ansible-playbook xxx.yml 二、模块 1. ping 2.command:默认模块&#xff08;不支持重定向&#xff0c;管道&#xff09; 3.shell:类似com…

IP地址显示“不安全”怎么办|已解决

解决IP地址显示“不安全”的问题&#xff0c;通常需要确保网站或服务使用HTTPS协议进行加密通信&#xff0c;可以通过部署SSL证书来解决&#xff0c;以下是具体的解决步骤&#xff1a; 1 申请IP地址SSL证书&#xff1a;网站管理员应向证书颁发机构&#xff08;CA&#xff09;申…

网络拓扑—WEB-IIS服务搭建

文章目录 WEB-IIS服务搭建网络拓扑配置网络IISPC 安装IIS服务配置IIS服务&#xff08;默认站点&#xff09;PC机访问网页 配置IIS服务&#xff08;新建站点&#xff09;PC机访问网页 WEB-IIS服务搭建 网络拓扑 //交换机忽略不计 IIS服务IP&#xff1a;192.168.1.1 PC机IP&…

人类交互2 听觉处理和语言中枢

人类听觉概述 人类听觉是指通过耳朵接收声音并将其转化为神经信号&#xff0c;从而使我们能够感知和理解声音信息的能力。听觉是人类五种感觉之一&#xff0c;对我们的日常生活和交流至关重要。 听觉是人类交流和沟通的重要工具。通过听觉&#xff0c;我们能够听到他人的语言…

安装错误提示Please run MaterialLibrary2018.msi first或者其他MaterialLibrary版本

打开autoremove&#xff0c;系统检查&#xff0c;点击开始检查。检查完成修复。 可以解决部分该问题&#xff0c;如果没解决的请咨询

Linux中的文件描述符

1.系统调用接口和库函数的关系 函数&#xff1a;fopen fclose fread fwrite 都是c标准库当中的函数&#xff0c;也就是用户操作接口中ibc系统调用&#xff1a;open close read write 都是系统调用提供的接口 c语言中接口底层封装的都是系统调用接口 FILE* stdin stdout stderr…

Python模块、包和异常处理

大家好&#xff0c;在当今软件开发领域&#xff0c;Python作为一种简洁、易读且功能强大的编程语言&#xff0c;被广泛应用于各种领域。作为一名测试开发工程师&#xff0c;熟练掌握Python的模块、包和异常处理是提高代码可维护性和错误处理能力的关键。本文将和大家一起探讨Py…

SAP-MRP和采购申请

1、如果采购申请是手工创建的,跑MRP会不会被覆盖? 创建一个采购申请18089476,然后运行MRP-MD03,再用MD04查看下 从上图看,手工创建的采购申请被打上*号,没有被覆盖掉。 2、如果采购申请被审批了,会不会被覆盖掉? 首先创建一个独立需求MD61 然后库存消耗掉为0,运行M…