用通俗易懂方式讲解：快速部署大模型 ChatGLM3 并进行推理

在深入了解了一些大模型的知识之后，最好的方法是亲自动手搭建一个开源的大模型，以更深入地理解其工作原理。

在此基础上，我们将以 ChatGLM3 为例进行部署及推理，从而进一步探索大模型的应用和实践。

ChatGLM3简介：

ChatGLM3是由智谱AI与清华大学KEG实验室联合发布的第三代大型对话预训练模型。ChatGLM3具备高度的语言理解能力，能够准确理解用户的自然语言输入，并以流畅、连贯的方式生成回复，进行多轮对话。除了传统的文本交互，ChatGLM3还能够直接执行用户提供的代码片段，并调用外部工具或API来处理特定任务。这种能力极大地扩展了模型的应用范围，使其能直接参与到编程指导、数据分析、问题诊断等实际工作流程中。

ChatGLM3模型：

ChatGLM3-6B：这是ChatGLM3系列中的一个具体型号，表明模型拥有约60亿参数。ChatGLM3-6B在各种评测中表现出色，特别是在10B以下的基础模型中，其性能被评价为最强。

ChatGLM3-6B-Base：作为ChatGLM3-6B的基础模型，采用了多样化的训练数据、充足的训练步数以及优化的训练策略，这使得ChatGLM3-6B在语义理解、数学计算、逻辑推理、代码处理及知识应用等多个维度上展现出卓越的能力。

技术交流&资料

技术要学会分享、交流，不建议闭门造车。一个人可以走的很快、一堆人可以走的更远。

成立了大模型算法面试和技术交流群，相关资料、技术交流&答疑，均可加我们的交流群获取，群友已超过2000人，添加时最好的备注方式为：来源+兴趣方向，方便找到志同道合的朋友。

方式①、微信搜索公众号：机器学习社区，后台回复：加群
方式②、添加微信号：mlc2040，备注：来自CSDN + 技术交流

ChatGLM3官方推荐硬件要求

运行 Int4版本的 ChatGLM3-6B配置：

内存：>= 8GB

显存: >= 5GB（1060 6GB,2060 6GB）

# int4 模型加载示例
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).quantize(4).cuda()

运行 FP16版本的ChatGLM3-6B配置：

内存：>= 16GB

显存: >= 13GB（4080 16GB）

# FP16 模型加载示例
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).half().cuda()

ChatGLM3延续了前两代模型的低部署门槛特性，使得用户能够较为便捷地在不同的环境中设置并运行模型。

ChatGLM3官方推荐软件要求：

python 版本推荐3.10 - 3.11

transformers 库版本推荐为 4.36.2

torch 推荐使用 2.0 及以上的版本，以获得最佳的推理性能

ChatGLM3程序获取：

ChatGLM3是开源的，用户可以通过访问相关GitHub仓库获取模型源码和使用指南。

ChatGLM3 github地址：https://github.com/THUDM/ChatGLM3

本次部署环境：

操作系统：

在这里插入图片描述

CPU: 8核

内存：32GB

GPU：1 * NVIDIA V100

在这里插入图片描述

Python：3.11.5

PyTorch：2.1+cu118

ChatGLM3部署及推理步骤：

1. 下载ChatGL3

!git clone https://github.com/THUDM/ChatGLM3

# 输出
Cloning into 'ChatGLM3'...
remote: Enumerating objects: 1261, done.
remote: Counting objects: 100% (683/683), done.
remote: Compressing objects: 100% (250/250), done.
remote: Total 1261 (delta 537), reused 433 (delta 433), pack-reused 578
Receiving objects: 100% (1261/1261), 17.27 MiB | 10.77 MiB/s, done.
Resolving deltas: 100% (743/743), done.

2. 依赖安装

!cd ChatGLM3 && pip install -r requirements.txt

# 输出
Looking in indexes: https://mirrors.cloud.aliyuncs.com/pypi/simple
Collecting protobuf>=4.25.3 (from -r requirements.txt (line 3))Downloading https://mirrors.cloud.aliyuncs.com/pypi/packages/2c/2a/d2741cad35fa5f06d9c59dda3274e5727ca11075dfd7de3f69c100efdcad/protobuf-5.26.1-cp37-abi3-manylinux2014_x86_64.whl (302 kB)━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.8/302.8 kB 22.0 MB/s eta 0:00:00
Collecting transformers>=4.38.1 (from -r requirements.txt (line 4))Downloading https://mirrors.cloud.aliyuncs.com/pypi/packages/e2/52/02271ef16713abea41bab736dfc2dbee75e5e3512cf7441e233976211ba5/transformers-4.39.2-py3-none-any.whl (8.8 MB)━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.8/8.8 MB 122.8 MB/s eta 0:00:0000:0100:01
Collecting tokenizers>=0.15.0 (from -r requirements.txt (line 5))Downloading https://mirrors.cloud.aliyuncs.com/pypi/packages/15/0b/c09b2c0dc688c82adadaa0d5080983de3ce920f4a5cbadb7eaa5302ad251/tokenizers-0.15.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 117.8 MB/s eta 0:00:00
# 不再展示......

3. 模型下载

考虑到从Hugging Face Hub下载需要花费大量时间，我们选择从ModelScope下载：

# modelscope API下载
!pip install modelscope
# 模型下载
from modelscope import snapshot_download
model_dir = snapshot_download("ZhipuAI/chatglm3-6b", revision = "v1.0.0")

# 输出
# ......上面输出信息不展示
Downloading: 100%|██████████| 1.29k/1.29k [00:00<00:00, 6.51MB/s]
Downloading: 100%|██████████| 40.0/40.0 [00:00<00:00, 221kB/s]
Downloading: 100%|██████████| 2.28k/2.28k [00:00<00:00, 11.6MB/s]
Downloading: 100%|██████████| 4.04k/4.04k [00:00<00:00, 14.2MB/s]
Downloading: 100%|██████████| 54.3k/54.3k [00:00<00:00, 37.4MB/s]
Downloading: 100%|█████████▉| 1.70G/1.70G [00:07<00:00, 244MB/s]
Downloading: 100%|█████████▉| 1.83G/1.83G [00:08<00:00, 234MB/s]
Downloading: 100%|█████████▉| 1.80G/1.80G [00:08<00:00, 237MB/s]
Downloading: 100%|█████████▉| 1.69G/1.69G [00:07<00:00, 244MB/s]
Downloading: 100%|█████████▉| 1.83G/1.83G [00:08<00:00, 227MB/s]
Downloading: 100%|█████████▉| 1.80G/1.80G [00:07<00:00, 241MB/s]
Downloading: 100%|█████████▉| 0.98G/0.98G [00:04<00:00, 221MB/s]
Downloading: 100%|██████████| 20.0k/20.0k [00:00<00:00, 73.6MB/s]
Downloading: 100%|██████████| 14.3k/14.3k [00:00<00:00, 52.6MB/s]
Downloading: 100%|██████████| 4.37k/4.37k [00:00<00:00, 18.3MB/s]
Downloading: 100%|██████████| 11.0k/11.0k [00:00<00:00, 44.0MB/s]
Downloading: 100%|██████████| 995k/995k [00:00<00:00, 18.3MB/s]
Downloading: 100%|██████████| 244/244 [00:00<00:00, 1.34MB/s]
# 模型已下载完成

4. 进行推理测试

from modelscope import AutoTokenizer, AutoModel, snapshot_download
model_dir = snapshot_download("ZhipuAI/chatglm3-6b", revision = "v1.0.0")
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True).half().cuda()
model = model.eval()
response, history = model.chat(tokenizer, "你好，你是？", history=[])
print(response)
response, history = model.chat(tokenizer, "用Python写一个小时钟。", history=history)
print(response)

你好，我是 ChatGLM3-6B，是清华大学KEG实验室和智谱AI公司共同训练的语言模型。我的任务是针对用户的问题和要求提供适当的答复和支持。

好的，这里有一个使用 Python 实现的小时钟程序：

from datetime import datetimedef show_time():now = datetime.now()hours = now.hourminutes = now.minutereturn f"现在是{hours:02d}:{minutes:02d}"if __name__ == "__main__":while True:show_time()