LLM大语言模型(七):部署ChatGLM3-6B并提供HTTP server能力

目录

HighLight

部署ChatGLM3-6B并开启HTTP server能力

下载embedding模型bge-large-zh-v1.5

HTTP接口问答示例

LLM讲了个尴尬的笑话~


HighLight

将LLM服务化(如提供HTTP server能力),才能在其上构建自己的应用。

部署ChatGLM3-6B并开启HTTP server能力

下载embedding模型bge-large-zh-v1.5

启动模型需要

https://www.modelscope.cn/models/Xorbits/bge-large-zh-v1.5/files

# set LLM path 修改为自己的路径
MODEL_PATH = os.environ.get('MODEL_PATH', 'D:\\github\\chatglm3-6b')
TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", MODEL_PATH)

# embedding model修改为自己的路径
EMBEDDING_PATH = os.environ.get('EMBEDDING_PATH', "D:\\github\\bge-large-zh-v1.5")

参考ChatGLM官方提供的demo

openai_api_demo/api_server.py


import os
import time
import tiktoken
import torch
import uvicornfrom fastapi import FastAPI, HTTPException, Response
from fastapi.middleware.cors import CORSMiddlewarefrom contextlib import asynccontextmanager
from typing import List, Literal, Optional, Union
from loguru import logger
from pydantic import BaseModel, Field
from transformers import AutoTokenizer, AutoModel
from utils import process_response, generate_chatglm3, generate_stream_chatglm3
from sentence_transformers import SentenceTransformerfrom sse_starlette.sse import EventSourceResponse# Set up limit request time
EventSourceResponse.DEFAULT_PING_INTERVAL = 1000# set LLM path
MODEL_PATH = os.environ.get('MODEL_PATH', 'D:\\github\\chatglm3-6b')
TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", MODEL_PATH)# set Embedding Model path
EMBEDDING_PATH = os.environ.get('EMBEDDING_PATH', "D:\\github\\bge-large-zh-v1.5")@asynccontextmanager
async def lifespan(app: FastAPI):yieldif torch.cuda.is_available():torch.cuda.empty_cache()torch.cuda.ipc_collect()app = FastAPI(lifespan=lifespan)app.add_middleware(CORSMiddleware,allow_origins=["*"],allow_credentials=True,allow_methods=["*"],allow_headers=["*"],
)class ModelCard(BaseModel):id: strobject: str = "model"created: int = Field(default_factory=lambda: int(time.time()))owned_by: str = "owner"root: Optional[str] = Noneparent: Optional[str] = Nonepermission: Optional[list] = Noneclass ModelList(BaseModel):object: str = "list"data: List[ModelCard] = []class FunctionCallResponse(BaseModel):name: Optional[str] = Nonearguments: Optional[str] = Noneclass ChatMessage(BaseModel):role: Literal["user", "assistant", "system", "function"]content: str = Nonename: Optional[str] = Nonefunction_call: Optional[FunctionCallResponse] = Noneclass DeltaMessage(BaseModel):role: Optional[Literal["user", "assistant", "system"]] = Nonecontent: Optional[str] = Nonefunction_call: Optional[FunctionCallResponse] = None## for Embedding
class EmbeddingRequest(BaseModel):input: List[str]model: strclass CompletionUsage(BaseModel):prompt_tokens: intcompletion_tokens: inttotal_tokens: intclass EmbeddingResponse(BaseModel):data: listmodel: strobject: strusage: CompletionUsage# for ChatCompletionRequestclass UsageInfo(BaseModel):prompt_tokens: int = 0total_tokens: int = 0completion_tokens: Optional[int] = 0class ChatCompletionRequest(BaseModel):model: strmessages: List[ChatMessage]temperature: Optional[float] = 0.8top_p: Optional[float] = 0.8max_tokens: Optional[int] = Nonestream: Optional[bool] = Falsetools: Optional[Union[dict, List[dict]]] = Nonerepetition_penalty: Optional[float] = 1.1class ChatCompletionResponseChoice(BaseModel):index: intmessage: ChatMessagefinish_reason: Literal["stop", "length", "function_call"]class ChatCompletionResponseStreamChoice(BaseModel):delta: DeltaMessagefinish_reason: Optional[Literal["stop", "length", "function_call"]]index: intclass ChatCompletionResponse(BaseModel):model: strid: strobject: Literal["chat.completion", "chat.completion.chunk"]choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]created: Optional[int] = Field(default_factory=lambda: int(time.time()))usage: Optional[UsageInfo] = None@app.get("/health")
async def health() -> Response:"""Health check."""return Response(status_code=200)@app.post("/v1/embeddings", response_model=EmbeddingResponse)
async def get_embeddings(request: EmbeddingRequest):embeddings = [embedding_model.encode(text) for text in request.input]embeddings = [embedding.tolist() for embedding in embeddings]def num_tokens_from_string(string: str) -> int:"""Returns the number of tokens in a text string.use cl100k_base tokenizer"""encoding = tiktoken.get_encoding('cl100k_base')num_tokens = len(encoding.encode(string))return num_tokensresponse = {"data": [{"object": "embedding","embedding": embedding,"index": index}for index, embedding in enumerate(embeddings)],"model": request.model,"object": "list","usage": CompletionUsage(prompt_tokens=sum(len(text.split()) for text in request.input),completion_tokens=0,total_tokens=sum(num_tokens_from_string(text) for text in request.input),)}return response@app.get("/v1/models", response_model=ModelList)
async def list_models():model_card = ModelCard(id="chatglm3-6b")return ModelList(data=[model_card])@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
async def create_chat_completion(request: ChatCompletionRequest):global model, tokenizerif len(request.messages) < 1 or request.messages[-1].role == "assistant":raise HTTPException(status_code=400, detail="Invalid request")gen_params = dict(messages=request.messages,temperature=request.temperature,top_p=request.top_p,max_tokens=request.max_tokens or 1024,echo=False,stream=request.stream,repetition_penalty=request.repetition_penalty,tools=request.tools,)logger.debug(f"==== request ====\n{gen_params}")if request.stream:# Use the stream mode to read the first few characters, if it is not a function call, direct stram outputpredict_stream_generator = predict_stream(request.model, gen_params)output = next(predict_stream_generator)if not contains_custom_function(output):return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")# Obtain the result directly at one time and determine whether tools needs to be called.logger.debug(f"First result output:\n{output}")function_call = Noneif output and request.tools:try:function_call = process_response(output, use_tool=True)except:logger.warning("Failed to parse tool call")# CallFunctionif isinstance(function_call, dict):function_call = FunctionCallResponse(**function_call)"""In this demo, we did not register any tools.You can use the tools that have been implemented in our `tools_using_demo` and implement your own streaming tool implementation here.Similar to the following method:function_args = json.loads(function_call.arguments)tool_response = dispatch_tool(tool_name: str, tool_params: dict)"""tool_response = ""if not gen_params.get("messages"):gen_params["messages"] = []gen_params["messages"].append(ChatMessage(role="assistant",content=output,))gen_params["messages"].append(ChatMessage(role="function",name=function_call.name,content=tool_response,))# Streaming output of results after function callsgenerate = predict(request.model, gen_params)return EventSourceResponse(generate, media_type="text/event-stream")else:# Handled to avoid exceptions in the above parsing function process.generate = parse_output_text(request.model, output)return EventSourceResponse(generate, media_type="text/event-stream")# Here is the handling of stream = Falseresponse = generate_chatglm3(model, tokenizer, gen_params)# Remove the first newline characterif response["text"].startswith("\n"):response["text"] = response["text"][1:]response["text"] = response["text"].strip()usage = UsageInfo()function_call, finish_reason = None, "stop"if request.tools:try:function_call = process_response(response["text"], use_tool=True)except:logger.warning("Failed to parse tool call, maybe the response is not a tool call or have been answered.")if isinstance(function_call, dict):finish_reason = "function_call"function_call = FunctionCallResponse(**function_call)message = ChatMessage(role="assistant",content=response["text"],function_call=function_call if isinstance(function_call, FunctionCallResponse) else None,)logger.debug(f"==== message ====\n{message}")choice_data = ChatCompletionResponseChoice(index=0,message=message,finish_reason=finish_reason,)task_usage = UsageInfo.model_validate(response["usage"])for usage_key, usage_value in task_usage.model_dump().items():setattr(usage, usage_key, getattr(usage, usage_key) + usage_value)return ChatCompletionResponse(model=request.model,id="",  # for open_source model, id is emptychoices=[choice_data],object="chat.completion",usage=usage)async def predict(model_id: str, params: dict):global model, tokenizerchoice_data = ChatCompletionResponseStreamChoice(index=0,delta=DeltaMessage(role="assistant"),finish_reason=None)chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")yield "{}".format(chunk.model_dump_json(exclude_unset=True))previous_text = ""for new_response in generate_stream_chatglm3(model, tokenizer, params):decoded_unicode = new_response["text"]delta_text = decoded_unicode[len(previous_text):]previous_text = decoded_unicodefinish_reason = new_response["finish_reason"]if len(delta_text) == 0 and finish_reason != "function_call":continuefunction_call = Noneif finish_reason == "function_call":try:function_call = process_response(decoded_unicode, use_tool=True)except:logger.warning("Failed to parse tool call, maybe the response is not a tool call or have been answered.")if isinstance(function_call, dict):function_call = FunctionCallResponse(**function_call)delta = DeltaMessage(content=delta_text,role="assistant",function_call=function_call if isinstance(function_call, FunctionCallResponse) else None,)choice_data = ChatCompletionResponseStreamChoice(index=0,delta=delta,finish_reason=finish_reason)chunk = ChatCompletionResponse(model=model_id,id="",choices=[choice_data],object="chat.completion.chunk")yield "{}".format(chunk.model_dump_json(exclude_unset=True))choice_data = ChatCompletionResponseStreamChoice(index=0,delta=DeltaMessage(),finish_reason="stop")chunk = ChatCompletionResponse(model=model_id,id="",choices=[choice_data],object="chat.completion.chunk")yield "{}".format(chunk.model_dump_json(exclude_unset=True))yield '[DONE]'def predict_stream(model_id, gen_params):"""The function call is compatible with stream mode output.The first seven characters are determined.If not a function call, the stream output is directly generated.Otherwise, the complete character content of the function call is returned.:param model_id::param gen_params::return:"""output = ""is_function_call = Falsehas_send_first_chunk = Falsefor new_response in generate_stream_chatglm3(model, tokenizer, gen_params):decoded_unicode = new_response["text"]delta_text = decoded_unicode[len(output):]output = decoded_unicode# When it is not a function call and the character length is> 7,# try to judge whether it is a function call according to the special function prefixif not is_function_call and len(output) > 7:# Determine whether a function is calledis_function_call = contains_custom_function(output)if is_function_call:continue# Non-function call, direct stream outputfinish_reason = new_response["finish_reason"]# Send an empty string first to avoid truncation by subsequent next() operations.if not has_send_first_chunk:message = DeltaMessage(content="",role="assistant",function_call=None,)choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=finish_reason)chunk = ChatCompletionResponse(model=model_id,id="",choices=[choice_data],created=int(time.time()),object="chat.completion.chunk")yield "{}".format(chunk.model_dump_json(exclude_unset=True))send_msg = delta_text if has_send_first_chunk else outputhas_send_first_chunk = Truemessage = DeltaMessage(content=send_msg,role="assistant",function_call=None,)choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=finish_reason)chunk = ChatCompletionResponse(model=model_id,id="",choices=[choice_data],created=int(time.time()),object="chat.completion.chunk")yield "{}".format(chunk.model_dump_json(exclude_unset=True))if is_function_call:yield outputelse:yield '[DONE]'async def parse_output_text(model_id: str, value: str):"""Directly output the text content of value:param model_id::param value::return:"""choice_data = ChatCompletionResponseStreamChoice(index=0,delta=DeltaMessage(role="assistant", content=value),finish_reason=None)chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")yield "{}".format(chunk.model_dump_json(exclude_unset=True))choice_data = ChatCompletionResponseStreamChoice(index=0,delta=DeltaMessage(),finish_reason="stop")chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")yield "{}".format(chunk.model_dump_json(exclude_unset=True))yield '[DONE]'def contains_custom_function(value: str) -> bool:"""Determine whether 'function_call' according to a special function prefix.For example, the functions defined in "tools_using_demo/tool_register.py" are all "get_xxx" and start with "get_"[Note] This is not a rigorous judgment method, only for reference.:param value::return:"""return value and 'get_' in valueif __name__ == "__main__":# Load LLMtokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True)model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, device_map="auto").eval()# load Embeddingembedding_model = SentenceTransformer(EMBEDDING_PATH, device="cuda")uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)

HTTP接口问答示例

curl -H "Content-Type: application/json" -X POST -d '{"messages": [{"role":"user","content":"给我讲个笑话"}], "model":"chatglm3-6b"}' http://localhost:8000/v1/chat/completions

返回结果

HTTP/1.1 200 OK
date: Sat, 16 Mar 2024 13:16:00 GMT
server: uvicorn
content-length: 611
content-type: application/json
Connection: close{"model": "chatglm3-6b","id": "","object": "chat.completion","choices": [{"index": 0,"message": {"role": "assistant","content": "好的,给您讲一个轻松的笑话:\n\n有一天,小明在公园里捡到一个神奇的灯笼。他捧着灯笼说了:“我希望我成为世界上最聪明的人!”突然,他变成了一个女人。\n\n这个笑话是在玩弄性别刻板印象,暗示女性比男性更聪明。希望这个笑话能带给您快乐!","name": null,"function_call": null},"finish_reason": "stop"}],"created": 1710594964,"usage": {"prompt_tokens": 11,"total_tokens": 83,"completion_tokens": 72}
}

LLM讲了个尴尬的笑话~

若HTTPserver在处理请求的过程中出现failed to open nvrtc-builtins64_121.dll错误,请参考下文1解决。

参考

  1. ChatGLM3-6B独立部署提供HTTP服务failed to open nvrtc-builtins64_121.dll-CSDN博客
  2. LLM大语言模型(四):在ChatGLM3-6B中使用langchain_chatglm3-6b langchain-CSDN博客
  3. LLM大语言模型(一):ChatGLM3-6B本地部署-CSDN博客

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/749247.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

[Windows] Win11 常用快捷键

文章目录 &#x1f680; [Windows] Win11 常用快捷键&#x1f310; Windows 操作系统&#x1f525; Windows 11 &#x1f310; Windows 11 快捷键概览&#x1f525; 基本快捷键&#x1f525; 窗口快捷键&#x1f525; 功能快捷键 &#x1f4dd; 小结 &#x1f680; [Windows] W…

工欲善其事,必先利其器,Markdown和Mermaid的梦幻联动(2)

该文章Github地址&#xff1a;https://github.com/AntonyCheng/typora-notes/tree/master/chapter03-mermaid 在此介绍一下作者开源的SpringBoot项目初始化模板&#xff08;Github仓库地址&#xff1a;https://github.com/AntonyCheng/spring-boot-init-template & CSDN文…

C#编程语言在软件开发中的深度应用与实践

C#编程语言在软件开发中的深度应用与实践 一、引言 C#是一种现代、通用、面向对象的编程语言&#xff0c;由微软公司开发&#xff0c;并作为.NET框架的主要编程语言。它广泛应用于桌面应用程序、游戏开发、Web服务和云计算等多个领域。本文将探讨使用C#进行软件开发的深层次思路…

QT6 界面中嵌入第三方程序中的窗口

本文描述QT6 界面中嵌入第三方程序中的窗口。 第三方程序也是用QT6编写。 QT的编译器版本都为QT6.2.4 MSVC2019 64 bit 第三方程序为QT写的能调试成功。但不是QT写的还不可以。 使用场景:将软件分成几个模块&#xff0c;多人写或者一个人先后写。 1.头文件 #ifndef PAREN…

数字航海与网络深渊:探索出海策略中的技术维度

在这个数字化加速的时代&#xff0c;"出海"已经成为企业寻求新机遇、拓展全球视野的关键行动。而在这一进程中&#xff0c;SOCKS5代理、代理IP、跨界电商、游戏技术以及网络安全构成了出海航程中的核心技术舵手。这些技术不仅是航行工具&#xff0c;更是深渊中的指南…

ChatGPT 遇到对手:Anthropic Claude 语言模型的崛起

ChatGPT 遇到对手&#xff1a;Anthropic Claude 语言模型的崛起 。 这个巨大的上下文容量使 Claude 2.1 能够处理更大的数据体。用户可以提供复杂的代码库、详细的财务报告或广泛的作品作为提示。然后 Claude 可以连贯地总结长文本&#xff0c;基于文档进行彻底的问答&#x…

“技多不压身”是什么意思?看完这篇文章你会明白:有了手艺,走遍天下都不怕!

“技多不压身”是什么意思&#xff1f;看完这篇文章你会明白&#xff1a;有了手艺&#xff0c;走遍天下都不怕&#xff01; 咱们的老祖宗流传一句话&#xff1a;“一招鲜&#xff0c;吃遍天。”这话说得直白&#xff0c;却道出了学一门手艺或技术对于人生的重要性。“李秘书讲…

系统设计学习(四)海量数据

十一&#xff0c;百亿数据中找中位数 桶/计数排序思想 根据数据的特征&#xff0c;比如数据落在某个固定范围内&#xff0c;可以使用桶排序或计数排序的思想。通过统计每个桶内元素的数量&#xff0c;我们可以确定中位数所在的桶&#xff0c;然后在该桶内使用更精确的方法计算中…

使用Loadrunner进行性能测试

一、确定性能测试的范围、要求、配置、工具等 明确测试的系统&#xff1a; 本文档主要指的是web应用。 明确测试要求&#xff1a; 用户提出性能测试&#xff0c;例如&#xff0c;网站首页页面响应时间在3S之内&#xff0c;主要的业务操作时间小于10s&#xff0c;支持300用户在…

Android Studio实现内容丰富的安卓宠物用品商店管理系统

获取源码请点击文章末尾QQ名片联系&#xff0c;源码不免费&#xff0c;尊重创作&#xff0c;尊重劳动。 项目编号128 1.开发环境android stuido jdk1.8 eclipse mysql tomcat 2.功能介绍 安卓端&#xff1a; 1.注册登录 2.系统公告 3.宠物社区&#xff08;可发布宠物帖子&#…

2024.3.12-408学习笔记-C-C++语法中的引用和布尔类型

1、引用& #include <stdio.h>void modify_pointer(int* &p1, int* q1) {p1 q1; }int main() {int* p NULL;int i 10;int* q &i;modify_pointer(p, q);printf("after modify_pointer *p %d\n", *p);//after modify_pointer *p 10return 0; }…

this是什么?为什么要改变this?怎么改变 this 指向?

目录 this 是什么&#xff1f; 箭头函数中的 this 为什么要改变 this 指向&#xff1f; 改变 this 指向的三种方法 call(无数个参数) apply(两个参数) bind(无数个参数) this 是什么&#xff1f; 在对象方法中&#xff0c;this 指的是所有者对象&#xff08;方法的拥有者…

提升口才表达能力的重要性与途径

提升口才表达能力的重要性与途径 口才表达能力&#xff0c;即一个人通过口头语言准确、流畅、生动地传达思想、情感和观点的能力&#xff0c;是现代社会中不可或缺的一项基本技能。无论是在职场沟通、人际交往还是公共场合发言&#xff0c;优秀的口才表达能力都能为我们带来诸…

力扣hot100:416.分割等和子集(组合/动态规划/STL问题)

组合数问题 我们思考一下&#xff0c;如果要把数组分割成两个子集&#xff0c;并且两个子集的元素和相等&#xff0c;是否等价于在数组中寻找若干个数使之和等于所有数的一半&#xff1f;是的&#xff01; 因此我们可以想到&#xff0c;两种方式&#xff1a; ①回溯的方式找到t…

vanna:基于RAG的text2sql框架

文章目录 vanna简介及使用vanna的原理vanna的源码理解总结参考资料 vanna简介及使用 vanna是一个开源的利用了RAG的SQL生成python框架&#xff0c;在2024年3月已经有了5.8k的star数。 Vanna is an MIT-licensed open-source Python RAG (Retrieval-Augmented Generation) fram…

探讨大世界游戏的制作流程及技术——大场景制作技术概况篇

接上文&#xff0c;我们接下来了解一下大世界场景制作技术有哪些&#xff0c;本篇旨在给大家过一遍目前业界的做法&#xff0c;能让大家有一个宏观的知识蓝图。实际上&#xff0c;针对不同的游戏类型和美术风格&#xff0c;制作技术在细节上有着非常大的不同&#xff0c;业界目…

逗号运算符

在C语言中&#xff0c;逗号运算符,用于分隔表达式&#xff0c;并按顺序计算每个表达式&#xff0c;最终返回最后一个表达式的值。逗号运算符的语法如下&#xff1a; expression1, expression2它的行为是先计算expression1&#xff0c;然后计算expression2&#xff0c;最后返回e…

Meson编译工具安装及使用Meson编译DPDK

一、Meson编译工具安装 1&#xff09;安装python3的环境 yum install -y python3 python3 -m ensurepip --upgrade 2&#xff09;安装meson pip3 install --user meson --default-timeout10000 export PATH$PATH:$HOME/.local/bin使得环境变量一直有效&#xff0c;可编辑 /etc/…

HarmonyOS NEXT应用开发—自定义视图实现Tab效果

介绍 本示例介绍使用Text、List等组件&#xff0c;添加点击事件onclick,动画&#xff0c;animationTo实现自定义Tab效果。 效果预览图 使用说明 点击页签进行切换&#xff0c;选中态页签字体放大加粗&#xff0c;颜色由灰变黑&#xff0c;起到强调作用&#xff0c;同时&…

手撕算法-队列实现栈And栈实现队列

手撕算法-队列实现栈And栈实现队列 两个栈实现队列两个队列实现栈包含min函数的栈 两个栈实现队列 分析&#xff1a;转换数据方向&#xff0c;第一个栈写&#xff0c;第二个栈读。 代码&#xff1a; import java.util.*; import java.util.Stack;public class Solution {Sta…