一、安装xinference
pip install xinference
二、启动xinference
./xinference-local --host=0.0.0.0 --port=5544
三、注册本地模型
1、注册embedding模型
curl -X POST "http://localhost:5544/v1/models" \
-H "Content-Type: application/json" \
-d '{"model_type": "embedding","model_name": "bce-embedding-base_v1", "model_uid": "bce-embedding-base_v1", "model_path": "/root/embed_rerank/bce-embedding-base_v1/"
}'验证:
curl -X POST "http://localhost:5544/v1/embeddings" \
-H "Content-Type: application/json" \
-d '{"model": "bce-embedding-base_v1","input": ["需要嵌入的文本1", "这是第二个句子"]
}'2、注册rerank模型curl -X POST "http://localhost:5544/v1/models" \
-H "Content-Type: application/json" \
-d '{"model_type": "rerank", "model_name": "bce-reranker-base_v1", "model_uid": "bce-reranker-base_v1", "model_path": "/root/embed_rerank/bce-reranker-base_v1"
}'验证
curl -X POST "http://localhost:5544/v1/rerank" \
-H "Content-Type: application/json" \
-d '{"model": "bge-reranker-v2-m3","query": "What is Python?","documents": ["Python is a programming language.","Java is another language.","Python is used for web development."]
}'3、执行./xinference list 查看运行模型
四、删除模型
curl -X DELETE "http://localhost:5544/v1/models/bge-reranker-v2-m3"
五、备注
1、在cpu运行
- 服务器有显卡但是选择用cpu加载
启动xinference之前设置
export CUDA_VISIBLE_DEVICES=""
- 服务器无显卡会自动在cpu加载模型
2、在gpu运行
启动服务器前设置环境变量
export CUDA_VISIBLE_DEVICES=""
curl -X POST "http://localhost:5544/v1/models" \
-H "Content-Type: application/json" \
-d '{"model_type": "embedding","model_name": "bce-embedding-base_v1", "model_uid": "bce-embedding-base_v1", "model_path": "/root/zml/embed_rerank/bce-embedding-base_v1/" "gpu_idx": 1"n_gpu" : 1
}'备注:
gpu_idx :选用的显卡index
n_gpu:选定的显卡总张数