30分钟快速搭建AI推理平台：Xinference本地部署与模型测试全图解

显示全部楼层

Xinference简化各种 AI 模型的运行和集成，用于本地环境部署开源 LLM、嵌入模型和多模态模型运行推理。

一，环境准备

1，conda环境

condacreate-nxinferencepython=3.11condaactivatexinference

2，安装torch

pip3installtorchtorchvisiontorchaudio--index-urlhttps://download.pytorch.org/whl/cu126

3，环境变量配置

#环境变量配置exportHF_ENDPOINT="https://hf-mirror.com"exportUSE_MODELSCOPE_HUB=1exportXINFERENCE_HOME=/home/jovyan/dev/xinferenceexportXINFERENCE_MODEL_SRC=modelscope#exportXINFERENCE_ENDPOINT=http://0.0.0.0:9997

二、安装部署

1，源码安装

gitclonehttps://github.com/xorbitsai/inference.gitcdinferencepipinstall-e.

2，pip安装

#全量安装：推理所有支持的模型

pipinstall"xinference[all]"

# 部分模型需要(rank)

pipinstallsentence-transformers#pipinstallflash_attn

# rerank模型需要

pipinstallsentencepiecepipinstallprotobuf

三，推理引擎

1，Transformers 引擎（PyTorch）

#支持几乎有所的最新模型，Pytorch模型默认使用的引擎pipinstall"xinference[transformers]"

2，vLLM 引擎

#支持高并发，使用vllm引擎能获取更高的吞吐量pipinstall"xinference[vllm]"#FlashInferisoptionalbutrequiredforspecificfunctionalitiessuchasslidingwindowattentionwithGemma2.#ForCUDA12.4&torch2.4tosupportslidingwindowattentionforgemma2andllama3.1styleropepipinstallflashinfer-ihttps://flashinfer.ai/whl/cu124/torch2.4#ForotherCUDA&torchversions,pleasecheckhttps://docs.flashinfer.ai/installation.html

# 说明：当满足以下条件时，自动选择 vllm 作为引擎：

模型格式为pytorch，gptq或者awq。当模型格式为pytorch时，量化选项需为none。当模型格式为awq时，量化选项需为Int4。当模型格式为gptq时，量化选项需为Int3、Int4或者Int8。操作系统为Linux并且至少有一个支持CUDA的设备自定义模型的model_family字段和内置模型的model_name字段在vLLM的支持列表中。

3，Llama.cpp 引擎

说明：自 v1.5.0 起，xllamacpp 成为 llama.cpp 后端的默认选项。如需启用 llama-cpp-python，请设置环境变量 USE_XLLAMACPP=0

pipinstallxinference

# xllamacpp 的安装说明：

#CPU或MacMetal：pipinstall-Uxllamacpp#Cuda:pipinstallxllamacpp--force-reinstall--index-urlhttps://xorbitsai.github.io/xllamacpp/whl/cu124#HIP:pipinstallxllamacpp--force-reinstall--index-urlhttps://xorbitsai.github.io/xllamacpp/whl/rocm-6.0.2

#llama-cpp-python 不同硬件的安装方式：

#AppleM系列CMAKE_ARGS="-DLLAMA_METAL=on"pipinstallllama-cpp-python#英伟达显卡：CMAKE_ARGS="-DLLAMA_CUBLAS=on"pipinstallllama-cpp-python#AMD显卡：CMAKE_ARGS="-DLLAMA_HIPBLAS=on"pipinstallllama-cpp-python

# 说明：

Xinference通过xllamacpp或llama-cpp-python支持gguf格式的模型。xllamacpp由Xinference团队开发，并将在未来成为llama.cpp的唯一后端。llama-cpp-python是llama.cpp后端的默认选项。要启用xllamacpp，请添加环境变量USE_XLLAMACPP=1。在即将发布的Xinferencev1.5.0中，xllamacpp将成为llama.cpp的默认选项，而llama-cpp-python将被弃用。在Xinferencev1.6.0中，llama-cpp-python将被移除。

4，SGLang 引擎

pipinstall"xinference[sglang]"#ForCUDA12.4&torch2.4tosupportslidingwindowattentionforgemma2andllama3.1styleropepipinstallflashinfer-ihttps://flashinfer.ai/whl/cu124/torch2.4#ForotherCUDA&torchversions,pleasecheckhttps://docs.flashinfer.ai/installation.html

#说明：

SGLang具有基于RadixAttention的高性能推理运行时。SGLang通过在多个调用之间自动重用KV缓存，显著加速了复杂LLM程序的执行。SGLang还支持其他常见推理技术，如连续批处理和张量并行处理。

5，MLX 引擎

pipinstall"xinference[mlx]"

#说明：

MLX-lm用来在苹果silicon芯片上提供高效的LLM推理。

四、启动及访问

1，启动（注意：默认从huggingface拉模型）

XINFERENCE_MODEL_SRC=modelscopexinference-local--host0.0.0.0--port9997

#说明

说明1：xinference-local默认会在本地启动一个worker，端点为：http://127.0.0.1:9997，端口默认为9997，仅支持本机本地访问。说明2：默认使用<HOME>/.xinference作为主目录来存储一些必要的信息（日志文件、模型文件等），配置环境变量XINFERENCE_HOME修改主目录：XINFERENCE_HOME=/tmp/xinferencexinference-local--host0.0.0.0--port9997说明3：默认从huggingface拉模型，可配置XINFERENCE_MODEL_SRC=modelscope指定拉取模型hub

2，帮助信息

xinference-local--help

3，UI访问

http://127.0.0.1:9997/ui

4，API 文档

http://127.0.0.1:9997/docs

五、模型测试

#启动 xinference-local

XINFERENCE_MODEL_SRC=modelscopexinference-local--host0.0.0.0--port9997#后台启动nohupXINFERENCE_MODEL_SRC=modelscopexinference-local--host0.0.0.0--port9997>xinference-local.log2>&1&

1，Language 模型

# 选择语言模型

# 配置运行参数

# 下载日志

# 运行记录

# 测试qwen2.5

# curl测试

curl-X'OST'\'http://127.0.0.1:9997/v1/chat/completions'\-H'accept:application/json'\-H'Content-Type:application/json'\-d'{"model":"qwen2.5-instruct","messages":[{"role":"system","content":"Youareahelpfulassistant."},{"role":"user","content":"你是谁?"}]}'

# 测试qwen2.5-vl模型（下载运行同qwen2.5）

# 运行日志

# 运行记录

# web页面测试

2，Embedding 模型

# 下载embedding模型

# 配置模型运行参数

# 下载日志

# 运行记录

#curl测试

curlhttp://127.0.0.1:9997/v1/embeddings\-H"Content-Type:application/json"\-d'{"input":"北京景点推荐","model":"jina-embeddings-v3"}'

3，Rerank 模型

# 选择模型点击进入

# 配置运行参数

# 运行日志

# 运行记录

# curl测试

curl-X'OST''http://127.0.0.1:9997/v1/rerank'\-H'Content-Type:application/json'\-d'{"model":"bge-reranker-v2-m3","query":"一个男人正在吃意大利面。","documents":["一个男人在吃东西。","一个男人正在吃一块面包。","这个女孩怀着一个婴儿。","一个人在骑马。","一个女人在拉小提琴。"]}'

4，图像模型

# 选择模型

# 运行参数配置

# 后台日志

# 运行记录

#web界面测试

#测试propmtAdigitalillustrationofamoviepostertitled[‘SadSax:FuryToad’],[MadMax]parodyposter,featuring[asaxophone-playingtoadinapost-apocalypticdesert,withacustomizedcarmadeofmusicalinstruments],inthebackground,[awastelandwithothermusicalvehiclechases],movietitlein[agritty,boldfont,dustyandintensecolorpalette].

#测试propmtAdigitalillustrationofamoviepostertitled['Mulan'],featuring[afiercewarriorwomanwithlongflowingblackhair,dressedintraditionalChinesearmorwithredaccents,holdingaswordreadyforbattle].Sheisposedinadynamicactionstanceagainst[abackdropofruggedsnow-coveredmountainswithdarkstormyskies].Themovietitle['Mulan']iswrittenin[boldredcalligraphy-styletext,prominentlydisplayedatthebottom],alongwith[areleasedateinsmallerfont].Thesceneconveys[intensity,bravery,andanepicadventure].

5，语音模型

# 选择语音模型

# 下载及运行

# 后台日志

# 运行成功

#curl测试

curl-X'OST'\'http://localhost:9997/v1/audio/speech'\-H'accept:application/json'\-H'Content-Type:application/json'\-d'{"model":"CosyVoice2-0.5B","input":"hello","voice":"中文女"}'-ohello1.mp3

6，视频模型

# 选择视频模型

# 运行参数配置

# 运行日志

# 运行成功

# curl测试

curl-X'OST'\'http://127.0.0.1:9997/v1/video/generations'\-H'accept:application/json'\-H'Content-Type:application/json'\-d'{"model":"CogVideoX-5b","prompt":"anapple"}'-oapple.mp4

# 取b64_json进行base64解码即可生成视频

7，自定义模型

# 注册模型

# 注册成功

# 模型编辑

# 运行测试

# 运行日志

# 运行记录

# web界面测试

说明：pytorch模型格式，可以选择vllm启动，速度会快很多

# curl测试

curl-X'OST'\'http://127.0.0.1:9997/v1/chat/completions'\-H'accept:application/json'\-H'Content-Type:application/json'\-d'{"model":"my-Qwen2.5-32B-Instruct-GPTQ-Int4","messages":[{"role":"system","content":"Youareahelpfulassistant."},{"role":"user","content":"你是谁?"}]}'

六、问题总结

1，问题一：

解决：安装torch

pip3installtorchtorchvisiontorchaudio--index-urlhttps://download.pytorch.org/whl/cu126

2，问题二：

解决：安装缺失的包

pipinstallsentence-transformers

问题3:

解决：使用sglang 引擎推理启动

#Xinference#大模型本地部署