小白入门：使用vLLM在本机MAC上部署大模型 - 链载Ai

在本文中，我将探索 vLLM，这是一款广泛用于在计算机上运行大型语言模型（LLM）的工具。我将指导你如何在 Mac 上安装 vLLM，并演示如何通过 REST API 运行 LLM。

什么是 vLLM？

vLLM 是由加州大学伯克利分校研究人员创建的开源库，旨在简化大型语言模型（LLM）的推理过程，使其快速、高效且用户友好。vLLM 的名称代表“Virtual Large Language Model”（虚拟大型语言模型）。其主要目标是增强 LLM 的服务和部署，特别是在对性能要求较高的场景中，例如实时应用、API 或研究实验。

尽管 vLLM 为 GPU（使用 CUDA）进行了优化，以利用诸如 PagedAttention 等特性实现最佳性能，但它也支持基于 CPU 的推理和服务。

截至目前，vLLM 不支持 macOS 上的 Metal Performance Shaders（MPS），这意味着在 Mac 上使用时，它只能在 CPU 上运行。

安装

在本文中，我假设你已经在 Mac 上安装了 Anaconda。首先，创建一个名为usingvllm的虚拟环境：

$ conda create -n usingvllm jupyter

系统会提示你安装一些文件。输入y并按回车键。

创建虚拟环境后，激活它：

$ conda activate usingvllm

进入虚拟环境后，克隆 vLLM 的 Git 仓库：

$ gitclonehttps://github.com/vllm-project/vllm.git

你还需要安装两个包：

$ pip install torch torchvision

克隆仓库后，进入vllm目录，并通过运行以下命令安装 vLLM：

$cdvllm
$ pip install -e .

测试 vLLM

安装完成后，启动 Jupyter Notebook：

$ jupyter notebook

现在，你可以使用 vLLM 加载一个模型，例如tiiuae/falcon-7b-instruct：

fromvllm.entrypoints.llmimportLLM
fromvllm.sampling_paramsimportSamplingParams

llm = LLM(model="tiiuae/falcon-7b-instruct")

sampling_params = SamplingParams(temperature =0.9,
                max_tokens =200)

prompt ="What is quantum computing?"
output = llm.generate(prompt, sampling_params)

print(output)
print(output[0].outputs[0].text)

对于 Apple Silicon Mac，vLLM 使用float16数据类型。

tiiuae/falcon-7b-instruct模型将被下载到你的本地计算机上。

vLLM 将 Hugging Face 模型下载到默认的~/cache/huggingface/hub文件夹中。

模型下载并加载完成后，你会看到类似以下内容：

[RequestOutput(request_id=0, prompt='What is quantum computing?', prompt_token_ids=[1562, 304, 17235, 15260, 42], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\nQuantum computing is the practice of computing using quantum-mechanical phenomena, such as superposition and entanglement. Unlike classical computing, which relies on binary numbers and bits, quantum computing uses quantum-mechanical phenomena to generate information. This can offer incredible performance improvements in certain types of operations that were not possible before, such as cryptography.', token_ids=(193, 26847, 381, 15260, 304, 248, 3100, 275, 15260, 1241, 17235, 24, 1275, 38804, 25849, 23, 963, 345, 2014, 9073, 273, 833, 642, 1977, 25, 15752, 15613, 15260, 23, 585, 21408, 313, 16529, 4169, 273, 12344, 23, 17235, 15260, 4004, 17235, 24, 1275, 38804, 25849, 271, 7420, 1150, 25, 735, 418, 1880, 7309, 2644, 10421, 272, 1714, 3059, 275, 5342, 325, 646, 416, 1777, 996, 23, 963, 345, 6295, 3842, 25, 11), cumulative_logprob=None, logprobs=None, finish_reason=stop, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1742731247.0472178, last_token_time=1742731264.39699, first_scheduled_time=1742731247.054006, first_token_time=1742731251.790441, time_in_queue=0.0067882537841796875, finished_time=1742731264.3972821, scheduler_time=0.005645580822601914, model_forward_time=None, model_execute_time=None, spec_token_acceptance_counts=[0]), lora_request=None, num_cached_tokens=0, multi_modal_placeholders={})]

量子计算是利用量子力学现象（如叠加和纠缠）进行计算的一种实践。与依赖二进制数字和比特的经典计算不同，量子计算利用量子力学现象生成信息。这可以在某些之前无法实现的操作类型（如密码学）中提供惊人的性能提升。

将 vLLM 作为服务器运行

你还可以将 vLLM 作为服务器运行，通过 REST API 接受客户端连接。为此，进入vllm文件夹，并使用以下命令提供特定模型的服务：

$cdvllm
$ vllm serve tiiuae/falcon-7b-instruct

注意，如果你想提供多个模型的服务，需要多次运行上述命令。此外，服务器默认监听端口号为 8000。如果你想自定义端口号，可以使用--port选项，例如：vllm serve tiiuae/falcon-7b-instruct --port 5002，这将使服务器监听 5002 端口。

如果你遇到“RuntimeError: Failed to infer device type”错误，请在命令中添加--device cpu选项：

$ vllm serve tiiuae/falcon-7b-instruct --device cpu

vLLM 服务器运行后，你可以在另一个终端中使用curl命令进行测试：

$ curl http://localhost:8000/docs

上述命令将以 HTML 格式返回服务器的文档：

<!DOCTYPE html>
<html>
<head>
<linktype="text/css"rel="stylesheet"href="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui.css">
<link rel="shortcut icon"href="https://fastapi.tiangolo.com/img/favicon.png">
<title>FastAPI - Swagger UI</title>
</head>
<body>
<div id="swagger-ui">
</div>
<script src="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui-bundle.js"></script>
<!-- `SwaggerUIBundle` is now available on the page -->
<script>
const ui = SwaggerUIBundle({
  url:'/openapi.json',
 "dom_id":"#swagger-ui",
 "layout":"BaseLayout",
 "deepLinking":true,
 "showExtensions":true,
 "showCommonExtensions":true,
  oauth2RedirectUrl: window.location.origin +'/docs/oauth2-redirect',
  presets: [
    SwaggerUIBundle.presets.apis,
    SwaggerUIBundle.SwaggerUIStandalonePreset
  ],
})
</script>
</body>
</html>

你也可以使用以下 URL 在浏览器中查看文档：http://localhost:8000/docs。

如果你想了解在 REST API 中使用的模型的详细信息，可以使用以下端点：

$ curl http://localhost:8000/v1/models

你将看到以下详细信息：

{
 "object":"list",
 "data": [
    {
     "id":"tiiuae/falcon-7b-instruct",
     "object":"model",
     "created": 1742723998,
     "owned_by":"vllm",
     "root":"tiiuae/falcon-7b-instruct",
     "parent": null,
     "max_model_len": 2048,
     "permission": [
        {
         "id":"modelperm-dc669ee015d54b5497dd0ae2cb64fdad",
         "object":"model_permission",
         "created": 1742723998,
         "allow_create_engine":false,
         "allow_sampling":true,
         "allow_logprobs":true,
         "allow_search_indices":false,
         "allow_view":true,
         "allow_fine_tuning":false,
         "organization":"*",
         "group": null,
         "is_blocking":false
        }
      ]
    }
  ]
}

要使用 REST API 生成响应，可以在curl中使用以下命令：

$ curl -X POST http://localhost:8000/v1/completions \
 -H"Content-Type: application/json"\
 -d'{
    "model": "tiiuae/falcon-7b-instruct",
    "prompt": "Where is Singapore located?",
    "max_tokens": 500,
    "temperature": 0.7
   }'

你应该会收到类似以下的响应：

{
 "id":"cmpl-ea189bc39565488386a48459b5695577",
 "object":"text_completion",
 "created": 1742724348,
 "model":"tiiuae/falcon-7b-instruct",
 "choices": [
    {
     "index": 0,
     "text":"\nSingapore is located in Southeast Asia, on the tip of the Malay Peninsula.",
     "logprobs": null,
     "finish_reason":"stop",
     "stop_reason": null,
     "prompt_logprobs": null
    }
  ],
 "usage": {
   "prompt_tokens": 5,
   "total_tokens": 22,
   "completion_tokens": 17,
   "prompt_tokens_details": null
  }
}

使用 OpenAI 类访问 REST API

要访问 vLLM 暴露的 REST API，你可以使用openaiPython 库中的 OpenAI 类，它提供了一个方便的接口来与 vLLM 的 OpenAI 兼容 API 端点进行交互。由于 vLLM 的服务器模仿了 OpenAI API 的结构，因此你可以使用这个类来发送文本生成、嵌入或其他支持功能的请求。

以下示例展示了如何向tiiuae/falcon-7b-instruct模型提问：

fromopenaiimportOpenAI

openai_api_key ="anything here" # 你可以在这里设置任何内容
openai_api_base ="http://localhost:8000/v1"
client = OpenAI(
  api_key = openai_api_key,
  base_url = openai_api_base,
)
completion = client.completions.create(model ="tiiuae/falcon-7b-instruct",
                   prompt ="Where is Singapore located?",
                   max_tokens =500)

# 仅打印生成的文本
print(completion.choices[0].text.strip())

你应该会收到类似以下的响应：

Singapore is locatedinSoutheast Asia on the island of Singapore.

总结

在本文中，我演示了如何在 Mac 上设置 vLLM 并用它来本地托管大型语言模型。尽管 vLLM 目前不支持 MPS，但它仍然可以有效地使用 CPU 运行。此外，你还了解了如何将 vLLM 配置为 REST API，从而让开发人员可以通过 OpenAI 类与之交互。你使用 vLLM 的个人体验如何呢？