【LLM应用框架】DSPy构建RAG

显示全部楼层

一、前言

在本文中，我们将基于DSPy构建一个简单的关于Linux应用的技术问答系统，并探索使用和不使用检索增强生成（RAG）的效果。

二、DSPy基本模块

（1）本地环境配置

安装DSPy：需要python3.9及以上版本，这里我们从git工程地址下载所需的版本：

condacreate—namepy39python=3.9condaactivatepy39pipinstallgit+https://github.com/stanfordnlp/dspy.git@2.5.29

本地部署LM模型：这里选择llama3.2，安装Ollama并运行LM服务：

curl-fsSL<https://ollama.ai/install.sh>|shollamarunllama3.2

测试一下环境是否OK：importdspyllama32=dspy.LM('ollama_chat/llama3.2',api_base='http://localhost:11434',api_key='')dspy.configure(lm=llama32)

（2）DSPy基本模块

在DSPy中可以直接通过lm(prompt="prompt")或lm(messages=[...])来提示语言模型。然而，DSPy 提供了模块作为定义语言模型函数的更好方式。

最简单的模块是dspy.Predict。它需要一个 DSPy 签名，即结构化的输入/输出模式，并为你指定的行为返回一个可调用函数。DSPy使用“内联”符号为签名声明一个模块，该模块将问题（类型为str）作为输入，并生成响应作为输出。

qa=dspy.Predict('question:str->response:str')response=qa(question="whatarehighmemoryandlowmemoryonlinux?")print(response.response)

在这个例子中，DSPy 为构建这个qa模块会将你的签名、语言模型和输入传递给一个适配器（Adapter），这是一个处理输入结构化和解析结构化输出以适应你的签名的层。可以很容易地查看 DSPy 发送的最近 n 个提示：

print(dspy.inspect_history(n=1))

得到的结果如下：

[2025-01-10T11:06:24.275829]
System message:
Your input fields are:1.`question`(str)
Your output fields are:1.`response`(str)
All interactions will be structured in the following way, with the appropriatevaluesfilled in.
[[## question ## ]]{question}
[[## response ## ]]{response}
[[## completed ## ]]
In adhering to this structure, your objective is:    Given the fields`question`, produce the fields`response`.
User message:
[[## question ## ]]what are high memoryandlow memory on linux?
Respond with the corresponding output fields, starting with the field`[[ ## response ## ]]`,andthen ending with the markerfor`[[ ## completed ## ]]`.
Response:
[[## response ## ]]High MemoryandLow Memory on Linux refer to two different conditions that affectsystemperformance. Here's a brief explanation of each:省略[[ ## completed ## ]]

DSPy 有多种内置模块，例如 dspy.ChainOfThought、dspy.ProgramOfThought 和 dspy.ReAct。这些模块可以与基本的 dspy.Predict 互换使用。

三、构建与优化RAG

使用DSPy 的基本功能其实已经可以快速地实现很多功能，而如果想构建一个高质量的系统并随着时间的推移不断改进，则需要通过评估系统的质量并利用 DSPy 的强大工具（如优化器）快速迭代。要衡量 DSPy 系统的质量，通常需要：

输入样本：例如问答对的问题样本，需要加载一个包含问题及其标准答案的数据集。
输出质量评分指标：指标种类繁多，有些指标需要理想输出的真实标签，例如用于分类或问答，其他指标是自监督的，例如检查忠实度或缺乏幻觉。对于问答任务，评估回答质量的优劣往往可以通过衡量：系统响应在多大程度上涵盖了标准答案中的所有关键事实，或者反过来：系统响应在多大程度上没有说出标准答案中没有的内容。这个指标本质上是“语义 F1”，因此可以从 DSPy 中加载一个 SemanticF1 指标，然后使用dspy.Evaluate计算平均得分。

下面的示例是在Colab上使用DSPy构建一个回答技术问题的 RAG 系统。输入样本从 RAG-QA Arena 数据集中获取了一些基于 StackExchange 的问题及其正确答案，并使用SemanticF1作为评估指标：

（1）Colab环境配置

!apt-getinstall-ypciutilslshw!curl-fsSLhttps://ollama.ai/install.sh|sh!pipinstalldspy!pipinstallfaiss-cpu

（2）启动ollama服务器

importosimportthreadingimportsubprocessimportrequestsimportjsonimporttimedefollama()s.environ['OLLAMA_HOST']='0.0.0.0:11434'os.environ['OLLAMA_ORIGINS']='*'subprocess.Popen(["ollama","serve"])time.sleep(10)ollama_thread=threading.Thread(target=ollama)ollama_thread.start()defllama_run():subprocess.Popen(["ollama","pull","llama3.2"])llama_run_thread=threading.Thread(target=llama_run)llama_run_thread.start()检查是否已经启动成功

检查是否已启动成功：

!curlhttp://localhost:11434/v1/chat/completions\-H"Content-Type:application/json"\-d'{"model":"llama3.2","messages":[{"role":"user","content":"Hello"}]}'

输出下面的回复说明已经在Colab中成功启动ollama服务：

{"id":"chatcmpl-676","object":"chat.completion","created":1745216425,"model":"llama3.2","system_fingerprint":"fp_ollama","choices":[{"index":0,"message":{"role":"assistant","content":"Hello!HowcanIassistyoutoday?"},"finish_reason":"stop"}],"usage":{"prompt_tokens":26,"completion_tokens":10,"total_tokens":36}}

（3）构建与迭代RAG系统

importdspyimportujsonfromdspy.utilsimportdownloadimportrandomfromdspy.evaluateimportSemanticF1fromsentence_transformersimportSentenceTransformer#使用本地部署的Llama3.2模型（通过Ollama服务），支持自定义API端点lm=dspy.LM('ollama_chat/llama3.2',api_base='http://localhost:11434',api_key='')dspy.configure(lm=lm)#从HuggingFace下载RAG-QAArena技术问答数据集，转换为DSPy的Example格式，#支持输入字段question的自动解析download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_examples.jsonl")withopen("ragqa_arena_tech_examples.jsonl")asf:data=[ujson.loads(line)forlineinf]data=[dspy.Example(**d).with_inputs('question')fordindata]#数据集划分，通过随机打乱后划分训练集（20条）、开发集（20条）和测试集（500条），用于后续优化与评估。random.Random(0).shuffle(data)trainset,devset,testset=data[:20],data[200:220],data[500:1000]print(f'{len(trainset)},{len(devset)},{len(testset)}')#初始化评估器（SemanticF1指标）metric=SemanticF1(decompositional=True)evaluate=dspy.Evaluate(devset=devset,metric=metric,num_threads=12,display_progress=True,display_table=2)#从HuggingFace下载RAG-QAArena技术问答精简数据集download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_corpus.jsonl")#加载技术文档语料库，截断超过6000字符的文档并添加省略号max_characters=6000#用于截断>99th百分位的文档withopen("ragqa_arena_tech_corpus.jsonl")asf:corpus=[ujson.loads(line)['text'][:max_characters].split('\n')[0]+'...'forlineinf]print(f"Loaded{len(corpus)}documents.Willencodethembelow.")#使用all-MiniLM-L6-v2句子嵌入模型生成文本向量embedding_model=SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')defembedder(texts):returnembedding_model.encode(texts)topk_docs_to_retrieve=5#每个搜索查询要检索的文档数量search=dspy.retrievers.Embeddings(embedder=embedder,corpus=corpus,k=topk_docs_to_retrieve)#继承dspy.Module，包含检索器（search）和生成器（ChainOfThought）classRAG(dspy.Module):def__init__(self):		#思维链（ChainOfThought）：通过签名context,question->response声明输入输出关系self.respond=dspy.ChainOfThought('context,question->response')defforward(self,question):		#前向推理流程：检索文档->拼接上下文->生成最终响应context=search(question).passagesreturnself.respond(context=context,question=question)rag=RAG()rag(question="whatarehighmemoryandlowmemoryonlinux?")print(evaluate(RAG()))#使用MIPROv2优化器自动调整提示和权重tp=dspy.MIPROv2(metric=metric,auto="medium",num_threads=12)optimized_rag=tp.compile(rag,trainset=trainset,max_bootstrapped_demos=2,max_labeled_demos=2,requires_permission_to_run=False)#对比优化前后的性能baseline=rag(question="cmd+tabdoesnotworkonhiddenorminimizedwindows")print(baseline.response)pred=optimized_rag(question="cmd+tabdoesnotworkonhiddenorminimizedwindows")print(pred.response)print(evaluate(optimized_rag))

输出的部分结果如下所示：

未经过调优的RAG评估得分

优化器dspy.MIPROv2的工作步骤

STEP 1: 通过少量示例来引导模型进行任务

==>STEP1:BOOTSTRAPFEWSHOTEXAMPLES<==2025/04/2108:56:47INFOdspy.teleprompt.mipro_optimizer_v2:Thesewill be usedasfew-shot example candidatesforour program andforcreating instructions.
2025/04/2108:56:47INFOdspy.teleprompt.mipro_optimizer_v2:BootstrappingN=19sets of demonstrations...Bootstrappingset1/19Bootstrappingset2/19Bootstrappingset3/1975%|███████▌|3/4 [00:00<00:00, 9.09it/s]Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts....100%|██████████| 4/4[00:00<00:00, 9.12it/s]2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2:

STEP 2: 使用之前步骤中的少量示例、生成的数据集摘要、程序代码摘要以及随机选择的提示来提出指令。

>STEP2ROPOSEINSTRUCTIONCANDIDATES<==2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2:Wewill use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2:Proposinginstructions...2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2roposedInstructionsforPredictor0:2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2:0:Giventhe fields `context`, `question`, produce the fields `response`.2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2:1:Checkthe documentationforthe specific operating system version or use command-line tools like dscl and fs_usage to find the location....2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2:17:Giventhe context provided, respond with a step-by-step guide on how to recursively delete empty directoriesinyour home directory.

STEP 3: 通过贝叶斯优化算法找到最优的提示组合参数

==>STEP3:FINDINGOPTIMALPROMPTPARAMETERS<==2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2:Wewill evaluate the program over a series of trials with different combinations of instructions and few-shot examples to find the optimal combination usingBayesianOptimization.
2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2:==Trial1/25-FullEvaluationofDefaultProgram==Bootstrapped2full traces after3examplesforup to1rounds, amounting to4attempts.AverageMetric:9.65/16(60.3%):100%|██████████|16/16 [00:00<00:00, 65.16it/s]2025/04/21 08:56:52 INFO dspy.evaluate.evaluate: Average Metric: 9.654891774891775 /16(60.3%)2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2efaultprogram score:60.34
/usr/local/lib/python3.11/dist-packages/optuna/_experimental.py:31:ExperimentalWarning:Argument``multivariate``isan experimental feature.Theinterface can changeinthe future. warnings.warn(2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2:=====Trial2/25=====
AverageMetric:9.90/16(61.8%):100%|██████████|16/16 [00:00<00:00, 60.96it/s]2025/04/21 08:56:52 INFO dspy.evaluate.evaluate: Average Metric: 9.895873015873017 /16(61.8%)2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2:Bestfull score so far!Score:61.852025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2: Score:61.85with parameters ['redictor0:Instruction12', 'redictor0:Few-ShotSet7'].2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2:Scoresso far: [60.34,61.85]2025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2:Bestscore so far:61.852025/04/2108:56:52INFOdspy.teleprompt.mipro_optimizer_v2:========================
....
2025/04/2108:56:59INFOdspy.teleprompt.mipro_optimizer_v2:=====Trial26/25=====
AverageMetric:9.80/16(61.2%):100%|██████████|16/16 [00:00<00:00, 61.79it/s]2025/04/21 08:56:59 INFO dspy.evaluate.evaluate: Average Metric: 9.799439775910363 /16(61.2%)2025/04/2108:56:59INFOdspy.teleprompt.mipro_optimizer_v2: Score:61.25with parameters ['redictor0:Instruction16', 'redictor0:Few-ShotSet16'].2025/04/2108:56:59INFOdspy.teleprompt.mipro_optimizer_v2:Scoresso far: [60.34,61.85,58.79,55.82,50.58,53.71,63.6,67.79,62.04,65.41,67.79,64.42,67.79,57.3,58.93,62.62,63.39,63.53,60.06,56.46,58.87,63.79,67.79,54.2,67.79,61.25]2025/04/2108:56:59INFOdspy.teleprompt.mipro_optimizer_v2:Bestscore so far:67.792025/04/2108:56:59INFOdspy.teleprompt.mipro_optimizer_v2:=========================
2025/04/2108:56:59INFOdspy.teleprompt.mipro_optimizer_v2:Returningbest identified program with score67.79!

调优后的RAG评估得分

四、总结

本文介绍了DSPy的简单用法以及在Colab中如何构建并优化迭代一个技术问答RAG系统的流程。