2026 年你需要了解的 RAG 全解析

显示全部楼层

核心概念、架构与更多实践…

随着2025年的过去，各类基于 LLM 的系统架构已被广泛采用，并在诸多任务上展现出高效性。其中，Retrieval-Augmented Generation (RAG)无疑是每位 AI 从业者都必须掌握的关键架构。

鉴于其重要性以及原始 RAG 框架在研究与生产系统中的快速演进，本文力求兼顾实用与系统性。

每当我写这类重要 AI 主题时，都会尽量深入，覆盖核心概念、技术、取舍与真实系统中的实操细节。这既是我偏好的学习方式，也因为在多个来源之间来回切换并不利于高效学习。当然，博客并非教科书，深度与广度之间总要权衡。

RAG 起源于 2020 年NeurIPS/Meta的工作，展示了如何将稠密神经检索器与 seq2seq 生成器结合来回答知识密集型问题。该论文定义了核心变体，并在 QA 任务上取得显著提升。自那以后，RAG 在工业界与学术界被广泛使用，衍生出众多改进与新变体。

本文属于「Master LLMs」系列的一部分。我们将从第一性原理到高级实现全面梳理 RAG。无论你是在构建基于 LLM 的应用，还是在设计企业级 AI 架构，都能获得关于 RAG 的工作机制、适用场景及落地优化的实用洞见。

什么是 RAG，它解决了什么问题？

Retrieval-Augmented Generation 是一种在生成回答前，先利用_外部_知识库来优化 LLM 输出的 AI 架构。它不再只依赖模型训练中学到的知识，而是从外部来源（如数据库）检索最新信息，并据此生成更准确、最新的答案。

从本质上说，RAG 针对传统语言模型的三大限制：

幻觉（Hallucination）问题：标准 LLM 只基于训练中学到的模式生成文本。当被问到超出训练数据或需要严格事实准确性的问题时，常会自信却错误地作答。
知识截止（Knowledge Cutoff）问题：每个语言模型都有训练截止日期。2024 年初训练的模型无法了解 2024 年末或 2025 年的事件。
领域特化（Domain Specificity）挑战：将 LLM 微调到医疗、法律、企业内部知识等专业领域，通常需要巨大的计算资源、海量标注数据以及高昂的基础设施投入。

RAG 通过将回答建立在外部、最新数据库中检索到的文档之上来解决这些问题。与其通过 Finetuning 把所有知识都「装进」模型，RAG 允许模型在推理时（inference time）访问当前的、特定领域的信息。

RAG 的核心理念很优雅：不再把所有知识都编码在模型参数里，而是将知识存储（可检索数据库）与推理能力（LLM）解耦。这使得 AI 系统更具适配性，也更可验证。

RAG 如何工作：核心架构与组成

任何 RAG 系统都通过一条组件流水线来运行，每个组件在将用户问题转化为准确、具备上下文依据的回答时扮演关键角色。我们先从基础架构讲起，再逐步扩展到高级技巧。

下面把 RAG 架构拆分为几个关键阶段。

阶段一：数据准备与摄入（Ingestion）

在回答任何问题之前，RAG 系统必须先摄入并准备外部知识源。和所有 LLM 应用一样，数据质量至关重要。我在本系列其他文章里多次强调过这一点，对 RAG 同样适用。

清单：Master LLMs: A Practical Guide from Fundamentals to Mastery | Curated by Hamza Boulahia |...

如果这一阶段的数据质量很差，无论后续使用怎样的检索或生成方法，结果都会大打折扣。

这一基础阶段包括若干关键步骤：

Document Loading

流程从文档加载器开始，它从不同来源摄入数据（例如 PDF、Word 文档、网页、数据库、API 等）。

文档加载器负责格式化解析，例如：

使用 Web scrapers 从网页中提取主体内容
使用 OCR 模型将 PDF 转为 markdown（更适合 LLM 的格式）
使用 CSV/Excel 解析器保留表格数据的列结构与关系

Text Chunking

原始文档（如 100 页的 PDF 报告）通常过大，无法一次性送入 LLM 处理。因此需要把文档切分为较小、可摄入的 chunks。

Chunker 会将文档切分为单独看也有意义的片段。这看似简单，实则是构建 RAG 系统时最重要的决策之一。Chunk 既要足够大以承载完整的语义（不要把段落截断在半句），又要足够小以保持相关性，并适配模型的上下文窗口限制。

常见的 chunking 策略包括：

Fixed-size chunking：按字符、词或 tokens 间隔切分，通常加入重叠以在边界处保留上下文（例如 500 tokens，重叠 50 tokens）
Sentence-based chunking：按自然语言边界（句子）切分，保证语义完整性
Recursive chunking：优先按段落边界切，退化再按句子，最后按固定长度兜底
Semantic chunking：利用 embeddings 识别话题切换点，围绕概念边界切分

Embedding Generation

完成切分后，将每个文本片段转换为数值向量表示（embedding），以捕捉其语义含义。常用 embedding 模型包括 OpenAI 的 text-embedding-3、Sentence-BERT、Cohere 的 embed 等。

这些稠密向量通常有数百到上千维，语义相近的文本会得到相似的向量。

选择合适的 embedding 模型至关重要，因为它会：

决定语义检索的质量：embedding 越好，检索到的结果越相关
影响系统效率与成本：向量维度影响存储占用，以及向量相似度搜索的速度/成本

Vector Indexing

嵌入后的 chunks 会存入专用的Vector Database（或 Vector Store）。在原型验证阶段，你可以用简单的内存结构（如 Python 字典或文件索引）来存储，但面向规模化与优化，通常选用 Weaviate、Chroma、Pinecone、Qdrant、FAISS 等。

这些数据库围绕similarity search做了深度优化，能快速从中检索出与用户查询最相关的 chunks。

阶段二：查询时检索（Query Time Retrieval）

当用户提交查询时，检索流程开始：

核心机制：首先使用与建库时相同的 embedding 模型将查询转换为向量。随后数据库将此查询向量与数以百万计的 chunk 向量进行比较。
搜索过程：比较通过Approximate Nearest Neighbor (ANN)完成，它利用专门的索引技术（如 Weaviate 中的HNSW graphs）来在高维空间中找到与查询最接近的向量，从而避免逐一比较_所有_向量的高昂成本。
相似度度量：通常使用Cosine Similarity来度量向量的「接近度」（关注向量方向/夹角）。
Metadata Filtering：高级系统会在检索前或检索过程中使用元数据进行过滤，例如按文档类型、时间范围或用户权限过滤，确保结果既相关又可授权。

这一过程极其迅速，数据库可以在毫秒级返回最语义相关的 chunks（即「最近邻」），从而支撑 RAG 这类实时应用。

更深入了解检索器与工作机制，可参考：

Top Retriever Methods That Outperform Cosine Similarity

阶段三：增强与生成（Augmentation and Generation）

当相关上下文检索完毕，系统会通过 prompt engineering 构造增强提示（augmented prompt）。

示例：将检索到的 chunks 与用户问题按结构化模板拼接：

Context Information:
[Retrieved Chunk 1]
[Retrieved Chunk 2]
[Retrieved Chunk 3]

User Question: [Original Query]

Instructions: Answer the question based on the provided context.
If the context doesn't contain enough information to answer the question,
say so explicitly. Include citations to specific context sections.

LLM 生成：将增强后的提示发送给语言模型（如 GPT-4、Claude、Llama、Gemini 等），由其综合检索信息并生成带引用的回答。

核心组件小结

选择合适组件的实用考量

1. 选择 Embedding 模型

如上节所述，embedding 模型对检索质量影响很大，需考虑：

领域特化：通用模型如 OpenAI 的 text-embedding-3-small、sentence-transformers all-MiniLM-L6-v2 适合广泛场景；但在专业领域（如医学）使用更针对的模型（如 BioBERT）能显著提升效果。
维度取舍：更高维（1536+）可表达更细致的语义，但需要更多存储与算力；较小维度（384–768）通常能以较小质量损失换取更快搜索。
多语种需求：如 Cohere 的多语种 embeddings 或 BGE-M3 能处理多语言，适合全球化应用。

选型时可参考 BEIR（Benchmarking IR）或 MTEB（Massive Text Embedding Benchmark）等通用基准，但我更推荐：

针对你的场景做小规模测试集。准备 20–30 个真实代表性查询，为每个查询手动标注 3–5 个应当检索到的文档，作为 ground truth。
测试不同 embedding 模型，看它们能否稳定在 Top 结果中返回期望文档。
追踪 recall@5、MRR（Mean Reciprocal Rank）等指标。

这种面向领域的测试，比通用榜单更能反映「你的数据与查询模式」下的真实表现。

2. 确定最优 Chunk 大小

Chunk 大小最终取决于任务。例如 FAQ 聊天机器人可用更小的 chunk；而财报分析型 RAG 可能需要更大一点。

但有些通用注意点：

过小（< ~128 tokens）：片段上下文不足，信息支离破碎，检索相关性差。
过大（> ~1024 tokens）：片段包含多个主题，降低检索精度；且多段检索时容易触达上下文窗口上限。
折中范围（256–512 tokens）：多数应用在此范围表现良好，既有足够上下文，又能维持较高检索精度。

Overlap 策略：相邻 chunk 保持 10–20% 重叠，可避免关键信息被分割，但会增加存储。

与 embedding 选型一样，最佳 chunk 大小应通过试验并跟踪 MRR、NDCG（Normalized Discounted Cumulative Gain）等指标来确定。

从零搭建你的第一个 RAG：实操示例

我们用 LangChain 与 Gemini 构建一个回答公司 Remote Working Policy 的 Q&A 机器人。可在 Colab 运行此notebook。无需上传 PDF，运行笔记本会自动生成。

使用 LangChain

依赖：

!pip install -qU langchain langchain-community langchain-text-splitters
!pip install -qU langchain-huggingface langchain-google-genai langchain-openai
!pip install -qU faiss-cpu pypdf fpdf sentence-transformers

导入：

# Libraries for PDF creation
fromfpdfimportFPDF
importtextwrap

# Libraries for RAG
importos
fromlangchain_community.document_loadersimportPyPDFDirectoryLoader
fromlangchain_text_splittersimportRecursiveCharacterTextSplitter
fromlangchain_huggingfaceimportHuggingFaceEmbeddings
fromlangchain_google_genaiimportChatGoogleGenerativeAI
fromlangchain_community.vectorstoresimportFAISS

加载文档：使用PyPDFDirectoryLoader递归加载目录中的所有 PDF，自动处理抽取与元数据，适合多 PDF 作为 RAG 来源（更适合纯文本 PDF）。

# Load documents
print("Loading documents...")
loader = PyPDFDirectoryLoader("./company_docs/")
documents = loader.load()
print(f"Loaded{len(documents)}documents (1 doc for each page)")

递归切分并重叠：RecursiveCharacterTextSplitter优先按自然边界（段落、句子）切分，最后按字符长度兜底，适合本场景。

# Chunk documents
print("Chunking documents...")
text_splitter = RecursiveCharacterTextSplitter(
  chunk_size=500,
  chunk_overlap=50,
  separators=["\n\n","\n"," ",""]
)
chunks = text_splitter.split_documents(documents)
print(f"Created{len(chunks)}chunks")

用 FAISS 创建向量数据库与检索器：用轻量 embedding 模型all-MiniLM-L6-v2计算每个 chunk 的向量，然后用 FAISS 创建向量存储，并转为 retriever 用于相似度搜索。

# Create embeddings and FAISS vector store
print("Creating embeddings and FAISS vector database...")
embeddings = HuggingFaceEmbeddings(
  model_name="sentence-transformers/all-MiniLM-L6-v2"
)
vectorstore = FAISS.from_documents(
  documents=chunks,
  embedding=embeddings
)
# Save FAISS index
vectorstore.save_local("faiss_index")

# Create retriever
retriever = vectorstore.as_retriever(
  search_type="similarity",
  search_kwargs={"k": 4}  # retrieve top 4 similar chunks
)

实例化 LLM：我在教程中常用 Gemini，因为有免费层，小模型也很快。

# Set your Google API key
os.environ["GOOGLE_API_KEY"]="your_api_key"

# Create LLM and QA chain
llm=ChatGoogleGenerativeAI(
  model="gemini-2.0-flash-lite",
  temperature=0
)

创建并运行 RAG：写一个函数串起整个 RAG 流水线，并打印带引用的结果。

# Create RAG pipeline

defask_with_sources(question):
 
 # Retrieve docs first
  docs = retriever.invoke(question)
 
 # Format context
  context ="\n\n".join(
 f"Source:{doc.metadata.get('source','Unknown')}(Page{doc.metadata.get('page','N/A')})\nContent:{doc.page_content}"
 fordocindocs)
 
 # Generate answer
  prompt_text =f"""
  Answer the question based only on the following retrieved context, and include the source used at the end as reference:

 {context}

  Question:{question}
  """
  response = llm.invoke(prompt_text)
 
 # Print result with sources
 print(f"\nQuestion:{question}")
 print(f"\nAnswer:{response.content}")
 print(f"\nSources Retrieved:")
 fori, docinenumerate(docs,1):
    source = doc.metadata.get('source','Unknown')
    page = doc.metadata.get('page','N/A')
   print(f" {i}.{source}, Page{page}")

测试 RAG：

question ="what are the Ambient noise levels required"
ask_with_sources(question)

输出：

Question: what are the Ambient noise levels required

Answer: 工作时间内，环境噪音必须低于 45 dB。

Source: company_docs/remote_policy.pdf (Page 0)

Sources Retrieved:
 1.company_docs/remote_policy.pdf, Page 0
 2.company_docs/remote_policy.pdf, Page 0
 3.company_docs/remote_policy.pdf, Page 1
 4.company_docs/remote_policy.pdf, Page 2

很好，现在我们可以获得带引用、可验证的答案了！

使用 LlamaIndex

LlamaIndex 更专注于 RAG 应用，提供易用的高层 API。我们复现同一示例（基于 Remote Work Policy 文档的 RAG）。代码在与上例相同的notebook中。

依赖：

!pip install -qU llama-index-llms-google-genai llama-index llama-index-embeddings-huggingface
!pip install -qU nest-asyncio

导入：

# libraries for rag
importos
importre
fromllama_index.coreimportVectorStoreIndex, SimpleDirectoryReader, Settings
fromllama_index.llms.google_genaiimportGoogleGenAI
fromllama_index.embeddings.huggingfaceimportHuggingFaceEmbedding
fromllama_index.core.query_engineimportCitationQueryEngine

importnest_asyncio# necessary for notebooks
nest_asyncio.apply()

创建 RAG 引擎：配置与上例相同，但更简洁。

# Define the Global Settings
Settings.llm = GoogleGenAI(model="models/gemini-2.0-flash-lite")
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Load documents
documents = SimpleDirectoryReader('./company_docs').load_data()

# Create index
index = VectorStoreIndex.from_documents(documents)

# Create query engine
query_engine = CitationQueryEngine.from_args(
  index,
  similarity_top_k=4,
  citation_chunk_size=500,
)

运行 RAG：

defrun_rag(query):
 response = query_engine.query(query)

print("Answer:", response.response)
print("="*50)

# Extract citation numbers like [1], [4], etc.
 citations = re.findall(r'\[(\d+)\]', response.response)
 cited_indices = {int(cid)forcidincitations} # Use set for fast lookup

# Display only cited nodes
fori, nodeinenumerate(response.source_nodes, start=1):
  ifiincited_indices:
    print(f"[{i}] Metadata (CITED):")
    print(" File:", node.metadata.get('file_name','Unknown'))
    print(" Page:", node.metadata.get('page_label','N/A'))
    print(" Score:", node.score)
    print("-"*40)

# Test
response = run_rag("what are the Ambient noise levels required")

输出：

Answer: 工作时间内，环境噪音必须低于 45 dB [4]。

==================================================
[4] Metadata (CITED):
 File: remote_policy.pdf
 Page: 1
 Score: 0.09110435154336344
----------------------------------------

以上就是使用 LlamaIndex 运行简单 RAG 的方式。当然它还能做更多。

LangChain 与 LlamaIndex 各擅胜场：

LangChain 在复杂流水线与灵活控制上更强
LlamaIndex 更适合快速原型开发

选择 RAG 还是 Finetuning？

在本系列的前一篇中（见这里），我们系统讨论了 LLM 的 Finetuning，并看到它可通过有/无监督训练来定制模型。

相较于 Finetuning，RAG 以一种无需训练的方式来定制 Large Language Models。要明确什么时候该用哪种方法，需要从实操角度理解 RAG 与 Finetuning 的核心差异：

RAG：省预算、上手快

无需重训，只需把模型连到文档
文档一改即生效
数据要求低：只要你的文档，不需要标注集
透明度高：能看到具体用了哪些来源➡适合知识更新频繁、需要引用、预算或周期受限的场景

Finetuning：参数优化的路径

能让模型「内化」行业知识
擅长统一语气风格与专业表达
成本高
需要大量高质量训练数据
更新意味着完整重训周期
透明度低：难解释模型为何「知道」➡适合知识相对稳定、且有充足预算与数据的场景，尤其当你更关注统一格式与推理速度，而非引用可溯

RAG 的局限与挑战

尽管 RAG 优势明显，但仍存在不少挑战。流水线任一组件出现问题都会_层层传导_。糟糕的 chunks 会导致差的检索，最终生成也不会好。因此要尽量独立测试每个组件。

目标是在问题源头发现并修正，而不是在下游「救火」。

截至 2025 年，主要挑战包括：

1. 检索质量问题

Embedding 模型可能无法捕捉细微关系，导致相关文档排名靠后。技术术语、领域特定词汇或新颖表达会让面向通用语言的检索系统感到困惑。

可行措施：

使用领域专用 embedding 模型
采用语义 + 关键词的 hybrid search
在摄入阶段为每个 chunk 生成 hypothetical questions
增加用于过滤的丰富 metadata

2. 幻觉仍可能出现

RAG 显著降低幻觉，但 LLM 仍可能误读来源、错误拼合信息或合理化超出证据的表述。

可行措施：

在 prompt 中要求对每个关键陈述给出引用
用多次 LLM 调用做自评（response critique）
在关键业务中保留 human-in-the-loop

3. 信息冲突

检索结果可能相互矛盾，尤其在快速变化的领域或存在多视角的任务（如产品评价、市场分析）。

可行措施：

在合适场景呈现多种观点
通过 reranking 优先考虑较新文档
引入来源权威度评分
让 LLM 识别并明确指出冲突

4. 可扩展性与时延

面对上百万文档的大规模 RAG，检索延迟会影响体验。

可行措施：

优化向量库配置（HNSW、量化等）
对常见查询做缓存

5. 上下文窗口溢出

小上下文窗口的模型（<128K tokens）尤为容易触顶；即便 1M+ context 的大模型，在复杂应用中检索大量长 chunks 也可能撞线。

可行措施：

对检索上下文做摘要压缩
用 reranking 算法优先保留最相关 chunks

高级 RAG 技术

为克服标准 RAG 的内在限制，围绕体系架构与组件层面涌现了多种优化方法。本节覆盖当前驱动高性能 RAG 流水线的关键增强技术。

1.用 Hypothetical Question Generation 做 Metadata 增强

除了在摄入时加入结构化 metadata 以便在语义检索前做过滤外，还可以利用 LLM 为每个 chunk 生成它能够回答的问题，并一并存储。

查询时，用问题匹配问题，而不是用原始文本匹配，从而显著提升语义匹配质量。

# Example metadata structure
chunk_metadata = {
 "source":"employee_handbook.pdf",
 "page":42,
 "department":"HR",
 "category":"benefits",
 "last_updated":"2024-11-15",
 "access_level":"internal",
 "generated_questions": ["question_1","question_2", ..]
}

2. 检索后优化（Reranking & LLM-as-a-Judge）

Reranking：初检之后，reranking model作为专门的过滤器，显著提升检索准确度。它以查询与一组候选匹配项为输入，计算相似度并据此重排。

但我们不是已经在初检时算过「相似度」了吗？

没错，不过两者的相似度计算方式不同：初检基于Representation-based similarity（如 cosine similarity，对预计算向量做比较）；reranking 模型属于 cross-encoder，计算Interaction-based similarity，把查询与文档一起输入，用注意力机制捕捉更深层语义。

因此 reranking 在语义相关性判别上更强！

尽管 cross-encoder 语义准确度更高，但它们无法使用预计算索引，计算昂贵，不适合初检阶段。

作为参照，用 BERT 在 V100 上对 4000 万条记录做 reranking，单个查询就可能超过 50 小时。

RAG 中的两阶段流程：

初检：在海量文档上做快速相似检索，取回 Top 50 或 Top 100 候选（快速但可能粗糙）
重排：对这小部分候选做 reranking，得到更精准的 Top K 交给下游

实操示例：继续用 LlamaIndex，在 RAG 流中添加 reranker。这样可用更小 chunk 也能找到正确答案。

在之前依赖基础上新增一个导入：

fromllama_index.core.postprocessorimportSentenceTransformerRerank

创建带 reranking 的 RAG 引擎：

# Define Global Settings
Settings.llm = GoogleGenAI(model="models/gemini-2.0-flash-lite")
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Load documents
documents = SimpleDirectoryReader('./company_docs').load_data()

# Create index
index = VectorStoreIndex.from_documents(documents)

# Set up reranker (post-processor)
rerank = SentenceTransformerRerank(
  model="cross-encoder/ms-marco-MiniLM-L-6-v2",
  top_n=3 # number of final nodes to keep after reranking
)

# Create query engine with reranker
query_engine = CitationQueryEngine.from_args(
  index,
  similarity_top_k=10,   # retrieve more candidates for reranking
  citation_chunk_size=128,
  node_postprocessors=[rerank], # apply reranking after retrieval
)

测试：

# Test
response= run_rag("what are the Ambient noise levels required")

输出：

Answer: 工作时间内，环境噪音必须低于 45 dB [4]。

==================================================
[4] Metadata (CITED):
 File: remote_policy.pdf
 Page: 1
 Score: -6.811233043670654
 Text snippet: Source 4:
ft. dedicated work area
 - Professional background for video calls
3.2 Technical Specifications:
 - Internet: Minimum 100 Mbps download / 20 Mbps upload (fiber preferred)
 - Backup connection: Cellular hotspot with 15GB monthly data plan
 - Hardware: Dual monitors (24"+), ergonomic chair, noise-canceling headset
3.3 Environmental Standards:
 - Ambient noise under 45 dB during work hours
 - Temperature-controlled environment (68-75°F)
 - Adequate lighting meeting ISO 9241-6 standards

----------------------------------------

现在我们可以打印完整引用上下文，轻松核验答案的正确性！

LLM-as-a-Judge：与用专门打分头的 reranking 不同，LLM-as-a-judge借助 LLM 的推理能力评估文档质量。它将查询与候选文档列表一并输入，按指定标准（事实一致性、业务约束等）打分，输出每个文档的相关度分数。

它可替代或增强传统 reranking，尤其当需要遵循复杂自然语言规则时，LLM 具备 cross-encoder 难以企及的灵活性。

Prompt 示例：

### Role
You are an expert Information Retrieval Judge.
Your task is to evaluate the relevance of the following documents to
a specific user query.

### Evaluation Criteria
Assign a relevance score from 0.0 to 1.0 for each document based
on these rules:
1.Accuracy: Does the document directly answer the query?
2.Specificity: Does it contain technical details or specific
data points rather than generalities?
3.Constraints: Prioritize documents that mention
[INSERT SPECIFIC BUSINESS CONSTRAINT, e.g., "2024 Policy Updates"].

### Inputs
Query: {{user_query}}

Documents:
{{retrieved_context_list}}

### Output Format
Return ONLY a JSON object where the keys are the document IDs and the values
are the numerical scores.
Example: {"doc_1": 0.95, "doc_2": 0.40}

3. 多阶段 LLM 编排（Multi-Stage LLM Orchestration）

高级 RAG 会使用多次 LLM 调用来优化输入、处理上下文并评估输出。可利用 prompt engineering 与 context engineering 提升表现：

Prompt 优化
：检索前先用 LLM 重写含糊或表述不佳的查询

原始： "maternity leave"
优化： "What is the eligibility criteria, total duration, and salary percentage for paid parental leave under the company's Family and Medical Leave policy?"

上下文摘要：检索后先对冗长上下文做摘要，降噪并聚焦关键信息
自我评估：生成后让 LLM 对回答做自评（准确性与相关性）

这些是基于 prompt 的改进，易于集成，因此本节不再给出代码示例。

4. GraphRAG：融合 Knowledge Graph

GraphRAG 在传统向量检索基础上融入结构化的知识图谱，显式表示实体及其关系。

工作方式：在向量 embeddings 之外，维护一个知识图谱来表示实体（人、产品、概念）与关系（works_for、related_to、causes）。检索时沿图谱游走查找相关实体，用关系性信息丰富上下文，弥补纯语义检索的不足。

用例示例：对于「What projects has company A worked on with company B in 2025?」这类问题，图谱可以沿着 "works_with"、"contributes_to" 边显式找到联系；若仅靠向量搜索，若文本未直白描述，可能会错过。

实现考量：GraphRAG 初期需要更多精力来梳理主题与关系，但在大型企业数据中回答复杂关系问题时更有优势。

实操示例：我们用更简单但针对性的文本来演示实体关系抽取与知识图谱的威力。可在 Colab 运行此notebook。

依赖：

!pip install -qU llama-indexllama-index-llms-gemini llama-index-embeddings-huggingface llama-index-graph-stores-neo4j
!pip install -qU llama-index-extractors-entity sentence-transformers nest_asyncio pyvis yfiles_jupyter_graphs

导入：

fromllama_index.coreimportSettings
fromllama_index.llms.google_genaiimportGoogleGenAI
fromllama_index.coreimportDocument
fromllama_index.embeddings.huggingfaceimportHuggingFaceEmbedding
fromllama_index.core.graph_storesimportSimplePropertyGraphStore
fromllama_index.core.indices.property_graphimportPropertyGraphIndex

importnest_asyncio
nest_asyncio.apply()

构造示例文档

# Simple demo documents
texts= [
 "Apple Inc. is headquartered in Cupertino, California. Tim Cook is the CEO of Apple.",
 "Microsoft was founded by Bill Gates and Paul Allen. Microsoft is based in Redmond, Washington.",
 "Google is a subsidiary of Alphabet Inc. Sundar Pichai is the CEO of Google.",
 "Apple and Microsoft are competitors in the tech industry."
]

documents= [Document(text=t) for t in texts]

用图索引构建 RAG 引擎：

# Create an in-memory graph store
graph_store=SimplePropertyGraphStore()

# Set up LLM and embedding model settings
Settings.llm=GoogleGenAI(model="models/gemini-2.0-flash-lite")
Settings.embed_model=HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Build index - LlamaIndex auto-extracts entities/relations using LLM
index=PropertyGraphIndex.from_documents(
  documents,
  graph_store=graph_store,
  show_progress=True,
  use_async=False,
  llm=Settings.llm,
)

# Create query engine that uses both vector + graph context
query_engine=index.as_query_engine(
  include_text=True,
  response_mode="tree_summarize",
  similarity_top_k=2,
)

测试：

# Ask a question that benefits from graph reasoning
question ="Who is the CEO of Apple, and where is it headquartered?"
response = query_engine.query(question)

print("Answer:\n", response.response)

# Show graph context used
print("\nGraph context used:")
fornodeinresponse.source_nodes:
 print("-", node.text)

输出：

Answer:
Tim Cook 是 Apple 的 CEO，总部位于加州库比蒂诺（Cupertino, California）。


Graph context used:
-Here are some facts extracted from the provided text:

Tim cook -> Is -> Ceo
Tim cook -> Is -> Apple

Apple Inc. is headquartered in Cupertino, California. Tim Cook is the CEO of Apple.
-Here are some facts extracted from the provided text:

Apple inc. -> Headquartered in -> California
Apple inc. -> Headquartered in -> Cupertino
Tim cook -> Is -> Ceo

Apple Inc. is headquartered in Cupertino, California. Tim Cook is the CEO of Apple.
-Here are some facts extracted from the provided text:

Sundar pichai -> Is -> Ceo

Google is a subsidiary of Alphabet Inc. Sundar Pichai is the CEO of Google.

可视化用于回答问题的知识图谱

5. Hybrid Search（语义 + 关键词）

纯语义搜索可能错过带有精确关键词匹配的文档；而纯关键词检索又捕捉不到语义相似。Hybrid search 将两者结合：

BM25 + Vector Search
：将稀疏检索（如 BM25/TF-IDF）与稠密向量检索并行，然后用 Reciprocal Rank Fusion (RRF) 融合结果。

适用场景：技术文档、专有名词或缩写较多的应用（例如 "API-2024-v3" 等精确匹配很关键）。

实操示例：

在 reranking 示例基础上，加入自定义 hybrid retriever，执行 dense + BM25 并用 RRF 融合。

依赖：

!pip install -qU llama-index-retrievers-bm25

在之前导入基础上新增：

fromllama_index.retrievers.bm25importBM25Retriever

自定义 hybrid retriever：

classHybridRRFRetriever(BaseRetriever):
 def__init__(self, vector_retriever, bm25_retriever, rrf_k=60):
   self.vector_retriever = vector_retriever
   self.bm25_retriever = bm25_retriever
   self.rrf_k = rrf_k


 def_retrieve(self, query_bundle: QueryBundle):
    vector_nodes =self.vector_retriever.retrieve(query_bundle) # List[NodeWithScore]
    bm25_nodes =self.bm25_retriever.retrieve(query_bundle)   # List[NodeWithScore]

    rrf_scores = defaultdict(float)
    node_map = {}      

   # Both retrievers return NodeWithScore
   # Printing retrieved docs for debugging
   print("\n==== Vector-only top 3:")
   forrank, nwsinenumerate(vector_nodes):
     print(f" [{rank+1}]{nws.node.text[:100]}...")
      rrf_scores[nws.node.node_id] +=1/ (self.rrf_k + rank +1)
      node_map[nws.node.node_id] = nws.node # ← store BaseNode
   
   print("\n==== BM25-only top 3:")
   forrank, nwsinenumerate(bm25_nodes):
     print(f" [{rank+1}]{nws.node.text[:100]}...")
      rrf_scores[nws.node.node_id] +=1/ (self.rrf_k + rank +1)
      node_map[nws.node.node_id] = nws.node

    sorted_nodes =sorted(rrf_scores.items(), key=lambdax: x[1], reverse=True)
   return[
      NodeWithScore(node=node_map[node_id], score=score)
     fornode_id, scoreinsorted_nodes[:20]
    ]

用自定义检索器搭建 RAG：

# Define Global Settings
Settings.llm = GoogleGenAI(model="models/gemini-2.0-flash-lite")
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Load documents
documents = SimpleDirectoryReader('./company_docs').load_data()

# Create index
index = VectorStoreIndex.from_documents(documents)

# Create base retrievers
vector_retriever = VectorIndexRetriever(index, similarity_top_k=20)
bm25_retriever = BM25Retriever.from_defaults(
  nodes=index.docstore.docs.values(),
  similarity_top_k=20
)

# Wrap in hybrid retriever
hybrid_retriever = HybridRRFRetriever(vector_retriever, bm25_retriever, rrf_k=60)

# Set up reranker (post-processor)
rerank = SentenceTransformerRerank(
  model="cross-encoder/ms-marco-MiniLM-L-6-v2",
  top_n=3 # number of final nodes to keep after reranking
)

# Use with CitationQueryEngine
query_engine = CitationQueryEngine.from_args(
  index,
  retriever=hybrid_retriever,    # override default retriever
  citation_chunk_size=128,
  node_postprocessors=[rerank],   # reranker still applied!
)

测试：

defrun_rag(query):
 response = query_engine.query(query)

print("\n\n","="*50,"\n")
print("Answer:", response.response)
print("="*50)

# Extract citation numbers like [1], [4], etc.
 citations = re.findall(r'\[(\d+)\]', response.response)
 cited_indices = {int(cid)forcidincitations} # Use set for fast lookup

# Display only cited nodes
fori, nodeinenumerate(response.source_nodes, start=1):
  ifiincited_indices:
    print(f"[{i}] Metadata (CITED):")
    print(" File:", node.metadata.get('file_name','Unknown'))
    print(" Page:", node.metadata.get('page_label','N/A'))
    print(" Score:", node.score)
    print(" Text snippet:",str(node.text))
    print("-"*40)

# Test
response = run_rag("what are the Ambient noise levels required")

输出：

==== Vector-only top 3:
 [1] 6.1 Evaluation Metrics:
 -Primary: Project completion rates and quality scores
 -Secondary: Peer...
 [2] 3.4 Safety Compliance:
 -Annual remote workspace safety checklist submission required
 -Fire ext...
 [3] 9. TERMINATION & TRANSITION
9.1 Equipment Return:
 -All company property must be returned within 5...
 [4] COMPANY REMOTE WORK POLICY
Effective Date: January 1, 2025
Policy Number: HR-REM-2025-001
1.ELIGIBI...
 [5] Revision History:
v1.0 (Jan 2025): Initial policy release
v1.1 (Mar 2025): Added cybersecurity insur...

==== BM25-only top 3:
 [1] COMPANY REMOTE WORK POLICY
Effective Date: January 1, 2025
Policy Number: HR-REM-2025-001
1.ELIGIBI...
 [2] 9. TERMINATION & TRANSITION
9.1 Equipment Return:
 -All company property must be returned within 5...
 [3] 3.4 Safety Compliance:
 -Annual remote workspace safety checklist submission required
 -Fire ext...
 [4] 6.1 Evaluation Metrics:
 -Primary: Project completion rates and quality scores
 -Secondary: Peer...
 [5] Revision History:
v1.0 (Jan 2025): Initial policy release
v1.1 (Mar 2025): Added cybersecurity insur...


==================================================

Answer: 工作时间内，环境噪音必须低于 45 dB [4]。

==================================================
[4] Metadata (CITED):
 File: remote_policy.pdf
 Page: 1
 Score: -6.811233043670654
 Text snippet: Source 4:
ft. dedicated work area
 - Professional background for video calls
3.2 Technical Specifications:
 - Internet: Minimum 100 Mbps download / 20 Mbps upload (fiber preferred)
 - Backup connection: Cellular hotspot with 15GB monthly data plan
 - Hardware: Dual monitors (24"+), ergonomic chair, noise-canceling headset
3.3 Environmental Standards:
 - Ambient noise under 45 dB during work hours
 - Temperature-controlled environment (68-75°F)
 - Adequate lighting meeting ISO 9241-6 standards

----------------------------------------

6. Structured RAG

真实场景中，并非所有数据都是非结构化文本。Structured RAG 用来整合多源数据：

Multi-Store 策略：按数据类型选择合适存储：

非结构化文本：Vector database
结构化记录：关系型数据库（PostgreSQL）
关系密集数据：Graph database（Neo4j）
时序数据：Time-series database

查询期综合（Query-Time Synthesis）：从多个存储各取所需，再合并成统一上下文。

示例（未纳入 Colab）：

# Complex query requiring multiple sources
user_query ="Why is customer#12345's shipment delayed?"

# Retrieve from multiple sources
customer_info = sql_db.query(f"SELECT * FROM customers WHERE id = 12345")
shipment_docs = vector_store.search("shipping delays weather")
tracking_data = api.get_tracking(customer_info.order_id)

# Combine into unified context
context =f"""
Customer Information:{customer_info}
Relevant Shipping Policies:{shipment_docs}
Current Tracking Status:{tracking_data}
"""
answer = llm.generate(f"Context:{context}\\n\\nQuestion:{user_query}")

这种方法能给出单一数据源无法完成的完整答案。

本节与后续高级技巧各自都值得单独成文与配套 Colab。为避免「信息过载」，本文暂不展开。

7. Agentic RAG

Agentic RAG 超越简单的「检索-生成」，赋予 LLM 工具使用能力，以动态编排复杂流程：

Tool Use：给 LLM 提供多种检索工具（vector search、SQL、web search、API、calculator 等），让其自主选择使用时机与顺序。

示例：

tools = [
  VectorSearchTool(vectorstore),
  SQLQueryTool(database),
  WebSearchTool(),
  CalculatorTool()
]

agent = create_agent(llm, tools)
response = agent.run("Compare our Q3 revenue to industry averages and calculate the percentage difference")

Agent 将会自动：

用 vector search 查找内部 Q3 营收报告
用 web search 获取行业平均数据
用 calculator 计算百分比差异
综合结果输出连贯答案

自我纠错：Agentic 系统还能在检索质量不佳时重试并改写查询，形成反馈闭环以改善准确性。

真实世界的应用与行业场景

RAG 正在被快速采用于各行各业。典型受益方向包括：

企业知识管理（高频场景）：大型组织的文档分散在 wiki、SharePoint、文件服务器中。RAG 聊天机器人统一索引，员工只需问「我们 Q4 出差政策是什么？」即可获得带引用的即时答案。
客户支持：客服团队耗时于重复问题，复杂问题却积压。RAG 机器人自动检索排障指南、FAQ、历史解决方案，可将工单量减少 40–60%，并凭 24/7 可用性提升满意度。
科研与发现：面对指数级增长的论文，研究人员往往力不从心。RAG 助手可搜索 PubMed、arXiv、机构库，针对「最近有哪些 CRISPR 抗癌研究？」给出全面且带引用的综述。
法律检索：法律从业者花费大量时间在案例与先例上。RAG 工具可检索相关案例、按标准审阅合同并标注风险，以分钟级缩短原本需要数小时的检索时间，让中小团队也能高效开展研究。

结语：用 RAG 构建可靠的 AI

Retrieval-Augmented Generation 已成为协助 LLM 产出更优结果的强大且可靠的工具。本文系统覆盖了构建 RAG 系统的核心概念与关键技术。

尽管仍有大量高级变体与优化超出本文范围，但现在你已经具备探索复杂 RAG 架构所需的核心知识。

从基础 RAG 到可投产系统，成功之道在于：组件选型、性能优化与限制规避。关键要点包括：

从简单开始：先实现基础 RAG，再度量表现，按需逐步加复杂度
优先保障检索质量：再强的 LLM 也救不了差检索；投入到 chunk 策略、embedding 选型与 metadata 设计
重视评估：建立清晰指标（检索准确度、答案质量、延迟）并持续监控
面向规模规划：从一开始就考虑数据量、查询模式与更新频率