Instructions: Answer the question based on the provided context. If the context doesn't contain enough information to answer the question, say so explicitly. Include citations to specific context sections.
# Create embeddings and FAISS vector store print("Creating embeddings and FAISS vector database...") embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2" ) vectorstore = FAISS.from_documents( documents=chunks, embedding=embeddings ) # Save FAISS index vectorstore.save_local("faiss_index")
# Create retriever retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 4} # retrieve top 4 similar chunks )
实例化 LLM:我在教程中常用 Gemini,因为有免费层,小模型也很快。
# Set your Google API key os.environ["GOOGLE_API_KEY"]="your_api_key"
# Create LLM and QA chain llm=ChatGoogleGenerativeAI( model="gemini-2.0-flash-lite", temperature=0 )
创建并运行 RAG:写一个函数串起整个 RAG 流水线,并打印带引用的结果。
# Create RAG pipeline
defask_with_sources(question):
# Retrieve docs first docs = retriever.invoke(question)
# Format context context ="\n\n".join( f"Source:{doc.metadata.get('source','Unknown')}(Page{doc.metadata.get('page','N/A')})\nContent:{doc.page_content}" fordocindocs)
# Generate answer prompt_text =f""" Answer the question based only on the following retrieved context, and include the source used at the end as reference:
# Extract citation numbers like [1], [4], etc. citations = re.findall(r'\[(\d+)\]', response.response) cited_indices = {int(cid)forcidincitations} # Use set for fast lookup
# Create index index = VectorStoreIndex.from_documents(documents)
# Set up reranker (post-processor) rerank = SentenceTransformerRerank( model="cross-encoder/ms-marco-MiniLM-L-6-v2", top_n=3 # number of final nodes to keep after reranking )
# Create query engine with reranker query_engine = CitationQueryEngine.from_args( index, similarity_top_k=10, # retrieve more candidates for reranking citation_chunk_size=128, node_postprocessors=[rerank], # apply reranking after retrieval )
测试:
# Test response= run_rag("what are the Ambient noise levels required")
输出:
Answer: 工作时间内,环境噪音必须低于 45 dB [4]。
================================================== [4] Metadata (CITED): File: remote_policy.pdf Page: 1 Score: -6.811233043670654 Text snippet: Source 4: ft. dedicated work area - Professional background for video calls 3.2 Technical Specifications: - Internet: Minimum 100 Mbps download / 20 Mbps upload (fiber preferred) - Backup connection: Cellular hotspot with 15GB monthly data plan - Hardware: Dual monitors (24"+), ergonomic chair, noise-canceling headset 3.3 Environmental Standards: - Ambient noise under 45 dB during work hours - Temperature-controlled environment (68-75°F) - Adequate lighting meeting ISO 9241-6 standards
### Role You are an expert Information Retrieval Judge. Your task is to evaluate the relevance of the following documents to a specific user query.
### Evaluation Criteria Assign a relevance score from 0.0 to 1.0 for each document based on these rules: 1.Accuracy: Does the document directly answer the query? 2.Specificity: Does it contain technical details or specific data points rather than generalities? 3.Constraints: Prioritize documents that mention [INSERT SPECIFIC BUSINESS CONSTRAINT, e.g., "2024 Policy Updates"].
### Inputs Query: {{user_query}}
Documents: {{retrieved_context_list}}
### Output Format Return ONLY a JSON object where the keys are the document IDs and the values are the numerical scores. Example: {"doc_1": 0.95, "doc_2": 0.40}
优化: "What is the eligibility criteria, total duration, and salary percentage for paid parental leave under the company's Family and Medical Leave policy?"
# Simple demo documents texts= [ "Apple Inc. is headquartered in Cupertino, California. Tim Cook is the CEO of Apple.", "Microsoft was founded by Bill Gates and Paul Allen. Microsoft is based in Redmond, Washington.", "Google is a subsidiary of Alphabet Inc. Sundar Pichai is the CEO of Google.", "Apple and Microsoft are competitors in the tech industry." ]
documents= [Document(text=t) for t in texts]
用图索引构建 RAG 引擎:
# Create an in-memory graph store graph_store=SimplePropertyGraphStore()
# Set up LLM and embedding model settings Settings.llm=GoogleGenAI(model="models/gemini-2.0-flash-lite") Settings.embed_model=HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
# Build index - LlamaIndex auto-extracts entities/relations using LLM index=PropertyGraphIndex.from_documents( documents, graph_store=graph_store, show_progress=True, use_async=False, llm=Settings.llm, )
# Create query engine that uses both vector + graph context query_engine=index.as_query_engine( include_text=True, response_mode="tree_summarize", similarity_top_k=2, )
测试:
# Ask a question that benefits from graph reasoning question ="Who is the CEO of Apple, and where is it headquartered?" response = query_engine.query(question)
print("Answer:\n", response.response)
# Show graph context used print("\nGraph context used:") fornodeinresponse.source_nodes: print("-", node.text)
输出:
Answer: Tim Cook 是 Apple 的 CEO,总部位于加州库比蒂诺(Cupertino, California)。
Graph context used: -Here are some facts extracted from the provided text:
Tim cook -> Is -> Ceo Tim cook -> Is -> Apple
Apple Inc. is headquartered in Cupertino, California. Tim Cook is the CEO of Apple. -Here are some facts extracted from the provided text:
Apple inc. -> Headquartered in -> California Apple inc. -> Headquartered in -> Cupertino Tim cook -> Is -> Ceo
Apple Inc. is headquartered in Cupertino, California. Tim Cook is the CEO of Apple. -Here are some facts extracted from the provided text:
Sundar pichai -> Is -> Ceo
Google is a subsidiary of Alphabet Inc. Sundar Pichai is the CEO of Google.
# Wrap in hybrid retriever hybrid_retriever = HybridRRFRetriever(vector_retriever, bm25_retriever, rrf_k=60)
# Set up reranker (post-processor) rerank = SentenceTransformerRerank( model="cross-encoder/ms-marco-MiniLM-L-6-v2", top_n=3 # number of final nodes to keep after reranking )
# Use with CitationQueryEngine query_engine = CitationQueryEngine.from_args( index, retriever=hybrid_retriever, # override default retriever citation_chunk_size=128, node_postprocessors=[rerank], # reranker still applied! )
# Extract citation numbers like [1], [4], etc. citations = re.findall(r'\[(\d+)\]', response.response) cited_indices = {int(cid)forcidincitations} # Use set for fast lookup