|
ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;margin: 1em 8px;letter-spacing: 0.1em;color: rgb(33, 37, 41);padding: 8px 12px;background: rgba(237, 242, 255, 0.8);border-radius: 8px;">文章内容主要有:ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;margin-left: 8px;padding-left: 1em;list-style: circle;color: rgb(63, 63, 63);" class="list-paddingleft-1"> •RAG 流程回顾, •语句窗口检索(SWR)的概念, •SWR 详细实现, •如何优化和评估SWR ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 20px;font-weight: bold;display: table;margin: 2em auto 1.5em;padding-top: 6px;padding-bottom: 6px;padding-left: 16.7472px;background-image: linear-gradient(135deg, rgb(113, 23, 234), rgba(113, 23, 234, 0.667), rgba(234, 96, 96, 0.533), rgba(217, 57, 205, 0.267), rgba(217, 57, 205, 0));background-position: initial;background-size: initial;background-repeat: initial;background-attachment: initial;background-origin: initial;background-clip: initial;color: rgb(255, 255, 255);border-radius: 8px;width: 318.246px;">基础RAG概念ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;margin: 1em 8px;letter-spacing: 0.1em;color: rgb(33, 37, 41);padding: 8px 12px;background: rgba(237, 242, 255, 0.8);border-radius: 8px;">回顾一下基础的 RAG 架构,这个架构下对于较小的内容块效果比较好。ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;margin: 1.5em 8px;color: rgb(63, 63, 63);"> ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;border-radius: 8px;display: block;margin: 0.1em auto 0.5em;border-width: 0px;border-style: solid;border-color: initial;height: auto !important;" title="null" src="https://api.ibos.cn/v4/weapparticle/accesswximg?aid=83510&url=aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy9hZzdaM2RDcTdMbXI0REk0MGNMaDZUeW5OelpoZ2NqelMxWFZOclRXaEM4NFhBUzFrV0l0S0tDcm5MTUp4SUdWV21pYXJNeWNrWFJNbFVSdWlhUGsxM3lnLzY0MD93eF9mbXQ9cG5nJmFtcA==;from=appmsg"/>ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;margin: 1em 8px;letter-spacing: 0.1em;color: rgb(33, 37, 41);padding: 8px 12px;background: rgba(237, 242, 255, 0.8);border-radius: 8px;">刚刚提到,由于RAG对于较小块效果比较好,第一步还是将文件拆成比较小的块,当查询到相关的块之后, 我们围绕之前的语句,进行上下文窗口的扩展,讲较小语句的上下文一起发给 LLM,这就是语句窗口检索。ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;margin: 1em 8px;letter-spacing: 0.1em;color: rgb(33, 37, 41);padding: 8px 12px;background: rgba(237, 242, 255, 0.8);border-radius: 8px;">为了理解语句窗口检索,我画了个架构图,ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;margin: 1.5em 8px;color: rgb(63, 63, 63);"> ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;border-radius: 8px;display: block;margin: 0.1em auto 0.5em;border-width: 0px;border-style: solid;border-color: initial;height: auto !important;" title="null" src="https://api.ibos.cn/v4/weapparticle/accesswximg?aid=83510&url=aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy9hZzdaM2RDcTdMbXI0REk0MGNMaDZUeW5OelpoZ2NqejBabzZYbWptaFhvVHpLcU5Pa2p1dFlMZ0Q0a1FubkUzSm94YW41NnlZeFVJaWNMOGRzaWF0clJnLzY0MD93eF9mbXQ9cG5nJmFtcA==;from=appmsg"/>如果你只是为了了解一下概念,那么读到这里就可以了,后面是实现程序和评估演示。
下面演示如何使用和评估语句窗口检索。 读取文档获取和解析文档,和之前一样的步骤: importwarnings warnings.filterwarnings('ignore')
importutils importos importopenai openai.api_key=utils.get_openai_api_key()
fromllama_indeximportSimpleDirectoryReader
documents=SimpleDirectoryReader( input_files=["./eBook-How-to-Build-a-Career-in-AI.pdf"] ).load_data()
合并文档把文件合并成一个文档对象方便我们处理: fromllama_indeximportDocument document=Document(text="\n\n".join([doc.textfordocindocuments]))
节点解析创建一个支持 SentenceWindow 的 NodeParser 节点处理器(窗口大小我们默认为3): fromllama_index.node_parserimportSentenceWindowNodeParser
#createthesentencewindownodeparserw/defaultsettings node_parser=SentenceWindowNodeParser.from_defaults( window_size=3, window_metadata_key="window", original_text_metadata_key="original_text", )
构建 Context用标准方法ServiceContext.from_defaults构建Context,传入我们上一步创建的node_parser。 fromllama_index.llmsimportOpenAI
llm=OpenAI(model="gpt-3.5-turbo",temperature=0.1)
fromllama_indeximportServiceContext
sentence_context=ServiceContext.from_defaults( llm=llm, embed_model="local:BAAI/bge-small-en-v1.5", #embed_model="local:BAAI/bge-large-en-v1.5" node_parser=node_parser, )
构建 Sentence Index使用过 VectorStoreIndex 构建 Index, fromllama_indeximportVectorStoreIndex
sentence_index=VectorStoreIndex.from_documents( [document],service_context=sentence_context )
持久化到磁盘,这里我们指定当前相对目录(后续可以从该目录恢复,就不用重复前面的流程了)。 sentence_index.storage_context.persist(persist_dir="./sentence_index") 构建 postprocessorfromllama_index.indices.postprocessorimportMetadataReplacementPostProcessor
postproc=MetadataReplacementPostProcessor( target_metadata_key="window" )
fromllama_index.schemaimportNodeWithScore fromcopyimportdeepcopy
scored_nodes=[NodeWithScore(node=x,score=1.0)forxinnodes] nodes_old=[deepcopy(n)forninnodes]
使用 PostProcess 处理原来的节点。 replaced_nodes=postproc.postprocess_nodes(scored_nodes) 重新排序fromllama_index.indices.postprocessorimportSentenceTransformerRerank
rerank=SentenceTransformerRerank( top_n=2,model="BAAI/bge-reranker-base" )
执行查询引擎sentence_window_engine=sentence_index.as_query_engine( similarity_top_k=6,node_postprocessors=[postproc,rerank] )
window_response=sentence_window_engine.query( "在人工智能领域建立职业生涯的关键是什么?" )
最终回应:在人工智能领域建立职业生涯的关键包括学习基础技术技能、 参与项目、找到工作以及成为支持性社区的一部分。
评估程序使用使用同样的方法进行评估,同样是构建问题列表,评估两步。 eval_questions=[] withopen('generated_questions.text','r')asfile: forlineinfile: #Removenewlinecharacterandconverttointeger item=line.strip() eval_questions.append(item)
fromtrulens_evalimportTru
defrun_evals(eval_questions,tru_recorder,query_engine): forquestionineval_questions: withtru_recorderasrecording: response=query_engine.query(question)
不同窗口大小比较下面比较下不同参数下 SWR 的性能如何。 窗口大小 = 1创建窗口大小为 1 的 index: sentence_index_1=build_sentence_window_index( documents, llm=OpenAI(model="gpt-3.5-turbo",temperature=0.1), embed_model="local:BAAI/bge-small-en-v1.5", sentence_window_size=1, save_dir="sentence_index_1", ) sentence_window_engine_1=get_sentence_window_query_engine( sentence_index_1 ) tru_recorder_1=get_prebuilt_trulens_recorder( sentence_window_engine_1, app_id='sentencewindowengine1' )
窗口大小 = 3创建窗口大小为 3 的 index: sentence_index_3=build_sentence_window_index( documents, llm=OpenAI(model="gpt-3.5-turbo",temperature=0.1), embed_model="local:BAAI/bge-small-en-v1.5", sentence_window_size=3, save_dir="sentence_index_3", ) sentence_window_engine_3=get_sentence_window_query_engine( sentence_index_3 )
tru_recorder_3=get_prebuilt_trulens_recorder( sentence_window_engine_3, app_id='sentencewindowengine3' )
运行 Dashboard查看对比:  可以看到窗口大小为 3 的时候,评估效果的三个指标都表现很好。  实际的开发过程中,我们也是需要一次次调整参数,进行评估对比,找出最优的 RAG 方法和参数。
--- END --- |