|
RAG 2.0方法由contextual.ai推出,它将预训练、微调和对齐所有组件作为一个单一的集成系统,通过大模型和检索器进行反向传播以最大化性能。旨在解决RAG面临的各个组件技术是有效,但整体远非最佳的问题。 Google DeepMind提出一种新颖的方法RICHES(Retrieval Interlaced with Sequence Generation),通过单一的LLM和解码过程,将文本生成与文档检索原生地交织在一起。无需单独的检索器和生成器,直接解码文档内容或相关的自然语言检索键。无需额外训练,即可通过提示适应多样的新任务。
示例RICHES输出,用于具有单个大型语言模型(LLM)和解码通道的多跳查询。绿色引用文本是从检索语料库中"检索"或逐字生成的。RICHES生成原生地交错了思考和多个检索证据。

RICHES的工作流程太长不看版:
初始化模型:选择一个适合的预训练大型语言模型(LLM)。
定义检索键:确定用于检索的文档标识符,如标题、段落、句子或命题。
构建索引:使用FM-Index等技术为语料库构建索引,优化检索效率。
接收输入:接收用户的问题或查询作为输入。
交替生成:LLM交替进行自由文本生成和受限检索键生成。
应用约束:在生成过程中,利用索引对检索键进行约束,确保它们对应于语料库中的有效文档。
检索文档:根据生成的检索键,从语料库中检索相关文档或信息片段。
整合与输出:将检索到的内容与生成的文本结合,形成完整的回答或解决方案。
评估:使用适当的评估指标(如F1分数、AutoAIS)对输出结果进行评估。
迭代优化:根据评估结果进行模型和流程的迭代改进。
检索与生成的交织:
检索键的定义:
概率模型的更新:
约束束解码(Constrained Beam Decoding):
- 对查询“马拉松何时更名为士力架?”的约束束可视化。最终RICHES输出为“马拉松在1990年更名为士力架”。加粗的框追踪了顶部束序列的进展。灰色划掉的框是LLM(大型语言模型)偏好但被语料库约束阻止的序列。

通过FM-Index实现高效约束:

自适应束大小:
索引策略:
RICHES在开放领域问答(归因问答、多跳问答和检索思考)任务上展现出强大的性能,与传统的检索增强生成方法相比,在多跳问答任务(Hotpot)上表现尤为出色,能够通过单一的解码过程实现更准确的答案生成。
RICHES的综合性能比较。对于密集检索器,检索并获取前k个文档并将其输入到少量射击Answerer中,其中GTR段落的k=1,GTR命题的k=2。对于迭代检索,每个步骤最多检索4个文档,每个步骤的k=1。
RICHES与密集检索在单跳问答(QA)中的示例对比。仅展示检索到的文本以供说明。

来自RICHES的迭代检索输出示例。备注以(# 评论)的形式进行了注释。

还讨论了RICHES在不同索引策略、束搜索(beam search)大小的对比效果:
命题(Proposition)检索键的效果是最好的


附录:
用于多跳问答的RICHES的Few-shot Prompt模版
For given input query, write 1-3 passages to answer the query. Write a hint keyword and a passagecontained within « and ». A passage must be a complete sentence and not a phrase. It must containcomplete context for answering the query and should not begin with it, he, they etc. Do not repeat anypassages. Aim for new keywords.
question: The football manager who recruited Cristiano Ronaldo managed Manchester United duringwhat timeframe?passage: keyword: Cristiano Ronaldo’s recruiting manager « Alex Ferguson recruited Cristiano Ronaldo »keyword: Sir Alex Ferguson’s tenure at Manchester United « Sir Alex Ferguson managed ManchesterUnited from 1986 to 2013. »answer: 1986 to 2013
question: Were Eatza Pizza and Your Pie founded in the same state?passage: keyword: Eatza Pizza founded in state « Eatza Pizza was founded in Arizona » keyword: YourPie founded in state « Your Pie was founded in Athens, Georgia »answer: no
question: In which stadium do the teams owned by Myra Kraft’s husband play?passage: keyword: Myra Kraft’s husband « Robert Kraft’s wife is Myra Kraft. » keyword: Robert Kraft’steam « Robert Kraft is the owner of the New England Patriots. » keyword: New England Patriots stadium« Gillette Stadium is the home of the New England Patriots. »answer: Gillette Stadium
question: <question>passage:
用于单跳问答的RICHES的Few-shot Prompt模版
For given input query, write 1-3 passages to answer the query. Write a hint keyword and a passagecontained within « and ». A passage must be a complete sentence and not a phrase. It must containcomplete context for answering the query and should not begin with it, he, they etc. Do not repeat anypassages. Aim for new keywords.
question: who is the owner of phoenix mall pune?passage: keyword: Phoenix Market City owner « Phoenix Market City is developed by Phoenix MillsLimited. »answer: Phoenix Mills Limited
question: what brings in more money nba or nfl?passage: keyword: NFL revenues « NFL revenues are well over $10 billion per season. » keyword: NBArevenue « NBA amasses about $6 billion annually. »answer: NFL
question: when was the french national anthem adopted?passage: keyword: French national anthem « La Marseillaise became the national anthem of France. »keyword: La Marseillaise adoption « La Marseillaise was adopted by France in 1795. »answer: 1795
question: questionpassage:
从命题中提取答案的Few-shot Prompt模版
Answer the ’question’ only based on the given ’passage’. If the ’passage’ lacks context or is not relevant,say ’Cannot answer’ else say generate a short answer. Do not answer the query from outside the scope ofthe passage.
question: what brings in more money nba or nfl?passage: NFL revenues are well over $10 billion per season. NBA amasses about $6 billion annually.answer: NFL
question: when did they put warnings on cigarette packspassage: Tobacco packaging 1978’s warning was not removed, so now every cigarette pack contains bothwarnings (one on each lateral).answer: Cannot Answer
question: when was the french national anthem adopted?passage: La Marseillaise became the national anthem of France. La Marseillaise was adopted by Francein 1795.answer: 1795
question: questionpassage: passageanswer:
约束解码过程的说明。给定前缀“Joker is played by”,续接词“Nolan”在语料库中未找到,因此被屏蔽掉。

https://arxiv.org/pdf/2407.00361From RAG to RICHES: Retrieval Interlaced with Sequence GenerationGoogle Deepmind
推荐阅读
• 对齐LLM偏好的直接偏好优化方法:DPO、IPO、KTO
• 一篇搭建AI大模型应用平台架构的全面指南
• RAG全景图:从RAG启蒙到高级RAG之36技,再到终章Agentic RAG!
• Agent到多模态Agent再到多模态Multi-Agents系统的发展与案例讲解(1.2万字,20+文献,27张图)
欢迎关注我的公众号“PaperAgent”,每天一篇大模型(LLM)文章来锻炼我们的思维,简单的例子,不简单的方法,提升自己。
|