链载Ai

标题: 【AI 搜索】开源AI搜索项目学习：400行核心代码完成整个流程 [打印本页]

作者: 链载Ai 时间: 5 小时前
标题: 【AI 搜索】开源AI搜索项目学习：400行核心代码完成整个流程

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif; font-size: 1.2em; display: table; margin: 2em auto 1em; padding-right: 1em; padding-left: 1em; border-bottom: 2px solid rgb(15, 76, 129); color: rgb(63, 63, 63);">0. 背景

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif; letter-spacing: 0.1em; color: rgb(63, 63, 63);">AI大模型爆发已经一两年的时间了，目前为止，相对成熟的应用领域是在AI+搜索领域，像 Kimi Chat、百度App、New Bing等，都逐步拥有了此功能，用户只需输入想知道的事情，这些软件会自动搜索网络内容，然后根据网络内容总结出最终的答案，大大减轻了用户的检索和分析的负担。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif; letter-spacing: 0.1em; color: rgb(63, 63, 63);">前面，我也分析过这类AI检索功能的背后原理，同时也从0实现了一个AI搜索工具，感兴趣的可以去看下这篇文章：【AI大模型应用开发】【综合实战】AI+搜索，手把手带你实现属于你的AI搜索引擎（附完整代码）。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif; letter-spacing: 0.1em; color: rgb(63, 63, 63);">但是，自己实现的，终究只是个Demo，只是原理通了，但效果如何？可能用起来并不好。毕竟，AI大模型应用的特点就是：上手简单，落地难。要想实现效果好的产品，还需要大量的细节处理和打磨。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif; letter-spacing: 0.1em; color: rgb(63, 63, 63);">最近，我发现了一个开源的AI搜索工具，叫 Lepton Search，GitHub Star数7.5K，还挺受欢迎的。本文我们来看下它的具体实现，看下与我之前的思路有没有区别，有没有其它值得借鉴的地方。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif; font-size: 1.2em; display: table; margin: 2em auto 1em; padding-right: 1em; padding-left: 1em; border-bottom: 2px solid rgb(15, 76, 129); color: rgb(63, 63, 63);">1. Lepton Search 工具介绍

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif; border-left: none; padding: 1em; border-radius: 8px; color: rgba(0, 0, 0, 0.5); background: rgb(247, 247, 247); margin: 2em 8px;">
在线体验地址：https://search.lepton.run/ GitHub 源码地址：https://search.lepton.run/

界面还是挺简洁的。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif; letter-spacing: 0.1em; color: rgb(63, 63, 63);">搜索后答案界面如下：

它会列出最终答案、引用的链接来源，以及联想一些用户可能会问的相关问题。

2. 实现原理

我们这里不讨论其界面前端的实现，只看后端的实现原理。其后端的实现核心代码大概有400多行，在 https://github.com/leptonai/search_with_lepton/blob/main/search_with_lepton.py 文件中。

2.1 总结

先说结论，其实现原理与我之前的文章写的实现原理差别不大，即首先利用检索接口检索出相关的网页和文本内容，然后以这些文本内容作为RAG的参考文本，与原始问题一同给到大模型，大模型根据参考文本给出最终答案。

说白了，就是一个RAG的应用，只是数据源来源不同而已。

2.2 重点代码分析

2.2.1 检索数据源

该项目可以使用不同的检索数据源，例如Google，Bing等，有现成的代码可以用。当然，要自己去申请相应接口的Key。

具体可直接用的不同检索API的函数定义如下：

defsearch_with_bing(query:str,subscription_key:str):

defsearch_with_google(query:str,subscription_key:str,cx:str):

defsearch_with_serper(query:str,subscription_key:str):

defsearch_with_searchapi(query:str,subscription_key:str):

2.2.2 检索入口函数

query_function 为该项目的检索入口函数。主要代码如下：

defquery_function(
self,
query:str,
search_uuid:str,
generate_related_questions:Optional[bool]=True,
)->StreamingResponse:

ifself.backend=="LEPTON":
#delegatetotheleptonsearchapi.
result=self.leptonsearch_client.query(
query=query,
search_uuid=search_uuid,
generate_related_questions=generate_related_questions,
)
returnStreamingResponse(content=result,media_type="text/html")

#First,doasearchquery.
query=queryor_default_query
......
contexts=self.search_function(query)

system_prompt=_rag_query_text.format(
context="\n\n".join(
[f"[[citation:{i+1}]]{c['snippet']}"fori,cinenumerate(contexts)]
)
)
try:
client=self.local_client()
llm_response=client.chat.completions.create(
model=self.model,
messages=[
{"role":"system","content":system_prompt},
{"role":"user","content":query},
],
max_tokens=1024,
stop=stop_words,
stream=True,
temperature=0.9,
)
ifself.should_do_related_questionsandgenerate_related_questions:
#Whiletheanswerisbeinggenerated,wecanstartgenerating
#relatedquestionsasafuture.
related_questions_future=self.executor.submit(
self.get_related_questions,query,contexts
)
else:
related_questions_future=None
exceptExceptionase:
......

以上代码主要做了以下几件事，也是AI搜索引擎的常规步骤：

（1）contexts = self.search_function(query)检索相关文本

（2）system_prompt 组装RAG Prompt，Prompt模板如下：

_rag_query_text="""
YouarealargelanguageAIassistantbuiltbyLeptonAI.Youaregivenauserquestion,andpleasewriteclean,conciseandaccurateanswertothequestion.Youwillbegivenasetofrelatedcontextstothequestion,eachstartingwithareferencenumberlike[[citation:x]],wherexisanumber.Pleaseusethecontextandcitethecontextattheendofeachsentenceifapplicable.

Youranswermustbecorrect,accurateandwrittenbyanexpertusinganunbiasedandprofessionaltone.Pleaselimitto1024tokens.Donotgiveanyinformationthatisnotrelatedtothequestion,anddonotrepeat.Say"informationismissingon"followedbytherelatedtopic,ifthegivencontextdonotprovidesufficientinformation.

Pleasecitethecontextswiththereferencenumbers,intheformat[citation:x].Ifasentencecomesfrommultiplecontexts,pleaselistallapplicablecitations,like[citation:3][citation:5].Otherthancodeandspecificnamesandcitations,youranswermustbewritteninthesamelanguageasthequestion.

Herearethesetofcontexts:

{context}

Remember,don'tblindlyrepeatthecontextsverbatim.Andhereistheuserquestion:
"""

（3）client.chat.completions.create调用大模型获取答案

以上3步为基本步骤。该项目还增加了额外的步骤，获取相关的问题。

2.2.3 获取相关问题

获取相关问题展示给用户的能力在某些情况下也是有用和有意义的，给用户提示，在用户不知道该如何问的时候有灵感。

其实现方法如下：

defget_related_questions(self,query,contexts):
......

try:
response=self.local_client().chat.completions.create(
model=self.model,
messages=[
{
"role":"system",
"content":_more_questions_prompt.format(
context="\n\n".join([c["snippet"]forcincontexts])
),
},
{
"role":"user",
"content":query,
},
],
tools=[{
"type":"function",
"function":tool.get_tools_spec(ask_related_questions),
}],
max_tokens=512,
)
......

具体实现原理也是利用大模型，根据原始问题和回复的问题答案来生成几个相关问题。通过其Prompt可以很容易看出其实现方式：

_more_questions_prompt="""
Youareahelpfulassistantthathelpstheusertoaskrelatedquestions,basedonuser'soriginalquestionandtherelatedcontexts.Pleaseidentifyworthwhiletopicsthatcanbefollow-ups,andwritequestionsnolongerthan20wordseach.Pleasemakesurethatspecifics,likeevents,names,locations,areincludedinfollowupquestionssotheycanbeaskedstandalone.Forexample,iftheoriginalquestionasksabout"theManhattanproject",inthefollowupquestion,donotjustsay"theproject",butusethefullname"theManhattanproject".Yourrelatedquestionsmustbeinthesamelanguageastheoriginalquestion.

Herearethecontextsofthequestion:

{context}

Remember,basedontheoriginalquestionandrelatedcontexts,suggestthreesuchfurtherquestions.DoNOTrepeattheoriginalquestion.Eachrelatedquestionshouldbenolongerthan20words.Hereistheoriginalquestion:
"""

欢迎光临链载Ai (https://www.lianzai.com/)