在人工智能(AI)领域,多模态代理的概念正逐渐受到关注。这些代理能够处理并整合来自不同模态(如文本、图像、语音等)的信息,以执行复杂的任务。本文将详细介绍如何使用CrewAI框架、Groq硬件加速器和Replicate AI的模型来构建一个多模态AI代理,该代理能够执行文本到语音、基于文本的图像生成、图像描述以及网络搜索等多种任务。
Replicate AI提供了丰富的预训练模型,包括文本到语音、图像生成、图像描述等,这些模型可以直接用于构建多模态代理。Replicate AI还简化了模型的部署和扩展过程,使得多模态代理能够轻松地应用于实际场景。
Tavily-Python是一个开源库,用于网络搜索和信息检索。在多模态代理中,Tavily-Python被用于执行网络搜索任务,以获取与用户查询相关的信息。
首先,需要为多模态代理开发一套工具,使其能够安全、高效地与各种数据源进行交互。这包括安装必要的Python库,如CrewAI、Groq和Replicate AI的客户端库。同时,还需要设置API密钥和配置环境变量,以确保代理能够正常访问和使用这些服务。
多模态代理的架构设计应该允许跨不同模式的数据进行有效的处理和集成。通常,一个多模态代理会包含多个子代理,每个子代理负责处理一种或多种数据类型。例如,可以创建以下类型的代理:
文本处理代理:负责处理文本数据,包括文本到语音的转换、文本分析等。
图像处理代理:负责处理图像数据,包括图像生成、图像描述等。
音频处理代理:负责处理音频数据,包括语音识别、语音合成等。
网络搜索代理:负责从 Web 检索相关信息以回答查询等。
这些代理在CrewAI的协调下共同工作(Multi-Agent架构:探索AI协作的新纪元),以实现复杂的任务目标。例如,当用户输入一段文本描述时,文本处理代理可以将其转换为语音输出;同时,图像处理代理可以根据文本描述生成相应的图像;音频处理代理则可以处理用户的语音输入,实现语音交互。
代码实现
在代码实现阶段,我们首先需要安装必要的依赖项,包括CrewAI、Groq、Replicate AI和Tavily-Python等库。然后,我们设置API密钥,并创建所需的工具函数。
接下来,我们定义代理的角色和任务。每个代理都被赋予一个特定的角色,并配置相应的工具集。例如,文本到语音代理使用Replicate AI的文本到语音模型,而图像生成代理则使用Replicate AI的图像生成模型。
最后,我们设置路由器代理(Router Agent),它负责分析用户查询,并根据查询内容决定下一步的行动。路由器代理将任务分配给相应的代理,并收集它们的输出,最终生成用户所需的响应。
!pipinstall-qUlangchainlangchain_communitytavily-pythonlangchain-groqgroqreplicate!pipinstall-qUcrewaicrewai[tools]
设置API key
importosfromgoogle.colabimportuserdataos.environ['OPENAI_API_KEY']=userdata.get('OPENAI_API_KEY')os.environ['REPLICATE_API_TOKEN']=userdata.get('REPLICATE_API_TOKEN')os.environ['TAVILY_API_KEY']=userdata.get('TAVILY_API_KEY')os.environ['GROQ_API_KEY']=userdata.get('GROQ_API_KEY')创建tool、agent、task及辅助函数(具体函数含义可参考代码中注释):
##创建websearchtoolfromlangchain_community.tools.tavily_searchimportTavilySearchResultsdefweb_search_tool(question:str)->str:"""Thistoolisusefulwhenwewantwebsearchforcurrentevents."""#Functionlogichere#Step1:InstantiatetheTavilyclientwithyourAPIkeywebsearch=TavilySearchResults()#Step2erformasearchqueryresponse=websearch.invoke({"query":question})returnresponse
##创建exttospeechtoolimportreplicate#deftext2speech(text:str)->str:"""Thistoolisusefulwhenwewanttoconverttexttospeech."""#Functionlogichereoutput=replicate.run("cjwbw/seamless_communication:668a4fec05a887143e5fe8d45df25ec4c794dd43169b9a11562309b2d45873b0",input={"task_name":"T2ST(TexttoSpeechtranslation)","input_text":text,"input_text_language":"English","max_input_audio_length":60,"target_language_text_only":"English","target_language_with_speech":"English"})returnoutput["audio_output"]#Createtexttoimagedeftext2image(text:str)->str:"""Thistoolisusefulwhenwewanttogenerateimagesfromtextualdescriptions."""#Functionlogichereoutput=replicate.run("xlabs-ai/flux-dev-controlnet:f2c31c31d81278a91b2447a304dae654c64a5d5a70340fba811bb1cbd41019a2",input={"steps":28,"prompt":text,"lora_url":"","control_type":"depth","control_image":"https://replicate.delivery/pbxt/LUSNInCegT0XwStCCJjXOojSBhPjpk2Pzj5VNjksiP9cER8A/ComfyUI_02172_.png","lora_strength":1,"output_format":"webp","guidance_scale":2.5,"output_quality":100,"negative_prompt":"lowquality,ugly,distorted,artefacts","control_strength":0.45,"depth_preprocessor":"DepthAnything","soft_edge_preprocessor":"HED","image_to_image_strength":0,"return_preprocessed_image":False})print(output)returnoutput[0]##texttoimagedefimage2text(image_url:str,prompt:str)->str:"""Thistoolisusefulwhenwewanttogeneratetextualdescriptionsfromimages."""#Functionoutput=replicate.run("yorickvp/llava-13b:80537f9eead1a5bfa72d5ac6ea6414379be41d4d4f6679fd776e9535d1eb58bb",input={"image":image_url,"top_p":1,"prompt":prompt,"max_tokens":1024,"temperature":0.2})return"".join(output)from crewai_tools import tool## Router Tool@tool("router tool")def router_tool(question:str) -> str:"""Router Function"""prompt = f"""Based on the Question provide below determine the following:1. Is the question directed at generating image ?2. Is the question directed at describing the image ?3. Is the question directed at converting text to speech?.4. Is the question a generic one and needs to be answered searching the web?Question: {question}RESPONSE INSTRUCTIONS:- Answer either 1 or 2 or 3 or 4.- Answer should strictly be a string.- Do not provide any preamble or explanations except for 1 or 2 or 3 or 4.OUTPUT FORMAT:1"""response = llm.invoke(prompt).contentif response == "1":return 'text2image'elif response == "3":return 'text2speech'elif response == "4":return 'web_search'else:return 'image2text'
@tool("retrivertool")defretriver_tool(router_response:str,question:str,image_url:str)->str:"""RetriverFunction"""ifrouter_response=='text2image':returntext2image(question)elifrouter_response=='text2speech':returntext2speech(question)elifrouter_response=='image2text':returnimage2text(image_url,question)else:returnweb_search_tool(question)##设置LLMfromlangchain_groqimportChatGroqllm=ChatGroq(model_name="llama-3.1-70b-versatile",temperature=0.1,max_tokens=1000,)
##创建RouteragentfromcrewaiimportAgentRouter_Agent=Agent(role='Router',goal='Routeuserquestiontoatexttoimageortexttospeechorwebsearch',backstory=("Youareanexpertatroutingauserquestiontoatexttoimageortexttospeechorwebsearch.""Usethetexttoimagetogenerateimagesfromtextualdescriptions.""Usethetexttospeechtoconverttexttospeech.""Usetheimagetotexttogeneratetextdescribingtheimagebasedonthetextualdescription.""Usethewebsearchtosearchforcurrentevents.""Youdonotneedtobestringentwiththekeywordsinthequestionrelatedtothesetopics.Otherwise,useweb-search."),verbose=True,allow_delegation=False,llm=llm,tools=[router_tool],)##RetrieverAgentRetriever_Agent=Agent(role="Retriever",goal="UsetheinformationretrievedfromtheRoutertoanswerthequestionandimageurlprovided.",backstory=("YouareanassistantfordirectingtaskstorespectiveagentsbasedontheresponsefromtheRouter.""UsetheinformationfromtheRoutertoperformtherespectivetask.""Donotprovideanyotherexplanation"),verbose=True,allow_delegation=False,llm=llm,tools=[retriver_tool],)##创建routertaskfromcrewaiimportTaskrouter_task=Task(description=("Analysethekeywordsinthequestion{question}""Ifthequestion{question}instructstodescribeaimagethenusetheimageurl{image_url}togenerateadetailedandhighqualityimagescoveringallthenuancessecribedinthetextualdescriptionsprovidedinthequestion{question}.""Basedonthekeywordsdecidewhetheritiseligibleforatexttoimageortexttospeechorwebsearch.""Returnasingleword'text2image'ifitiseligibleforgeneratingimagesfromtextualdescription.""Returnasingleword'text2speech'ifitiseligibleforconvertingtexttospeech.""Returnasingleword'image2text'ifitiseligiblefordescribingtheimagebasedonthequestion{question}andiamgeurl{image_url}.""Returnasingleword'web_search'ifitiseligibleforwebsearch.""Donotprovideanyotherpremableorexplaination."),expected_output=("Giveachoice'web_search'or'text2image'or'text2speech'or'image2text'basedonthequestion{question}andimageurl{image_url}""Donotprovideanypreambleorexplanationsexceptfor'text2image'or'text2speech'or'web_search'or'image2text'."),agent=Router_Agent,)#创建retrievetaskretriever_task=Task(description=("Basedontheresponsefromthe'router_task'generateresponseforthequestion{question}withthehelpoftherespectivetool.""Usetheweb_serach_tooltoretrieveinformationfromthewebincasetheroutertaskoutputis'web_search'.""Usethetext2speechtooltoconvertthetesttospeechinenglishincasetheroutertaskoutputis'text2speech'.""Usethetext2imagetooltoconvertthetesttospeechinenglishincasetheroutertaskoutputis'text2image'.""Usetheimage2texttooltodescribetheimageprovideintheimageurlincasetheroutertaskoutputis'image2text'."),expected_output=("Youshouldanalysetheoutputofthe'router_task'""Iftheresponseis'web_search'thenusetheweb_search_tooltoretrieveinformationfromtheweb.""Iftheresponseis'text2image'thenusethetext2imagetooltogenerateadetailedandhighqualityimagescoveringallthenuancessecribedinthetextualdescriptionsprovidedinthequestion{question}.""Iftheresponseis'text2speech'thenusethetext2speechtooltoconvertthetextprovidedinthequestion{question}tospeech""Iftheresponseis'image2text'thenusethe'image2text'tooltodescribetheimagebasedonthequestion{question}and{image_url}."),agent=Retriever_Agent,context=[router_task],)##设置crewfromcrewaiimportCrew,Processcrew=Crew(agents=[Router_Agent,Retriever_Agent],tasks=[router_task,retriever_task],verbose=True,)
## 启动inputs ={"question":"Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor","image_url":" "}result = crew.kickoff(inputs=inputs)######################Response#############################[2024-08-25 04:14:22][DEBUG]: == Working Agent: Router[2024-08-25 04:14:22][INFO]: == Starting Task: Analyse the keywords in the question Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visorIf the question Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor instructs to describe a image then use the image url to generate a detailed and high quality images covering all the nuances secribed in the textual descriptions provided in the question Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor.Based on the keywords decide whether it is eligible for a text to image or text to speech or web search.Return a single word 'text2image' if it is eligible for generating images from textual description.Return a single word 'text2speech' if it is eligible for converting text to speech.Return a single word 'image2text' if it is eligible for describing the image based on the question Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor and iamge url .Return a single word 'web_search' if it is eligible for web search.Do not provide any other premable or explaination.> Entering new CrewAgentExecutor chain...Thought: The question contains keywords like "Generate an image based upon this text" and a detailed description of the image, so it seems like the user wants to generate an image from the given text.Action: router toolAction Input: {"question": "Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor"}text2imageThought: The question contains keywords like "Generate an image based upon this text" and a detailed description of the image, so it seems like the user wants to generate an image from the given text.Action: router toolAction Input: {"question": "Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor"}I tried reusing the same input, I must stop using this action input. I'll try something else instead.Thought: The question contains keywords like "Generate an image based upon this text" and a detailed description of the image, so it seems like the user wants to generate an image from the given text.Action: router toolAction Input: {"question": "a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor"}text2imageThought: I now know the final answerFinal Answer: text2image> Finished chain.[2024-08-25 04:14:26][DEBUG]: == [Router] Task output: text2image[2024-08-25 04:14:26][DEBUG]: == Working Agent: Retriever[2024-08-25 04:14:26][INFO]: == Starting Task: Based on the response from the 'router_task' generate response for the question Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor with the help of the respective tool.Use the web_serach_tool to retrieve information from the web in case the router task output is 'web_search'.Use the text2speech tool to convert the test to speech in english in case the router task output is 'text2speech'.Use the text2image tool to convert the test to speech in english in case the router task output is 'text2image'.Use the image2text tool to describe the image provide in the image url in case the router task output is 'image2text'.> Entering new CrewAgentExecutor chain...Thought: I need to use the information from the Router to determine the task to perform.Action: retriver toolAction Input: {"router_response": "text2image", "question": "Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor", "image_url": ""}['https://replicate.delivery/yhqm/XjBShO4PSexSSaThOCnZoDl4rYeq1pNAZNaKIuvi3mvFHGWTA/R8_FLUX_XLABS_00001_.webp']https://replicate.delivery/yhqm/XjBShO4PSexSSaThOCnZoDl4rYeq1pNAZNaKIuvi3mvFHGWTA/R8_FLUX_XLABS_00001_.webpThought: I need to use the information from the Router to determine the task to perform.Action: retriver toolAction Input: {"router_response": "text2image", "question": "Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor", "image_url": ""}I tried reusing the same input, I must stop using this action input. I'll try something else instead.Thought: I need to use the information from the Router to determine the task to perform.Action: retriver toolAction Input: {"router_response": "text2image", "question": "Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor", "image_url": ""}I tried reusing the same input, I must stop using this action input. I'll try something else instead.Thought: I need to use the information from the Router to determine the task to perform.Action: retriver toolAction Input: {"router_response": "text2image", "question": "Generate an image based upon this text: a close up portfolio photo of a beautiful Indian Model woman, perfect eyes, bright studio lights, bokeh, 50mm photo, neon pink visor", "image_url": ""}I tried reusing the same input, I must stop using this action input. I'll try something else instead.Thought: I now know the final answerFinal Answer: https://replicate.delivery/yhqm/XjBShO4PSexSSaThOCnZoDl4rYeq1pNAZNaKIuvi3mvFHGWTA/R8_FLUX_XLABS_00001_.webp> Finished chain.[2024-08-25 04:15:07][DEBUG]: == [Retriever] Task output: https://replicate.delivery/yhqm/XjBShO4PSexSSaThOCnZoDl4rYeq1pNAZNaKIuvi3mvFHGWTA/R8_FLUX_XLABS_00001_.webp
import requestsfrom PIL import Imagefrom io import BytesIOimport matplotlib.pyplot as plt# URL of the imageimage_url = result.raw# Fetch the imageresponse = requests.get(image_url)# Check if the request was successfulif response.status_code == 200:# Open the image using PILimg = Image.open(BytesIO(response.content))# Display the image using matplotlibplt.imshow(img)plt.axis('off')# Hide the axisplt.show()else:print("Failed to retrieve image. Status code:", response.status_code)
在自动驾驶领域,多模态代理可以处理来自车辆传感器(如摄像头、雷达和激光雷达)的多种数据类型,实现更全面的环境感知和决策制定。例如,代理可以同时处理图像和音频数据,识别道路上的行人、车辆和障碍物,并根据这些信息做出避障、变道等决策。
在虚拟助手领域,多模态代理可以实现更加自然和智能的交互体验。代理可以同时处理用户的文本输入和语音输入,理解用户的意图和需求,并给出相应的回答和建议。此外,代理还可以根据用户的表情和动作等图像数据,进一步理解用户的情绪和需求,提供更加个性化的服务。
通过结合CrewAI框架、Groq硬件加速器和Replicate AI的模型,我们成功构建了一个多模态AI代理。该代理能够执行多种复杂的任务,包括文本到语音、基于文本的图像生成、图像描述以及网络搜索等。这种多模态代理的设计不仅提高了AI系统的灵活性和实用性,还为未来的AI应用提供了广阔的可能性(LLM Agent在商业中的应用:探索自主智能的新前沿)。
| 欢迎光临 链载Ai (https://www.lianzai.com/) | Powered by Discuz! X3.5 |