当下文档处理的自动化需求日益增长,尤其是对于 PDF 文档的有效处理成为了关键任务(ParseStudio:使用统一语法简化PDF文档解析)。随着人工智能技术的迅猛发展,大型语言模型(LLMs)如 ChatGPT 等在自然语言处理领域取得了显著成果,而自动化文档处理也成为了这场技术革命的最大受益者之一。然而,传统的文本处理方式在面对 PDF 文档时面临诸多挑战,如非文本元素(如图像、表格等)的处理困难。今天我们聊一下如何利用 Gemini 构建针对 PDF 文档的 AI 管道,以实现高效、精准的文档处理与信息提取。
pdf2image库将 PDF 文档的每一页提取为PIL图像格式,随后将图像编码为 Base64 格式,以便于添加到 LLM 请求中。这一步骤确保了文档的页面能够以适合模型处理的格式进行输入,为后续的分割和总结操作奠定基础。例如,在处理包含大量图表的财务报告 PDF 时,通过这一步骤能够准确地将每一页转换为图像格式,保留图表的完整性和清晰度(MinerU:精准解析PDF文档的开源解决方案)。fromdocument_ai_agents.document_utilsimportextract_images_from_pdffromdocument_ai_agents.image_utilsimportpil_image_to_base64_jpegfrompathlibimportPathclassDocumentParsingAgent:@classmethoddefget_images(cls, state):"""Extract pages of a PDF as Base64-encoded JPEG images."""assertPath(state.document_path).is_file(),"File does not exist"# Extract images from PDFimages = extract_images_from_pdf(state.document_path)assert images, "No images extracted"# Convert images to Base64-encoded JPEGpages_as_base64_jpeg_images = [pil_image_to_base64_jpeg(x) for x in images]return {"pages_as_base64_jpeg_images": pages_as_base64_jpeg_images}
LayoutElements和DetectedLayoutItem等自定义类定义输出结构,明确每个布局元素的类型(如表格、图表、文本块等)及其对应的总结内容。frompydanticimportBaseModel, FieldfromtypingimportLiteralimportjsonimportgoogle.generativeaiasgenaifromlangchain_core.documentsimportDocumentclassDetectedLayoutItem(BaseModel):"""Schema for each detected layout element on a page."""element_typeiteral["Table","Figure","Image","Text-block"] = Field(
...,description="Type of detected item. Examples: Table, Figure, Image, Text-block.")summary:str= Field(..., description="A detailed description of the layout item.")classLayoutElements(BaseModel):"""Schema for the list of layout elements on a page."""layout_items:list[DetectedLayoutItem] = []classFindLayoutItemsInput(BaseModel):"""Input schema for processing a single page."""document_path:strbase64_jpeg:strpage_number:intclassDocumentParsingAgent:def__init__(self, model_name="gemini-1.5-flash-002"):"""Initialize the LLM with the appropriate schema."""layout_elements_schema = prepare_schema_for_gemini(LayoutElements)self.model_name = model_nameself.model = genai.GenerativeModel(self.model_name,generation_config={"response_mime_type":"application/json","response_schema": layout_elements_schema,},)deffind_layout_items(self, state: FindLayoutItemsInput):"""Send a page image to the LLM for segmentation and summarization."""messages = [f"Find and summarize all the relevant layout elements in this PDF page in the following format: "f"{LayoutElements.schema_json()}. "f"Tables should have at least two columns and at least two rows. "f"The coordinates should overlap with each layout item.",{"mime_type":"image/jpeg","data": state.base64_jpeg},]# Send the prompt to the LLMresult = self.model.generate_content(messages)data = json.loads(result.text)# Convert the JSON output into documentsdocuments = [Document(page_content=item["summary"],metadata={"page_number": state.page_number,"element_type": item["element_type"],"document_path": state.document_path,},)for item in data["layout_items"]]return {"documents": documents}
find_layout_items函数进行处理。这对于处理大型多页 PDF 文档尤为重要,能够显著缩短处理时间,提高整体效率。例如,在处理一本数百页的学术书籍 PDF 时,并行处理可以充分利用计算资源,快速完成页面分割和总结任务。fromlanggraph.typesimportSendclassDocumentParsingAgent:@classmethoddefcontinue_to_find_layout_items(cls, state):"""Generate tasks to process each page in parallel."""return[Send("find_layout_items",FindLayoutItemsInput(base64_jpeg=base64_jpeg,page_number=i,document_path=state.document_path,),)fori, base64_jpeginenumerate(state.pages_as_base64_jpeg_images)]
fromlanggraph.graphimportStateGraph, START, ENDclassDocumentParsingAgent:defbuild_agent(self):"""Build the agent workflow using a state graph."""builder = StateGraph(DocumentLayoutParsingState)# Add nodes for image extraction and layout item detectionbuilder.add_node("get_images", self.get_images)builder.add_node("find_layout_items", self.find_layout_items)# Define the flow of the graphbuilder.add_edge(START, "get_images")builder.add_conditional_edges("get_images", self.continue_to_find_layout_items)builder.add_edge("find_layout_items", END)self.graph = builder.compile()
if__name__=="__main__":_state=DocumentLayoutParsingState(document_path="path/to/document.pdf")agent=DocumentParsingAgent()#Step1:Extractimages fromPDFresult_images=agent.get_images(_state)_state.pages_as_base64_jpeg_images=result_images["pages_as_base64_jpeg_images"]#Step2rocessthe first page (asan example)
result_layout=agent.find_layout_items(FindLayoutItemsInput(base64_jpeg=_state.pages_as_base64_jpeg_images[0],page_number=0,document_path=_state.document_path,))#Displaythe resultsforiteminresult_layout["documents"]:print(item.page_content)print(item.metadata["element_type"])
ChromaDB等向量数据库对 Agent 1 生成的文档总结进行索引。在索引过程中,不仅存储总结内容,还保留了文档路径、页面编号等重要元数据,以便后续检索和引用。例如,当用户查询特定信息时,这些元数据可以帮助快速定位相关文档页面,提供准确的上下文信息。在索引之前,会检查文档是否已被索引,避免重复操作,提高处理效率。classDocumentRAGAgent:defindex_documents(self, state: DocumentRAGState):"""Index the parsed documents into the vector store."""assertstate.documents,"Documents should have at least one element"# Check if the document is already indexedif self.vector_store.get(where={"document_path": state.document_path})["ids"]:logger.info("Documents for this file are already indexed, exiting this node")return # Skip indexing if already done# Add parsed documents to the vector storeself.vector_store.add_documents(state.documents)logger.info(f"Indexed {len(state.documents)} documents for {state.document_path}")
classDocumentRAGAgent:defanswer_question(self, state: DocumentRAGState):"""Retrieve relevant chunks and generate a response to the user's question."""# Retrieve the top-k relevant documents based on the queryrelevant_documents: list[Document] = self.retriever.invoke(state.question)# Retrieve corresponding page images (avoid duplicates)images = list(set([state.pages_as_base64_jpeg_images[doc.metadata["page_number"]]for doc in relevant_documents]))logger.info(f"Responding to question: {state.question}")# Construct the prompt: Combine images, relevant summaries, and the questionmessages = ([{"mime_type": "image/jpeg", "data": base64_jpeg} for base64_jpeg in images]+ [doc.page_content for doc in relevant_documents]+ [f"Answer this question using the context images and text elements only: {state.question}",])# Generate the response using the LLMresponse = self.model.generate_content(messages)return {"response": response.text, "relevant_documents": relevant_documents}
| 欢迎光临 链载Ai (https://www.lianzai.com/) | Powered by Discuz! X3.5 |