整个解析过程相对也比较简单,仅仅只是将PDF文档全部转换成图片,然后调用image2text抽取相关的文本,最后进行分词等后续处理。并没有像deepdoc那样对格式,表格等内容进行更深度的解析。值得注意的是抽取图片文本的提示词:ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-feature-settings: normal;font-variation-settings: normal;font-size: 14.4px;margin: 10px 8px;color: rgb(201, 209, 217);background: rgb(13, 17, 23);text-align: left;line-height: 1.5;overflow-x: auto;border-radius: 8px;padding: 0px !important;">INSTRUCTION: Transcribe the content from the provided PDF page image into clean Markdown format. - Only output the content transcribed from the image. - Do NOT output this instruction or any other explanation. - If the content is missing or you do not understand the input, return an empty string.
RULES: 1. Do NOT generate examples, demonstrations, or templates. 2. Do NOT output any extra text such as 'Example', 'Example Output', or similar. 3. Do NOT generate any tables, headings, or content that is not explicitly present in the image. 4. Transcribe content word-for-word. Do NOT modify, translate, or omit any content. 5. Do NOT explain Markdown or mention that you are using Markdown. 6. Do NOT wrap the output in ```markdown or ``` blocks. 7. Only apply Markdown structure to headings, paragraphs, lists, and tables, strictly based on the layout of the image. Do NOT create tables unless an actual table exists in the image. 8. Preserve the original language, information, and order exactly as shown in the image.ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 16px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">可以看出来,这个提示词其实主要起到的作用还是OCR, 禁止大模型进一步的提炼和加工信息,仅仅只是逐字的转录。个人觉得这部分后续还有比较大优化的空间,当前的提示词并没有能发挥出多模态大模型的推理总结能力,对图片信息进行整合过滤,而仅仅只是当成一个高级OCR模型使用,实际能起到的效果较为有限。不过当前这个功能仅仅是一个实验性质的特性,只有期待后续的更新和优化了。ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 16px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">另外当前如果使用的是image2text模型来解析PDF,在存储切片结果chunk时就不会存储当前切片对应的图片信息,这样检索时只能检索到文本信息而没有对应的图片,这里按理说要已经将PDF转换成图片了,完全可以和deepdoc的解析流程保持一致,这也是一个可以优化的点。ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;padding-left: 8px;color: rgb(63, 63, 63);">文档中图表信息的提取ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 16px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">各种文档中的图表信息在之前的文档解析过程中是难以被准确提取和检索的,但其中的数据又非常的重要,这也是成为了RAG应用的一个痛点。而在RagFlow的新版本中终于对这个部分做了补全,就是利用多模态大模型去提取图表中隐藏的信息和数据。这个过程默认用户不需要进行什么显式的配置,只要当前存在image2text的模型,PDF解析器使用的是deepdoc,那么在解析过程中如果识别到图表,则会自动调用image2text模型尝试提取信息。相关代码如下:
You are an expert visual data analyst. Analyze the image and provide a comprehensive description of its content. Focus on identifying the type of visual data representation (e.g., bar chart, pie chart, line graph, table, flowchart), its structure, and any text captions or labels included in the image.
Tasks: 1. Describe the overall structure of the visual representation. Specify if it is a chart, graph, table, or diagram. 2. Identify and extract any axes, legends, titles, or labels present in the image. Provide the exact text where available. 3. Extract the data points from the visual elements (e.g., bar heights, line graph coordinates, pie chart segments, table rows and columns). 4. Analyze and explain any trends, comparisons, or patterns shown in the data. 5. Capture any annotations, captions, or footnotes, and explain their relevance to the image. 6. Only include details that are explicitly present in the image. If an element (e.g., axis, legend, or caption) does not exist or is not visible, do not mention it.
Output format (include only sections relevant to the image content): - Visual Type: [Type] - Title: [Title text, if available] - Axes / Legends / Labels: [Details, if available] - Data Points: [Extracted data] - Trends / Insights: [Analysis and interpretation] - Captions / Annotations: [Text and relevance, if available]
Ensure high accuracy, clarity, and completeness in your analysis, and includes only the information present in the image. Avoid unnecessary statements about missing elements.