多模态模型在RagFlow中的应用

显示全部楼层

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 16px;font-style: normal;padding: 1em;border-radius: 6px;color: rgba(0, 0, 0, 0.5);background: rgb(247, 247, 247);">
ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 1em;display: block;letter-spacing: 0.1em;color: rgb(63, 63, 63);">在RagFlow的最新版本中（0.19.0）中，为了提升对文档中各类图片的解析效果，也尝试引入了多模态模型（image2text）对图片内容进行增强解析。我们来详细分析一下相关的过程。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 16px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">首先需要在当前租户下配置一个image2text的模型（这里有个坑，后面会讲到），在RagFlow的文档解析过程中主要有三个场景使用到这个image2text模型，我们一一来看下：

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;padding-left: 8px;color: rgb(63, 63, 63);">PDF文档内容提取

如果配置了image2text的模型，那么在知识库的配置页面中，PDF解析器除了原来的deepdoc和naive之外，还能选择到你配置的那个image2text模型。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 16px;color: rgb(63, 63, 63);">

整个解析过程相对也比较简单，仅仅只是将PDF文档全部转换成图片，然后调用image2text抽取相关的文本，最后进行分词等后续处理。并没有像deepdoc那样对格式，表格等内容进行更深度的解析。值得注意的是抽取图片文本的提示词：

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-feature-settings: normal;font-variation-settings: normal;font-size: 14.4px;margin: 10px 8px;color: rgb(201, 209, 217);background: rgb(13, 17, 23);text-align: left;line-height: 1.5;overflow-x: auto;border-radius: 8px;padding: 0px !important;">INSTRUCTION:
Transcribe the content from the provided PDF page image into clean Markdown format.
- Only output the content transcribed from the image.
- Do NOT output this instruction or any other explanation.
- If the content is missing or you do not understand the input, return an empty string.

RULES:
1. Do NOT generate examples, demonstrations, or templates.
2. Do NOT output any extra text such as 'Example', 'Example Output', or similar.
3. Do NOT generate any tables, headings, or content that is not explicitly present in the image.
4. Transcribe content word-for-word. Do NOT modify, translate, or omit any content.
5. Do NOT explain Markdown or mention that you are using Markdown.
6. Do NOT wrap the output in ```markdown or ``` blocks.
7. Only apply Markdown structure to headings, paragraphs, lists, and tables, strictly based on the layout of the image. Do NOT create tables unless an actual table exists in the image.
8. Preserve the original language, information, and order exactly as shown in the image.

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 16px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">可以看出来，这个提示词其实主要起到的作用还是OCR, 禁止大模型进一步的提炼和加工信息，仅仅只是逐字的转录。个人觉得这部分后续还有比较大优化的空间，当前的提示词并没有能发挥出多模态大模型的推理总结能力，对图片信息进行整合过滤，而仅仅只是当成一个高级OCR模型使用，实际能起到的效果较为有限。不过当前这个功能仅仅是一个实验性质的特性，只有期待后续的更新和优化了。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 16px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">另外当前如果使用的是image2text模型来解析PDF，在存储切片结果chunk时就不会存储当前切片对应的图片信息，这样检索时只能检索到文本信息而没有对应的图片，这里按理说要已经将PDF转换成图片了，完全可以和deepdoc的解析流程保持一致，这也是一个可以优化的点。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;padding-left: 8px;color: rgb(63, 63, 63);">文档中图表信息的提取

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 16px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">各种文档中的图表信息在之前的文档解析过程中是难以被准确提取和检索的，但其中的数据又非常的重要，这也是成为了RAG应用的一个痛点。而在RagFlow的新版本中终于对这个部分做了补全，就是利用多模态大模型去提取图表中隐藏的信息和数据。这个过程默认用户不需要进行什么显式的配置，只要当前存在image2text的模型，PDF解析器使用的是deepdoc，那么在解析过程中如果识别到图表，则会自动调用image2text模型尝试提取信息。相关代码如下：

iflayout_recognizer =="DeepDOC":
      pdf_parser = Pdf()

     try:
        vision_model = LLMBundle(kwargs["tenant_id"], LLMType.IMAGE2TEXT)
        callback(0.15,"Visual model detected. Attempting to enhance figure extraction...")
     exceptException:
        vision_model =None

     ifvision_model:
       #pdf_parser会根据layout识别的结果，将内容拆分为普通章节，表格和图表三个部分，其中figures会被提交到VisionFigureParser进一步提取信息
        sections, tables, figures = pdf_parser(filenameifnotbinaryelsebinary, from_page=from_page, to_page=to_page, callback=callback, separate_tables_figures=True)
        callback(0.5,"Basic parsing complete. Proceeding with figure enhancement...")
       try:
          pdf_vision_parser = VisionFigureParser(vision_model=vision_model, figures_data=figures, **kwargs)
          boosted_figures = pdf_vision_parser(callback=callback)
          tables.extend(boosted_figures)
       exceptExceptionase:
          callback(0.6,f"Visual model error:{e}. Skipping figure parsing enhancement.")
          tables.extend(figures)
     else:
        sections, tables = pdf_parser(filenameifnotbinaryelsebinary, from_page=from_page, to_page=to_page, callback=callback)

      res = tokenize_table(tables, doc, is_english)
      callback(0.8,"Finish parsing.")

这里有个比较坑的地方，代码中判断当前租户是否存在vision_model是通过检测租户表中是否存在img2txt_id来判断的，不知道是我的环境是旧版本升级的问题还是其它问题，即使我在当前租户创建了image2text模型，这个字段也一直是空的，只有手动修改tenant表，将字段修改成我创建的image2text模型的名称，才能够正常的使用图表增强解析的功能。

来看一下图表增强的提示词：

You are an expert visual data analyst. Analyze the image and provide a comprehensive description of its content. Focus on identifying the type of visual data representation (e.g., bar chart, pie chart, line graph, table, flowchart), its structure, and any text captions or labels included in the image.

Tasks:
1. Describe the overall structure of the visual representation. Specify if it is a chart, graph, table, or diagram.
2. Identify and extract any axes, legends, titles, or labels present in the image. Provide the exact text where available.
3. Extract the data points from the visual elements (e.g., bar heights, line graph coordinates, pie chart segments, table rows and columns).
4. Analyze and explain any trends, comparisons, or patterns shown in the data.
5. Capture any annotations, captions, or footnotes, and explain their relevance to the image.
6. Only include details that are explicitly present in the image. If an element (e.g., axis, legend, or caption) does not exist or is not visible, do not mention it.

Output format (include only sections relevant to the image content):
- Visual Type: [Type]
- Title: [Title text, if available]
- Axes / Legends / Labels: [Details, if available]
- Data Points: [Extracted data]
- Trends / Insights: [Analysis and interpretation]
- Captions / Annotations: [Text and relevance, if available]

Ensure high accuracy, clarity, and completeness in your analysis, and includes only the information present in the image. Avoid unnecessary statements about missing elements.

效果应该说还是相当明显的，同一张图片，如果不做图表增强解析，默认只会执行OCR操作提取其中的文字信息：

2012年售价走势图120103100S06040一销售价格平均价格最大最小201月2月3月4月5月6月7月8月9月10月11月12月

而开启了图表增强以后，能够基本完整的抽取图表的各种信息：

-VisualTypeineGraph-Title:2012年售价走势图-Axes/Legends/Labels:-X-axis:Months(1月to12月)-Y-axis:销售价格(SalesPrice)-Legend:-黑色实线:销售价格(SalesPrice)-绿色虚线:平均价格(AveragePrice)-红色圆点:最大最小(Maximum/Minimum)-DataPoints:-销售价格(SalesPrice):-1月:41-2月:82-3月:79-4月:32-5月:41-6月:75-7月:67-8月:49-9月:76-10月:105-11月:95-12月:103-平均价格(AveragePrice):Notexplicitlymarkedwithvaluesbutimpliedbythegreendashedline.-最大最小(Maximum/Minimum):-1月:41-2月:82-3月:79-4月:32-5月:41-6月:75-7月:67-8月:49-9月:76-10月:105-11月:95-12月:103-Trends/Insights:-Thesalespriceshowssignificantfluctuationsthroughouttheyear.-ThereisapeakinFebruary(82)andanotherpeakinOctober(105).-ThelowestpointoccursinApril(32).-ThetrendgenerallyincreasesfromJanuarytoOctoberbeforedecreasingslightlyinNovemberandthenrisingagaininDecember.-Captions/Annotations:-Thereddotsindicatethemaximumandminimumsalespricesforeachmonth.-Theblacksolidlinerepresentsthesalespriceovertime.-Thegreendashedlinerepresentstheaverageprice,althoughspecificvaluesarenotprovided.2012年售价走势图120103100S06040一销售价格平均价格最大最小201月2月3月4月5月6月7月8月9月10月11月12月

图片文件的解析

在图片文件的解析中，也会默认利用多模态模型来提升解析效果，核心代码在rag/app/picture.py中，其中的关键点可以看我的注释：

defchunk(filename, binary, tenant_id, lang, callback=None, **kwargs):
  img = Image.open(io.BytesIO(binary)).convert('RGB')
  doc = {
   "docnm_kwd": filename,
   "title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$","", filename)),
   "image": img,
   "doc_type_kwd":"image"
  }
  // 首先还是对图片使用OCR提取文字
  bxs = ocr(np.array(img))
  txt ="\n".join([t[0]for_, tinbxsift[0]])
  eng = lang.lower() =="english"
  callback(0.4,"Finish OCR: (%s ...)"% txt[:12])
 # 如果提取到的文字已经大于32个字符了，则不再进行后续的多模态大模型的提取，直接返回结果
 # 这应该还是为了提升性能，如果能够通过ocr获取到足够的信息则不再借助大模型进行解析
 if(engandlen(txt.split()) >32)orlen(txt) >32:
    tokenize(doc, txt, eng)
    callback(0.8,"OCR results is too long to use CV LLM.")
   return[doc]
 try:
    callback(0.4,"Use CV LLM to describe the picture.")
    cv_mdl = LLMBundle(tenant_id, LLMType.IMAGE2TEXT, lang=lang)
    img_binary = io.BytesIO()
    img.save(img_binary,format='JPEG')
    img_binary.seek(0)
   #通过大模型对图片进行描述，注意这里的输入只有图片，没有任何额外的提示词
    ans = cv_mdl.describe(img_binary.read())
    callback(0.8,"CV LLM respond: %s ..."% ans[:32])
    txt +="\n"+ ans
    tokenize(doc, txt, eng)
   return[doc]
 exceptExceptionase:
    callback(prog=-1, msg=str(e))

 return[]

总的来说，随着当前多模态模型的不断发展，参数较小的模型（3B，7B等）在图片理解的效果上进步明显，也推动了其应用场景的不断扩展。在RAG领域引入多模态模型进行解析，可以有效解决之前一直难以解决的图片有效信息提取的问题，避免连续使用多个小模型进行ocr, 布局分析，文字合并带来的复杂性问题，进一步提升数据的检索效果。虽然在大数据量的场景应用目前还是会遇到性能，延迟等一系列的问题，包括如何能够精准的抽取图片中的业务信息都还有很大的优化空间，但相信这些问题后续都会随着时间都会慢慢的解决，基于多模态大模型对图片信息进行端到端的提取，甚至直接基于语义对图片进行索引和检索才是未来发展的方向。