LangExtract——大模型文本提炼工具

显示全部楼层

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;display: table;padding: 0px 1em;color: rgb(63, 63, 63);">LangExtract——大模型文本提炼工具

概述

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 14px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;padding-left: 8px;color: rgb(63, 63, 63);">什么是LangExtract

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 14px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">LangExtract是一个Python库，它使用大型语言模型（LLMs）从非结构化文本文档中提取结构化信息。该库可以处理临床笔记、文学文本或报告等材料，识别和组织关键细节，同时保持提取的数据和源文本位置之间的精确映射。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;padding-left: 8px;color: rgb(63, 63, 63);">核心能力

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 14px;padding: 0.25em 0.5em;color: rgb(63, 63, 63);word-break: keep-all;"> 能力	ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 14px;padding: 0.25em 0.5em;color: rgb(63, 63, 63);word-break: keep-all;"> 描述
ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 14px;padding: 0.25em 0.5em;color: rgb(63, 63, 63);word-break: keep-all;">ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: inherit;color: rgb(15, 76, 129);">标注来源	ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 14px;padding: 0.25em 0.5em;color: rgb(63, 63, 63);word-break: keep-all;"> 将每个提取映射到源文本中的精确字符位置
结构化输出	根据少量样本规范输出模式
长文档处理	通过分块和并行处理来处理大量文本
交互式可视化	生成 HTML 文件，用于在上下文中审查提取内容
多供应商支持	与云 LLM（Gemini、OpenAI）和本地模型（Ollama）配合使用
领域适应性	可使用示例配置任何提取任务

实践

安装

LangExtract 可以从 PyPI 安装，也可以从源代码构建。该库需要 Python 3.10 及以上版本，并为特定提供程序提供可选的依赖项。

标准安装

pipinstalllangextract

开发安装

git clone https://github.com/google/langextract.git
cd langextract
pip install -e ".[dev]"

基本工作流

1、定义输入样本；创建提示词(prompt_description)和示例(examples)来指导模型

2、调用lx.extract()函数处理输入文本:

内部处理流程如下:

• 输入处理:如果fetch_urls=True且输入是 URL,会自动下载文本
• 创建提示模板:使用PromptTemplateStructured组织提示词和示例
• 模型配置:根据参数优先级创建语言模型(优先级:model>config>model_id)
• 文本处理:通过Annotator协调文本分块、并行处理和结果解析
• 结果对齐:使用Resolver将提取结果对齐到源文本位置

3. 可视化结果

保存结果并生成交互式 HTML 可视化

返回的AnnotatedDocument包含:

• 原始文本和document_id
•Extraction对象列表,每个包含char_interval位置信息
• 每个提取的AlignmentStatus表示匹配质量

注：对于长文档,可以使用 URL 直接处理并启用并行处理和多次提取来提高性能和准确性。系统支持多种模型提供商(Gemini、OpenAI、Ollama 等),通过工厂模式自动选择合适的提供商

调用qwen示例

importlangextractaslx
fromlangextractimportfactory



fromlangextract.providers.openaiimportOpenAILanguageModel

# Text with a medication mention
input_text ="Patient took 400 mg PO Ibuprofen q4h for two days."

# Define extraction prompt
prompt_description ="Extract medication information including medication name, dosage, route, frequency, and duration in the order they appear in the text."

# Define example data with entities in order of appearance
examples = [
  lx.data.ExampleData(
    text="Patient was given 250 mg IV Cefazolin TID for one week.",
    extractions=[
      lx.data.Extraction(extraction_class="dosage", extraction_text="250 mg"),
      lx.data.Extraction(extraction_class="route", extraction_text="IV"),
      lx.data.Extraction(extraction_class="medication", extraction_text="Cefazolin"),
      lx.data.Extraction(extraction_class="frequency", extraction_text="TID"), # TID = three times a day
      lx.data.Extraction(extraction_class="duration", extraction_text="for one week")
    ]
  )
]

result = lx.extract(
  text_or_documents=input_text,
  prompt_description=prompt_description,
  examples=examples,
  fence_output=True,
  use_schema_constraints=False,
  model = OpenAILanguageModel(
    model_id='qwen-plus',
    base_url='',
    api_key='',
    provider_kwargs={
     'connect_timeout':60, # 允许 60 秒完成 SSL 握手
     'timeout':120    # 保持 120 秒的整体请求超时
    }
  )
)

# Display entities with positions
print(f"Input:{input_text}\n")
print("Extracted entities:")
forentityinresult.extractions:
  position_info =""
 ifentity.char_interval:
    start, end = entity.char_interval.start_pos, entity.char_interval.end_pos
    position_info =f" (pos:{start}-{end})"
 print(f"•{entity.extraction_class.capitalize()}:{entity.extraction_text}{position_info}")

# Save and visualize the results
lx.io.save_annotated_documents([result], output_name="medical_ner_extraction.jsonl", output_dir=".")

# Generate the interactive visualization
html_content = lx.visualize("medical_ner_extraction.jsonl")
withopen("medical_ner_visualization.html","w")asf:
 ifhasattr(html_content,'data'):
    f.write(html_content.data) # For Jupyter/Colab
 else:
    f.write(html_content)

print("Interactive visualization saved to medical_ner_visualization.html")

这段代码的核心目标是：使用 langextract 库对接大语言模型（Qwen），从医疗文本中自动提取结构化的药物信息（剂量、途径、名称等），并通过打印、文件保存、HTML 可视化等方式展示结果。适用于医疗文本分析、药物信息抽取等场景。

界面如下：

其实一开始是会出现乱码：在生成的html增加utf-8。

<!DOCTYPEhtml>
<html>
<head>
 <metacharset="UTF-8">
 <title>医疗实体提取可视化</title>
 <style>
   <!-- 原有CSS样式保持不变 -->
 </style>
</head>
<body>
 <!-- 原有HTML内容保持不变 -->
 <divclass="lx-animated-wrapper lx-gif-optimized">
   <!-- ... 原有内容 ... -->
 </div>

 <script>
   <!-- 原有JavaScript代码保持不变 -->
 </script>
</body>
</html>

此外，我还发现了这个文档是不支持中文的！！！

增加中文分词

进入tokenizer分词部分：

# ✅ Updated to support Chinese characters (CJK Unified Ideographs, Extension A, Compatibility Ideographs)
#  and other Unicode languages

_LETTERS_PATTERN = (
 r"[A-Za-z\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+"
)
"""匹配中文、英文的连续字母（包含CJK基本区、扩展A区、兼容区）"""

_DIGITS_PATTERN = (
 r"[0-9\uff10-\uff19]+"
)
"""匹配阿拉伯数字与全角数字"""

_SYMBOLS_PATTERN = (
 r"[^A-Za-z0-9\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff\s]+"
)
"""匹配除中文、英文、数字和空格外的符号（含全角符号）"""

_END_OF_SENTENCE_PATTERN = re.compile(r"[.?!。？！]$")
"""匹配句末符号（含中英文标点）"""

_SLASH_ABBREV_PATTERN = (
 r"[A-Za-z0-9\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+"
 r"(?:/[A-Za-z0-9\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+)+"
)
"""匹配类似 '中/英/混合' 这种带斜杠的缩写或组合词"""

_TOKEN_PATTERN = re.compile(
 rf"{_SLASH_ABBREV_PATTERN}|{_LETTERS_PATTERN}|{_DIGITS_PATTERN}|{_SYMBOLS_PATTERN}"
)
"""通用token匹配模式：支持中文、英文、数字、符号"""

_WORD_PATTERN = re.compile(
 rf"(?:{_LETTERS_PATTERN}|{_DIGITS_PATTERN})\Z"
)
"""匹配完整词语（字母或数字结尾）"""

修改如上内容就可以支持中文啦。

pdf支持

LangExtract目前仅支持处理原始文本字符串。在实际工作流程中，源文件通常是PDF、DOCX或PPTX格式。用户目前必须：

手动将文件转换为文本（丢失布局和出处）。
将纯文本输入到LangExtract中。
手动将提取内容映射回原始文档以进行验证。
单步流程将使LangExtract的采用变得更为简便。

建议的解决方案
将Docling库作为可选前端集成：

Docling可以将多种文档格式转换为统一的DoclingDocument。
它保留了来源（页面、边界框、阅读顺序）。
将提取的文本块按照今天的方式输入到LangExtract中。
通过起源元数据将提取的结果映射回原始文档。
集成将是可选的（pip install langextract[docling]），因此核心包保持无依赖性。

概念验证:这个还只是验证，没有加入代码中。

importlangextractaslx
importtextwrap
frompdf_extractimportextract_with_file_support

# 1. Define the prompt and extraction rules
prompt = textwrap.dedent("""\
  Extract characters, emotions, and relationships in order of appearance.
  Use exact text for extractions. Do not paraphrase or overlap entities.
  Provide meaningful attributes for each entity to add context.""")

# 2. Provide a high-quality example to guide the model
examples = [
  lx.data.ExampleData(
    text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
    extractions=[
      lx.data.Extraction(
        extraction_class="character",
        extraction_text="ROMEO",
        attributes={"emotional_state":"wonder"}
      ),
      lx.data.Extraction(
        extraction_class="emotion",
        extraction_text="But soft!",
        attributes={"feeling":"gentle awe"}
      ),
      lx.data.Extraction(
        extraction_class="relationship",
        extraction_text="Juliet is the sun",
        attributes={"type":"metaphor"}
      ),
    ]
  )
]

source ="<sample pdf file>.pdf"
result = extract_with_file_support(
  source=source,
  prompt_description=prompt,
  examples=examples,
  model_id="gemini-2.5-flash",
)

# result.extractions[0].extraction_text
# result.extractions[0].provenance

总结：

现在的langextract感觉就是集成了各种大模型。封装好了输入输出。现在其实也不支持pdf。

维度	LangExtract 的强 / 劣势	Qwen 驱动抽取系统（假设）可能的强 /劣势	总体判断 /建议
PDF /复杂文档支持	劣势 — 不能直接支持 PDF / DOC 等格式，需要用户先转文本	如果工程设计完整，有机会支持直接从 PDF / DOC /OCR 输入	如果你的工作流程中有大量 PDF /Word 文档，Qwen 工程若支持这一环节会是一个显著优势
中文 / 多语言支持	劣势 — 对非拉丁字符处理有已知问题 /对齐、字符位置可能出错	优势可能更明显（尤其是在中文为主的语境下）	在中文主导场景下，Qwen 工程可能更好；但还是要做适配和测试
应用范围 / 任务领域适应性	优势 — 通过少量示例即可适配领域抽取任务（法律、医学、报表、报告等）	若工程设计允许灵活输入 schema / prompt，也能做到类似适配能力	在多领域抽取 / 快速原型方面，两者如果设计都合理都可以胜任；关键在于抽取质量与稳定性
准确度 / 稳定性 /一致性	优势 — 强调可追踪 / 抽取结果有结构约束 / 多轮抽取以提升召回	若模型 + prompt +工程策略设计得好，理论上也可以接近	在高要求 /审计场景下，LangExtract 的可追踪 / 可视化是很大的加分；Qwen 工程要做到这一点需要额外设计
工程 /扩展性 /集成性	劣势 — 对于 PDF /OCR 等上游处理依赖外部工具；框架相对专一	优势可能大 — 可以自定义融合上游组件 (OCR /排版解析 /纠错 /缓存 /扩展)	如果你有团队、工程能力，Qwen 驱动系统的灵活性可能比 LangExtract 高
成本 /效率	劣势 — 多轮 /分块抽取可能使得模型调用量大、成本上升	若设计好（批处理、缓存、模型量化 /加速机制）可以优化成本	在大规模 /线上服务场景下，成本控制是很关键的考量
可解释 / 可追踪性	优势 — 内建源定位 /可视化审阅支持	若设计得当可以匹配 /接近	若你的应用场景重视“为什么模型抽这个字段 /在原文什么地方”，LangExtract 当前在这一块有天然优势

除了标记出处（功能也很鸡肋），感觉得不到太大的优势。