用 iText2KG 增量构建《西游记》知识图谱

显示全部楼层

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;margin: 1em 8px;letter-spacing: 0.1em;color: rgb(33, 37, 41);padding: 8px 12px;background: rgba(237, 242, 255, 0.8);border-radius: 8px;">iText2KG 是一个 Python 包，旨在通过利用大型语言模型从文本文档中提取实体和关系，逐步构建具有已解析实体和关系的一致知识图谱。它具有零样本能力，无需专门的训练即可跨各个领域提取知识。该包包括用于文档提炼、实体提取和关系提取的模块，确保已解析且唯一的实体和关系。它不断用新文档更新知识图谱，并将其集成到 Neo4j 等框架中进行可视化表示。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 20px;font-weight: bold;display: table;margin: 2em auto 1.5em;padding-top: 6px;padding-bottom: 6px;padding-left: 16.75px;background-image: linear-gradient(135deg, rgb(113, 23, 234), rgba(113, 23, 234, 0.667), rgba(234, 96, 96, 0.533), rgba(217, 57, 205, 0.267), rgba(217, 57, 205, 0));background-position: initial;background-size: initial;background-repeat: initial;background-attachment: initial;background-origin: initial;background-clip: initial;color: rgb(255, 255, 255);border-radius: 8px;width: 318.25px;">总体架构

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;margin: 1em 8px;letter-spacing: 0.1em;color: rgb(33, 37, 41);padding: 8px 12px;background: rgba(237, 242, 255, 0.8);border-radius: 8px;">iText2KG软件包由四个主要模块组成，它们协同工作，从非结构化文本构建和可视化知识图谱。整体架构概述：

1.文档提取器：该模块处理原始文档，并根据用户定义的模式将其重新组织成语义块。它通过关注相关信息并以预定义的格式对其进行结构化来提高信噪比。
2.增量实体提取器：此模块从语义块中提取唯一实体并解决歧义以确保每个实体都有明确定义。它使用余弦相似度度量将局部实体与全局实体进行匹配。
3.增量关系提取器：此模块识别提取的实体之间的关系。它可以以两种模式运行：使用全局实体丰富图形中的潜在信息，或使用局部实体建立更精确的关系。
4.图形集成器和可视化：此模块将提取的实体和关系集成到 Neo4j 数据库中，提供知识图谱的可视化表示。它允许对结构化数据进行交互式探索和分析。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;margin: 1.5em 8px;color: rgb(63, 63, 63);">ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;border-radius: 8px;display: block;margin: 0.1em auto 0.5em;border-width: 0px;border-style: solid;border-color: initial;height: auto !important;" title="null" src="https://api.ibos.cn/v4/weapparticle/accesswximg?aid=89938&url=aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy9hZzdaM2RDcTdMa1JMM2pJcHhMSFJRRUU2Z0syaWJ3SjhhcWFNdHVGQzVQOGZyQ0M5ck4wMGN3NnJVeTlWMUxlblR6aEFHSDhoUW5GWjdQb0xxOW91S1EvNjQwP3d4X2ZtdD1wbmcmYW1w;from=appmsg"/>

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;margin: 1em 8px;letter-spacing: 0.1em;color: rgb(33, 37, 41);padding: 8px 12px;background: rgba(237, 242, 255, 0.8);border-radius: 8px;">LLM 被提示提取代表一个唯一概念的实体，以避免语义混合的实体。下图显示了使用 Langchain JSON 解析器的实体和关系提取提示。它们分类如下：

•蓝色- 由 Langchain 自动格式化的提示；
•常规- 我们设计的提示；
•斜体- 专门为实体和关系提取设计的提示。

•（a）关系提取提示和
•（b）实体提取提示。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;margin: 1.5em 8px;color: rgb(63, 63, 63);">ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;border-radius: 8px;display: block;margin: 0.1em auto 0.5em;border-width: 0px;border-style: solid;border-color: initial;height: auto !important;" title="null" src="https://api.ibos.cn/v4/weapparticle/accesswximg?aid=89938&url=aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy9hZzdaM2RDcTdMa1JMM2pJcHhMSFJRRUU2Z0syaWJ3Sjh3WlhkbXI5WmVFaWFTbE84cEZSVFpzOURQZUQxUVFHdEM1aWFpY0hxeGEyb2ZXMmJsTDYzb1lBRmcvNjQwP3d4X2ZtdD1wbmcmYW1w;from=appmsg"/>

安装

要安装 iText2KG，请确保已安装 Python，然后使用pip安装

pipinstallitext2kg

或者使用poetry安装：

poetryadditext2kg

准备文档

我这里使用《西游记》的白话文版本来进行演示。

准备好文本（百度搜索有很多），放到datasets目录：

加载大模型

iText2KG 进行知识图谱的抽取使用到了聊天和嵌入两个模型，可以使用本地 Ollama 的模型：

fromlangchain_ollamaimportChatOllama,OllamaEmbeddings

llm=ChatOllama(
model="glm4",
temperature=0,
)
embeddings=OllamaEmbeddings(
model="glm4",
)

记得安装依赖：

pipinstalllangchain-communitylangchain-ollama

定义指令

iText2KG 可以很方便的定义指令，比如我下面的：

IE_query='''
#指令：
-像经验丰富的信息提取者一样行动。
-提取的信息包含人物、地点、事件、物品、任务以及技能。
-您有大量的故事阅读。
-如果找不到正确的信息，请将其保留为空白。
'''

你可以根据你要抽取的数据类型进行简单定制。

文件处理

类似 RAG 一样，我们也需要对文件进行预处理，我直接给出函数代码，大家需要的话可以直接拿来用：

defbuild_sections(file_path):
loader=PythonLoader(file_path)
pages=loader.load_and_split()

#wehavereplacedthecurlybraceswithsquarebracketstoavoidtheerrorinthequery
distilled_cv=document_distiller.distill(
documents=[page.page_content.replace("{",'[').replace("}","]")forpageinpages],
IE_query=IE_query,
output_data_structure=CV
)

sections=[f"{key}-{value}".replace("{","[").replace("}","]")forkey,valueindistilled_cv.items()
ifvalue!=[]andvalue!=""andvalue!=None]
returnsections

如果要处理xiyou01.txt文件，那么可以如下写：

sections=build_sections('./datasets/xiyou01.txt')

sections 的内容大概是这样：

构建图谱

加下来就是构建图谱了，我也把写好的函数贴到下面，需要的大家可以直接用：

defbuild_graph(sections,existing_global_entities=None,existing_global_relationships=None,ent_threshold=0.6,rel_threshold=0.6):
global_ent,global_rel=itext2kg.build_graph(
sections=sections,ent_threshold=ent_threshold,rel_threshold=rel_threshold,
existing_global_relationships=existing_global_relationships,
existing_global_entities=existing_global_entities
)
print(global_rel)
print(global_ent)
returnglobal_ent,global_rel

传入刚才的 sections 就可以得到节点和关系数据：

global_ent,global_rel=build_graph(sections)

我们可以看到节点数据大约如下：

关系数据大约如下：

从控制台的调试信息我们可以看出，iText2KG 会进行关系梳理、节点去重等动作：

显示

我们使用pyvis这个工具来显示图谱。

首先是安装：

pipinstallpyvis

程序比较简单：

frompyvis.networkimportNetwork

net=Network(height="100vh",width="100%")
forxinglobal_ent:
net.add_node(x['name'])
forxinglobal_rel:
net.add_edge(x['startNode'],x['endNode'],weight=1)
net.show('mygraph.html',notebook=False)

然后点击生成的mygraph.html文件，就可以看到关系数据如下：

调整下参数，你可能得到更多的节点和关系：

增量构建

iText2KG 在构建图谱的一个亮点功能就是增量构建。

比如刚才了构建了《西游记》的第一章的图谱，我们可以在第一章的基础之上进行构建，而不是第一章和第二章一起构建。

sections2=build_sections(text02)
global_ent2,global_rel2=build_graph(sections2,existing_global_entities=global_ent1,existing_global_relationships=global_rel1)

再次画图看一下结构吧：

这次效果不是很好，产生了一些孤立节点。

指定结构

iText2KG 还有一个很棒的特性就是你可以指定节点的属性结构。

先声明一个节点类，比如官方示例中的简历：


classJobOffer(BaseModel):
job_offer_title:str=Field(...,description="Thejobtitle")
company:str=Field(...,description="Thenameofthecompanyofferingthejob")
location:str=Field(...,description="Thejoblocation(canspecifyifremote/hybrid)")
job_type:str=Field(...,description="Typeofjob(e.g.,full-time,part-time,contract)")
responsibilitiesist[JobResponsibility]=Field(...,description="Listofkeyresponsibilities")
qualificationsist[JobQualification]=Field(...,description="Listofrequiredorpreferredqualifications")
certifications:Optional[List[JobCertification]]=Field(None,description="Requiredorpreferredcertifications")
benefits:Optional[List[str]]=Field(None,description="Listofjobbenefits")
experience_required:str=Field(...,description="Requiredyearsofexperience")
salary_range:Optional[str]=Field(None,description="Salaryrangefortheposition")
apply_url:Optional[str]=Field(None,description="URLtoapplyforthejob")

然后我们再蒸馏的时候就可以指定output_data_structure参数：

distilled_Job_Offer=document_distiller.distill(
documents=[job_offer],IE_query=IE_query,
output_data_structure=JobOffer
)

写在最后

AI应用特别是 LLM 应用的性能，很大程度依赖大模型本身。

iText2KG 也不例外，小参数的大模型很可能解析不出期望的结果。

个人感觉，这个框架虽然能用，但是速度、准确度都比较慢，成本（TOKEN）也比较高。但是好在，项目本身是在积极开发中。

项目代码：https://github.com/AuvaLab/itext2kg^[1]