使用Docling从文档中构建知识图谱 - 链载Ai

知识图谱是表示信息的一种结构化方式，它们由节点和边组成。节点代表实体（如人、地点或概念），而边则表示这些实体之间的关系。通过以这种方式组织信息，知识图谱使得数据探索更加直观，便于复杂查询回答，并支持高级分析任务。它们被广泛应用于搜索引擎、推荐系统和数据集成等领域，以提供更深入的洞察并增强决策能力。

使用Docling进行文档提取可以显著简化构建知识图谱的过程。Docling能够解析多种文档格式，包括复杂的PDF文件，并提供文档内容的结构化表示，这简化了关键实体和关系的识别。与处理需要大量预处理的原始文本相比，Docling提供了一个更有组织的输出，使得提取填充知识图谱所需的具体信息变得更加容易，例如样本文档中存在的实体（“巴黎”、“埃菲尔铁塔”）及其关系（“位于”、“由…设计”）。这种结构化方法减少了信息提取所涉及的努力，并提高了结果知识图谱的准确性。

代码实现

好的，介绍了这个想法之后，决定编写一个示例代码，从PDF中构建知识图谱。

importjsonimportloggingimporttimefrompathlibimportPath
importspacyimportnetworkxasnximportmatplotlib.pyplotasplt
fromdocling.datamodel.base_modelsimportInputFormatfromdocling.datamodel.pipeline_optionsimport(AcceleratorDevice,AcceleratorOptions,PdfPipelineOptions,)fromdocling.document_converterimportDocumentConverter, PdfFormatOption
# Load a spaCy language modelnlp = spacy.load("en_core_web_sm")
defextract_text_from_docling_document(docling_document):"""Extracts text content from a Docling Document object."""text = docling_document.export_to_text()returntext
defbuild_knowledge_graph(text):doc = nlp(text)graph = nx.Graph()
# Extract entitiesforentindoc.ents:graph.add_node(ent.text, label=ent.label_)
# Simple relationship extraction (can be improved)forsentindoc.sents:fori, tokeninenumerate(sent):iftoken.dep_in["nsubj","dobj"]:subject = [wforwintoken.head.leftsifw.dep_ =="nsubj"]object_ = [wforwintoken.head.rightsifw.dep_ =="dobj"]ifsubjectandobject_:graph.add_edge(subject[0].text, object_[0].text, relation=token.head.lemma_)elifsubjectandtoken.head.lemma_in["be","have"]:right_children = [childforchildintoken.head.rightsifchild.dep_in["attr","acomp"]]ifright_children:graph.add_edge(subject[0].text, right_children[0].text, relation=token.head.lemma_)returngraph
defvisualize_knowledge_graph(graph):"""Visualizes the knowledge graph."""pos = nx.spring_layout(graph)nx.draw(graph, pos, with_labels=True, node_size=3000, node_color="skyblue", font_size=10, font_weight="bold")edge_labels = nx.get_edge_attributes(graph,'relation')nx.draw_networkx_edge_labels(graph, pos, edge_labels=edge_labels)plt.title("Knowledge Graph from Document")plt.show()
defmain():logging.basicConfig(level=logging.INFO)_log = logging.getLogger(__name__)# Initialize the logger here#nlp = spacy.load("en_core_web_sm") # Load spacy # Removed from here#input_doc_path = Path("./input/2503.11576v1.pdf")input_doc_path = Path("./inp

巴黎市位于法国，以其标志性的埃菲尔铁塔而闻名。它是一处热门旅游目的地。这座铁塔由古斯塔夫・埃菲尔设计。著名科学家玛丽・居里出生于巴黎，她为放射学领域做出了重大贡献。她曾在镭研究所工作。塞纳河流经巴黎。
说明其适用性的原因：该文本包含多个实体和关系，这些实体和关系可被轻松提取并在知识图谱中呈现：・实体：巴黎、法国、埃菲尔铁塔、古斯塔夫・埃菲尔、玛丽・居里、镭研究所、塞纳河・关系：巴黎位于法国。巴黎因埃菲尔铁塔而闻名。埃菲尔铁塔由古斯塔夫・埃菲尔设计。玛丽・居里出生于巴黎。玛丽・居里是一位科学家。玛丽・居里为放射学领域做出了贡献。玛丽・居里曾在镭研究所工作。塞纳河流经巴黎。
根据该文本构建的知识图谱会将这些实体表示为节点，将关系表示为边，从而提供信息的结构化呈现。

INFO:__main__ocumentconvertedin8.63seconds.WARNING:docling_core.types.doc.documentarameter`strict_text`has been deprecatedandwill be ignored.Numberofnodes:23Numberofedges:72025-04-2321:33:52.828python3[73966:691115]Theclass'NSSavePanel'overrides the method identifier.Thismethod is implemented byclass'NSWindow'
Nodes: [('aris', {'label':'GPE'}), ('France', {'label':'GPE'}), ('Eiffel Tower', {'label':'FAC'}), ('Gustave Eiffel', {'label':'ERSON'}), ('Marie Curie', {'label':'ERSON'}), ('the Radium Institute', {'label':'FAC'}), ('Seine River', {'label':'LOC'}), ('## Explanation', {'label':'MONEY'}), ('Radium Institute', {'label':'ORG'}), ('the Eiffel Tower', {'label':'LOC'}), ('The Eiffel Tower', {'label':'LOC'}), ('city', {}), ('renowned', {}), ('It', {}), ('destination', {}), ('Explanation', {}), ('entities', {}), ('this', {}), ('suitable', {}), ('Curie', {}), ('scientist', {}), ('contributions', {}), ('graph', {})]
Edges: [('city','renowned', {'relation':'be'}), ('It','destination', {'relation':'be'}), ('Explanation','entities', {'relation':'contain'}), ('entities','graph', {'relation':'represent'}), ('this','suitable', {'relation':'be'}), ('Curie','scientist', {'relation':'be'}), ('Curie','contributions', {'relation':'make'})]

结论

总之，在文档提取流程中有效利用Docling通过简化从复杂文档中识别关键实体和关系的步骤，从而提高了知识图谱创建的准确性和效率。

链载Ai

引言与动机

代码实现

结论