研究图谱是一种结构化表示研究对象的信息图谱，它捕捉关于研究者、组织、出版物、资助和研究数据之间的实体和关系的信息

显示全部楼层

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;letter-spacing: 0.1em;">

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;letter-spacing: 0.1em;">由DALLE于2024年2月6日创建

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 1.2em;display: table;border-bottom: 1px solid rgb(248, 57, 41);">引言

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;letter-spacing: 0.1em;">研究图谱是一种结构化表示研究对象的信息图谱，它捕捉关于研究者、组织、出版物、资助和研究数据之间的实体和关系的信息。目前，这些出版物以PDF文件形式存在，由于自由文本的形式，很难解析PDF文件以提取结构化信息。在本文中，我们将尝试通过从PDF出版物中提取相关信息，并使用OpenAI将其组织成图谱结构来创建研究图谱。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;letter-spacing: 0.1em;">

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;letter-spacing: 0.1em;">从PDF创建图谱的流程

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 1.2em;display: table;border-bottom: 1px solid rgb(248, 57, 41);">OpenAI

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;letter-spacing: 0.1em;">在这项工作中，我们使用OpenAI API和GPT的新助手功能（目前处于Beta阶段）将PDF文档转换为基于研究图谱模式的结构化JSON文件集。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 1.1em;border-left: 4px solid rgb(248, 57, 41);">助手API

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15px;letter-spacing: 0.1em;">助手API允许您在应用程序中构建人工智能（AI）助手。助手可以通过使用模型、工具和信息来回答用户问题。它是一个正在积极开发的Beta API。使用助手API，我们可以使用OpenAI托管的工具，如代码解释器和知识检索。本文将重点介绍知识检索。

知识检索

有时，我们需要AI模型基于未知知识回答查询，比如用户提供的文档或敏感信息。我们可以使用助手API的知识检索工具来增强模型的信息。我们可以将文件上传到助手，它会自动将文档分块，并创建和存储嵌入以实现数据的向量搜索。

示例

在我们的示例中，我们将出版物的PDF文件上传到OpenAI助手和知识检索工具，以获取给定出版物的图谱模式的JSON输出。我们使用的出版物可以从以下链接^[1]访问。

步骤1

读取存储出版物PDF的输入路径和存储JSON输出的输出路径。

importconfigparserconfig=configparser.ConfigParser()config.read('{}/config.ini'.format(current_path))input_path=config['DEFAULT']['Input-Path']output_path=config['DEFAULT']['Output-Path']debug=config['DEFAULT']['Debug']

步骤2

从输入路径获取所有PDF文件。

onlyfiles=[fforfinos.listdir(input_path)ifos.path.isfile(os.path.join(input_path,f))]

步骤3

然后，我们需要初始化助手以使用知识检索工具。为此，我们需要在API中指定“retrieval”类型的工具。我们还指定了助手的指令和使用的OpenAI模型。

my_file_ids=[]ifclient.files.list().data==[]:forfinonlyfiles:file=client.files.create(file=open(input_path+f,"rb"),purpose='assistants')my_file_ids.append(file.id)#添加文件到助手assistant=client.beta.assistants.create(instructions="你是一个出版物数据库支持聊天机器人。使用上传的pdf文件以最佳方式响应用户查询，输出为JSON格式。",model="gpt-4-1106-preview",tools=[{"type":"retrieval"}],#不要将所有文件附加到助手，否则即使在查询消息中指定文件ID，也会导致答案不匹配。#我们将每条消息分别附加)

步骤4

然后，我们指定需要从出版物文件中提取的信息，并将其作为用户查询传递给助手。经过实验，我们发现请求每个用户消息的JSON格式输出生成最一致的结果。

user_msgs=["打印本文的标题，格式为JSON","打印本文的作者，格式为JSON","打印本文的摘要部分，格式为JSON","打印本文的关键词，格式为JSON","打印本文的DOI号码，格式为JSON","打印本文的作者单位，格式为JSON","打印本文的参考文献部分，格式为JSON"]

步骤5

接下来是将查询传递给助手以生成输出。我们需要为每个用户查询创建一个单独的线程对象，其中包含查询作为用户消息。然后，我们运行线程并检索助手的答案。

all_results = []
for i in my_file_ids:print('\n#####')# JSON结果，它可以提取并解析，希望如此file_result = {} for q in user_msgs:# 为每个查询创建线程、用户消息和运行对象thread = client.beta.threads.create()msg = client.beta.threads.messages.create(thread_id=thread.id,role="user",content=q,file_ids=[i] # 指定要从中提取的文件/出版物)print('\n',q)run = client.beta.threads.runs.create(thread_id=thread.id,assistant_id=assistant.id,additional_instructions="如果找不到答案，请打印“False”" # 在本演示时不太有用)# 通过每次检索更新对象检查运行状态while run.status in ["queued",'in_progress']:print(run.status) time.sleep(5)run = client.beta.threads.runs.retrieve(thread_id=thread.id,run_id=run.id)# 通常是速率限制错误if run.status=='failed':logging.info("运行失败: ", run)if run.status=='completed':print("<完成>")# 提取更新的消息对象，这包括用户消息messages = client.beta.threads.messages.list(thread_id=thread.id)for m in messages:if m.role=='assistant':value = m.content[0].text.value # 获取文本响应if "json" not in value:if value=='False':logging.info("未找到答案：", str(q))else:logging.info("不是JSON输出，可能在文件中找不到答案或模型已过时：", str(value))else:# 清理响应并尝试解析为JSONvalue = value.split("```")[1].split('json')[-1].strip()try: d = json.loads(value)file_result.update(d)print(d)except Exception as e:logging.info(f"查询 {q} \n解析字符串为JSON失败: ", str(e))print(f"查询 {q} \n解析字符串为JSON失败: ", str(e))all_results.append(file_result)

生成的JSON输出如下：

[{"title":"Dodes(diagnosticnodes)forGuidelineManipulation","authors":[{"name":"PMPutora","affiliation":"DepartmentofRadiation-Oncology,KantonsspitalSt.Gallen,St.Gallen,Switzerland"},{"name":"MBlattner","affiliation":"LaboratoryforWebScience,Zürich,Switzerland"},{"name":"APapachristofilou","affiliation":"DepartmentofRadiationOncology,UniversityHospitalBasel,Basel,Switzerland"},{"name":"FMariotti","affiliation":"LaboratoryforWebScience,Zürich,Switzerland"},{"name":"BPaoli","affiliation":"LaboratoryforWebScience,Zürich,Switzerland"},{"name":"LPlasswilma","affiliation":"DepartmentofRadiation-Oncology,KantonsspitalSt.Gallen,St.Gallen,Switzerland"}],"Abstract":{"Background":"Treatmentrecommendations(guidelines)arecommonlyrepresentedintextform.Basedonparameters(questions)recommendationsaredefined(answers).","Objectives":"Toimprovehandling,alternativeformsofrepresentationarerequired.","Methods":"TheconceptofDodes(diagnosticnodes)hasbeendeveloped.Dodescontainanswersandquestions.Dodesarebasedonlinkednodesandadditionallycontaindescriptiveinformationandrecommendations.DodesareorganizedhierarchicallyintoDodetrees.Dodecategoriesmustbedefinedtopreventredundancy.","Results":"AcentralizedandneutralDodedatabasecanprovidestandardizationwhichisarequirementforthecomparisonofrecommendations.CentralizedadministrationofDodecategoriescanprovideinformationaboutdiagnosticcriteria(Dodecategories)underutilizedinexistingrecommendations(Dodetrees).","Conclusions":"RepresentingclinicalrecommendationsinDodetreesimprovestheirmanageabilityhandlingandupdateability."},"Keywords":["dodes","ontology","semanticweb","guidelines","recommendations","linkednodes"],"DOI":"10.5166/jroi-2-1-6","references":[{"ref_number":"[1]","authors":"MohlerJBahnsonRRBostonBetal.","title":"NCCNclinicalpracticeguidelinesinoncology:prostatecancer.","source":"JNatlComprCancNetw.","year":"2010Feb","volume_issue_pages":"8(2):162-200"},{"ref_number":"[2]","authors":"HeidenreichAAusGBollaMetal.","title":"EAUguidelinesonprostatecancer.","source":"EurUrol.","year":"2008Jan","volume_issue_pages":"53(1):68-80","notes":"Epub2007Sep19.Review."},{"ref_number":"[3]","authors":"FairchildABarnesEGhoshSetal.","title":"Internationalpatternsofpracticeinpalliativeradiotherapyforpainfulbonemetastases:evidence-basedpractice?","source":"IntJRadiatOncolBiolPhys.","year":"2009Dec1","volume_issue_pages":"75(5):1501-10","notes":"Epub2009May21."},{"ref_number":"[4]","authors":"LawrentschukNDaljeetNMaCetal.","title":"Prostate-specificantigentestresultinterpretationwhencombinedwithriskfactorsforrecommendationofbiopsy:asurveyofurologist'spracticepatterns.","source":"IntUrolNephrol.","year":"2010Jun12","notes":"Epubaheadofprint"},{"ref_number":"[5]","authors":"ParmelliEPapiniDMojaLetal.","title":"Updatingclinicalrecommendationsforbreastcolorectalandlungcancertreatments:anopportunitytoimprovemethodologyandclinicalrelevance.","source":"AnnOncol.","year":"2010Jul19","notes":"Epubaheadofprint"},{"ref_number":"[6]","authors":"AhnHSLeeHJHahnSetal.","title":"EvaluationoftheSeventhAmericanJointCommitteeonCancer/InternationalUnionAgainstCancerClassificationofgastricadenocarcinomaincomparisonwiththesixthclassification.","source":"Cancer.","year":"2010Aug24","notes":"Epubaheadofprint"},{"ref_number":"[7]","authors":"Rami-PortaRGoldstrawP.","title":"StrengthandweaknessofthenewTNMclassificationforlungcancer.","source":"EurRespirJ.","year":"2010Aug","volume_issue_pages":"36(2):237-9"},{"ref_number":"[8]","authors":"SinnHPHelmchenBWittekindCH.","title":"TNMclassificationofbreastcancer:Changesandcommentsonthe7thedition.","source":"Pathologe.","year":"2010Aug15","notes":"Epubaheadofprint"},{"ref_number":"[9]","authors":"PaleriVMehannaHWightRG.","title":"TNMclassificationofmalignanttumours7thedition:what'snewforheadandneck?","source":"ClinOtolaryngol.","year":"2010Aug","volume_issue_pages":"35(4):270-2"},{"ref_number":"[10]","authors":"GuarinoN.","title":"FormalOntologyandInformationSystems","source":"1998IOSPress"},{"ref_number":"[11]","authors":"UscholdMGrunigerM.","title":"OntologiesrinciplesMethodsandApplications.","source":"KnowledgeEngineeringReview","year":"1996","volume_issue_pages":"11(2)"},{"ref_number":"[12]","authors":"AhoAGareyMUllmanJ.","title":"TheTransitiveReductionofaDirectedGraph.","source":"SIAMJournalonComputing","year":"1972","volume_issue_pages":"1(2):131–137"},{"ref_number":"[13]","authors":"TaiK","title":"Thetree-to-treecorrectionproblem.","source":"JournaloftheAssociationforComputingMachinery(JACM)","year":"1979","volume_issue_pages":"26(3):422-433"}]}]

步骤6

需要清理文件对象和助手对象，因为它们在“检索”模式下会产生费用。此外，这也是一种良好的编码实践。

forfinclient.files.list().data:client.files.delete(f.id)

#检索并删除正在运行的助手my_assistants=client.beta.assistants.list(order="desc")forainmy_assistants.data:response=client.beta.assistants.delete(a.id)print(response)

步骤7

接下来的步骤是使用PythonNetworkx^[2]包生成图谱可视化。

import networkx as nximport matplotlib.pyplot as plt
G = nx.DiGraph()node_colors = []
key = "jroi/" + all_results[0]['title']G.add_nodes_from([(all_results[0]['title'], {'doi': all_results[0]['DOI'], 'title': all_results[0]['title'], 'source': 'jroi', 'key': key})])node_colors.append('#4ba9dc')
for author in all_results[0]['authors']:key = "jroi/" + author['name']G.add_nodes_from([(author['name'], {'key': key, 'local_id': author['name'], 'full_name': author['name'], 'source': 'jroi'})])G.add_edge(all_results[0]['title'], author['name'])node_colors.append('#63cc9e')
for reference in all_results[0]['references']:key = "jroi/" + reference['title']G.add_nodes_from([(reference['title'].split('.')[0][:25] + '...', {'title': reference['title'], 'source': 'jroi', 'key': key})])G.add_edge(all_results[0]['title'], reference['title'].split('.')[0][:25] + '...')node_colors.append('#4ba9dc')
pos = nx.spring_layout(G)labels = nx.get_edge_attributes(G, 'label')nx.draw(G, pos, with_labels=True, node_size=1000, node_color=node_colors, font_size=7, font_color='black')nx.draw_networkx_edge_labels(G, pos, edge_labels=labels)plt.savefig("graph_image.png")plt.show()

生成的图谱可视化如下：

使用Networkx生成的图谱可视化

注意：请注意，不同执行之间，OpenAI生成的输出结构可能会有所不同。因此，您可能需要根据该结构更新上述代码。

结论

总之，利用GPT API从PDF出版物中提取研究图谱为研究人员和数据分析师提供了一种强大且高效的解决方案。该工作流简化了将PDF出版物转换为结构化和可访问的研究图谱的过程。但我们也必须注意大语言模型（LLMs）生成响应的不一致性。随着时间的推移，通过定期更新和改进提取模型，可以进一步提高准确性和相关性。