使用大型语言模型提取pdf文档元数据信息 - 链载Ai

动机

多年来，正则表达式一直是我解析文档的首选工具，我相信对于许多技术人员和行业也是如此。尽管正则表达式在某些情况下非常强大，但它们常常在面对真实世界文档的复杂性和多样性时缺少灵活性。

另一方面，大型语言模型提供了一种更强大、更灵活的方法来处理多种类型的文档结构和内容类型。

使用大语言模型处理文档流程

下面是一个常用的文档解析流程。为了简化问题，我们以研究论文处理的场景为例。

工作流程总体上具有三个主要组成部分：输入、处理和输出。
首先，提交文档，即PDF格式的科研论文进行处理。
处理组件的第一个模块从每个 PDF 中提取原始数据，并将其与包含大语言模型指令的提示相结合，以有效地提取数据。
然后，大语言模型使用提示来提取所有元数据。
对于每个 PDF，最终结果都以 JSON 格式保存，可用于进一步分析。

大语言模型相对于正则的优势

正则表达式（Regex）在处理研究论文结构的复杂性时存在显著的局限性，下面深入比较下这两种方法：

1、文档结构的灵活性

Regex需要每个文档结构的特定模式，并且当给定文档偏离预期格式时会失败。
LLMs自动理解并适应各种文档结构，并且无论相关信息位于文档中的哪个位置，它们都能够识别相关信息。

2. 上下文理解

Regex在不了解上下文或含义的情况下匹配模式。
LLMs对每个文档的含义有更细致的了解，这使他们能够更准确地提取相关信息。

3. 维护和可扩展性

Regex随着文档格式的变化需要不断更新。添加对新信息类型的支持需要编写全新的正则表达式。
LLMs可以轻松适应新的文档类型，只需对初始提示进行最小的更改，这使得它们更具可扩展性。

构建文档解析工作流程

上述理由足以采用LLMs来解析研究论文等复杂文档。

实验文档来自于

来自 Arxiv 网站的论文《Attention》
来自 Arxiv 网站的论文《BERT》

本节提供了利用大型语言模型构建现实世界文档解析系统的所有步骤，你可以直接在本地运行。

代码结构

project||---Extract_Metadata_With_Large_Language_Models.ipynb|data||----extracted_metadata/|----1706.03762v7.pdf|----1810.04805.pdf|----prompts||------scientific_papers_prompt.txt

project文件夹是根文件夹，包含data文件夹和notebook
data文件夹中有两个文件夹，extracted_metadata和prompts，以及两篇论文。
extracted_metadata当前为空，将包含 json 文件
prompts文件夹中有文本格式的提示

要提取的元数据

我们首先需要对需要提取的属性有一个明确的目标，为了简单起见，让我们重点关注我们场景的六个属性。

论文标题（Paper Title）
出版年份（Publication Year:）
作者（Authors）
作者联系方式（Author Contact）
摘要（Abstract）
概括摘要（Summary Abstract）

然后使用这些属性来定义提示，该提示清楚地解释了每个属性的含义以及最终输出的格式。文档的成功解析依赖于清晰解释每个属性含义以及以哪种格式提取最终结果的提示。

Scientific research paper:---{document}---
You are an expert in analyzing scientific research papers. Please carefully read the provided research paper above and extract the following key information:
Extract these six (6) properties from the research paper:- Paper Title: The full title of the research paper- Publication Year: The year the paper was published- Authors: The full names of all authors of the paper- Author Contact: A list of dictionaries, where each dictionary contains the following keys for each author:- Name: The full name of the author- Institution: The institutional affiliation of the author- Email: The email address of the author (if provided)- Abstract: The full text of the paper's abstract- Summary Abstract: A concise summary of the abstract in 2-3 sentences, highlighting the key points
Guidelines:- The extracted information should be factual and accurate to the document.- Be extremely concise, except for the Abstract which should be copied in full.- The extracted entities should be self-contained and easily understood without the rest of the paper.- If any property is missing from the paper, please leave the field empty rather than guessing.- For the Summary Abstract, focus on the main objectives, methods, and key findings of the research.- For Author Contact, create an entry for each author, even if some information is missing. If an email or institution is not provided for an author, leave that field empty in the dictionary.
Answer in JSON format. The JSON should contain 6 keys: "aperTitle", "ublicationYear", "Authors", "AuthorContact", "Abstract", and "SummaryAbstract". The "AuthorContact" should be a list of dictionaries as described above.

Prompt中有6大块内容，下面是对这6部分内容进行详细解释。

1、文档占位符

Scientificresearchpaper:---{document}---

使用{}符号定义，它指示将包含文档全文以供分析的位置。2、角色指定

该模型被指定了一个角色，以便更好地执行任务，这在以下行中进行了定义，设置上下文并指示人工智能成为科学研究论文分析的专家。

Youareanexpertinanalyzingscientificresearchpapers.

3、提取指令

本节指定应从文档中提取的信息片段。


Extract these six (6) properties from the research paper:

4. 属性定义

此处定义了上述每个属性，其中包含要包含的信息及其格式策略的具体详细信息。例如，Author Contact是包含其他详细信息的字典列表。

5、指导方针

这些指南告诉人工智能在提取过程中要遵循的规则，例如保持准确性以及如何处理丢失的信息。

6. 预期输出格式

这是最后一步，它指定回答时要考虑的确切格式，即json。


Answer in JSON format. The JSON should contain 6 keys: ...

安装必要的库

现在让我们开始安装必要的库。我们的文档解析系统是由多个库构建的，每个组件的主要库如下所示：

PDF 处理：pdfminer.six、PyPDF2和poppler-utils用于处理各种 PDF 格式和结构。
文本提取：unstructured及其依赖包（unstructured-inference、unstructured-pytesseract）用于从文档中智能提取内容。
OCR 功能：tesseract-ocr用于识别图像或扫描文档中的文本。
图像处理：pillow-heif用于图像处理任务。
AI 集成：openai库，用于在信息提取过程中利用 GPT 模型。

%%bash
pip -qqq install pdfminer.sixpip -qqq install pillow-heif==0.3.2pip -qqq install matplotlibpip -qqq install unstructured-inferencepip -qqq install unstructured-pytesseractpip -qqq install tesseract-ocrpip -qqq install unstructuredpip -qqq install openaipip -qqq install PyPDF2
apt install -V tesseract-ocrapt install -V libtesseract-dev
sudo apt-get updateapt-get install -V poppler-utils

安装成功后，导入如下：

importosimportreimportjsonimportopenaifrompathlibimportPathfromopenaiimportOpenAIfromPyPDF2importPdfReaderfromgoogle.colabimportuserdatafromunstructured.partition.pdfimportpartition_pdffromtenacityimportretry,wait_random_exponential,stop_after_attempt

设置凭据

在深入研究核心功能之前，我们需要使用必要的 API 凭据设置环境。

OPENAI_API_KEY = userdata.get('OPEN_AI_KEY')model_ID = userdata.get('GPT_MODEL')os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
client = OpenAI(api_key = OPENAI_API_KEY)

在这里，我们使用userdata.get()函数安全地访问 Google Colab 中的凭据。
我们检索要使用的特定 GPT 模型 ID，在我们的用例中为gpt-4o。

使用这样的环境变量来设置我们的凭据可确保对模型凭据的安全访问，同时保持我们选择模型的灵活性。它也是管理 API 密钥和模型的更好方法，尤其是在不同环境或多个项目中工作时。

工作流程处理

我们现在拥有有效构建端到端工作流程的所有资源。现在是时候开始每个工作流组件的技术实现了，从数据处理辅助函数开始。

1、数据处理

我们工作流程的第一步是预处理 PDF 文件并提取其文本内容，这是通过extract_text_from_pdf函数实现的。

它将 PDF 文件作为输入，并将其内容作为原始文本数据返回。

defextract_text_from_pdf(pdf_path:str):"""ExtracttextcontentfromaPDFfileusingtheunstructuredlibrary."""elements=partition_pdf(pdf_path,strategy="hi_res")return"\n".join([str(element)forelementinelements])

2、Prompt读取

提示存储在单独的.txt文件中，并使用以下函数加载。

defread_prompt(prompt_path:str):"""Readthepromptforresearchpaperparsingfromatextfile."""withopen(prompt_path,"r")asf:returnf.read()

3、元数据提取

这个函数实际上是我们工作流程的核心。它利用 OpenAI API 来处理给定 PDF 文件的内容。

如果不使用装饰器@retry，我们可能会遇到Error Code 429 - Rate limit reached for requests问题。这主要发生在我们在处理过程中达到速率限制时。我们希望函数继续尝试，直到成功达到目标，而不是失败。

@retry(wait=wait_random_exponential(min=1,max=120),stop=stop_after_attempt(10))defcompletion_with_backoff(**kwargs):returnclient.chat.completions.create(**kwargs)

通过在extract_metadata函数中使用completion_with_backoff：

它会等待 1 到 120 秒，然后重新运行失败的 API 调用。
上述等待时间随着每次重试而增加，但始终保持在 1 到 120 秒的范围内。
此过程称为指数退避，对于管理 API 速率限制（包括临时问题）非常有用。

def extract_metadata(content: str, prompt_path: str, model_id: str):"""Use GPT model to extract metadata from the research paper content based on the given prompt."""prompt_data = read_prompt(prompt_path)
try:response = completion_with_backoff(model=model_id,messages=[{"role": "system", "content": prompt_data},{"role": "user", "content": content}],temperature=0.2,)
response_content = response.choices[0].message.content# Process and return the extracted metadata# ...except Exception as e:print(f"Error calling OpenAI API: {e}")return {}

通过随提示一起发送论文内容，gpt-4o模型提取提示中指定的结构化信息。

完整代码

通过将所有逻辑放在一起，我们可以使用process_research_paper函数对单个 PDF 文件进行端到端执行，从提取预期的元数据到将最终结果保存为.json 格式。

def process_research_paper(pdf_path: str, prompt: str, output_folder: str, model_id: str):"""Process a single research paper through the entire pipeline."""print(f"rocessing research paper: {pdf_path}")
try:# Step 1: Extract text content from the PDFcontent = extract_text_from_pdf(pdf_path)
# Step 2: Extract metadata using GPT modelmetadata = extract_metadata(content, prompt, model_id)
# Step 3: Save the result as a JSON fileoutput_filename = Path(pdf_path).stem + '.json'output_path = os.path.join(output_folder, output_filename)
with open(output_path, 'w') as f:json.dump(metadata, f, indent=2)print(f"Saved metadata to {output_path}")
except Exception as e:print(f"Error processing {pdf_path}: {e}")

以下是将逻辑应用于单个文档处理的示例：


# Example for a single document
pdf_path = "./data/1706.03762v7.pdf"prompt_path ="./data/prompts/scientific_papers_prompt.txt"output_folder = "./data/extracted_metadata"
process_research_paper(pdf_path, prompt_path, output_folder, model_ID)

从上图中，我们可以看到生成的.json保存在./data/extracted_metadata/文件夹中，名称为1706.0376v7.json，与 PDF 的名称完全相同，但具有不同的扩展名。

下面给出了 json 文件的内容以及突出显示的研究论文，其中突出显示了已提取的目标属性：

从json数据中我们注意到所有属性都已成功提取。更棒的是，论文中没有提供Illia Polosukhin的机构，人工智能将其保留为空白字段。

{"aperTitle":"AttentionIsAllYouNeed","ublicationYear":"2017","Authors":["AshishVaswani","NoamShazeer","NikiParmar","JakobUszkoreit","LlionJones","AidanN.Gomez","LukaszKaiser","IlliaPolosukhin"],"AuthorContact":[{"Name":"AshishVaswani","Institution":"GoogleBrain","Email":"avaswani@google.com"},{"Name":"NoamShazeer","Institution":"GoogleBrain","Email":"noam@google.com"},{"Name":"NikiParmar","Institution":"GoogleResearch","Email":"nikip@google.com"},{"Name":"JakobUszkoreit","Institution":"GoogleResearch","Email":"usz@google.com"},{"Name":"LlionJones","Institution":"GoogleResearch","Email":"llion@google.com"},{"Name":"AidanN.Gomez","Institution":"UniversityofToronto","Email":"aidan@cs.toronto.edu"},{"Name":"LukaszKaiser","Institution":"GoogleBrain","Email":"lukaszkaiser@google.com"},{"Name":"IlliaPolosukhin","Institution":"","Email":"illia.polosukhin@gmail.com"}],"Abstract":"Thedominantsequencetransductionmodelsarebasedoncomplexrecurrentorconvolutionalneuralnetworksthatincludeanencoderandadecoder.Thebestperformingmodelsalsoconnecttheencoderanddecoderthroughanattentionmechanism.Weproposeanewsimplenetworkarchitecture,theTransformer,basedsolelyonattentionmechanisms,dispensingwithrecurrenceandconvolutionsentirely.Experimentsontwomachinetranslationtasksshowthesemodelstobesuperiorinqualitywhilebeingmoreparallelizableandrequiringsignificantlylesstimetotrain.Ourmodelachieves28.4BLEUontheWMT2014English-to-Germantranslationtask,improvingovertheexistingbestresults,includingensembles,byover2BLEU.OntheWMT2014English-to-Frenchtranslationtask,ourmodelestablishesanewsingle-modelstate-of-the-artBLEUscoreof41.8aftertrainingfor3.5daysoneightGPUs,asmallfractionofthetrainingcostsofthebestmodelsfromtheliterature.WeshowthattheTransformergeneralizeswelltoothertasksbyapplyingitsuccessfullytoEnglishconstituencyparsingbothwithlargeandlimitedtrainingdata.","SummaryAbstract":"ThepaperintroducestheTransformer,anovelnetworkarchitecturebasedsolelyonattentionmechanisms,eliminatingtheneedforrecurrenceandconvolutions.TheTransformerachievessuperiorperformanceonmachinetranslationtasks,settingnewstate-of-the-artBLEUscoreswhilebeingmoreparallelizableandrequiringlesstrainingtime.Additionally,itgeneralizeswelltoothertaskssuchasEnglishconstituencyparsing."}

此外，附加属性Summary Abstract的值如下所示，它完美地总结了最初的摘要，同时保持在提示中提供的两到三个句子约束内。

ThepaperintroducestheTransformer,anovelnetworkarchitecturebasedsolelyonattentionmechanisms,eliminatingtheneedforrecurrenceandconvolutions.TheTransformerachievessuperiorperformanceonmachinetranslationtasks,settingnewstate-of-the-artBLEUscoreswhilebeingmoreparallelizableandrequiringlesstrainingtime.Additionally,itgeneralizeswelltoothertaskssuchasEnglishconstituencyparsin

现在pipeline适用于单个文档，我们可以实现逻辑来对给定文件夹中的所有文档运行它，这是使用process_directory函数实现的。它处理每个文件并将其保存到同一个extracted_metadata文件夹中。

# Parse documents from a folderdef process_directory(prompt_path: str, directory_path: str, output_folder: str, model_id: str):"""Process all PDF files in the given directory."""
# Iterate through all files in the directoryfor filename in os.listdir(directory_path):if filename.lower().endswith('.pdf'):pdf_path = os.path.join(directory_path, filename)process_research_paper(pdf_path, prompt_path, output_folder, model_id)

以下是如何使用正确的参数调用该函数。

#Definepathsprompt_path="./data/prompts/scientific_papers_prompt.txt"directory_path="./data"output_folder="./data/extracted_metadata"

process_directory(prompt_path,directory_path,output_folder,model_ID)

处理成功显示如下信息，我们可以看到每篇研究论文都已被处理。

结论

本文简要概述了LLM在复杂文档元数据提取中的应用，提取的json数据可以存储在非关系数据库中以供进一步分析。LLM 和正则表达式在内容提取方面各有优缺点，每一种都应根据用例明智地应用。