一款RAG与LLaMA-3.1-8B相结合的PDF文档分析工具！

显示全部楼层

01。

概述

一款利用检索增强生成（RAG）技术和LLaMA-3.1-8B即时大型语言模型（LLM）的个人助理工具。该工具旨在通过结合机器学习和基于检索的系统，彻底改变PDF文档分析任务。

02。

RAG架构的起源

检索增强生成（RAG）是一种在自然语言处理（NLP）领域具有强大效能的技术，它将基于检索的方法与生成模型相结合，以产生更准确且与上下文相关的输出结果。这一方法最初由Facebook AI Research（FAIR）在2020年发表的论文《Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks》中提出。

想要深入了解RAG及其相关知识，可以参考Facebook AI Research的原始论文：《Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks》。

https://arxiv.org/pdf/2005.11401

03。

RAG 架构概述

RAG模型由三个主要部分组成：

索引器：该组件创建语料库的索引，以便于高效检索相关文档。
检索器：该组件根据输入的查询，在索引化的语料库中检索相关文档。
生成器：该组件根据检索到的文档生成相应的回应。

04。

实现细节

RAG模型的训练分为三个阶段：

索引器训练：索引器被训练以创建查询与文档之间的高效准确映射。
检索器训练：检索器被训练以最大化相关文档的相关性得分。
生成器训练：生成器被训练以提高真实响应的概率最大化。

在推理过程中，RAG模型遵循以下步骤：

索引：对语料库进行索引，以便于高效检索。
检索：根据给定查询的相关性得分，检索出排名最高的文档。
生成：根据输入的查询和检索到的文档生成回应。最终的回应是通过如上所述对检索到的文档进行边缘化处理获得的。

05。

安装

Install Packages

!condainstall-npa\
pytorch\
torchvision\
torchaudio\
cpuonly\
-cpytorch\
-cconda-forge\
--yes
%pipinstall-Uipywidgets
%pipinstall-Urequests
%pipinstall-Ullama-index
%pipinstall-Ullama-index-embeddings-huggingface
%pipinstall-Ullama-index-llms-groq
%pipinstall-Ugroq
%pipinstall-Ugradio

Install Tesseract

importos
importplatform
importsubprocess
importrequests


definstall_tesseract():
"""
InstallsTesseractOCRbasedontheoperatingsystem.
"""
os_name=platform.system()
ifos_name=="Linux":
print("DetectedLinux.InstallingTesseractusingapt-get...")
subprocess.run(["sudo","apt-get","update"],check=True)
subprocess.run(["sudo","apt-get","install","-y","tesseract-ocr"],check=True)
elifos_name=="Darwin":
print("DetectedmacOS.InstallingTesseractusingHomebrew...")
subprocess.run(["brew","install","tesseract"],check=True)
elifos_name=="Windows":
tesseract_installer_url="https://github.com/UB-Mannheim/tesseract/releases/download/v5.4.0.20240606/tesseract-ocr-w64-setup-5.4.0.20240606.exe"
installer_path="tesseract-ocr-w64-setup-5.4.0.20240606.exe"
response=requests.get(tesseract_installer_url)
withopen(installer_path,"wb")asfile:
file.write(response.content)
tesseract_path=r"C:\ProgramFiles\Tesseract-OCR"
os.environ["ATH"]+=os.pathsep+tesseract_path
try:
result=subprocess.run(["tesseract","--version"],check=True,capture_output=True,text=True)
print(result.stdout)
exceptsubprocess.CalledProcessErrorase:
print(f"ErrorrunningTesseract:{e}")
else:
print(f"UnsupportedOS:{os_name}")


install_tesseract()

Convert PDF to OCR

importwebbrowser

url="https://www.ilovepdf.com/ocr-pdf"
webbrowser.open_new(url)

Import Libraries

importos
fromllama_index.coreimport(
Settings,
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
load_index_from_storage
)
fromllama_index.embeddings.huggingfaceimportHuggingFaceEmbedding
fromllama_index.core.node_parserimportSentenceSplitter
fromllama_index.llms.groqimportGroq
importgradioasgr