LlamaParse 默认将 PDF 转换为 Markdown,文档的内容可以准确的解析出来。但LlamaCloud 官网因为不能设置解析文档的语言,默认只能识别英文的文档,中文的解析识别需要在 Python 代码中指定。
2.PDF文档处理
我们需要 OpenAI 和 LlamaParse API 密钥来运行该项目。
我们将使用 Python 代码展示 LlamaParse,在开始之前,你将需要一个 API 密钥。它是免费的。你可以从下图中看到设置密钥的链接,因此现在单击该链接并设置您的 API 密钥。由于我使用 OpenAI 进行 LLM 和嵌入,因此我也需要获取 OpenAI API 密钥。
llamaParse_API_key
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio import nest_asyncio nest_asyncio.apply()
import os # API access to llama-cloud os.environ["LLAMA_CLOUD_API_KEY"] = "llx-Pd9FqzqbITfp7KXpB0YHWngqXK4GWZvB5BSAf9IoiNDBeie4"
# Using OpenAI API for embeddings/llms #os.environ["OPENAI_API_KEY"] = "sk-OK5mvOSKVeRokboDB1eHrIuifAcUc6wAqU82ZgRVJMAg4tJ3" os.environ["MISTRAL_API_KEY"] = "q9ebyLLL3KoZTLAeb8I81doRLZK5SXNO"
Note: you may need to restart the kernel to use updated packages.
from llama_index.llms.mistralai import MistralAI from llama_index.embeddings.mistralai import MistralAIEmbedding from llama_index.core import VectorStoreIndex from llama_index.core import Settings
for node in objects: print(f"id:{node.node_id}") print(f"hash:{node.hash}") print(f"parent:{node.parent_node}") print(f"prev:{node.prev_node}") print(f"next:{node.next_node}")
# Object is a Table if node.node_id[-1 * len(TABLE_REF_SUFFIX):] == TABLE_REF_SUFFIX:
if node.next_node is not None: next_node = node.next_node
if 'table_summary' in node_json: print(f"summary:{node_json['table_summary']}")
print("=====================================")
Number of objects: 3 id:1ace071a-4177-4395-ae42-d520095421ff hash:e593799944034ed3ff7d2361c0f597cd67f0ee6c43234151c7d80f84407d4c5f parent:None prev:node_id='9694e60e-6988-493b-89c3-2533ff1adcd2' node_type=<ObjectType.TEXT: '1'> metadata={} hash='898f4b8eb4de2dbfafa9062f479df61668c6a7604f2bea5b9937b70e234ba746' next:node_id='9c71f897-f510-4b8c-a876-d0d8ab55e004' node_type=<ObjectType.TEXT: '1'> metadata={'table_df': "{'一、数据要素再认识': {0: '(一)国家战略全方位布局数据要素发展', 1: '(二)人工智能发展对数据供给提出更高要求', 2: '(三)数据要素概念聚焦于数据价值释放', 3: '二、资源:分类推进数据要素探索已成为共识', 4: '(一)不同类别数据资源面临不同关键问题', 5: '(二)授权运营促进公共数据供给提质增效', 6: '(三)会计入表推动企业数据价值“显性化”', 7: '(四)权益保护仍是个人数据开发利用主线', 8: '三、主体:企业政府双向发力推进可持续探索', 9: '(一)企业侧:数据管理与应用能力是前提', 10: '(二)政府侧:建立公平高效的机制是关键', 11: '四、市场:场内外结合推动数据资源最优配置', 12: '(一)数据流通存在多层次多样化形态', 13: '(二)场外交易活跃,场内交易多点突破', 14: '(三)多措并举破除数据流通障碍', 15: '五、技术:基于业务需求加速创新与体系重构', 16: '(一)数据技术随业务要求不断演进', 17: '(二)数据要素时代新技术不断涌现', 18: '(三)数据要素技术体系重构加速', 19: '六、趋势与展望', 20: '参考文献'}, '1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 7, 5: 11, 6: 15, 7: 18, 8: 21, 9: 21, 10: 26, 11: 29, 12: 30, 13: 33, 14: 35, 15: 37, 16: 37, 17: 38, 18: 42, 19: 42, 20: 46}}", 'table_summary': 'Title: Data Element Development and Utilization in National Strategic Perspective\n\nSummary: This table discusses various aspects of data element development and utilization, including strategic layout, resource classification, subject involvement, market dynamics, technological advancements, and trends. It also highlights key issues in different categories of data resources, the role of authorization and operation in improving public data supply, the importance of data management and application capabilities for enterprises, the need for a fair and efficient mechanism on the government side, and the significance of equity protection in personal data development and utilization. The table concludes with a section on market and technological trends.\n\nTable ID: Not provided in the context.\n\nThe table should be kept as it provides valuable insights into the development and utilization of data elements from a national strategic perspective.,\nwith the following columns:\n'} hash='48055467b0febd41fcce52d02d72730f8ea97c7a7905749afb557b0dcecef7c2' next:node_id='9c71f897-f510-4b8c-a876-d0d8ab55e004' node_type=<ObjectType.TEXT: '1'> metadata={'table_df': "{'一、数据要素再认识': {0: '(一)国家战略全方位布局数据要素发展', 1: '(二)人工智能发展对数据供给提出更高要求', 2: '(三)数据要素概念聚焦于数据价值释放', 3: '二、资源:分类推进数据要素探索已成为共识', 4: '(一)不同类别数据资源面临不同关键问题', 5: '(二)授权运营促进公共数据供给提质增效', 6: '(三)会计入表推动企业数据价值“显性化”', 7: '(四)权益保护仍是个人数据开发利用主线', 8: '三、主体:企业政府双向发力推进可持续探索', 9: '(一)企业侧:数据管理与应用能力是前提', 10: '(二)政府侧:建立公平高效的机制是关键', 11: '四、市场:场内外结合推动数据资源最优配置', 12: '(一)数据流通存在多层次多样化形态', 13: '(二)场外交易活跃,场内交易多点突破', 14: '(三)多措并举破除数据流通障碍', 15: '五、技术:基于业务需求加速创新与体系重构', 16: '(一)数据技术随业务要求不断演进', 17: '(二)数据要素时代新技术不断涌现', 18: '(三)数据要素技术体系重构加速', 19: '六、趋势与展望', 20: '参考文献'}, '1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 7, 5: 11, 6: 15, 7: 18, 8: 21, 9: 21, 10: 26, 11: 29, 12: 30, 13: 33, 14: 35, 15: 37, 16: 37, 17: 38, 18: 42, 19: 42, 20: 46}}", 'table_summary': 'Title: Data Element Development and Utilization in National Strategic Perspective\n\nSummary: This table discusses various aspects of data element development and utilization, including strategic layout, resource classification, subject involvement, market dynamics, technological advancements, and trends. It also highlights key issues in different categories of data resources, the role of authorization and operation in improving public data supply, the importance of data management and application capabilities for enterprises, the need for a fair and efficient mechanism on the government side, and the significance of equity protection in personal data development and utilization. The table concludes with a section on market and technological trends.\n\nTable ID: Not provided in the context.\n\nThe table should be kept as it provides valuable insights into the development and utilization of data elements from a national strategic perspective.,\nwith the following columns:\n'} hash='48055467b0febd41fcce52d02d72730f8ea97c7a7905749afb557b0dcecef7c2' type:3 class:IndexNode content:Title: Data Element Development and Utilization in National Strategic Perspective
Summary: This table discusses various aspects of data element development and utilization, including strategic layout metadata:{'col_schema': ''} extra:{'col_schema': ''} start_idx:816 end_idx:1319 =====================================
def initialiseNeo4jSchema(): cypher_schema = [ "CREATE CONSTRAINT sectionKey IF NOT EXISTS FOR (c:Section) REQUIRE (c.key) IS UNIQUE;", "CREATE CONSTRAINT chunkKey IF NOT EXISTS FOR (c:Chunk) REQUIRE (c.key) IS UNIQUE;", "CREATE CONSTRAINT documentKey IF NOT EXISTS FOR (cocument) REQUIRE (c.url_hash) IS UNIQUE;", "CREATE VECTOR INDEX `chunkVectorIndex` IF NOT EXISTS FOR (e:Embedding) ON (e.value) OPTIONS { indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}};" ]
# ================================================ # 1) Save documents
print("Start saving documents to Neo4j...") i = 0 with driver.session() as session: for doc in documents: cypher = "MERGE (document {url_hash: $doc_id}) ON CREATE SET d.url=$url;" session.run(cypher, doc_id=doc.doc_id, url=doc.doc_id) i = i + 1 session.close()
print(f"{i} documents saved.")
# ================================================ # 2) Save nodes
print("Start saving nodes to Neo4j...")
i = 0 with driver.session() as session: for node in base_nodes:
# >>1 Create Section node cypher = "MERGE (c:Section {key: $node_id})\n" cypher += " FOREACH (ignoreMe IN CASE WHEN c.type IS NULL THEN [1] ELSE [] END |\n" cypher += " SET c.hash = $hash, c.text=$content, c.type=$type, c.class=$class_name, c.start_idx=$start_idx, c.end_idx=$end_idx )\n" cypher += " WITH c\n" cypher += " MATCH (document {url_hash: $doc_id})\n" cypher += " MERGE (d)<-[:HAS_DOCUMENT]-(c);"
if node.next_node is not None: # and node.next_node.node_id[-1*len(TABLE_REF_SUFFIX):] != TABLE_REF_SUFFIX: cypher = "MATCH (c:Section {key: $node_id})\n" # current node should exist cypher += "MERGE (p:Section {key: $next_id})\n" # previous node may not exist cypher += "MERGE (p)<-[:NEXT]-(c);"
if node.prev_node is not None: # Because tables are in objects list, so we need to link from the opposite direction cypher = "MATCH (c:Section {key: $node_id})\n" # current node should exist cypher += "MERGE (p:Section {key: $prev_id})\n" # previous node may not exist cypher += "MERGE (p)-[:NEXT]->(c);"
# ================================================ # 3) Save objects
print("Start saving objects to Neo4j...")
i = 0 with driver.session() as session: for node in objects: node_json = json.loads(node.json())
# Object is a Table, then the ????_ref_table object is created as a Section, and the table object is Chunk if node.node_id[-1 * len(TABLE_REF_SUFFIX):] == TABLE_REF_SUFFIX: if node.next_node is not None: # here is where actual table object is loaded next_node = node.next_node
obj_metadata = json.loads(str(next_node.json()))
cypher = "MERGE (s:Section {key: $node_id})\n" cypher += "WITH s MERGE (c:Chunk {key: $table_id})\n" cypher += " FOREACH (ignoreMe IN CASE WHEN c.type IS NULL THEN [1] ELSE [] END |\n" cypher += " SET c.hash = $hash, c.definition=$content, c.text=$table_summary, c.type=$type, c.start_idx=$start_idx, c.end_idx=$end_idx )\n" cypher += " WITH s, c\n" cypher += " MERGE (s) <-[:UNDER_SECTION]- (c)\n" cypher += " WITH s MATCH (document {url_hash: $doc_id})\n" cypher += " MERGE (d)<-[:HAS_DOCUMENT]-(s);"
if node.prev_node is not None: cypher = "MATCH (c:Section {key: $node_id})\n" # current node should exist cypher += "MERGE (p:Section {key: $prev_id})\n" # previous node may not exist cypher += "MERGE (p)-[:NEXT]->(c);"
# ================================================ # 4) Create Chunks for each Section object of type TEXT # If there are changes to the content of TEXT section, the Section node needs to be recreated
print("Start creating chunks for each TEXT Section...")
with driver.session() as session:
cypher = "MATCH (s:Section) WHERE s.type='TEXT' \n" cypher += "WITH s CALL {\n" cypher += "WITH s WITH s, split(s.text, '\n') AS para\n" cypher += "WITH s, para, range(0, size(para)-1) AS iterator\n" cypher += "UNWIND iterator AS i WITH s, trim(para[i]) AS chunk, i WHERE size(chunk) > 0\n" cypher += "CREATE (c:Chunk {key: s.key + '_' + i}) SET c.type='TEXT', c.text = chunk, c.seq = i \n" cypher += "CREATE (s) <-[:UNDER_SECTION]-(c) } IN TRANSACTIONS OF 500 ROWS ;"
with driver.session() as session: # get chunks in document, together with their section titles result = session.run(f"MATCH (ch:{label}) RETURN id(ch) AS id, ch.{property} AS text") # call OpenAI embedding API to generate embeddings for each proporty of node # for each node, update the embedding property count = 0 for record in result: id = record["id"] text = record["text"]
# For better performance, text can be batched embedding_batch_response = mistralai_client.embeddings(model=EMBEDDING_MODEL,input=text, ) #print(embedding_batch_response.data[0]) #print(embedding_batch_response.data[0].embedding) # key property of Embedding node differentiates different embeddings cypher = "CREATE (e:Embedding) SET e.key=$key, e.value=$embedding, e.model=$model" cypher = cypher + " WITH e MATCH (n) WHERE id(n) = $id CREATE (n) -[:HAS_EMBEDDING]-> (e)" session.run(cypher,key=property, embedding=embedding_batch_response.data[0].embedding, id=id, model=EMBEDDING_MODEL) count = count + 1