链载Ai

标题: 喂饭教程!全网首发Neo4J可视化GraphRAG索引 [打印本页]

作者: 链载Ai    时间: 7 小时前
标题: 喂饭教程!全网首发Neo4J可视化GraphRAG索引

GraphRAG通过结合知识图谱,增加RAG的全局检索能力。今天我将讲解如何使用Neo4J可视化GraphRAG索引的结果,以便进一步的处理、分析。本篇仍然以小说《仙逆》提取的实体为例,一图胜千言。本文分为4小节,安装neo4j、导入GraphRAG索引文件、Neo4J可视化分析和总结,所有坑都已经帮你趟过啦,放心食用。

Neo4j[1]是由 Neo4j Inc. 开发的图数据库管理系统,是图数据库技术领域的领导者——强大的原生图存储、数据科学和分析,具备企业级的安全性。无约束地扩展您的事务和分析工作负载。已下载超过1.6亿次。Neo4j 存储的数据元素包括节点、连接它们的边以及节点和边的属性。

1. 安装Neo4j

Neo4j支持使用云端服务和本地社区开源版本,使用如下Docker命令启动Neo4J实例。

dockerrun\
-p7474:7474-p7687:7687\
--nameneo4j-apoc\
-eNEO4J_apoc_export_file_enabled=true\
-eNEO4J_apoc_import_file_enabled=true\
-eNEO4J_apoc_import_file_use__neo4j__config=true\
-eNEO4J_PLUGINS=\[\"apoc\"\]\
neo4j:5.21.2

浏览器打开http://localhost:7474/,然后输入默认用户名neo4j,默认密码neo4j即可登录,登录之后要求重设密码。

接下来,安装neo4j的依赖包

pipinstall--quietpandasneo4j-rust-ext

2. 导入GraphRAG的索引结果

为了更好地支持中文提取,本次采用deepseeker[2]的deep-seek-chat模型(为啥不用qwen2?因为我的免费额度使用完了)。注册之后免费500万Token,索引一次通过,支持128K上下文,最大输出Tokens为4096。所以设置LLM的时候,务必把max_tokens设置为4096,未明确说明TPM和RPM,根据平台符合自动调整。

importpandasaspd
fromneo4jimportGraphDatabase
importtime
NEO4J_URI="neo4j://localhost"#orneo4j+s://xxxx.databases.neo4j.io
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="password"#你自己的密码
NEO4J_DATABASE="neo4j"

#CreateaNeo4jdriver
driver=GraphDatabase.driver(NEO4J_URI,auth=(NEO4J_USERNAME,NEO4J_PASSWORD))
GRAPHRAG_FOLDER="./output/20240716-192226/artifacts"

在Neo4j中,索引仅用于查找图查询的起始点,例如快速查找两个节点以进行连接。约束用于避免重复,主要在实体类型的id上创建。我们使用带有两个下划线的类型作为标记,以区分它们与实际的实体类型。

statements="""
createconstraintchunk_idifnotexistsfor(c:__Chunk__)requirec.idisunique;
createconstraintdocument_idifnotexistsfor(d:__Document__)required.idisunique;
createconstraintentity_idifnotexistsfor(c:__Community__)requirec.communityisunique;
createconstraintentity_idifnotexistsfor(e:__Entity__)requiree.idisunique;
createconstraintentity_titleifnotexistsfor(e:__Entity__)requiree.nameisunique;
createconstraintentity_titleifnotexistsfor(e:__Covariate__)requiree.titleisunique;
createconstraintrelated_idifnotexistsfor()-[rel:RELATED]->()requirerel.idisunique;
""".split(";")

forstatementinstatements:
iflen((statementor"").strip())>0:
print(statement)
driver.execute_query(statement)
defbatched_import(statement,df,batch_size=1000):
"""
ImportadataframeintoNeo4jusingabatchedapproach.
Parameters:statementistheCypherquerytoexecute,dfisthedataframetoimport,andbatch_sizeisthenumberofrowstoimportineachbatch.
"""
total=len(df)
start_s=time.time()
forstartinrange(0,total,batch_size):
batch=df.iloc[start:min(start+batch_size,total)]
result=driver.execute_query("UNWIND$rowsASvalue"+statement,
rows=batch.to_dict('records'),
database_=NEO4J_DATABASE)
print(result.summary.counters)
print(f'{total}rowsin{time.time()-start_s}s.')
returntotal
doc_df=pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_documents.parquet',columns=["id","title"])
doc_df.head(2)

#importdocuments
statement="""
MERGE(d:__Document__{id:value.id})
SETd+=value{.title}
"""

batched_import(statement,doc_df)
text_df=pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_text_units.parquet',
columns=["id","text","n_tokens","document_ids"])
text_df.head(2)

statement="""
MERGE(c:__Chunk__{id:value.id})
SETc+=value{.text,.n_tokens}
WITHc,value
UNWINDvalue.document_idsASdocument
MATCH(d:__Document__{id:document})
MERGE(c)-[ART_OF]->(d)
"""

batched_import(statement,text_df)
entity_df=pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_entities.parquet',
columns=["name","type","description","human_readable_id","id","description_embedding",
"text_unit_ids"])
entity_df.head(2)

entity_statement="""
MERGE(e:__Entity__{id:value.id})
SETe+=value{.human_readable_id,.description,name:replace(value.name,'"','')}
WITHe,value
CALLdb.create.setNodeVectorProperty(e,"description_embedding",value.description_embedding)
CALLapoc.create.addLabels(e,casewhencoalesce(value.type,"")=""then[]else[apoc.text.upperCamelCase(replace(value.type,'"',''))]end)yieldnode
UNWINDvalue.text_unit_idsAStext_unit
MATCH(c:__Chunk__{id:text_unit})
MERGE(c)-[:HAS_ENTITY]->(e)
"""

batched_import(entity_statement,entity_df)
rel_df=pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_relationships.parquet',
columns=["source","target","id","rank","weight","human_readable_id","description",
"text_unit_ids"])
rel_df.head(2)

rel_statement="""
MATCH(source:__Entity__{name:replace(value.source,'"','')})
MATCH(target:__Entity__{name:replace(value.target,'"','')})
//notnecessarytomergeonidasthereisonlyonerelationshipperpair
MERGE(source)-[rel:RELATED{id:value.id}]->(target)
SETrel+=value{.rank,.weight,.human_readable_id,.description,.text_unit_ids}
RETURNcount(*)ascreatedRels
"""

batched_import(rel_statement,rel_df)
community_df=pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_communities.parquet',
columns=["id","level","title","text_unit_ids","relationship_ids"])

community_df.head(2)

statement="""
MERGE(c:__Community__{community:value.id})
SETc+=value{.level,.title}
/*
UNWINDvalue.text_unit_idsastext_unit_id
MATCH(t:__Chunk__{id:text_unit_id})
MERGE(c)-[:HAS_CHUNK]->(t)
WITHdistinctc,value
*/
WITH*
UNWINDvalue.relationship_idsasrel_id
MATCH(start:__Entity__)-[:RELATED{id:rel_id}]->(end:__Entity__)
MERGE(start)-[:IN_COMMUNITY]->(c)
MERGE(end)-[:IN_COMMUNITY]->(c)
RETURncount(distinctc)ascreatedCommunities
"""

batched_import(statement,community_df)
community_report_df=pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_community_reports.parquet',
columns=["id","community","level","title","summary","findings","rank",
"rank_explanation","full_content"])
community_report_df.head(2)
#importcommunities
community_statement="""MATCH(c:__Community__{community:value.community})
SETc+=value{.level,.title,.rank,.rank_explanation,.full_content,.summary}
WITHc,value
UNWINDrange(0,size(value.findings)-1)ASfinding_idx
WITHc,value,finding_idx,value.findings[finding_idx]asfinding
MERGE(c)-[:HAS_FINDING]->(f:Finding{id:finding_idx})
SETf+=finding"""
batched_import(community_statement,community_report_df)

以上我们导入了文档、TextUnits、实体、关系、社区和社区报告后,打开浏览器后就可可视化分析这些实体关系和社区之间的信息了。Here we go

3. 可视化分析

打开浏览器输入地址http://localhost:7474/browser/。

每个实体可点开,查看进一步的关联关系,王林和铁柱的关系也是一目了然。

社区有很多,基本上是对某一个特定事件进行整合,比如测试事件都关联了哪些人、哪些测试。

点开洞穴可以进一步查看该洞穴关联的实体和人物、文本单元。

4. 总结

本文通过使用Neo4J可视化分析GraphRAG索引结果,让我们能够更为直观的了解整个GraphRAG索引结果,需要完整脚本的同学发送消息neo4j即可领取。

[2]

deepseeker: https://platform.deepseek.com/







欢迎光临 链载Ai (https://www.lianzai.com/) Powered by Discuz! X3.5