链载Ai

标题: 从CSV到Neo4j:如何用LLM实现自动化数据建模? [打印本页]

作者: 链载Ai    时间: 1 小时前
标题: 从CSV到Neo4j:如何用LLM实现自动化数据建模?

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.1em;visibility: visible;">

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;margin: 1.5em 8px;color: rgb(63, 63, 63);">

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;margin: 1.5em 8px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">当大语言模型 (LLM) 尝试从平面 CSV 文件创建图时表现如何?

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;margin: 1.5em 8px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">我的工作中很大一部分是改善用户使用 Neo4j 的体验。通常,如何将数据导入 Neo4j 并有效建模是用户面临的主要挑战,尤其是在初期阶段。尽管初始的数据模型很重要并且需要深思熟虑,但随着数据规模或用户数量的增长,它可以轻松地重构以提高性能。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;margin: 1.5em 8px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">因此,作为对自己的挑战,我想看看 LLM 是否能够帮助进行初始数据模型的创建。即使不能,它至少可以展示事物之间的连接,并为用户提供一些可以快速展示给他人的结果。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;margin: 1.5em 8px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">直觉上,我知道数据建模是一个迭代的过程,某些 LLM 在处理大量数据时容易分心,因此这是一个使用 LangGraph 通过数据循环工作的好机会。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;margin: 1.5em 8px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">让我们深入了解使其实现的提示词。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: 15.4px;font-weight: bold;display: table;margin: 4em auto 2em;padding-right: 0.2em;padding-left: 0.2em;background: rgb(1, 155, 252);color: rgb(255, 255, 255);">图建模基础

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;margin: 1.5em 8px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">GraphAcademy 上的图数据建模基础课程引导你了解如何在图中建模数据的基础知识,但作为初步尝试,我使用了以下经验法则:

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;margin: 1.5em 8px;letter-spacing: 0.1em;color: rgb(63, 63, 63);">动词也可以是节点;你可能很高兴知道某个人订购了某个产品,但这个基本模型并不能让你知道产品是在哪里和什么时候订购的。在这种情况下,订单成为模型中的一个新节点。

我确信这可以被提炼成一个提示词,以创建一种零样本学习的图数据建模方法。

迭代方法

几个月前我简要尝试过这个方法,发现我使用的模型在处理较大模式时容易分心,提示词很快达到了 LLM 的 Token 限制。

这次我决定尝试一种迭代方法,一次只处理一个键。这应该有助于避免分心,因为 LLM 只需要一次考虑一项内容。

最终的方法使用了以下步骤:

  1. 1.将 CSV 文件加载到 Pandas 数据框中。

  2. 2.分析 CSV 中的每一列,并将其附加到一个基于 JSON Schema 的数据模型中。

  3. 3.为每个实体识别并添加缺失的唯一 ID。

  4. 4.审查数据模型的准确性。

  5. 5.生成导入节点和关系的 Cypher 语句。

  6. 6.生成支持导入语句的唯一约束。

  7. 7.创建约束并运行导入。

数据

我在 Kaggle 上快速浏览了一下,寻找一个有趣的数据集。脱颖而出的是 Spotify 最多播放的歌曲数据集。

importpandasaspd

csv_file='/Users/adam/projects/datamodeller/data/spotify/spotify-most-streamed-songs.csv'

df=pd.read_csv(csv_file)
df.head()


track_nameartist(s)_nameartist_countreleased_yearreleased_monthreleased_dayin_spotify_playlistsin_spotify_chartsstreamsin_apple_playlists…keymodedanceability_%valence_%energy_%acousticness_%instrumentalness_%liveness_%speechiness_%cover_url
0Seven(feat.Latto)(ExplicitVer.)Latto,JungKook2202371455314714138170343…BMajor80898331084NotFound
1LALAMykeTowers1202332314744813371628648…C#Major71617470104https://i.scdn.co/image/ab67616d0000b2730656d5…
2vampireOliviaRodrigo12023630139711314000397494…FMajor513253170316https://i.scdn.co/image/ab67616d0000b273e85259…
3CruelSummerTaylorSwift120198237858100800840817116…AMajor5558721101115https://i.scdn.co/image/ab67616d0000b273e787cf…
4WHERESHEGOESBadBunny1202351831335030323632284…AMinor6523801463116https://i.scdn.co/image/ab67616d0000b273ab5c9c…

这个数据集相对简单,但我马上就能看出应该存在曲目和艺术家之间的关系。

同时,还需要克服数据清理方面的挑战,尤其是列名和艺术家名称列中的艺术家是以逗号分隔的值。

选择 LLM

我真的很想使用本地 LLM,但我很快发现 Llama 3 无法胜任。如果有疑问,退回到 OpenAI:

fromlangchain_core.promptsimportPromptTemplate
fromlangchain_core.pydantic_v1importBaseModel,Field
fromtypingimportList
fromlangchain_core.output_parsersimportJsonOutputParser

fromlangchain_openaiimportChatOpenAI
llm=ChatOpenAI(model="gpt-4o")

创建数据模型

我使用了一组简化的建模说明来创建数据建模提示词。我不得不多次调整提示词以获得一致的输出。

零样本学习的例子效果相对较好,但我发现输出不一致。定义一个结构化的输出来保存 JSON 输出确实有帮助:

classJSONSchemaSpecification(BaseModel):
notes:str=Field(description="Anynotesorcommentsabouttheschema")
jsonschema:str=Field(description="AJSONarrayofJSONschemaspecificationsthatdescribetheentitiesinthedatamodel")

少样本学习示例输出

JSON 本身也不一致,所以我最终基于电影推荐数据集定义了一个模式。

示例输出:

example_output=[
dict(
title="Person",
type="object",
description="Node",
properties=[
dict(name="name",column_name="person_name",type="string",description="Thenameoftheperson",examples=["TomHanks"]),
dict(name="date_of_birth",column_name="person_dob",type="date",description="Thedateofbirthfortheperson",examples=["1987-06-05"]),
dict(name="id",column_name="person_name,date_of_birth",type="string",description="TheIDisacombinationofnameanddateofbirthtoensureuniqueness",examples=["tom-hanks-1987-06-05"]),
],
),
dict(
title="Director",
type="object",
description="Node",
properties=[
dict(name="name",column_name="director_names",type="string",description="Thenameofthedirectors.Splitvaluesincolumnbyacomma",examples=["FrancisFordCoppola"]),
],
),
dict(
title="Movie",
type="object",
description="Node",
properties=[
dict(name="title",column_name="title",type="string",description="Thetitleofthemovie",examples=["ToyStory"]),
dict(name="released",column_name="released",type="integer",description="Theyearthemoviewasreleased",examples=["1990"]),
],
),
dict(
title="ACTED_IN",
type="object",
description="Relationship",
properties=[
dict(name="_from",column_name="od",type="string",description="Personfoundbythe`id`.TheIDisacombinationofnameanddateofbirthtoensureuniqueness",examples=["Person"]),
dict(name="_to",column_name="title",type="string",description="Themovietitle",examples=["Movie"]),
dict(name="roles",type="string",column_name="person_roles",description="Therolesthepersonplayedinthemovie",examples=["Woody"]),
],
),
dict(
title="DIRECTED",
type="object",
description="Relationship",
properties=[
dict(name="_from",type="string",column_name="director_names",description="Directornamesarecommaseparated",examples=["Director"]),
dict(name="_to",type="string",column_name="title",description="Thelabelofthenodethisrelationshipendsat",examples=["Movie"]),
],
),
]

我不得不偏离严格的 JSON Schema,并在输出中添加 column_name 字段,以帮助 LLM 生成导入脚本。提供描述的示例在这方面也有帮助,否则在 MATCH 子句中使用的属性会不一致。

链式操作

这是最终的提示词:

model_prompt=PromptTemplate.from_template("""
YouareanexpertGraphDatabaseadministrator.
Yourtaskistodesignadatamodelbasedontheinformationprovidedfromanexistingdatasource.

Youmustdecidewherethefollowingcolumnfitsinwiththeexistingdatamodel.Consider:
*Doesthecolumnrepresentanentity,forexampleaPerson,Place,orMovie?Ifso,thisshouldbeanodeinitsownright.
*Doesthecolumnrepresentarelationshipbetweentwoentities?Ifso,thisshouldbearelationshipbetweentwonodes.
*Doesthecolumnrepresentanattributeofanentityorrelationship?Ifso,thisshouldbeapropertyofanodeorrelationship.
*Doesthecolumnrepresentasharedattributethatcouldbeinterestingtoquerythroughtofindsimilarnodes,forexampleaGenre?Ifso,thisshouldbeanodeinitsownright.


##InstructionsforNodes

*Nodelabelsaregenerallynouns,forexamplePerson,Place,orMovie
*NodetitlesshouldbeinUpperCamelCase

##InstructionsforRelationships

*Relationshopsaregenerallyverbs,forexampleACTED_IN,DIRECTED,orPURCHASED
*Examplesofgoodrelationshipsare(erson)-[:ACTED_IN]->(:Movie)or(erson)-[URCHASED]->(roduct)
*RelationshipsshouldbeinUPPER_SNAKE_CASE
*Provideanyspecificinstructionsforthefieldinthedescription.Forexample,doesthefieldcontainalistofcommaseparatedvaluesorasinglevalue?

##InstructionsforProperties

*RelationshipsshouldbeinlowerPascalCase
*Prefertheshorternamewherepossible,forexample"person_id"and"personId"shouldsimplybe"id"
*Ifyouarechangingthepropertynamefromtheoriginalfieldname,mentionthecolumnnameinthedescription
*Donotincludeexamplesforintegerordatefields
*Alwaysincludeinstructionsondatapreparationforthefield.Doesitneedtobecastasastringorsplitintomultiplefieldsonadelimitingvalue?
*Propertykeysshouldbelettersonly,nonumbersorspecialcharacters.

##Important!

Considertheexamplesprovided.Doesanydatapreparationneedtobedonetoensurethedataisinthecorrectformat?
Youmustincludeanyinformationaboutdatapreparationinthedescription.

##ExampleOutput

Hereisanexampleofagoodoutput:
'''
{example_output}
'''

##NewData:

Key:{key}
DataType:{type}
ExampleValues:{examples}

##ExistingDataModel

Hereistheexistingdatamodel:
'''
{existing_model}
'''

##KeepExistingDataModel

Applyyourchangestotheexistingdatamodelbutneverremoveanyexistingdefinitions.
""",partial_variables=dict(example_output=dumps(example_output)))

model_chain=model_prompt|llm.with_structured_output(JSONSchemaSpecification)

执行链式操作

为了迭代更新模型,我遍历数据框中的键,并将每个键、其数据类型及前五个唯一值传递给提示词:

fromjson_repairimportdumps,loads

existing_model={}

fori,keyinenumerate(df):
print("\n",i,key)
print("----------------")
try:
res=try_chain(model_chain,dict(
existing_model=dumps(existing_model),
key=key,
type=df[key].dtype,
examples=dumps(df[key].unique()[:5].tolist())
))
print(res.notes)
existing_model=loads(res.jsonschema)

print([n['title']forninexisting_model])
exceptExceptionase:
print(e)
pass

existing_model

控制台输出:

0track_name
----------------
Adding'track_name'toanexistingdatamodel.Thisrepresentsamusictrackentity.
['Track']

1artist(s)_name
----------------
Addinganewcolumn'artist(s)_name'totheexistingdatamodel.Thiscolumnrepresentsmultipleartistsassociatedwithtracksandshouldbemodeledasanewnode'Artist'andarelationship'PERFORMED_BY'from'Track'to'Artist'.
['Track','Artist','PERFORMED_BY']

2artist_count
----------------
Addedartist_countasapropertyofTracknode.Thispropertyindicatesthenumberofartistsperforminginthetrack.
['Track','Artist','PERFORMED_BY']

3released_year
----------------
Addthereleased_yearcolumntotheexistingdatamodelasapropertyoftheTracknode.
['Track','Artist','PERFORMED_BY']

4released_month
----------------
Addingthe'released_month'columntotheexistingdatamodel,consideringitasanattributeoftheTracknode.
['Track','Artist','PERFORMED_BY']

5released_day
----------------
Addedanewproperty'released_day'tothe'Track'nodetocapturethedayofthemonthatrackwasreleased.
['Track','Artist','PERFORMED_BY']

6in_spotify_playlists
----------------
Addingthenewcolumn'in_spotify_playlists'totheexistingdatamodelasapropertyofthe'Track'node.
['Track','Artist','PERFORMED_BY']

7in_spotify_charts
----------------
Addingthe'in_spotify_charts'columntotheexistingdatamodelasapropertyoftheTracknode.
['Track','Artist','PERFORMED_BY']

8streams
----------------
Addinganewcolumn'streams'totheexistingdatamodel,representingthenumberofstreamsforatrack.
['Track','Artist','PERFORMED_BY']

9in_apple_playlists
----------------
Addingnewcolumn'in_apple_playlists'totheexistingdatamodel
['Track','Artist','PERFORMED_BY']

10in_apple_charts
----------------
Adding'in_apple_charts'asapropertytothe'Track'node,representingthenumberoftimesthetrackappearedintheApplecharts.
['Track','Artist','PERFORMED_BY']

11in_deezer_playlists
----------------
Add'in_deezer_playlists'totheexistingdatamodelforamusictrackdatabase.
['Track','Artist','PERFORMED_BY']

12in_deezer_charts
----------------
Addinganewproperty'inDeezerCharts'totheexisting'Track'nodetorepresentthenumberoftimesthetrackappearedinDeezercharts.
['Track','Artist','PERFORMED_BY']

13in_shazam_charts
----------------
Addingnewdata'in_shazam_charts'totheexistingdatamodel.Thisappearstobeanattributeofthe'Track'node,indicatingthenumberoftimesatrackappearedintheShazamcharts.
['Track','Artist','PERFORMED_BY']

14bpm
----------------
AddedbpmcolumnasapropertytotheTracknodeasitrepresentsacharacteristicofthetrack.
['Track','Artist','PERFORMED_BY']

15key
----------------
Addingthe'key'columntotheexistingdatamodel.The'key'representsthemusicalkeyofatrack,whichisasharedattributethatcanbeinterestingtoquerythroughtofindsimilartracks.
['Track','Artist','PERFORMED_BY']

16mode
----------------
Adding'mode'totheexistingdatamodel.Itrepresentsamusicalcharacteristicofatrack,whichisbestcapturedasanattributeoftheTracknode.
['Track','Artist','PERFORMED_BY']

17danceability_%
----------------
Added'danceability_%'totheexistingdatamodelasapropertyoftheTracknode.Thefieldrepresentsthedanceabilitypercentageofthetrack.
['Track','Artist','PERFORMED_BY']

18valence_%
----------------
AddingthevalencepercentagecolumntotheexistingdatamodelasapropertyoftheTracknode.
['Track','Artist','PERFORMED_BY']

19energy_%
----------------
Integrationofthenewcolumn'energy_%'intotheexistingdatamodel.ThiscolumnrepresentsanattributeoftheTrackentityandshouldbeaddedasapropertyoftheTracknode.
['Track','Artist','PERFORMED_BY']

20acousticness_%
----------------
Addingacousticness_%totheexistingdatamodelasapropertyoftheTracknode.
['Track','Artist','PERFORMED_BY']

21instrumentalness_%
----------------
Addingthenewcolumn'instrumentalness_%'totheexistingTracknodeasitrepresentsanattributeoftheTrackentity.
['Track','Artist','PERFORMED_BY']

22liveness_%
----------------
Addingthenewcolumn'liveness_%'totheexistingdatamodelasanattributeoftheTracknode
['Track','Artist','PERFORMED_BY']

23speechiness_%
----------------
Addingthenewcolumn'speechiness_%'totheexistingdatamodelasapropertyofthe'Track'node.
['Track','Artist','PERFORMED_BY']

24cover_url
----------------
Addinganewproperty'cover_url'totheexisting'Track'node.ThispropertyrepresentstheURLofthetrack'scoverimage.
['Track','Artist','PERFORMED_BY']

在对提示词进行了一些调整以处理用例后,我得到了一个我相当满意的模型。LLM 成功确定了数据集由 Track、Artist 以及连接两者的 PERFORMED_BY 关系组成:

[
{
"title":"Track",
"type":"object",
"description":"Node",
"properties":[
{
"name":"name",
"column_name":"track_name",
"type":"string",
"description":"Thenameofthetrack",
"examples":[
"Seven(feat.Latto)(ExplicitVer.)",
"LALA",
"vampire",
"CruelSummer",
"WHERESHEGOES",
],
},
{
"name":"artist_count",
"column_name":"artist_count",
"type":"integer",
"description":"Thenumberofartistsperforminginthetrack",
"examples":[2,1,3,8,4],
},
{
"name":"released_year",
"column_name":"released_year",
"type":"integer",
"description":"Theyearthetrackwasreleased",
"examples":[2023,2019,2022,2013,2014],
},
{
"name":"released_month",
"column_name":"released_month",
"type":"integer",
"description":"Themonththetrackwasreleased",
"examples":[7,3,6,8,5],
},
{
"name":"released_day",
"column_name":"released_day",
"type":"integer",
"description":"Thedayofthemonththetrackwasreleased",
"examples":[14,23,30,18,1],
},
{
"name":"inSpotifyPlaylists",
"column_name":"in_spotify_playlists",
"type":"integer",
"description":"ThenumberofSpotifyplayliststhetrackisin.Castthevalueasaninteger.",
"examples":[553,1474,1397,7858,3133],
},
{
"name":"inSpotifyCharts",
"column_name":"in_spotify_charts",
"type":"integer",
"description":"ThenumberoftimesthetrackappearedintheSpotifycharts.Castthevalueasaninteger.",
"examples":[147,48,113,100,50],
},
{
"name":"streams",
"column_name":"streams",
"type":"array",
"description":"ThelistofstreamIDsforthetrack.Maintainthearrayformat.",
"examples":[
"141381703",
"133716286",
"140003974",
"800840817",
"303236322",
],
},
{
"name":"inApplePlaylists",
"column_name":"in_apple_playlists",
"type":"integer",
"description":"ThenumberofAppleplayliststhetrackisin.Castthevalueasaninteger.",
"examples":[43,48,94,116,84],
},
{
"name":"inAppleCharts",
"column_name":"in_apple_charts",
"type":"integer",
"description":"ThenumberoftimesthetrackappearedintheApplecharts.Castthevalueasaninteger.",
"examples":[263,126,207,133,213],
},
{
"name":"inDeezerPlaylists",
"column_name":"in_deezer_playlists",
"type":"array",
"description":"ThelistofDeezerplaylistIDsthetrackisin.Maintainthearrayformat.",
"examples":["45","58","91","125","87"],
},
{
"name":"inDeezerCharts",
"column_name":"in_deezer_charts",
"type":"integer",
"description":"ThenumberoftimesthetrackappearedintheDeezercharts.Castthevalueasaninteger.",
"examples":[10,14,12,15,17],
},
{
"name":"inShazamCharts",
"column_name":"in_shazam_charts",
"type":"array",
"description":"ThelistofShazamchartIDsthetrackisin.Maintainthearrayformat.",
"examples":["826","382","949","548","425"],
},
{
"name":"bpm",
"column_name":"bpm",
"type":"integer",
"description":"Thebeatsperminuteofthetrack.Castthevalueasaninteger.",
"examples":[125,92,138,170,144],
},
{
"name":"key",
"column_name":"key",
"type":"string",
"description":"Themusicalkeyofthetrack.Castthevalueasastring.",
"examples":["B","C#","F","A","D"],
},
{
"name":"mode",
"column_name":"mode",
"type":"string",
"description":"Themodeofthetrack(e.g.,Major,Minor).Castthevalueasastring.",
"examples":["Major","Minor"],
},
{
"name":"danceability",
"column_name":"danceability_%",
"type":"integer",
"description":"Thedanceabilitypercentageofthetrack.Castthevalueasaninteger.",
"examples":[80,71,51,55,65],
},
{
"name":"valence",
"column_name":"valence_%",
"type":"integer",
"description":"Thevalencepercentageofthetrack.Castthevalueasaninteger.",
"examples":[89,61,32,58,23],
},
{
"name":"energy",
"column_name":"energy_%",
"type":"integer",
"description":"Theenergypercentageofthetrack.Castthevalueasaninteger.",
"examples":[83,74,53,72,80],
},
{
"name":"acousticness",
"column_name":"acousticness_%",
"type":"integer",
"description":"Theacousticnesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[31,7,17,11,14],
},
{
"name":"instrumentalness",
"column_name":"instrumentalness_%",
"type":"integer",
"description":"Theinstrumentalnesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[0,63,17,2,19],
},
{
"name":"liveness",
"column_name":"liveness_%",
"type":"integer",
"description":"Thelivenesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[8,10,31,11,28],
},
{
"name":"speechiness",
"column_name":"speechiness_%",
"type":"integer",
"description":"Thespeechinesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[4,6,15,24,3],
},
{
"name":"coverUrl",
"column_name":"cover_url",
"type":"string",
"description":"TheURLofthetrack'scoverimage.Ifthevalueis'NotFound',itshouldbecastasanemptystring.",
"examples":[
"https://i.scdn.co/image/ab67616d0000b2730656d5ce813ca3cc4b677e05",
"https://i.scdn.co/image/ab67616d0000b273e85259a1cae29a8d91f2093d",
],
},
],
},
{
"title":"Artist",
"type":"object",
"description":"Node",
"properties":[
{
"name":"name",
"column_name":"artist(s)_name",
"type":"string",
"description":"Thenameoftheartist.Splitvaluesincolumnbyacomma",
"examples":[
"Latto",
"JungKook",
"MykeTowers",
"OliviaRodrigo",
"TaylorSwift",
"BadBunny",
],
}
],
},
{
"title":"PERFORMED_BY",
"type":"object",
"description":"Relationship",
"properties":[
{
"name":"_from",
"type":"string",
"description":"Thelabelofthenodethisrelationshipstartsat",
"examples":["Track"],
},
{
"name":"_to",
"type":"string",
"description":"Thelabelofthenodethisrelationshipendsat",
"examples":["Artist"],
},
],
},
]

[
{
"title":"Track",
"type":"object",
"description":"Node",
"properties":[
{
"name":"name",
"column_name":"track_name",
"type":"string",
"description":"Thenameofthetrack",
"examples":[
"Seven(feat.Latto)(ExplicitVer.)",
"LALA",
"vampire",
"CruelSummer",
"WHERESHEGOES",
],
},
{
"name":"artist_count",
"column_name":"artist_count",
"type":"integer",
"description":"Thenumberofartistsperforminginthetrack",
"examples":[2,1,3,8,4],
},
{
"name":"released_year",
"column_name":"released_year",
"type":"integer",
"description":"Theyearthetrackwasreleased",
"examples":[2023,2019,2022,2013,2014],
},
{
"name":"released_month",
"column_name":"released_month",
"type":"integer",
"description":"Themonththetrackwasreleased",
"examples":[7,3,6,8,5],
},
{
"name":"released_day",
"column_name":"released_day",
"type":"integer",
"description":"Thedayofthemonththetrackwasreleased",
"examples":[14,23,30,18,1],
},
{
"name":"inSpotifyPlaylists",
"column_name":"in_spotify_playlists",
"type":"integer",
"description":"ThenumberofSpotifyplayliststhetrackisin.Castthevalueasaninteger.",
"examples":[553,1474,1397,7858,3133],
},
{
"name":"inSpotifyCharts",
"column_name":"in_spotify_charts",
"type":"integer",
"description":"ThenumberoftimesthetrackappearedintheSpotifycharts.Castthevalueasaninteger.",
"examples":[147,48,113,100,50],
},
{
"name":"streams",
"column_name":"streams",
"type":"array",
"description":"ThelistofstreamIDsforthetrack.Maintainthearrayformat.",
"examples":[
"141381703",
"133716286",
"140003974",
"800840817",
"303236322",
],
},
{
"name":"inApplePlaylists",
"column_name":"in_apple_playlists",
"type":"integer",
"description":"ThenumberofAppleplayliststhetrackisin.Castthevalueasaninteger.",
"examples":[43,48,94,116,84],
},
{
"name":"inAppleCharts",
"column_name":"in_apple_charts",
"type":"integer",
"description":"ThenumberoftimesthetrackappearedintheApplecharts.Castthevalueasaninteger.",
"examples":[263,126,207,133,213],
},
{
"name":"inDeezerPlaylists",
"column_name":"in_deezer_playlists",
"type":"array",
"description":"ThelistofDeezerplaylistIDsthetrackisin.Maintainthearrayformat.",
"examples":["45","58","91","125","87"],
},
{
"name":"inDeezerCharts",
"column_name":"in_deezer_charts",
"type":"integer",
"description":"ThenumberoftimesthetrackappearedintheDeezercharts.Castthevalueasaninteger.",
"examples":[10,14,12,15,17],
},
{
"name":"inShazamCharts",
"column_name":"in_shazam_charts",
"type":"array",
"description":"ThelistofShazamchartIDsthetrackisin.Maintainthearrayformat.",
"examples":["826","382","949","548","425"],
},
{
"name":"bpm",
"column_name":"bpm",
"type":"integer",
"description":"Thebeatsperminuteofthetrack.Castthevalueasaninteger.",
"examples":[125,92,138,170,144],
},
{
"name":"key",
"column_name":"key",
"type":"string",
"description":"Themusicalkeyofthetrack.Castthevalueasastring.",
"examples":["B","C#","F","A","D"],
},
{
"name":"mode",
"column_name":"mode",
"type":"string",
"description":"Themodeofthetrack(e.g.,Major,Minor).Castthevalueasastring.",
"examples":["Major","Minor"],
},
{
"name":"danceability",
"column_name":"danceability_%",
"type":"integer",
"description":"Thedanceabilitypercentageofthetrack.Castthevalueasaninteger.",
"examples":[80,71,51,55,65],
},
{
"name":"valence",
"column_name":"valence_%",
"type":"integer",
"description":"Thevalencepercentageofthetrack.Castthevalueasaninteger.",
"examples":[89,61,32,58,23],
},
{
"name":"energy",
"column_name":"energy_%",
"type":"integer",
"description":"Theenergypercentageofthetrack.Castthevalueasaninteger.",
"examples":[83,74,53,72,80],
},
{
"name":"acousticness",
"column_name":"acousticness_%",
"type":"integer",
"description":"Theacousticnesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[31,7,17,11,14],
},
{
"name":"instrumentalness",
"column_name":"instrumentalness_%",
"type":"integer",
"description":"Theinstrumentalnesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[0,63,17,2,19],
},
{
"name":"liveness",
"column_name":"liveness_%",
"type":"integer",
"description":"Thelivenesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[8,10,31,11,28],
},
{
"name":"speechiness",
"column_name":"speechiness_%",
"type":"integer",
"description":"Thespeechinesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[4,6,15,24,3],
},
{
"name":"coverUrl",
"column_name":"cover_url",
"type":"string",
"description":"TheURLofthetrack'scoverimage.Ifthevalueis'NotFound',itshouldbecastasanemptystring.",
"examples":[
"https://i.scdn.co/image/ab67616d0000b2730656d5ce813ca3cc4b677e05",
"https://i.scdn.co/image/ab67616d0000b273e85259a1cae29a8d91f2093d",
],
},
],
},
{
"title":"Artist",
"type":"object",
"description":"Node",
"properties":[
{
"name":"name",
"column_name":"artist(s)_name",
"type":"string",
"description":"Thenameoftheartist.Splitvaluesincolumnbyacomma",
"examples":[
"Latto",
"JungKook",
"MykeTowers",
"OliviaRodrigo",
"TaylorSwift",
"BadBunny",
],
}
],
},
{
"title":"PERFORMED_BY",
"type":"object",
"description":"Relationship",
"properties":[
{
"name":"_from",
"type":"string",
"description":"Thelabelofthenodethisrelationshipstartsat",
"examples":["Track"],
},
{
"name":"_to",
"type":"string",
"description":"Thelabelofthenodethisrelationshipendsat",
"examples":["Artist"],
},
],
},
]

添加唯一标识符

我注意到模式中没有包含任何唯一标识符,这在导入关系时可能会成为问题。很显然,不同的艺术家可能发布同名歌曲,两个艺术家也可能拥有相同的名字。

因此,为曲目创建一个标识符,以便在更大的数据集中进行区分,这一点非常重要:

#Addprimarykey/uniqueidentifiers
uid_prompt=PromptTemplate.from_template("""
Youareagraphdatabaseexpertreviewingasingleentityfromadatamodelgeneratedbyacolleague.
Youwanttoensurethatallofthenodesimportedintothedatabaseareunique.

##Example

AschemacontainsActorswithanumberofpropertiesincludingname,dateofbirth.

Twoactorsmayhavethesamenamethenaddanewcompoundpropertycombiningthenameanddateofbirth.
Ifcombiningvalues,includetheinstructiontoconvertthevaluetoslugcase.Callthenewproperty'id'.

Ifyouhaveidentifiedanewproperty,addittothelistofpropertiesleavingtherestintact.
Includeinthedescriptionthefieldsthataretobeconcatenated.


##ExampleOutput

Hereisanexampleofagoodoutput:
'''
{example_output}
'''

##CurrentEntitySchema
'''
{entity}
'''

""",partial_variables=dict(example_output=dumps(example_output)))

uid_chain=uid_prompt|llm.with_structured_output(JSONSchemaSpecification)

这一步实际上只需要对节点执行,因此我从模式中提取了节点,为每个节点运行链式操作,然后将关系与更新后的定义结合起来:

#extractnodesandrelationships
nodes=[nforninexisting_modelif"node"inn["description"].lower()]
rels=[nforninexisting_modelif"node"notinn["description"].lower()]

#generateauniqueidfornodes
with_uids=[]

forentityinnodes:
res=uid_chain.invoke(dict(entity=dumps(entity)))
json=loads(res.jsonschema)

with_uids=with_uids+jsoniftype(json)==listelsewith_uids+[json]

#combinenodesandrelationships
with_uids=with_uids+rels

数据模型审查

为了确保模型的合理性,值得检查模型是否可以进行优化。model_prompt 在识别名词和动词方面做得很好,但在更复杂的模型中。

某次迭代将 *_playlists 和 _charts 列视为 ID,并尝试创建 Stream 节点和 IN_PLAYLIST 关系。我猜测这可能是因为超过 1000 的数字使用了逗号格式(例如 1,001)。

这是个不错的想法,但可能有点过于聪明了。不过这表明了让一个了解数据结构的人参与其中的重要性。

#Addprimarykey/uniqueidentifiers
review_prompt=PromptTemplate.from_template("""
Youareagraphdatabaseexpertreviewingadatamodelgeneratedbyacolleague.

Yourtaskistoreviewthedatamodelandensurethatitisfitforpurpose.
Checkfor:


##Checkfornestedobjects

RememberthatNeo4jcannotstorearraysofobjectsornestedobjects.
Thesemustbeconvertedintointoseparatenodeswithrelationshipsbetweenthem.
Youmustincludethenewnodeandareferencetotherelationshiptotheoutputschema.

##CheckforEntitiesinproperties

IfthereisapropertythatrepresentsanarrayofIDs,anewnodeshouldbecreatedforthatentity.
Youmustincludethenewnodeandareferencetotherelationshiptotheoutputschema.

#KeepInstructions

Ensurethattheinstructionsforthenodes,relationships,andpropertiesareclearandconcise.
Youmayimprovethembutthedetailmustnotberemovedinanycircumstances.


##CurrentEntitySchema

'''
{entity}
'''
""")

review_chain=review_prompt|llm.with_structured_output(JSONSchemaSpecification)

review_nodes=[nforninwith_uidsif"node"inn["description"].lower()]
review_rels=[nforninwith_uidsif"node"notinn["description"].lower()]

reviewed=[]

forentityinreview_nodes:
res=review_chain.invoke(dict(entity=dumps(entity)))
json=loads(res.jsonschema)

reviewed=reviewed+json

#addrelationshipsbackin
reviewed=reviewed+review_rels

len(reviewed)

reviewed=with_uids

在现实场景中,我会希望多次运行此过程,以迭代改进数据模型。我会设置一个最大限制,然后迭代到该点或数据模型对象不再变化为止。

生成导入语句

此时,模式应该足够健壮,并包含尽可能多的信息,以允许 LLM 生成一组导入脚本。

根据 Neo4j 数据导入建议,文件应多次处理,每次只导入一个节点或关系,以避免急切操作和锁定。

import_prompt=PromptTemplate.from_template("""
Basedonthedatamodel,writeaCypherstatementtoimportthefollowingdatafromaCSVfileintoNeo4j.

DonotuseLOADCSVasthisdatawillbeimportedusingtheNeo4jPythonDriver,useUNWINDonthe$rowsparameterinstead.
Youarewritingamulti-passimportprocess,soconcentrateontheentitymentioned.
Whenimportingdata,youmustusethefollowingguidelines:
*followtheinstructionsinthedescriptionwhenidentifyingprimarykeys.
*Usetheinstructionsinthedescriptiontodeterminetheformatofpropertieswhenafinding.
*WhencombiningfieldsintoanID,usetheapoc.text.slugfunctiontoconvertanytexttoslugcaseandtoLowertoconvertthestringtolowercase-apoc.text.slug(toLower(row.`name`))
*Ifyousplitaproperty,convertittoastringandusethetrimfunctiontoremoveanywhitespace-trim(toString(row.`name`))
*Whencombiningproperties,wrapeachpropertyinthecoalescefunctionsothepropertyisnotnullifoneofthevaluesisnotset-coalesce(row.`id`,'')+'--'+coalsece(row.`title`)
*Usethe`column_name`fieldtomaptheCSVcolumntothepropertyinthedatamodel.
*WrapallcolumnnamesfromtheCSVinbackticks-forexamplerow.`column_name`.
*Whenyoumergenodes,mergeontheuniqueidentifierandnothingelse.Allotherpropertiesshouldbesetusing`SET`.
*Donotuseapoc.periodic.iterate,thefileswillbebatchedintheapplication.
DataModel:
'''
{data_model}
'''
CurrentEntity:
'''
{entity}
'''
""")

此链式操作需要与前几个步骤不同的输出对象。在这种情况下,cypher 成员最为重要,但我还想包括一个 chain_of_thought 键,以鼓励思维过程:

classCypherOutputSpecification(BaseModel):
chain_of_thought:str=Field(description="AnyreasoningusedtowritetheCypherstatement")
cypher:str=Field(description="TheCypherstatementtoimportthedata")
notes:Optional[str]=Field(description="AnynotesorclosingremarksabouttheCypherstatement")

import_chain=import_prompt|llm.with_structured_output(CypherOutputSpecification)

然后同样的过程适用于遍历每个已审查的定义并生成 Cypher:

import_cypher=[]
forninreviewed:
print('\n\n------',n['title'])
res=import_chain.invoke(dict(
data_model=dumps(reviewed),
entity=n
))
import_cypher.append((
res.cypher
))
print(res.cypher)

控制台输出:

------Track
UNWIND$rowsASrow
MERGE(t:Track{id:apoc.text.slug(toLower(coalesce(row.`track_name`,'')+'-'+coalesce(row.`released_year`,'')))})
SETt.name=trim(toString(row.`track_name`)),
t.artist_count=toInteger(row.`artist_count`),
t.released_year=toInteger(row.`released_year`),
t.released_month=toInteger(row.`released_month`),
t.released_day=toInteger(row.`released_day`),
t.inSpotifyPlaylists=toInteger(row.`in_spotify_playlists`),
t.inSpotifyCharts=toInteger(row.`in_spotify_charts`),
t.streams=row.`streams`,
t.inApplePlaylists=toInteger(row.`in_apple_playlists`),
t.inAppleCharts=toInteger(row.`in_apple_charts`),
t.inDeezerPlaylists=row.`in_deezer_playlists`,
t.inDeezerCharts=toInteger(row.`in_deezer_charts`),
t.inShazamCharts=row.`in_shazam_charts`,
t.bpm=toInteger(row.`bpm`),
t.key=trim(toString(row.`key`)),
t.mode=trim(toString(row.`mode`)),
t.danceability=toInteger(row.`danceability_%`),
t.valence=toInteger(row.`valence_%`),
t.energy=toInteger(row.`energy_%`),
t.acousticness=toInteger(row.`acousticness_%`),
t.instrumentalness=toInteger(row.`instrumentalness_%`),
t.liveness=toInteger(row.`liveness_%`),
t.speechiness=toInteger(row.`speechiness_%`),
t.coverUrl=CASErow.`cover_url`WHEN'NotFound'THEN''ELSEtrim(toString(row.`cover_url`))END


------Artist
UNWIND$rowsASrow
WITHrow,split(row.`artist(s)_name`,',')ASartistNames
UNWINDartistNamesASartistName
MERGE(a:Artist{id:apoc.text.slug(toLower(trim(artistName)))})
SETa.name=trim(artistName)


------PERFORMED_BY
UNWIND$rowsASrow
UNWINDsplit(row.`artist(s)_name`,',')ASartist_name
MERGE(t:Track{id:apoc.text.slug(toLower(row.`track_name`))+'-'+trim(toString(row.`released_year`))})
MERGE(a:Artist{id:apoc.text.slug(toLower(trim(artist_name)))})
MERGE(t)-[ERFORMED_BY]->(a)

这个提示词经过了一些工程调整才获得一致的结果:

将此分为两个提示词——一个用于节点,一个用于关系——也会有所帮助。但这是留待另一天的任务。

创建唯一约束

接下来,导入脚本可以作为在数据库中创建唯一约束的基础:

constraint_prompt=PromptTemplate.from_template("""
Youareanexpertgraphdatabaseadministrator.
UsethefollowingCypherstatementtowriteaCypherstatementto
createuniqueconstraintsonanypropertiesusedinaMERGEstatement.

Thecorrectsyntaxforauniqueconstraintis:
CREATECONSTRAINTmovie_title_idIFNOTEXISTSFOR(m:Movie)REQUIREm.titleISUNIQUE;

Cypher:
'''
{cypher}
'''
""")

constraint_chain=constraint_prompt|llm.with_structured_output(CypherOutputSpecification)

constraint_queries=[]

forstatementinimport_cypher:
res=constraint_chain.invoke(dict(cypher=statement))

statements=res.cypher.split(";")

forcypherinstatements:
constraint_queries.append(cypher)

控制台输出:

CREATECONSTRAINTtrack_id_uniqueIFNOTEXISTSFOR(t:Track)REQUIREt.idISUNIQUE

CREATECONSTRAINTstream_idIFNOTEXISTSFOR(s:Stream)REQUIREs.idISUNIQUE

CREATECONSTRAINTplaylist_idIFNOTEXISTSFOR(playlist)REQUIREp.idISUNIQUE

CREATECONSTRAINTchart_idIFNOTEXISTSFOR(c:Chart)REQUIREc.idISUNIQUE

CREATECONSTRAINTtrack_id_uniqueIFNOTEXISTSFOR(t:Track)REQUIREt.idISUNIQUE

CREATECONSTRAINTstream_id_uniqueIFNOTEXISTSFOR(s:Stream)REQUIREs.idISUNIQUE

CREATECONSTRAINTtrack_id_uniqueIFNOTEXISTSFOR(t:Track)REQUIREt.idISUNIQUE

CREATECONSTRAINTplaylist_id_uniqueIFNOTEXISTSFOR(playlist)REQUIREp.idISUNIQUE

CREATECONSTRAINTtrack_id_uniqueIFNOTEXISTSFOR(track:Track)REQUIREtrack.idISUNIQUE

CREATECONSTRAINTchart_id_uniqueIFNOTEXISTSFOR(chart:Chart)REQUIREchart.idISUNIQUE

有时这个提示词会返回索引和约束的语句,因此需要用分号来分割。

运行导入

一切就绪后,便是执行 Cypher 语句的时间:

fromosimportgetenv
fromneo4jimportGraphDatabase

driver=GraphDatabase.driver(
getenv("NEO4J_URI"),
auth=(
getenv("NEO4J_USERNAME"),
getenv("NEO4J_PASSWORD")
)
)

withdriver.session()assession:
#truncatethedb
session.run("MATCH(n)DETACHDELETEn")

#createconstraints
forqinconstraint_queries:
ifq.strip()!="":
session.run(q)

#importthedata
forqinimport_cypher:
ifq.strip()!="":
res=session.run(q,rows=rows).consume()
print(q)
print(res.counters)

数据集的质量保证

如果没有对数据集进行一些质量保证,这篇文章就不完整了,使用 GraphCypherQAChain:

fromlangchain.chainsimportGraphCypherQAChain
fromlangchain_community.graphsimportNeo4jGraph

graph=Neo4jGraph(
url=getenv("NEO4J_URI"),
username=getenv("NEO4J_USERNAME"),
password=getenv("NEO4J_PASSWORD"),
enhanced_schema=True
)

qa=GraphCypherQAChain.from_llm(
llm,
graph=graph,
allow_dangerous_requests=True,
verbose=True
)

最受欢迎的艺术家

数据库中最受欢迎的艺术家是谁?

qa.invoke({"query":"Whoarethemostpopularartists?"})

>EnteringnewGraphCypherQAChainchain...
GeneratedCypher:
cypher
MATCH(:Track)-[ERFORMED_BY]->(a:Artist)
RETURNa.name,COUNT(*)ASpopularity
ORDERBYpopularityDESC
LIMIT10

FullContext:
[{'a.name':'BadBunny','popularity':40},{'a.name':'TaylorSwift','popularity':38},{'a.name':'TheWeeknd','popularity':36},{'a.name':'SZA','popularity':23},{'a.name':'KendrickLamar','popularity':23},{'a.name':'Feid','popularity':21},{'a.name':'Drake','popularity':19},{'a.name':'HarryStyles','popularity':17},{'a.name':'PesoPluma','popularity':16},{'a.name':'21Savage','popularity':14}]

>Finishedchain.

{
"query":"Whoarethemostpopularartists?",
"result":"BadBunny,TaylorSwift,andTheWeekndarethemostpopularartists."
}

LLM 似乎是根据艺术家参与的曲目数量来判断受欢迎程度的,而不是根据他们的总播放量。

每分钟节拍数

哪首曲子的 BPM 最高?

qa.invoke({"query":"WhichtrackhasthehighestBPM?"})

>EnteringnewGraphCypherQAChainchain...
GeneratedCypher:
cypher
MATCH(t:Track)
RETURNt
ORDERBYt.bpmDESC
LIMIT1

FullContext:
[{'t':{'id':'seven-feat-latto-explicit-ver--2023'}}]

>Finishedchain.

{
"query":"WhichtrackhasthehighestBPM?",
"result":"Idon'tknowtheanswer."
}

改进 Cypher 生成提示词

在这种情况下,Cypher 看起来不错,并且提示中包含了正确的结果,但 gpt-4o 无法解释答案。看来传递给 GraphCypherQAChain 的 CYPHER_GENERATION_PROMPT 需要添加额外的说明,以使列名更详细。

始终在 Cypher 语句中使用详细的列名,使用标签和属性名。例如,使用‘person_name’而不是‘name’。

带有自定义提示词的 GraphCypherQAChain:

CYPHER_GENERATION_TEMPLATE="""Task:GenerateCypherstatementtoqueryagraphdatabase.
Instructions:
Useonlytheprovidedrelationshiptypesandpropertiesintheschema.
Donotuseanyotherrelationshiptypesorpropertiesthatarenotprovided.
Schema:
{schema}
Noteonotincludeanyexplanationsorapologiesinyourresponses.
DonotrespondtoanyquestionsthatmightaskanythingelsethanforyoutoconstructaCypherstatement.
DonotincludeanytextexceptthegeneratedCypherstatement.

AlwaysuseverbosecolumnnamesintheCypherstatementusingthelabelandpropertynames.Forexample,use'person_name'insteadof'name'.
Includedatafromtheimmediatenetworkaroundthenodeintheresulttoprovideextracontext.Forexample,includetheMoviereleaseyear,alistofactorsandtheirroles,orthedirectorofamovie.
Whenorderingbyaproperty,addan`ISNOTNULL`checktoensurethatonlynodeswiththatpropertyarereturned.

Examples:HereareafewexamplesofgeneratedCypherstatementsforparticularquestions:

#HowmanypeopleactedinTopGun?
MATCH(m:Movie{{name:"TopGun"}})
RETURNCOUNT{(m)<-[:ACTED_IN]-()}ASnumberOfActors

Thequestionis:
{question}"""

CYPHER_GENERATION_PROMPT=PromptTemplate(
input_variables=["schema","question"],template=CYPHER_GENERATION_TEMPLATE
)

qa=GraphCypherQAChain.from_llm(
llm,
graph=graph,
allow_dangerous_requests=True,
verbose=True,
cypher_prompt=CYPHER_GENERATION_PROMPT,
)

由最多艺术家表演的曲目

图在返回按类型和方向计数的关系时表现出色。

qa.invoke({"query":"Whichtracksareperformedbythemostartists?"})

>EnteringnewGraphCypherQAChainchain...
GeneratedCypher:
cypher
MATCH(t:Track)
WITHt,COUNT{(t)-[ERFORMED_BY]->(:Artist)}asartist_count
WHEREartist_countISNOTNULL
RETURNt.idAStrack_id,t.nameAStrack_name,artist_count
ORDERBYartist_countDESC

FullContext:
[{'track_id':'los-del-espacio-2023','track_name':'LosdelEspacio','artist_count':8},{'track_id':'se-le-ve-2021','track_name':'SeLeVe','artist_count':8},{'track_id':'we-don-t-talk-about-bruno-2021','track_name':"WeDon'tTalkAboutBruno",'artist_count':7},{'track_id':'cayï-ï-la-noche-feat-cruz-cafunï-ï-abhir-hathi-bejo-el-ima--2022','track_name':None,'artist_count':6},{'track_id':'jhoome-jo-pathaan-2022','track_name':'JhoomeJoPathaan','artist_count':6},{'track_id':'besharam-rang-from-pathaan--2022','track_name':None,'artist_count':6},{'track_id':'nobody-like-u-from-turning-red--2022','track_name':None,'artist_count':6},{'track_id':'ultra-solo-remix-2022','track_name':'ULTRASOLOREMIX','artist_count':5},{'track_id':'angel-pt-1-feat-jimin-of-bts-jvke-muni-long--2023','track_name':None,'artist_count':5},{'track_id':'link-up-metro-boomin-don-toliver-wizkid-feat-beam-toian-spider-verse-remix-spider-man-across-the-spider-verse--2023','track_name':None,'artist_count':5}]

>Finishedchain.

{
"query":"Whichtracksareperformedbythemostartists?",
"result":"Thetracks\"LosdelEspacio\"and\"SeLeVe\"areperformedbythemostartists,witheachtrackhaving8artists."
}

总结

CSV 分析和建模是最耗时的部分。生成可能需要超过五分钟。

成本本身相当便宜。在八小时的实验中,我必须发送了数百个请求,最终我花费了大约一美元左右。

在达到这个目标之前,我遇到了许多挑战:

我可以看到这种方法在 LangGraph 实现中运行良好,在该实现中,操作按顺序执行,使大语言模型能够构建和精细化模型。随着新模型的不断发布,这些模型也可能受益于微调。






欢迎光临 链载Ai (https://www.lianzai.com/) Powered by Discuz! X3.5