•名词变成标签——它们描述节点所代表的事物。
•动词变成关系类型——它们描述事物之间的连接方式。
•其他内容变成属性(尤其是副词)——你有一个名字,并且可能驾驶一辆灰色的车。
我确信这可以被提炼成一个提示词,以创建一种零样本学习的图数据建模方法。
几个月前我简要尝试过这个方法,发现我使用的模型在处理较大模式时容易分心,提示词很快达到了 LLM 的 Token 限制。
这次我决定尝试一种迭代方法,一次只处理一个键。这应该有助于避免分心,因为 LLM 只需要一次考虑一项内容。
最终的方法使用了以下步骤:
1.将 CSV 文件加载到 Pandas 数据框中。
2.分析 CSV 中的每一列,并将其附加到一个基于 JSON Schema 的数据模型中。
3.为每个实体识别并添加缺失的唯一 ID。
4.审查数据模型的准确性。
5.生成导入节点和关系的 Cypher 语句。
6.生成支持导入语句的唯一约束。
7.创建约束并运行导入。
我在 Kaggle 上快速浏览了一下,寻找一个有趣的数据集。脱颖而出的是 Spotify 最多播放的歌曲数据集。
importpandasaspd
csv_file='/Users/adam/projects/datamodeller/data/spotify/spotify-most-streamed-songs.csv'
df=pd.read_csv(csv_file)
df.head()
track_nameartist(s)_nameartist_countreleased_yearreleased_monthreleased_dayin_spotify_playlistsin_spotify_chartsstreamsin_apple_playlists…keymodedanceability_%valence_%energy_%acousticness_%instrumentalness_%liveness_%speechiness_%cover_url
0Seven(feat.Latto)(ExplicitVer.)Latto,JungKook2202371455314714138170343…BMajor80898331084NotFound
1LALAMykeTowers1202332314744813371628648…C#Major71617470104https://i.scdn.co/image/ab67616d0000b2730656d5…
2vampireOliviaRodrigo12023630139711314000397494…FMajor513253170316https://i.scdn.co/image/ab67616d0000b273e85259…
3CruelSummerTaylorSwift120198237858100800840817116…AMajor5558721101115https://i.scdn.co/image/ab67616d0000b273e787cf…
4WHERESHEGOESBadBunny1202351831335030323632284…AMinor6523801463116https://i.scdn.co/image/ab67616d0000b273ab5c9c…这个数据集相对简单,但我马上就能看出应该存在曲目和艺术家之间的关系。
同时,还需要克服数据清理方面的挑战,尤其是列名和艺术家名称列中的艺术家是以逗号分隔的值。
我真的很想使用本地 LLM,但我很快发现 Llama 3 无法胜任。如果有疑问,退回到 OpenAI:
fromlangchain_core.promptsimportPromptTemplate
fromlangchain_core.pydantic_v1importBaseModel,Field
fromtypingimportList
fromlangchain_core.output_parsersimportJsonOutputParser
fromlangchain_openaiimportChatOpenAI
llm=ChatOpenAI(model="gpt-4o")我使用了一组简化的建模说明来创建数据建模提示词。我不得不多次调整提示词以获得一致的输出。
零样本学习的例子效果相对较好,但我发现输出不一致。定义一个结构化的输出来保存 JSON 输出确实有帮助:
classJSONSchemaSpecification(BaseModel):
notes:str=Field(description="Anynotesorcommentsabouttheschema")
jsonschema:str=Field(description="AJSONarrayofJSONschemaspecificationsthatdescribetheentitiesinthedatamodel")JSON 本身也不一致,所以我最终基于电影推荐数据集定义了一个模式。
示例输出:
example_output=[
dict(
title="Person",
type="object",
description="Node",
properties=[
dict(name="name",column_name="person_name",type="string",description="Thenameoftheperson",examples=["TomHanks"]),
dict(name="date_of_birth",column_name="person_dob",type="date",description="Thedateofbirthfortheperson",examples=["1987-06-05"]),
dict(name="id",column_name="person_name,date_of_birth",type="string",description="TheIDisacombinationofnameanddateofbirthtoensureuniqueness",examples=["tom-hanks-1987-06-05"]),
],
),
dict(
title="Director",
type="object",
description="Node",
properties=[
dict(name="name",column_name="director_names",type="string",description="Thenameofthedirectors.Splitvaluesincolumnbyacomma",examples=["FrancisFordCoppola"]),
],
),
dict(
title="Movie",
type="object",
description="Node",
properties=[
dict(name="title",column_name="title",type="string",description="Thetitleofthemovie",examples=["ToyStory"]),
dict(name="released",column_name="released",type="integer",description="Theyearthemoviewasreleased",examples=["1990"]),
],
),
dict(
title="ACTED_IN",
type="object",
description="Relationship",
properties=[
dict(name="_from",column_name="od",type="string",description="Personfoundbythe`id`.TheIDisacombinationofnameanddateofbirthtoensureuniqueness",examples=["Person"]),
dict(name="_to",column_name="title",type="string",description="Themovietitle",examples=["Movie"]),
dict(name="roles",type="string",column_name="person_roles",description="Therolesthepersonplayedinthemovie",examples=["Woody"]),
],
),
dict(
title="DIRECTED",
type="object",
description="Relationship",
properties=[
dict(name="_from",type="string",column_name="director_names",description="Directornamesarecommaseparated",examples=["Director"]),
dict(name="_to",type="string",column_name="title",description="Thelabelofthenodethisrelationshipendsat",examples=["Movie"]),
],
),
]我不得不偏离严格的 JSON Schema,并在输出中添加 column_name 字段,以帮助 LLM 生成导入脚本。提供描述的示例在这方面也有帮助,否则在 MATCH 子句中使用的属性会不一致。
这是最终的提示词:
model_prompt=PromptTemplate.from_template("""
YouareanexpertGraphDatabaseadministrator.
Yourtaskistodesignadatamodelbasedontheinformationprovidedfromanexistingdatasource.
Youmustdecidewherethefollowingcolumnfitsinwiththeexistingdatamodel.Consider:
*Doesthecolumnrepresentanentity,forexampleaPerson,Place,orMovie?Ifso,thisshouldbeanodeinitsownright.
*Doesthecolumnrepresentarelationshipbetweentwoentities?Ifso,thisshouldbearelationshipbetweentwonodes.
*Doesthecolumnrepresentanattributeofanentityorrelationship?Ifso,thisshouldbeapropertyofanodeorrelationship.
*Doesthecolumnrepresentasharedattributethatcouldbeinterestingtoquerythroughtofindsimilarnodes,forexampleaGenre?Ifso,thisshouldbeanodeinitsownright.
##InstructionsforNodes
*Nodelabelsaregenerallynouns,forexamplePerson,Place,orMovie
*NodetitlesshouldbeinUpperCamelCase
##InstructionsforRelationships
*Relationshopsaregenerallyverbs,forexampleACTED_IN,DIRECTED,orPURCHASED
*Examplesofgoodrelationshipsare(
erson)-[:ACTED_IN]->(:Movie)or(
erson)-[
URCHASED]->(
roduct)
*RelationshipsshouldbeinUPPER_SNAKE_CASE
*Provideanyspecificinstructionsforthefieldinthedescription.Forexample,doesthefieldcontainalistofcommaseparatedvaluesorasinglevalue?
##InstructionsforProperties
*RelationshipsshouldbeinlowerPascalCase
*Prefertheshorternamewherepossible,forexample"person_id"and"personId"shouldsimplybe"id"
*Ifyouarechangingthepropertynamefromtheoriginalfieldname,mentionthecolumnnameinthedescription
*Donotincludeexamplesforintegerordatefields
*Alwaysincludeinstructionsondatapreparationforthefield.Doesitneedtobecastasastringorsplitintomultiplefieldsonadelimitingvalue?
*Propertykeysshouldbelettersonly,nonumbersorspecialcharacters.
##Important!
Considertheexamplesprovided.Doesanydatapreparationneedtobedonetoensurethedataisinthecorrectformat?
Youmustincludeanyinformationaboutdatapreparationinthedescription.
##ExampleOutput
Hereisanexampleofagoodoutput:
'''
{example_output}
'''
##NewData:
Key:{key}
DataType:{type}
ExampleValues:{examples}
##ExistingDataModel
Hereistheexistingdatamodel:
'''
{existing_model}
'''
##KeepExistingDataModel
Applyyourchangestotheexistingdatamodelbutneverremoveanyexistingdefinitions.
""",partial_variables=dict(example_output=dumps(example_output)))
model_chain=model_prompt|llm.with_structured_output(JSONSchemaSpecification)为了迭代更新模型,我遍历数据框中的键,并将每个键、其数据类型及前五个唯一值传递给提示词:
fromjson_repairimportdumps,loads
existing_model={}
fori,keyinenumerate(df):
print("\n",i,key)
print("----------------")
try:
res=try_chain(model_chain,dict(
existing_model=dumps(existing_model),
key=key,
type=df[key].dtype,
examples=dumps(df[key].unique()[:5].tolist())
))
print(res.notes)
existing_model=loads(res.jsonschema)
print([n['title']forninexisting_model])
exceptExceptionase:
print(e)
pass
existing_model控制台输出:
0track_name
----------------
Adding'track_name'toanexistingdatamodel.Thisrepresentsamusictrackentity.
['Track']
1artist(s)_name
----------------
Addinganewcolumn'artist(s)_name'totheexistingdatamodel.Thiscolumnrepresentsmultipleartistsassociatedwithtracksandshouldbemodeledasanewnode'Artist'andarelationship'PERFORMED_BY'from'Track'to'Artist'.
['Track','Artist','PERFORMED_BY']
2artist_count
----------------
Addedartist_countasapropertyofTracknode.Thispropertyindicatesthenumberofartistsperforminginthetrack.
['Track','Artist','PERFORMED_BY']
3released_year
----------------
Addthereleased_yearcolumntotheexistingdatamodelasapropertyoftheTracknode.
['Track','Artist','PERFORMED_BY']
4released_month
----------------
Addingthe'released_month'columntotheexistingdatamodel,consideringitasanattributeoftheTracknode.
['Track','Artist','PERFORMED_BY']
5released_day
----------------
Addedanewproperty'released_day'tothe'Track'nodetocapturethedayofthemonthatrackwasreleased.
['Track','Artist','PERFORMED_BY']
6in_spotify_playlists
----------------
Addingthenewcolumn'in_spotify_playlists'totheexistingdatamodelasapropertyofthe'Track'node.
['Track','Artist','PERFORMED_BY']
7in_spotify_charts
----------------
Addingthe'in_spotify_charts'columntotheexistingdatamodelasapropertyoftheTracknode.
['Track','Artist','PERFORMED_BY']
8streams
----------------
Addinganewcolumn'streams'totheexistingdatamodel,representingthenumberofstreamsforatrack.
['Track','Artist','PERFORMED_BY']
9in_apple_playlists
----------------
Addingnewcolumn'in_apple_playlists'totheexistingdatamodel
['Track','Artist','PERFORMED_BY']
10in_apple_charts
----------------
Adding'in_apple_charts'asapropertytothe'Track'node,representingthenumberoftimesthetrackappearedintheApplecharts.
['Track','Artist','PERFORMED_BY']
11in_deezer_playlists
----------------
Add'in_deezer_playlists'totheexistingdatamodelforamusictrackdatabase.
['Track','Artist','PERFORMED_BY']
12in_deezer_charts
----------------
Addinganewproperty'inDeezerCharts'totheexisting'Track'nodetorepresentthenumberoftimesthetrackappearedinDeezercharts.
['Track','Artist','PERFORMED_BY']
13in_shazam_charts
----------------
Addingnewdata'in_shazam_charts'totheexistingdatamodel.Thisappearstobeanattributeofthe'Track'node,indicatingthenumberoftimesatrackappearedintheShazamcharts.
['Track','Artist','PERFORMED_BY']
14bpm
----------------
AddedbpmcolumnasapropertytotheTracknodeasitrepresentsacharacteristicofthetrack.
['Track','Artist','PERFORMED_BY']
15key
----------------
Addingthe'key'columntotheexistingdatamodel.The'key'representsthemusicalkeyofatrack,whichisasharedattributethatcanbeinterestingtoquerythroughtofindsimilartracks.
['Track','Artist','PERFORMED_BY']
16mode
----------------
Adding'mode'totheexistingdatamodel.Itrepresentsamusicalcharacteristicofatrack,whichisbestcapturedasanattributeoftheTracknode.
['Track','Artist','PERFORMED_BY']
17danceability_%
----------------
Added'danceability_%'totheexistingdatamodelasapropertyoftheTracknode.Thefieldrepresentsthedanceabilitypercentageofthetrack.
['Track','Artist','PERFORMED_BY']
18valence_%
----------------
AddingthevalencepercentagecolumntotheexistingdatamodelasapropertyoftheTracknode.
['Track','Artist','PERFORMED_BY']
19energy_%
----------------
Integrationofthenewcolumn'energy_%'intotheexistingdatamodel.ThiscolumnrepresentsanattributeoftheTrackentityandshouldbeaddedasapropertyoftheTracknode.
['Track','Artist','PERFORMED_BY']
20acousticness_%
----------------
Addingacousticness_%totheexistingdatamodelasapropertyoftheTracknode.
['Track','Artist','PERFORMED_BY']
21instrumentalness_%
----------------
Addingthenewcolumn'instrumentalness_%'totheexistingTracknodeasitrepresentsanattributeoftheTrackentity.
['Track','Artist','PERFORMED_BY']
22liveness_%
----------------
Addingthenewcolumn'liveness_%'totheexistingdatamodelasanattributeoftheTracknode
['Track','Artist','PERFORMED_BY']
23speechiness_%
----------------
Addingthenewcolumn'speechiness_%'totheexistingdatamodelasapropertyofthe'Track'node.
['Track','Artist','PERFORMED_BY']
24cover_url
----------------
Addinganewproperty'cover_url'totheexisting'Track'node.ThispropertyrepresentstheURLofthetrack'scoverimage.
['Track','Artist','PERFORMED_BY']在对提示词进行了一些调整以处理用例后,我得到了一个我相当满意的模型。LLM 成功确定了数据集由 Track、Artist 以及连接两者的 PERFORMED_BY 关系组成:
[
{
"title":"Track",
"type":"object",
"description":"Node",
"properties":[
{
"name":"name",
"column_name":"track_name",
"type":"string",
"description":"Thenameofthetrack",
"examples":[
"Seven(feat.Latto)(ExplicitVer.)",
"LALA",
"vampire",
"CruelSummer",
"WHERESHEGOES",
],
},
{
"name":"artist_count",
"column_name":"artist_count",
"type":"integer",
"description":"Thenumberofartistsperforminginthetrack",
"examples":[2,1,3,8,4],
},
{
"name":"released_year",
"column_name":"released_year",
"type":"integer",
"description":"Theyearthetrackwasreleased",
"examples":[2023,2019,2022,2013,2014],
},
{
"name":"released_month",
"column_name":"released_month",
"type":"integer",
"description":"Themonththetrackwasreleased",
"examples":[7,3,6,8,5],
},
{
"name":"released_day",
"column_name":"released_day",
"type":"integer",
"description":"Thedayofthemonththetrackwasreleased",
"examples":[14,23,30,18,1],
},
{
"name":"inSpotifyPlaylists",
"column_name":"in_spotify_playlists",
"type":"integer",
"description":"ThenumberofSpotifyplayliststhetrackisin.Castthevalueasaninteger.",
"examples":[553,1474,1397,7858,3133],
},
{
"name":"inSpotifyCharts",
"column_name":"in_spotify_charts",
"type":"integer",
"description":"ThenumberoftimesthetrackappearedintheSpotifycharts.Castthevalueasaninteger.",
"examples":[147,48,113,100,50],
},
{
"name":"streams",
"column_name":"streams",
"type":"array",
"description":"ThelistofstreamIDsforthetrack.Maintainthearrayformat.",
"examples":[
"141381703",
"133716286",
"140003974",
"800840817",
"303236322",
],
},
{
"name":"inApplePlaylists",
"column_name":"in_apple_playlists",
"type":"integer",
"description":"ThenumberofAppleplayliststhetrackisin.Castthevalueasaninteger.",
"examples":[43,48,94,116,84],
},
{
"name":"inAppleCharts",
"column_name":"in_apple_charts",
"type":"integer",
"description":"ThenumberoftimesthetrackappearedintheApplecharts.Castthevalueasaninteger.",
"examples":[263,126,207,133,213],
},
{
"name":"inDeezerPlaylists",
"column_name":"in_deezer_playlists",
"type":"array",
"description":"ThelistofDeezerplaylistIDsthetrackisin.Maintainthearrayformat.",
"examples":["45","58","91","125","87"],
},
{
"name":"inDeezerCharts",
"column_name":"in_deezer_charts",
"type":"integer",
"description":"ThenumberoftimesthetrackappearedintheDeezercharts.Castthevalueasaninteger.",
"examples":[10,14,12,15,17],
},
{
"name":"inShazamCharts",
"column_name":"in_shazam_charts",
"type":"array",
"description":"ThelistofShazamchartIDsthetrackisin.Maintainthearrayformat.",
"examples":["826","382","949","548","425"],
},
{
"name":"bpm",
"column_name":"bpm",
"type":"integer",
"description":"Thebeatsperminuteofthetrack.Castthevalueasaninteger.",
"examples":[125,92,138,170,144],
},
{
"name":"key",
"column_name":"key",
"type":"string",
"description":"Themusicalkeyofthetrack.Castthevalueasastring.",
"examples":["B","C#","F","A","D"],
},
{
"name":"mode",
"column_name":"mode",
"type":"string",
"description":"Themodeofthetrack(e.g.,Major,Minor).Castthevalueasastring.",
"examples":["Major","Minor"],
},
{
"name":"danceability",
"column_name":"danceability_%",
"type":"integer",
"description":"Thedanceabilitypercentageofthetrack.Castthevalueasaninteger.",
"examples":[80,71,51,55,65],
},
{
"name":"valence",
"column_name":"valence_%",
"type":"integer",
"description":"Thevalencepercentageofthetrack.Castthevalueasaninteger.",
"examples":[89,61,32,58,23],
},
{
"name":"energy",
"column_name":"energy_%",
"type":"integer",
"description":"Theenergypercentageofthetrack.Castthevalueasaninteger.",
"examples":[83,74,53,72,80],
},
{
"name":"acousticness",
"column_name":"acousticness_%",
"type":"integer",
"description":"Theacousticnesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[31,7,17,11,14],
},
{
"name":"instrumentalness",
"column_name":"instrumentalness_%",
"type":"integer",
"description":"Theinstrumentalnesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[0,63,17,2,19],
},
{
"name":"liveness",
"column_name":"liveness_%",
"type":"integer",
"description":"Thelivenesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[8,10,31,11,28],
},
{
"name":"speechiness",
"column_name":"speechiness_%",
"type":"integer",
"description":"Thespeechinesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[4,6,15,24,3],
},
{
"name":"coverUrl",
"column_name":"cover_url",
"type":"string",
"description":"TheURLofthetrack'scoverimage.Ifthevalueis'NotFound',itshouldbecastasanemptystring.",
"examples":[
"https://i.scdn.co/image/ab67616d0000b2730656d5ce813ca3cc4b677e05",
"https://i.scdn.co/image/ab67616d0000b273e85259a1cae29a8d91f2093d",
],
},
],
},
{
"title":"Artist",
"type":"object",
"description":"Node",
"properties":[
{
"name":"name",
"column_name":"artist(s)_name",
"type":"string",
"description":"Thenameoftheartist.Splitvaluesincolumnbyacomma",
"examples":[
"Latto",
"JungKook",
"MykeTowers",
"OliviaRodrigo",
"TaylorSwift",
"BadBunny",
],
}
],
},
{
"title":"PERFORMED_BY",
"type":"object",
"description":"Relationship",
"properties":[
{
"name":"_from",
"type":"string",
"description":"Thelabelofthenodethisrelationshipstartsat",
"examples":["Track"],
},
{
"name":"_to",
"type":"string",
"description":"Thelabelofthenodethisrelationshipendsat",
"examples":["Artist"],
},
],
},
]
[
{
"title":"Track",
"type":"object",
"description":"Node",
"properties":[
{
"name":"name",
"column_name":"track_name",
"type":"string",
"description":"Thenameofthetrack",
"examples":[
"Seven(feat.Latto)(ExplicitVer.)",
"LALA",
"vampire",
"CruelSummer",
"WHERESHEGOES",
],
},
{
"name":"artist_count",
"column_name":"artist_count",
"type":"integer",
"description":"Thenumberofartistsperforminginthetrack",
"examples":[2,1,3,8,4],
},
{
"name":"released_year",
"column_name":"released_year",
"type":"integer",
"description":"Theyearthetrackwasreleased",
"examples":[2023,2019,2022,2013,2014],
},
{
"name":"released_month",
"column_name":"released_month",
"type":"integer",
"description":"Themonththetrackwasreleased",
"examples":[7,3,6,8,5],
},
{
"name":"released_day",
"column_name":"released_day",
"type":"integer",
"description":"Thedayofthemonththetrackwasreleased",
"examples":[14,23,30,18,1],
},
{
"name":"inSpotifyPlaylists",
"column_name":"in_spotify_playlists",
"type":"integer",
"description":"ThenumberofSpotifyplayliststhetrackisin.Castthevalueasaninteger.",
"examples":[553,1474,1397,7858,3133],
},
{
"name":"inSpotifyCharts",
"column_name":"in_spotify_charts",
"type":"integer",
"description":"ThenumberoftimesthetrackappearedintheSpotifycharts.Castthevalueasaninteger.",
"examples":[147,48,113,100,50],
},
{
"name":"streams",
"column_name":"streams",
"type":"array",
"description":"ThelistofstreamIDsforthetrack.Maintainthearrayformat.",
"examples":[
"141381703",
"133716286",
"140003974",
"800840817",
"303236322",
],
},
{
"name":"inApplePlaylists",
"column_name":"in_apple_playlists",
"type":"integer",
"description":"ThenumberofAppleplayliststhetrackisin.Castthevalueasaninteger.",
"examples":[43,48,94,116,84],
},
{
"name":"inAppleCharts",
"column_name":"in_apple_charts",
"type":"integer",
"description":"ThenumberoftimesthetrackappearedintheApplecharts.Castthevalueasaninteger.",
"examples":[263,126,207,133,213],
},
{
"name":"inDeezerPlaylists",
"column_name":"in_deezer_playlists",
"type":"array",
"description":"ThelistofDeezerplaylistIDsthetrackisin.Maintainthearrayformat.",
"examples":["45","58","91","125","87"],
},
{
"name":"inDeezerCharts",
"column_name":"in_deezer_charts",
"type":"integer",
"description":"ThenumberoftimesthetrackappearedintheDeezercharts.Castthevalueasaninteger.",
"examples":[10,14,12,15,17],
},
{
"name":"inShazamCharts",
"column_name":"in_shazam_charts",
"type":"array",
"description":"ThelistofShazamchartIDsthetrackisin.Maintainthearrayformat.",
"examples":["826","382","949","548","425"],
},
{
"name":"bpm",
"column_name":"bpm",
"type":"integer",
"description":"Thebeatsperminuteofthetrack.Castthevalueasaninteger.",
"examples":[125,92,138,170,144],
},
{
"name":"key",
"column_name":"key",
"type":"string",
"description":"Themusicalkeyofthetrack.Castthevalueasastring.",
"examples":["B","C#","F","A","D"],
},
{
"name":"mode",
"column_name":"mode",
"type":"string",
"description":"Themodeofthetrack(e.g.,Major,Minor).Castthevalueasastring.",
"examples":["Major","Minor"],
},
{
"name":"danceability",
"column_name":"danceability_%",
"type":"integer",
"description":"Thedanceabilitypercentageofthetrack.Castthevalueasaninteger.",
"examples":[80,71,51,55,65],
},
{
"name":"valence",
"column_name":"valence_%",
"type":"integer",
"description":"Thevalencepercentageofthetrack.Castthevalueasaninteger.",
"examples":[89,61,32,58,23],
},
{
"name":"energy",
"column_name":"energy_%",
"type":"integer",
"description":"Theenergypercentageofthetrack.Castthevalueasaninteger.",
"examples":[83,74,53,72,80],
},
{
"name":"acousticness",
"column_name":"acousticness_%",
"type":"integer",
"description":"Theacousticnesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[31,7,17,11,14],
},
{
"name":"instrumentalness",
"column_name":"instrumentalness_%",
"type":"integer",
"description":"Theinstrumentalnesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[0,63,17,2,19],
},
{
"name":"liveness",
"column_name":"liveness_%",
"type":"integer",
"description":"Thelivenesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[8,10,31,11,28],
},
{
"name":"speechiness",
"column_name":"speechiness_%",
"type":"integer",
"description":"Thespeechinesspercentageofthetrack.Castthevalueasaninteger.",
"examples":[4,6,15,24,3],
},
{
"name":"coverUrl",
"column_name":"cover_url",
"type":"string",
"description":"TheURLofthetrack'scoverimage.Ifthevalueis'NotFound',itshouldbecastasanemptystring.",
"examples":[
"https://i.scdn.co/image/ab67616d0000b2730656d5ce813ca3cc4b677e05",
"https://i.scdn.co/image/ab67616d0000b273e85259a1cae29a8d91f2093d",
],
},
],
},
{
"title":"Artist",
"type":"object",
"description":"Node",
"properties":[
{
"name":"name",
"column_name":"artist(s)_name",
"type":"string",
"description":"Thenameoftheartist.Splitvaluesincolumnbyacomma",
"examples":[
"Latto",
"JungKook",
"MykeTowers",
"OliviaRodrigo",
"TaylorSwift",
"BadBunny",
],
}
],
},
{
"title":"PERFORMED_BY",
"type":"object",
"description":"Relationship",
"properties":[
{
"name":"_from",
"type":"string",
"description":"Thelabelofthenodethisrelationshipstartsat",
"examples":["Track"],
},
{
"name":"_to",
"type":"string",
"description":"Thelabelofthenodethisrelationshipendsat",
"examples":["Artist"],
},
],
},
]我注意到模式中没有包含任何唯一标识符,这在导入关系时可能会成为问题。很显然,不同的艺术家可能发布同名歌曲,两个艺术家也可能拥有相同的名字。
因此,为曲目创建一个标识符,以便在更大的数据集中进行区分,这一点非常重要:
#Addprimarykey/uniqueidentifiers
uid_prompt=PromptTemplate.from_template("""
Youareagraphdatabaseexpertreviewingasingleentityfromadatamodelgeneratedbyacolleague.
Youwanttoensurethatallofthenodesimportedintothedatabaseareunique.
##Example
AschemacontainsActorswithanumberofpropertiesincludingname,dateofbirth.
Twoactorsmayhavethesamenamethenaddanewcompoundpropertycombiningthenameanddateofbirth.
Ifcombiningvalues,includetheinstructiontoconvertthevaluetoslugcase.Callthenewproperty'id'.
Ifyouhaveidentifiedanewproperty,addittothelistofpropertiesleavingtherestintact.
Includeinthedescriptionthefieldsthataretobeconcatenated.
##ExampleOutput
Hereisanexampleofagoodoutput:
'''
{example_output}
'''
##CurrentEntitySchema
'''
{entity}
'''
""",partial_variables=dict(example_output=dumps(example_output)))
uid_chain=uid_prompt|llm.with_structured_output(JSONSchemaSpecification)这一步实际上只需要对节点执行,因此我从模式中提取了节点,为每个节点运行链式操作,然后将关系与更新后的定义结合起来:
#extractnodesandrelationships
nodes=[nforninexisting_modelif"node"inn["description"].lower()]
rels=[nforninexisting_modelif"node"notinn["description"].lower()]
#generateauniqueidfornodes
with_uids=[]
forentityinnodes:
res=uid_chain.invoke(dict(entity=dumps(entity)))
json=loads(res.jsonschema)
with_uids=with_uids+jsoniftype(json)==listelsewith_uids+[json]
#combinenodesandrelationships
with_uids=with_uids+rels为了确保模型的合理性,值得检查模型是否可以进行优化。model_prompt 在识别名词和动词方面做得很好,但在更复杂的模型中。
某次迭代将 *_playlists 和 _charts 列视为 ID,并尝试创建 Stream 节点和 IN_PLAYLIST 关系。我猜测这可能是因为超过 1000 的数字使用了逗号格式(例如 1,001)。
这是个不错的想法,但可能有点过于聪明了。不过这表明了让一个了解数据结构的人参与其中的重要性。
#Addprimarykey/uniqueidentifiers
review_prompt=PromptTemplate.from_template("""
Youareagraphdatabaseexpertreviewingadatamodelgeneratedbyacolleague.
Yourtaskistoreviewthedatamodelandensurethatitisfitforpurpose.
Checkfor:
##Checkfornestedobjects
RememberthatNeo4jcannotstorearraysofobjectsornestedobjects.
Thesemustbeconvertedintointoseparatenodeswithrelationshipsbetweenthem.
Youmustincludethenewnodeandareferencetotherelationshiptotheoutputschema.
##CheckforEntitiesinproperties
IfthereisapropertythatrepresentsanarrayofIDs,anewnodeshouldbecreatedforthatentity.
Youmustincludethenewnodeandareferencetotherelationshiptotheoutputschema.
#KeepInstructions
Ensurethattheinstructionsforthenodes,relationships,andpropertiesareclearandconcise.
Youmayimprovethembutthedetailmustnotberemovedinanycircumstances.
##CurrentEntitySchema
'''
{entity}
'''
""")
review_chain=review_prompt|llm.with_structured_output(JSONSchemaSpecification)
review_nodes=[nforninwith_uidsif"node"inn["description"].lower()]
review_rels=[nforninwith_uidsif"node"notinn["description"].lower()]
reviewed=[]
forentityinreview_nodes:
res=review_chain.invoke(dict(entity=dumps(entity)))
json=loads(res.jsonschema)
reviewed=reviewed+json
#addrelationshipsbackin
reviewed=reviewed+review_rels
len(reviewed)
reviewed=with_uids在现实场景中,我会希望多次运行此过程,以迭代改进数据模型。我会设置一个最大限制,然后迭代到该点或数据模型对象不再变化为止。
此时,模式应该足够健壮,并包含尽可能多的信息,以允许 LLM 生成一组导入脚本。
根据 Neo4j 数据导入建议,文件应多次处理,每次只导入一个节点或关系,以避免急切操作和锁定。
import_prompt=PromptTemplate.from_template("""
Basedonthedatamodel,writeaCypherstatementtoimportthefollowingdatafromaCSVfileintoNeo4j.
DonotuseLOADCSVasthisdatawillbeimportedusingtheNeo4jPythonDriver,useUNWINDonthe$rowsparameterinstead.
Youarewritingamulti-passimportprocess,soconcentrateontheentitymentioned.
Whenimportingdata,youmustusethefollowingguidelines:
*followtheinstructionsinthedescriptionwhenidentifyingprimarykeys.
*Usetheinstructionsinthedescriptiontodeterminetheformatofpropertieswhenafinding.
*WhencombiningfieldsintoanID,usetheapoc.text.slugfunctiontoconvertanytexttoslugcaseandtoLowertoconvertthestringtolowercase-apoc.text.slug(toLower(row.`name`))
*Ifyousplitaproperty,convertittoastringandusethetrimfunctiontoremoveanywhitespace-trim(toString(row.`name`))
*Whencombiningproperties,wrapeachpropertyinthecoalescefunctionsothepropertyisnotnullifoneofthevaluesisnotset-coalesce(row.`id`,'')+'--'+coalsece(row.`title`)
*Usethe`column_name`fieldtomaptheCSVcolumntothepropertyinthedatamodel.
*WrapallcolumnnamesfromtheCSVinbackticks-forexamplerow.`column_name`.
*Whenyoumergenodes,mergeontheuniqueidentifierandnothingelse.Allotherpropertiesshouldbesetusing`SET`.
*Donotuseapoc.periodic.iterate,thefileswillbebatchedintheapplication.
DataModel:
'''
{data_model}
'''
CurrentEntity:
'''
{entity}
'''
""")此链式操作需要与前几个步骤不同的输出对象。在这种情况下,cypher 成员最为重要,但我还想包括一个 chain_of_thought 键,以鼓励思维过程:
classCypherOutputSpecification(BaseModel):
chain_of_thought:str=Field(description="AnyreasoningusedtowritetheCypherstatement")
cypher:str=Field(description="TheCypherstatementtoimportthedata")
notes:Optional[str]=Field(description="AnynotesorclosingremarksabouttheCypherstatement")
import_chain=import_prompt|llm.with_structured_output(CypherOutputSpecification)然后同样的过程适用于遍历每个已审查的定义并生成 Cypher:
import_cypher=[]
forninreviewed:
print('\n\n------',n['title'])
res=import_chain.invoke(dict(
data_model=dumps(reviewed),
entity=n
))
import_cypher.append((
res.cypher
))
print(res.cypher)
控制台输出:
------Track
UNWIND$rowsASrow
MERGE(t:Track{id:apoc.text.slug(toLower(coalesce(row.`track_name`,'')+'-'+coalesce(row.`released_year`,'')))})
SETt.name=trim(toString(row.`track_name`)),
t.artist_count=toInteger(row.`artist_count`),
t.released_year=toInteger(row.`released_year`),
t.released_month=toInteger(row.`released_month`),
t.released_day=toInteger(row.`released_day`),
t.inSpotifyPlaylists=toInteger(row.`in_spotify_playlists`),
t.inSpotifyCharts=toInteger(row.`in_spotify_charts`),
t.streams=row.`streams`,
t.inApplePlaylists=toInteger(row.`in_apple_playlists`),
t.inAppleCharts=toInteger(row.`in_apple_charts`),
t.inDeezerPlaylists=row.`in_deezer_playlists`,
t.inDeezerCharts=toInteger(row.`in_deezer_charts`),
t.inShazamCharts=row.`in_shazam_charts`,
t.bpm=toInteger(row.`bpm`),
t.key=trim(toString(row.`key`)),
t.mode=trim(toString(row.`mode`)),
t.danceability=toInteger(row.`danceability_%`),
t.valence=toInteger(row.`valence_%`),
t.energy=toInteger(row.`energy_%`),
t.acousticness=toInteger(row.`acousticness_%`),
t.instrumentalness=toInteger(row.`instrumentalness_%`),
t.liveness=toInteger(row.`liveness_%`),
t.speechiness=toInteger(row.`speechiness_%`),
t.coverUrl=CASErow.`cover_url`WHEN'NotFound'THEN''ELSEtrim(toString(row.`cover_url`))END
------Artist
UNWIND$rowsASrow
WITHrow,split(row.`artist(s)_name`,',')ASartistNames
UNWINDartistNamesASartistName
MERGE(a:Artist{id:apoc.text.slug(toLower(trim(artistName)))})
SETa.name=trim(artistName)
------PERFORMED_BY
UNWIND$rowsASrow
UNWINDsplit(row.`artist(s)_name`,',')ASartist_name
MERGE(t:Track{id:apoc.text.slug(toLower(row.`track_name`))+'-'+trim(toString(row.`released_year`))})
MERGE(a:Artist{id:apoc.text.slug(toLower(trim(artist_name)))})
MERGE(t)-[
ERFORMED_BY]->(a)这个提示词经过了一些工程调整才获得一致的结果:
•有时 Cypher 会包含定义了多个字段的 MERGE 语句,这在最佳情况下也是次优的。如果任何列为空,整个导入将失败。
•有时结果会包含 apoc.period.iterate,但我不再需要它,并且我希望能够使用 Python 驱动程序执行代码。
•我不得不反复强调,在创建关系时应使用指定的列名。
•当使用关系两端节点的唯一标识符时,LLM 就不会遵循指示,因此我多次尝试让它遵循描述中的指示。这个提示词和 model_prompt 之间有些往返。
•对于包含特殊字符的列名(如 energy_%),需要使用反引号。
将此分为两个提示词——一个用于节点,一个用于关系——也会有所帮助。但这是留待另一天的任务。
接下来,导入脚本可以作为在数据库中创建唯一约束的基础:
constraint_prompt=PromptTemplate.from_template("""
Youareanexpertgraphdatabaseadministrator.
UsethefollowingCypherstatementtowriteaCypherstatementto
createuniqueconstraintsonanypropertiesusedinaMERGEstatement.
Thecorrectsyntaxforauniqueconstraintis:
CREATECONSTRAINTmovie_title_idIFNOTEXISTSFOR(m:Movie)REQUIREm.titleISUNIQUE;
Cypher:
'''
{cypher}
'''
""")
constraint_chain=constraint_prompt|llm.with_structured_output(CypherOutputSpecification)
constraint_queries=[]
forstatementinimport_cypher:
res=constraint_chain.invoke(dict(cypher=statement))
statements=res.cypher.split(";")
forcypherinstatements:
constraint_queries.append(cypher)控制台输出:
CREATECONSTRAINTtrack_id_uniqueIFNOTEXISTSFOR(t:Track)REQUIREt.idISUNIQUE
CREATECONSTRAINTstream_idIFNOTEXISTSFOR(s:Stream)REQUIREs.idISUNIQUE
CREATECONSTRAINTplaylist_idIFNOTEXISTSFOR(p
laylist)REQUIREp.idISUNIQUE
CREATECONSTRAINTchart_idIFNOTEXISTSFOR(c:Chart)REQUIREc.idISUNIQUE
CREATECONSTRAINTtrack_id_uniqueIFNOTEXISTSFOR(t:Track)REQUIREt.idISUNIQUE
CREATECONSTRAINTstream_id_uniqueIFNOTEXISTSFOR(s:Stream)REQUIREs.idISUNIQUE
CREATECONSTRAINTtrack_id_uniqueIFNOTEXISTSFOR(t:Track)REQUIREt.idISUNIQUE
CREATECONSTRAINTplaylist_id_uniqueIFNOTEXISTSFOR(p
laylist)REQUIREp.idISUNIQUE
CREATECONSTRAINTtrack_id_uniqueIFNOTEXISTSFOR(track:Track)REQUIREtrack.idISUNIQUE
CREATECONSTRAINTchart_id_uniqueIFNOTEXISTSFOR(chart:Chart)REQUIREchart.idISUNIQUE有时这个提示词会返回索引和约束的语句,因此需要用分号来分割。
一切就绪后,便是执行 Cypher 语句的时间:
fromosimportgetenv
fromneo4jimportGraphDatabase
driver=GraphDatabase.driver(
getenv("NEO4J_URI"),
auth=(
getenv("NEO4J_USERNAME"),
getenv("NEO4J_PASSWORD")
)
)
withdriver.session()assession:
#truncatethedb
session.run("MATCH(n)DETACHDELETEn")
#createconstraints
forqinconstraint_queries:
ifq.strip()!="":
session.run(q)
#importthedata
forqinimport_cypher:
ifq.strip()!="":
res=session.run(q,rows=rows).consume()
print(q)
print(res.counters)如果没有对数据集进行一些质量保证,这篇文章就不完整了,使用 GraphCypherQAChain:
fromlangchain.chainsimportGraphCypherQAChain
fromlangchain_community.graphsimportNeo4jGraph
graph=Neo4jGraph(
url=getenv("NEO4J_URI"),
username=getenv("NEO4J_USERNAME"),
password=getenv("NEO4J_PASSWORD"),
enhanced_schema=True
)
qa=GraphCypherQAChain.from_llm(
llm,
graph=graph,
allow_dangerous_requests=True,
verbose=True
)
数据库中最受欢迎的艺术家是谁?
qa.invoke({"query":"Whoarethemostpopularartists?"})
>EnteringnewGraphCypherQAChainchain...
GeneratedCypher:
cypher
MATCH(:Track)-[
ERFORMED_BY]->(a:Artist)
RETURNa.name,COUNT(*)ASpopularity
ORDERBYpopularityDESC
LIMIT10
FullContext:
[{'a.name':'BadBunny','popularity':40},{'a.name':'TaylorSwift','popularity':38},{'a.name':'TheWeeknd','popularity':36},{'a.name':'SZA','popularity':23},{'a.name':'KendrickLamar','popularity':23},{'a.name':'Feid','popularity':21},{'a.name':'Drake','popularity':19},{'a.name':'HarryStyles','popularity':17},{'a.name':'PesoPluma','popularity':16},{'a.name':'21Savage','popularity':14}]
>Finishedchain.
{
"query":"Whoarethemostpopularartists?",
"result":"BadBunny,TaylorSwift,andTheWeekndarethemostpopularartists."
}LLM 似乎是根据艺术家参与的曲目数量来判断受欢迎程度的,而不是根据他们的总播放量。
哪首曲子的 BPM 最高?
qa.invoke({"query":"WhichtrackhasthehighestBPM?"})
>EnteringnewGraphCypherQAChainchain...
GeneratedCypher:
cypher
MATCH(t:Track)
RETURNt
ORDERBYt.bpmDESC
LIMIT1
FullContext:
[{'t':{'id':'seven-feat-latto-explicit-ver--2023'}}]
>Finishedchain.
{
"query":"WhichtrackhasthehighestBPM?",
"result":"Idon'tknowtheanswer."
}在这种情况下,Cypher 看起来不错,并且提示中包含了正确的结果,但 gpt-4o 无法解释答案。看来传递给 GraphCypherQAChain 的 CYPHER_GENERATION_PROMPT 需要添加额外的说明,以使列名更详细。
始终在 Cypher 语句中使用详细的列名,使用标签和属性名。例如,使用‘person_name’而不是‘name’。
带有自定义提示词的 GraphCypherQAChain:
CYPHER_GENERATION_TEMPLATE="""Task:GenerateCypherstatementtoqueryagraphdatabase.
Instructions:
Useonlytheprovidedrelationshiptypesandpropertiesintheschema.
Donotuseanyotherrelationshiptypesorpropertiesthatarenotprovided.
Schema:
{schema}
Note
onotincludeanyexplanationsorapologiesinyourresponses.
DonotrespondtoanyquestionsthatmightaskanythingelsethanforyoutoconstructaCypherstatement.
DonotincludeanytextexceptthegeneratedCypherstatement.
AlwaysuseverbosecolumnnamesintheCypherstatementusingthelabelandpropertynames.Forexample,use'person_name'insteadof'name'.
Includedatafromtheimmediatenetworkaroundthenodeintheresulttoprovideextracontext.Forexample,includetheMoviereleaseyear,alistofactorsandtheirroles,orthedirectorofamovie.
Whenorderingbyaproperty,addan`ISNOTNULL`checktoensurethatonlynodeswiththatpropertyarereturned.
Examples:HereareafewexamplesofgeneratedCypherstatementsforparticularquestions:
#HowmanypeopleactedinTopGun?
MATCH(m:Movie{{name:"TopGun"}})
RETURNCOUNT{(m)<-[:ACTED_IN]-()}ASnumberOfActors
Thequestionis:
{question}"""
CYPHER_GENERATION_PROMPT=PromptTemplate(
input_variables=["schema","question"],template=CYPHER_GENERATION_TEMPLATE
)
qa=GraphCypherQAChain.from_llm(
llm,
graph=graph,
allow_dangerous_requests=True,
verbose=True,
cypher_prompt=CYPHER_GENERATION_PROMPT,
)图在返回按类型和方向计数的关系时表现出色。
qa.invoke({"query":"Whichtracksareperformedbythemostartists?"})
>EnteringnewGraphCypherQAChainchain...
GeneratedCypher:
cypher
MATCH(t:Track)
WITHt,COUNT{(t)-[
ERFORMED_BY]->(:Artist)}asartist_count
WHEREartist_countISNOTNULL
RETURNt.idAStrack_id,t.nameAStrack_name,artist_count
ORDERBYartist_countDESC
FullContext:
[{'track_id':'los-del-espacio-2023','track_name':'LosdelEspacio','artist_count':8},{'track_id':'se-le-ve-2021','track_name':'SeLeVe','artist_count':8},{'track_id':'we-don-t-talk-about-bruno-2021','track_name':"WeDon'tTalkAboutBruno",'artist_count':7},{'track_id':'cayï-ï-la-noche-feat-cruz-cafunï-ï-abhir-hathi-bejo-el-ima--2022','track_name':None,'artist_count':6},{'track_id':'jhoome-jo-pathaan-2022','track_name':'JhoomeJoPathaan','artist_count':6},{'track_id':'besharam-rang-from-pathaan--2022','track_name':None,'artist_count':6},{'track_id':'nobody-like-u-from-turning-red--2022','track_name':None,'artist_count':6},{'track_id':'ultra-solo-remix-2022','track_name':'ULTRASOLOREMIX','artist_count':5},{'track_id':'angel-pt-1-feat-jimin-of-bts-jvke-muni-long--2023','track_name':None,'artist_count':5},{'track_id':'link-up-metro-boomin-don-toliver-wizkid-feat-beam-toian-spider-verse-remix-spider-man-across-the-spider-verse--2023','track_name':None,'artist_count':5}]
>Finishedchain.
{
"query":"Whichtracksareperformedbythemostartists?",
"result":"Thetracks\"LosdelEspacio\"and\"SeLeVe\"areperformedbythemostartists,witheachtrackhaving8artists."
}CSV 分析和建模是最耗时的部分。生成可能需要超过五分钟。
成本本身相当便宜。在八小时的实验中,我必须发送了数百个请求,最终我花费了大约一美元左右。
在达到这个目标之前,我遇到了许多挑战:
•提示词经过了多次迭代才正确。这一问题可以通过对模型进行微调或提供少样本示例来解决。
•GPT-4o 的 JSON 响应可能不一致。有人向我推荐了 json-repair (https://github.com/mangiucugna/json_repair),它比让 LLM 自己验证 JSON 输出效果更好。
我可以看到这种方法在 LangGraph 实现中运行良好,在该实现中,操作按顺序执行,使大语言模型能够构建和精细化模型。随着新模型的不断发布,这些模型也可能受益于微调。
| 欢迎光临 链载Ai (https://www.lianzai.com/) | Powered by Discuz! X3.5 |