在前面的课程中,我们讲述了如何将文档进行导入以及将文档根据需要切分成不同的块。正常情况下,我们传统的数据库其实完全没有必要去将其切分成很多块,因为无论你怎么去切分都会导致一定的信息丢失。而之所以我们需要将导入的内容进行切分,其实最主要的原因就是向量数据库的特性。
因此,在使用一定的技术对文本内容进行划分形成一个个的chunk(块)后,下一步我们就是要将这些内容存储起来了。对于信息的管理而言,存储是至关重要的,因为假如信息不被存储起来,后续在我们需要时就无法顺利地调用出来。而将这些切分后的块存储到向量数据库中,也为后续的精准检索和高效利用奠定了坚实的基础。
对于每一个经过切分的块,我们第一步就是要进行Embeddings(向量化)的工作。Embeddings 的本质是将一段文本转化为一长串的向量,这些向量实际上是对文字的一种数字化表示。换句话说,就是把文本内容翻译成机器可以理解和操作的数字形式。当这些高维度的信息被映射到一个高维度的空间中时,我们会发现,如果两段话的语义相近,那么它们在这个高维度语义空间中的距离也会更接近。这种距离的相近性正是向量化的一个重要特征,它帮助我们在处理文本时,更容易捕捉到语义上的相似性。
当所有的块都被转化为向量形式后,我们将这些向量统一放置在一个高维度的空间中,这样就搭建起了一个向量数据库。在这种数据库中,每个向量代表着一个文本块的语义特征。当用户输入一个问题时,这个问题也首先会被转化为一个向量,然后将该向量放入整个向量空间中,通过计算找到与这个向量距离最近的几个向量。这些距离最近的向量代表着与用户问题语义最相似的文本块。我们通过这些向量作为索引,找到对应的文本内容,然后将这些内容返回,作为大语言模型的输入。
通过这种方式,大语言模型的回答不仅仅依赖于它自身预训练时所获取的知识,还可以结合向量数据库中存储的丰富外部信息。因此,大语言模型在回答用户问题时,不仅更加专业化,而且更能紧密地贴合真实的使用场景和实际需求。结合了外部数据的支持,模型的回答显得更加灵活且充满上下文关联性,极大地提升了模型的实用性和回答的准确性。
这种架构使得我们能够在处理各种复杂问题时,提供基于真实数据的动态响应,尤其是当信息需要不断更新或者有针对特定领域的高精度需求时,这种结合了向量数据库的方式显得尤为重要。通过将内部知识与外部数据库相结合,我们能够在更广泛的场景下提供更加高质量、个性化的响应。
下面就让我们进入实战的环节来尝试一下如何实现文本向量化吧!主要包的环境版本如下所示:
langchain0.3.0
langchain-community0.3.0
pypdf5.0.0
openai1.47.0
beautifulsoup44.12.3
chromadb0.5.15
那首先在进行词向量嵌入前,我们还是需要先导入文件且进行切分。这里我们还是选择使用网页的内容,并且在最后打印一下切分的块数。
fromlangchain_community.document_loadersimportWebBaseLoader
fromlangchain.text_splitterimportRecursiveCharacterTextSplitter
#文件导入
loader=WebBaseLoader("https://zh.d2l.ai/")
docs=loader.load()
#文本切分
text_splitter=RecursiveCharacterTextSplitter(
chunk_size=1500,
chunk_overlap=150
)
splits=text_splitter.split_documents(docs)
print(len(splits))
在运行后我们可以看到,基于我们的切分方式,这部分内容最后被切分为了18块,下面我们就需要选择一个合适的embedding模型来将这部分内容转化为数字的表达并存储在向量空间之中。对于langchain而言,支持程度最高的当属OpenAI的embedding模型,但是由于我们国内的网络无法直接使用OpenAI的服务,因此我们需要前往Langchain的官网查看其他可用的embedding模型。
在搜索的过程中,我发现了两个在langchain中支持了百度的千帆以及百川的embedding模型。这里我们以百川的embedding模型为例来进行实战的演示。那首先第一步,我们需要先在百川智能的官网注册一个新账户。然后跟随者新手指导创建自己的API Key。
最后我们还需要完成内部的实名认证。这样我们就可以看到我们的账户上有了赠送金80元了。也就完全足够我们完成后续的文本嵌入工作了。
我们可以简单的用一个句子来检验一下自己的向量数据库是否能够成功的运行。
fromlangchain_community.embeddingsimportBaichuanTextEmbeddings
embeddings=BaichuanTextEmbeddings(baichuan_api_key="sk-*")#这里写入刚刚创建的APIKey
text_1="今天天气不错"
query_result=embeddings.embed_query(text_1)
print(query_result)
假如在运行后成功显示了这一串数字的列表就意味着我们成功配置好了模型。
[-0.0075800973,0.05336676,-0.017781364,0.035153043,0.05473809,-0.0034863392,-0.017725622,0.03400442,-0.0015618298,0.037048902,0.023895627,-0.045087088,-0.0064525804,-0.0049004983,-0.064887315,-0.015825018,0.022772776,0.0060217036,0.025914272,0.024085524,-0.014776563,-0.050150223,-0.026934989,-0.016943432,-0.015303897,0.0023684907,0.056052454,0.0498096,0.054810315,0.014616517,-0.0108804125,0.042461645,0.015064584,0.056263994,0.027121587,-0.022202492,-0.027525024,-0.008664351,-0.036344998,-0.01657127,0.014527989,0.028132828,-0.014738482,0.009414009,0.07236609,0.012243288,-0.022873089,-0.0010711104,0.00669326,-0.07445394,0.010032093,-0.0126827555,-0.013846456,-0.03963767,-0.0015490066,-0.014372803,0.013366017,0.035709277,-0.00043608135,0.03880523,-0.029518811,0.17903462,-0.013301915,0.039601598,0.007936066,-0.022787375,-0.043096073,-0.015623084,-0.006984037,-0.05443407,0.014602836,-0.010637717,-0.024930496,-0.005672862,-0.030609699,-0.04613513,-0.005200756,-0.0019113886,-0.0010369268,0.06364622,0.0017903518,-0.013281618,-0.049308456,0.0055903164,-0.039054055,0.038039792,-0.037285045,0.090783335,-0.015622992,0.016203934,-0.054214057,0.03961282,0.026360435,-0.018353252,0.016519671,0.041751813,-0.014477077,0.0127489995,-0.030455282,0.024851138,0.019062009,-0.017065959,-0.02627655,0.02258509,-0.04847564,0.073734246,-0.041269477,-0.01692317,0.022543054,-0.0034427822,-0.014681885,0.02256325,-0.010713156,-0.005427022,-0.05921746,-0.011502023,0.02275171,0.037494913,0.047674343,-0.026584001,0.04412184,0.055725418,0.007986744,-0.021016385,-0.023211442,-0.036163304,0.016720304,0.033081684,-0.0032843906,-0.042655814,0.007911628,-0.024465615,0.015406191,-0.010756438,-0.03702144,-0.02317689,-0.054836072,-0.031834766,-0.02790807,-0.032853704,0.019062275,-0.046842992,0.017380653,0.014115385,0.014931091,0.010515374,-0.05607991,-0.0019574235,0.013854338,0.012391423,0.018231213,0.076454654,0.008188162,-0.037639007,0.045298122,-0.073848456,0.068050385,-0.02705975,0.023622544,-0.0042208578,0.0337943,-0.022992376,0.05826827,0.011858104,0.01569607,0.042661842,-0.017209984,0.021694295,0.007505209,0.011397206,-0.022396868,0.006295061,-0.02817121,-0.020757983,-0.029239543,0.017580206,0.0053575137,0.0047285426,-0.008169297,-0.025038041,0.06669833,0.03582289,0.0050918288,-0.027723635,0.00014384017,-0.02224188,-0.015801586,-0.01848588,0.050851304,-0.09774136,-0.017576136,0.057299227,0.000442171,0.013498493,0.055637747,0.0124135325,0.048408154,0.007920959,0.0095843775,-0.02551911,0.078900434,-0.018005438,-0.04383461,0.006133353,0.0077908253,-0.015221063,-0.0068961675,-0.004510117,-0.003239107,-0.016756145,-0.0010812985,-0.012540754,0.06154694,0.033851653,-0.03275014,-0.031958163,0.034976978,-0.0070638503,-0.0107440045,-0.01419383,-0.020839222,0.017075969,0.022541199,-0.022424642,-0.008391753,0.027798742,-0.016050762,0.0074032117,0.0127465585,0.054057844,0.02783872,-0.02091366,0.03957936,3.540524e-05,-0.02064022,0.016737267,-0.020010721,-0.028651457,0.012814642,0.012334396,0.01962399,0.009970842,-0.0016417224,0.019976128,0.02243525,0.008874889,0.016382845,0.053224027,0.015183802,-0.027056694,-0.017940272,-0.052429568,-0.021074902,0.0082086325,0.013914038,-0.0017846578,-0.0028791567,0.04976489,-0.02925111,0.051794793,0.039445613,0.00066636136,-0.020006057,-0.028057193,0.009057354,0.032207295,0.03994987,-0.009276522,0.004099356,-0.0017198246,0.008118259,0.037628453,0.003725293,0.004188389,0.010137537,0.009433222,-0.017439801,-0.013881865,0.0361468,0.02582073,-0.016355189,-0.04893795,0.006175331,-0.002328138,0.031224497,0.021124806,-0.047953952,0.018411173,0.0062507354,0.032359783,-0.016102508,0.016811898,0.0061463136,0.019790461,0.0749888,0.0231207,0.037050426,0.0382117,-0.006130921,0.033440713,0.04484506,0.0064165182,0.014606872,0.031042341,0.04283885,0.025414506,0.010775507,0.037886404,0.048966765,0.024765946,-0.014064301,-0.03577636,0.03747178,-0.026369339,0.01162005,0.008283756,-0.035896856,0.018269759,-0.0018216838,0.037973937,0.0071238307,-0.021558587,-0.056917723,-0.036986127,-0.018191688,0.003929742,0.014920644,0.046500735,-0.003899504,-0.025425287,0.06907293,0.0033958552,-0.011529156,-0.010054631,-0.041107662,0.026493138,0.019427532,-0.008603713,-0.021817626,0.08313295,0.019652348,0.04430555,0.016818708,0.008762839,0.024853677,-0.006410369,0.028246468,0.003120029,0.026448244,0.021368962,0.020341473,0.032707144,0.010048966,-0.0052957716,0.011840406,0.005509251,0.027914446,-0.031125596,-0.055385284,0.03086642,0.017625086,0.0071002766,-0.028331919,0.009388892,0.011837206,-0.02456342,-0.07309796,-0.00057860994,-0.015253635,0.038293343,0.021359684,0.029337738,-0.019162798,-0.046192612,-0.007225378,0.002991033,0.022218604,0.027269544,0.019665781,0.045284394,-0.0138705885,-0.02697327,-0.02061466,-0.02047067,0.027559392,-0.02104657,-0.0058107083,-0.029897332,-0.020770589,0.015403516,-0.008580397,-0.026312502,-0.019879675,-0.030739132,0.036861204,-0.008701295,0.017366812,-0.056029975,0.01807307,0.049722634,-0.02142505,-0.025946185,-0.054207556,-0.0023204167,0.016083129,-0.034265004,0.06520241,-0.0015073599,-0.048764326,-0.013347777,-0.02209387,0.039798964,-0.020732056,0.029141719,-0.031482995,0.032682046,0.0010684143,-0.0097546335,-0.042786293,0.014075801,0.07067692,0.037168015,0.03665133,0.013099383,0.045177884,-0.004970739,-0.011883725,-0.017102279,-0.003542377,-0.0055756634,0.003154168,0.018256243,-0.003425703,-0.022192966,0.017587043,-0.0498685,-0.036788683,0.0065018726,0.05418345,-0.0061499123,0.023049217,-0.045025963,0.03354438,0.004722013,0.025999319,0.0023555926,-0.025582422,0.010080429,-0.007307532,0.0012863537,0.019033063,-0.070112355,-0.013475271,-0.0065755583,0.11608823,0.004442146,-0.04703502,0.001603611,-0.0015136485,0.049734406,-0.002725904,-0.034830157,-0.0061635147,-0.05292667,0.0022673467,-0.05968527,0.020396495,0.042536885,0.035687473,-0.008441101,-0.0050368826,-0.010241215,0.026221493,-0.012606776,0.0052525415,-0.013098624,0.021647796,-0.032603852,-0.01961021,-0.0014802065,-0.013694488,-0.026213555,-0.004319692,-0.0024521998,-0.033362728,-0.015006008,-0.00011668134,0.006287778,-0.026308479,0.0051860493,-0.006457781,0.022501219,-0.060934205,0.02888547,-0.013671586,0.014401986,-0.013942933,0.023061465,-0.016080467,0.046032842,0.009503975,-0.04311421,-0.029868264,0.005392035,0.077726334,-0.0032545268,-0.030466318,-0.036948357,0.021784065,-0.013924415,0.0051887096,-0.024259165,-0.0034383263,-0.006792153,-0.01465431,-0.0028333322,0.031919274,-0.004374,0.009561171,0.008825877,0.03902738,0.02316122,-0.0044253934,0.009838579,-0.05157412,-0.016854126,-0.03643707,0.0051199044,-0.033243462,-0.020101415,0.0062606363,0.0029247815,0.045445547,-0.0033785962,-0.019610366,0.03553738,0.08627263,-0.017463442,-0.032748982,0.026960844,0.034985423,0.024329964,-0.021733815,0.04659081,0.07016787,0.029160311,0.009318909,-0.04597192,0.009287834,0.01196339,0.00562987,0.0034188589,0.03260338,-0.017949251,0.0062056463,-0.00607117,-0.034285963,0.004724186,0.006547158,-0.05276999,-0.0029403407,0.028996695,0.0063203387,-0.0128724035,-0.011359581,0.010519107,0.023802942,0.022068769,0.053198587,0.038477365,0.008498884,-0.029403154,-0.044804808,0.0014762741,-0.010704774,-0.015829897,0.007617258,0.013373788,-0.06088202,0.034654226,-0.020530708,0.038792215,-0.027131623,-0.023142476,-0.009830022,0.00865907,-0.03165655,-0.014819898,-0.005636912,0.015855074,0.04296683,-0.0061140233,0.00010367572,-0.007218054,0.01453267,-0.026057784,0.015334749,-0.067188405,-0.021128211,-0.041068982,-0.025423573,-0.042716593,-0.020645829,0.0042654662,0.0046950225,-0.018072188,0.00816476,-0.018691676,-0.043077044,-0.012576883,0.04691853,-0.008885239,0.014005581,-0.03384052,0.013673596,0.006526697,0.05935537,-0.005123774,-0.034266487,-0.007336818,0.02392226,0.040810507,-0.025057884,0.06466703,-0.03366888,-0.041856468,-0.01415845,0.048406634,-0.053556327,0.018411247,-0.0024865929,0.0051267263,0.005148555,0.0076659825,-0.007355344,0.026327092,0.015124973,0.01254597,-0.03510743,0.013027197,0.021324873,0.0037396972,0.011446163,-0.049029667,-0.06374169,-0.035224453,-0.008530957,0.013979153,-0.052223757,-0.07793377,0.0019591607,0.0036665122,0.018972708,-0.054722056,0.011637819,-0.04576508,0.020068552,-0.014040663,0.0057487655,-0.02871472,0.0003392991,-0.05118145,-0.008924686,0.055410165,-0.014164663,0.012071178,-0.089032665,0.061980903,-0.013205378,0.011523452,-0.010561611,-0.031480946,0.017789777,0.015061673,-0.01870068,0.0050982414,-0.037708025,-0.037060946,-0.024205728,-0.0051698796,-0.047261383,-0.010686779,-0.009199801,0.026395503,0.014988068,-0.056384664,-0.019924317,-0.020795995,0.029633936,-0.05803649,0.013602117,-0.026799895,0.02065851,-0.037712716,0.0012044552,0.022238478,-0.03026828,0.01717357,0.0122454455,-0.022047512,-0.030403102,-0.03818517,0.0022959309,0.05757244,0.04599548,-0.034720577,-0.04541671,0.035721175,0.025337683,0.015104129,-0.024553472,-0.0120457215,0.011136674,0.011347994,0.011413503,-0.05628646,-0.036509324,-0.0073100314,-0.015757952,0.025875537,0.024211854,0.033534613,-0.0070441454,-0.0112673305,-0.01562657,0.018628063,0.0117082475,0.0069683115,0.0028892006,-0.032603897,0.04310698,-0.030078558,0.019854952,0.0020202359,0.011382138,0.067307346,0.0153360125,-0.036706623,-0.06758915,-0.009290535,0.0032376533,0.021220103,-0.04974581,0.006176343,0.02083126,0.008996937,-0.012759147,-0.022663713,-0.05233211,0.05656895,-0.03491627,-0.07367662,0.0012407355,-0.021305252,0.0013611852,-0.002151927,-0.010974503,-0.018530773,-0.015555192,-0.0039501754,-0.034966897,-0.05811783,0.003567218,-0.03514525,0.02981192,-0.031437088,-7.947596e-05,0.06417666,-0.01710917,0.042598553,-0.024223894,-0.005406725,0.004249046,-0.0027075368,-0.048399337,-0.067284755,-0.026074586,-0.00038019277,-0.006811091,-0.07566606,0.08235222,0.053955197,-0.03657168,-0.003544574,-0.079615615,0.03951712,0.0024060132,0.038426884,-0.02753584,-0.06644118,-0.031597123,0.04755233,-0.023949284,-0.032563534,0.04564931,0.06101583,-0.052196596,0.01139768,-0.047797628,-0.039955243,-0.05175579,0.01297511,-0.041249238,0.019837338,0.049250647,-0.014128847,0.04320848,0.010422725,-0.07393222,-0.006160438,0.0027281432,0.0030357142,0.008153918,-0.011437956,0.03710885,-0.0060646483,0.041081015,-0.0504446,-0.0077678473,-0.018858869,-0.0012665257,0.0011934272,0.008902389,-0.004539913,0.01709035,0.03139153,-0.015624287,0.019661324,0.031275284,-0.027015321,-0.050437905,-0.0012837746,0.008730318,0.039631095,-0.03803519,0.027657144,0.0046184133,0.047920577,0.026898373,-0.013888336,0.024467615,0.046619438,0.0073392773,0.026615879,0.0051005054,-0.0010000536,-0.02616397,-0.019414691,-0.021826822,0.06719448,-0.04384584,-0.014913537,0.018902838,-0.03909702,-0.037313778,-0.025058893,0.040213257,-0.05010741,-0.009008712,-0.03383378,-0.01292646,0.017064018,-0.02641823,-0.001377406,0.04626737,-0.016107187,-0.042900816,-0.032694705,-0.035868037,0.053755976,0.040513847,0.039083917,0.016084548,-0.021831522,-0.018332051,-0.0018600253,0.017321626,0.015134231,0.010182116,-0.033388954,0.013745612,0.03670928,-0.0120792445,-0.034577135,-0.03743215,0.019235548,0.05369886,0.0060843416,0.028200045,0.02975724,0.030495575,0.0010913435,-0.007207034,-0.019063968,-0.03966763,0.045496132,0.021001477,0.023116166,0.0066026025,-0.03898141,-0.0073256157,-0.04920679,-0.045882054,-0.02638399,-0.022420716,0.008425027,0.022284094,-0.03773097,-0.025559437,-0.103974216,-0.005242592,-0.0671696,0.026114486,0.025766661,0.0066235885,-0.027907312,-0.023954913,-0.07919421,0.022901187,-0.04137246,-0.0090224035,-0.00075775903,0.05223202,-0.023998972,0.013845849,0.020067148,0.027350433,0.010106185,0.027008096,0.004266425,0.0053291633,-0.023529287,0.0036859394,0.022047544,-0.005255481,0.02131725,-0.010966137,0.021549305,-0.032396805,-0.006565986,-0.004006749,-0.0071045044,-0.07429621,-0.0021682882,0.036354702,0.026676627,0.004047951,-0.013582781,0.035064716,0.0057615186,0.020303765,0.044283222,0.022294967,0.005740548,0.02587493,-0.017269071,0.0017926773,-0.026692772,0.03899112,-0.0014920657,0.01090971,-0.024606658,0.0037900629,0.065766014,-0.02223905,-0.017547399,-0.0033335802,0.07088441,0.015989626,-0.011706954,-0.0348461,0.016082006,0.03237533,0.017515019,-0.037049048,-0.028268043,-0.05145202,-0.0076028937,-0.020903181,0.00521495,0.02538109,-0.041564934,0.084722616,0.02844224,0.013816085,-0.018994967,-0.016160535,0.010936769,0.004741091,-0.008952183,0.01748413,-0.015781391,-0.0027252352,0.009749333,0.002431545,-0.016886959,-0.016557302,-0.008530143,0.046074543,-0.0033596302,0.0054472997,-0.0338974,-0.0041198754,-0.051638987,-0.010516797,-0.0013898545,0.025756834,-0.027647143,0.01751096,0.0040743006,0.017228596,0.041711632,0.002354791,-0.009751058,-0.013770569,0.035232715,0.023409935,-0.03652178,-0.0030414755,-0.047194887,0.004529896,-0.042122014,-0.01235932,0.027848661]
在准备好了embedding模型后,我们还需要解决的一个问题是用哪一个向量数据库进行向量的存储。在Langchain的官方文档中我们可以看到,Langchain最常支持的有四个向量数据库,包括Chroma、Pinecone、FAISS以及Lance。那这里我就以Chroma为例来展示一下具体的使用方式:
我们先跟着官方文档的要求安装一下Chorma这个库。
pipinstalllangchain-chroma
安装完成后,我们就可以开始设置一下我们向量数据库存放的位置然后进行设置了。
fromlangchain_chromaimportChroma
#存放文件路径设置
persist_directory=r'D:\langchain'
#创建向量数据库
vectordb=Chroma.from_documents(
documents=splits,
embedding=embeddings,
persist_directory=persist_directory
)
print(vectordb._collection.count())
最后假如也是打印出18的话就代表向量数据库创建完成。此时我们就可以在存放的路径处找到一个名为chorma.sqlite3的文件以及一个存放了内容的文件夹。这个就是我们所设置的向量数据库了。
我们可以尝试对我们建立的数据库进行内容的检索,看看能不能找到最相关的块。
#检索
question="图像识别"
docs=vectordb.similarity_search(question,k=3)
print(len(docs))
print(docs[0].page_content)
我们发现其找到的内容里就包括了计算机视觉的相关内容,因此其实是成功的找到比较相关的内容的!
13.计算机视觉
13.1.图像增广
13.2.微调
13.3.目标检测和边界框
13.4.锚框
13.5.多尺度目标检测
13.6.目标检测数据集
13.7.单发多框检测(SSD)
13.8.区域卷积神经网络(R-CNN)系列
13.9.语义分割和数据集
13.10.转置卷积
13.11.全卷积网络
13.12.风格迁移
13.13. 实战 Kaggle 比赛:图像分类(CIFAR-10)
13.14. 实战Kaggle比赛:狗的品种识别(ImageNet Dogs)
本节课中完整的代码如下所示:
fromlangchain_community.document_loadersimportWebBaseLoader
fromlangchain.text_splitterimportRecursiveCharacterTextSplitter
fromlangchain_community.embeddingsimportBaichuanTextEmbeddings
fromlangchain_chromaimportChroma
#文件导入
loader=WebBaseLoader("https://zh.d2l.ai/")
docs=loader.load()
#文本切分
text_splitter=RecursiveCharacterTextSplitter(
chunk_size=1500,
chunk_overlap=150
)
splits=text_splitter.split_documents(docs)
print(len(splits))
#文本嵌入
embeddings=BaichuanTextEmbeddings(baichuan_api_key="sk-83842453061e34d80b392edba11f62fe")
#测试
#text_1="今天天气不错"
#query_result=embeddings.embed_query(text_1)
#print(query_result)
#路径设置
persist_directory=r'D:\langchain'
#向量库创建
vectordb=Chroma.from_documents(
documents=splits,
embedding=embeddings,
persist_directory=persist_directory
)
print(vectordb._collection.count())
#检索
question="图像识别"
docs=vectordb.similarity_search(question,k=3)
print(len(docs))
print(docs[0].page_content)
以上就是本节课的主要内容!我们探讨了向量数据库(VectorStore)和嵌入(Embedding)的概念,并通过 Langchain 和百川 API 的结合,成功实现了向量数据库的创建。这使得我们能够在高效管理语义信息的基础上,为后续的大模型对话系统提供强有力的支持。
| 欢迎光临 链载Ai (https://www.lianzai.com/) | Powered by Discuz! X3.5 |