from langchain.text_splitter import NLTKTextSplitter text_splitter = NLTKTextSplitter() text = "..."# 待处理的文本 texts = text_splitter.split_text(text) for doc in texts: print(doc)
(llamaindex_010)Florian:~Florian$python/Users/Florian/Documents/june_pdf_loader/test_semantic_chunk.py ... ... ---------------------------------------------------------------------------------------------------- We argue that current techniques restrict the power of the pre-trained representations,espe- cially for thefine-tuning approaches.The ma- jor limitation is that standard language models are unidirectional,and this limits the choice of archi- tectures that can be used during pre-training.For example,in OpenAI GPT,the authors use a left-to- right architecture,where every token can only at- tend to previous tokens in the self-attention layers of the Transformer(Vaswani et al.,2017).Such re- strictions are sub-optimal for sentence-level tasks, and could be very harmful when applyingfine- tuning based approaches to token-level tasks such as question answering,where it is crucial to incor- porate context from both directions. In this paper,we improve thefine-tuning based approaches by proposing BERT:Bidirectional Encoder Representations from Transformers. BERT alleviates the previously mentioned unidi- rectionality constraint by using a“masked lan- guage model”(MLM)pre-training objective,in- spired by the Cloze task(Taylor,1953).The masked language model randomly masks some of the tokens from the input,and the objective is to
predict the original vocabulary id of the maskedarXiv:1810.04805v2[cs.CL]24 May 2019 ---------------------------------------------------------------------------------------------------- word based only on its context.Unlike left-to- right language model pre-training,the MLM ob- jective enables the representation to fuse the left and the right context,which allows us to pre- train a deep bidirectional Transformer.In addi- tion to the masked language model,we also use a“next sentence prediction”task that jointly pre- trains text-pair representations.The contributions of our paper are as follows: •We demonstrate the importance of bidirectional pre-training for language representations.Un- like Radford et al.(2018),which uses unidirec- tional language models for pre-training,BERT uses masked language models to enable pre- trained deep bidirectional representations.This is also in contrast to Peters et al. ---------------------------------------------------------------------------------------------------- ... ...
result=p(documents='We demonstrate the importance of bidirectional pre-training for language representations.Unlike Radford et al.(2018),which uses unidirectional language models for pre-training,BERT uses masked language models to enable pretrained deep bidirectional representations.This is also in contrast to Peters et al.(2018a),which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.•We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures.BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks,outperforming many task-specific architectures.Today is a good day')
print(result[OutputKeys.TEXT])
在测试数据的末尾添加了一句"Today is a good day(今天是个好日子)",但文本分割处理后的result变量中并没有以任何方式将"Today is a good day(今天是个好日子)" 分割开。
We demonstrate the importance of bidirectional pre-trainingforlanguage representations.Unlike Radford et al.(2018),which uses unidirectional language modelsforpre-training,BERT uses masked language models to enable pretrained deep bidirectional representations.Thisisalsoincontrast to Peters et al.(2018a),which uses a shallow concatenation of independently trained left-to-rightandright-to-left LMs.•We show that pre-trained representationsreducethe needformany heavily-engineered taskspecific architectures.BERTisthe first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-levelandtoken-level tasks,outperforming many task-specific architectures.Todayisa good day
这篇题为《Dense X Retrieval:What Retrieval Granularity Should We Use?》的论文介绍了一种新的检索单位,称为proposition。proposition被定义为文本中的atomic expressions(译者注:不能进一步分解的单个语义元素,可用于构成更大的语义单位),用于检索和表达文本中的独特事实或特定概念,能够以简明扼要的方式表达,使用自然语言完整地呈现一个独立的概念或事实,不需要额外的信息来解释。
那么,如何获取所谓的proposition呢?本文通过构建提示词并与LLM交互来获取。
LlamaIndex和Langchain都实现了相关算法,下面将使用LlamaIndex进行演示。
LlamaIndex的实现思路是使用论文中提供的提示词来生成proposition:
PROPOSITIONS_PROMPT=PromptTemplate( """Decompose the"Content"into clear and simple propositions,ensuring they are interpretable out of context. 1.Split compound sentence into simple sentences.Maintain the original phrasing from the input whenever possible. 2.For any named entity that is accompanied by additional descriptive information,separate this information into its own distinct proposition. 3.Decontextualize the proposition by adding necessary modifier to nouns or entire sentences and replacing pronouns(e.g.,"it","he","she","they","this","that")with the full name of the entities they refer to. 4.Present the results as a list of strings,formatted in JSON.
Input:Title:¯Eostre.Section:Theories and interpretations,Connection to Easter Hares.Content: The earliest evidence for the Easter Hare(Osterhase)was recorded in south-west Germany in 1678 by the professor of medicine Georg Franck von Franckenau,but it remained unknown in other parts of Germany until the 18th century.Scholar Richard Sermon writes that"hares were frequently seen in gardens in spring,and thus may have served as a convenient explanation for the origin of the colored eggs hidden there for children.Alternatively,there is a European tradition that hares laid eggs,since a hare’s scratch or form and a lapwing’s nest look very similar,and both occur on grassland and are first seen in the spring.In the nineteenth century the influence of Easter cards,toys,and books was to make the Easter Hare/Rabbit popular throughout Europe. German immigrants then exported the custom to Britain and America where it evolved into the Easter Bunny." Output:["The earliest evidence for the Easter Hare was recorded in south-west Germany in 1678 by Georg Franck von Franckenau.","Georg Franck von Franckenau was a professor of medicine.","The evidence for the Easter Hare remained unknown in other parts of Germany until the 18th century.","Richard Sermon was a scholar.","Richard Sermon writes a hypothesis about the possible explanation for the connection between hares and the tradition during Easter","Hares were frequently seen in gardens in spring.","Hares may have served as a convenient explanation for the origin of the colored eggs hidden in gardens for children.","There is a European tradition that hares laid eggs.","A hare’s scratch or form and a lapwing’s nest look very similar.","Both hares and lapwing’s nests occur on grassland and are first seen in the spring.","In the nineteenth century the influence of Easter cards,toys,and books was to make the Easter Hare/Rabbit popular throughout Europe.","German immigrants exported the custom of the Easter Hare/Rabbit to Britain and America.","The custom of the Easter Hare/Rabbit evolved into the Easter Bunny in Britain and America."]
>/Users/Florian/anaconda3/envs/llamaindex_010/lib/python3.11/site-packages/llama_index/packs/dense_x_retrieval/base.py(91)__init__() 90 --->91all_nodes=nodes+sub_nodes 92all_nodes_dict={n.node_id:n for n in all_nodes}
ipdb>sub_nodes[20] IndexNode(id_='ecf310c7-76c8-487a-99f3-f78b273e00d9',embedding=None,metadata={},excluded_embed_metadata_keys=[],excluded_llm_metadata_keys=[],relationships={},text='Our paper demonstrates the importance of bidirectional pre-training for language representations.',start_char_idx=None,end_char_idx=None,text_template='{metadata_str}\n\n{content}',metadata_template='{key}:{value}',metadata_seperator='\n',index_id='8deca706-fe97-412c-a13f-950a19a594d1',obj=None) ipdb>sub_nodes[21] IndexNode(id_='4911332e-8e30-47d8-a5bc-ed7cbaa8e042',embedding=None,metadata={},excluded_embed_metadata_keys=[],excluded_llm_metadata_keys=[],relationships={},text='Radford et al.(2018)uses unidirectional language models for pre-training.',start_char_idx=None,end_char_idx=None,text_template='{metadata_str}\n\n{content}',metadata_template='{key}:{value}',metadata_seperator='\n',index_id='8deca706-fe97-412c-a13f-950a19a594d1',obj=None) ipdb>sub_nodes[22] IndexNode(id_='83aa82f8-384a-4b06-92c8-d6277c4162bf',embedding=None,metadata={},excluded_embed_metadata_keys=[],excluded_llm_metadata_keys=[],relationships={},text='BERT uses masked language models to enable pre-trained deep bidirectional representations.',start_char_idx=None,end_char_idx=None,text_template='{metadata_str}\n\n{content}',metadata_template='{key}:{value}',metadata_seperator='\n',index_id='8deca706-fe97-412c-a13f-950a19a594d1',obj=None) ipdb>sub_nodes[23] IndexNode(id_='2ac635c2-ccb0-4e62-88c7-bcbaef3ef38a',embedding=None,metadata={},excluded_embed_metadata_keys=[],excluded_llm_metadata_keys=[],relationships={},text='eters et al.(2018a)uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.',start_char_idx=None,end_char_idx=None,text_template='{metadata_str}\n\n{content}',metadata_template='{key}:{value}',metadata_seperator='\n',index_id='8deca706-fe97-412c-a13f-950a19a594d1',obj=None) ipdb>sub_nodes[24] IndexNode(id_='e37b17cf-30dd-4114-a3c5-9921b8cf0a77',embedding=None,metadata={},excluded_embed_metadata_keys=[],excluded_llm_metadata_keys=[],relationships={},text='re-trained representations reduce the need for many heavily-engineered task-specific architectures.',start_char_idx=None,end_char_idx=None,text_template='{metadata_str}\n\n{content}',metadata_template='{key}:{value}',metadata_seperator='\n',index_id='8deca706-fe97-412c-a13f-950a19a594d1',obj=None)
foroutputinoutputs: if notoutput.strip(): continue if notoutput.strip().endswith("]"): if notoutput.strip().endswith('"') and notoutput.strip().endswith( "," ): output=output+ '"' output=output+ "]" if notoutput.strip().startswith("["): if notoutput.strip().startswith('"'): output= '"' +output output= "[" +output
try: propositions=json.loads(output) exceptException: #fallback to yaml try: propositions=yaml.safe_load(output) exceptException: #fallback to next output continue
#Flatten list return [nodeforsub_nodeinsub_nodesfornodeinsub_node]
对每个原始节点(original node),都异步调用self._aget_proposition,通过PROPOSITIONS_PROMPT获取LLM返回的inital_output,然后根据inital_output获取propositions并构建TextNode。最后,将这些TextNode与原始节点(original node)关联起来,即使用[IndexNode.from_text_node(n,node.node_id)for n in nodes]。
... 技术堆栈的演变。这种灵活性,结合易于使用,帮助团队更快地将 AI 原型部署到生产环境中。在架构方面,灵活性也至关重要。不同的应用场景有不同的需求。例如,我们与许多软件公司以及在受监管行业中运营的公司合作。他们通常需要多租户功能来隔离数据并保持合规性。在构建检索增强生成(RAG)应用程序时,使用特定于帐户或用户的数据显示结果的上下文,数据必须保留在为其用户组专用的租户内。Weaviate 的原生、多租户架构对于需要优先考虑此类需求的客户来说表现卓越。