{ "structure": [ { "nodes": [ { "title": "Abstract", "node_id": "0001", "summary": "This text discusses the increasing importance of fine-tuning large language models (LLMs) for human intent alignment, highlighting the need for efficient resource utilization. It contrasts Reinforcement Learning from Human or AI Preferences (RLHF/RLAIF), which is complex and unstable, with Direct Preference Optimization (DPO), a simpler alternative. The work introduces an active learning strategy for DPO, proposing an acquisition function that uses predictive entropy and the certainty of the implicit preference model to improve the efficiency and effectiveness of fine-tuning with pairwise preference data.", "end_index": 1, "start_index": 1 }, { "nodes": [ { "title": "3.1. Acquisition functions", "node_id": "0005", "summary": "### 3.1. Acquisition functions\n\nIn selecting scoring methods (step 8 in 1) we aim for options that are straightforward to implement and do not require modifications to the model architectures or the fine-tuning procedure itself. This allows for a drop in addition to existing implementations. As a result, we propose using the predictive entropy of $p_{\\theta_t}(y|x)$ as well as a measure of certainty under the Bradley-Terry preference model, which leverages the implicit reward model in DPO.\n", "end_index": 4, "start_index": 3 } ], "title": "3 Active Preference Learning", "node_id": "0004", "summary": "This text introduces Active Preference Learning (APL), a machine learning paradigm for efficiently selecting the most informative data points during training, specifically within a pool-based active learning setting. The APL training procedure involves iteratively sampling prompts, generating pairs of completions using the current model, ranking these pairs with an acquisition function, selecting the highest-ranked pairs for preference labeling by an oracle, and then fine-tuning the model with these labeled preferences. This approach augments the standard DPO fine-tuning loop with an outer data acquisition loop, where the number of acquisition steps is determined by the labeling budget and batch size. A key difference from traditional active learning is the necessity of generating completions for acquired data before scoring, especially if the acquisition function requires them. The text also outlines crucial design considerations, including the selection of acquisition functions, fine-tuning implementation details, the choice of oracle, and experimental settings for sampling parameters. Algorithm 1 provides a detailed step-by-step breakdown of the entire APL procedure.", "end_index": 3, "start_index": 2 } ] }
prompt = f""" You are given a list of documents with their IDs, file names, and descriptions. Your task is to select documents that may contain information relevant to answering the user query.
Response Format: {{ "thinking": "<Your reasoning for document selection>", "answer": <ython list of relevant doc_ids>, e.g. ['doc_id1', 'doc_id2']. Return [] if no documents are relevant. }}
Return only the JSON structure, with no additional output. """
ToC树检索
让大模型根据目录树来推理相关联的node节点,获取到node节点内容之后再进行迭代式生成。
prompt = f""" You are given a query and the tree structure of a document. You need to find all nodes that are likely to contain the answer.
Query: {query}
Document tree structure: {PageIndex_Tree}
Reply in the following JSON format: {{ "thinking": <your reasoning about which nodes are relevant>, "node_list": [node_id1, node_id2, ...] }} """