(2)textbook/phi Textbooks Are All You Need Textbooks Are All You Need II: phi-1.5 technical report 与其说是生成,不如说重点可能在于过滤(从大数据集中筛选),其实也是prompt工程
2. 种子集+LLM生成类(包括迭代)
(1)Evol-Instruct WizardLM: Empowering Large Language Models to Follow Complex Instructions WizardCoder: Empowering Code Large Language Models with Evol-Instruct WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct等
How Do Humans Write Code? Large Models Do It the Same Way Too TORA: A TOOL-INTEGRATED REASONING AGENT FOR MATHEMATICAL PROBLEM SOLVING 其实就是把答案做成CoT、PoT这样的形式,往往是原来有的标签的(在有标签的情况下LLM的CoT解释会比没有好很多,俗称对着答案猜步骤)
4. 数据集和模型一起迭代
这个是不包括不产生新数据的模型自迭代的
(1)Self-Alignment with Instruction Backtranslation
①自增强。用LLaMA和种子集来给无标签web数据标注prompt;②自管理。迭代进行
(2)Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Reinforced Self-Training (ReST) for Language Modeling
ReST其实是一种和RLHF相对应的方法,不过确实是自迭代 内外循环,筛选是基于RM的 Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models ReSTEM,ReST的简化修改版本,主要用于微调了 REST MEETS REACT: SELF-IMPROVEMENT FOR MULTI-STEP REASONING LLM AGENT ReST+ReAct,目的是让ReAct中的agent迭代
(4)Orca系列
Orca系列其实每一个都是有差别的,想了想还是放这吧 Orca-Math: Unlocking the potential of SLMs in Grade School Math 重点谈这个。获取数据集的方法是Agent-Instruct,三个agent,负责扩充问题的、提出修改意见的、修改的,可以迭代进行 训练是先SFT,然后相当于用了RLAIF迭代训练(用GPT-4生成偏好数据,然后对齐,对齐用的是KTO方法) Orca: Progressive Learning from Complex Explanation Traces of GPT-4 Orca就比较简单了,直接获取GPT-4响应,创新点是添加了解释轨迹 Orca 2: Teaching Small Language Models How to Reason 在Orca 1的解释微调的基础上引入谨慎推理,就是在训练时告诉模型用了哪种解题方法(接回答、逐步处理、解释然后回答),然后在推理时不告诉(提示擦除),让模型学会自己选择方法