深入浅出大模型：大模型预训练、后训练、微调

显示全部楼层

今天，我们通过最为通俗易懂的比喻，来详细阐述大模型训练的三个不同阶段：❶预训练（Pre-training）、❷后训练（Post-training）以及❸微调（Fine-tuning）。

ingFang SC", "Microsoft YaHei", "Source Han Sans SC", "Noto Sans CJK SC", "WenQuanYi Micro Hei", sans-serif;font-size: medium;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: start;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;">先看预训练

ingFang SC", "Microsoft YaHei", "Source Han Sans SC", "Noto Sans CJK SC", "WenQuanYi Micro Hei", sans-serif;font-size: medium;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: start;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;"> 预训练，即利用庞大的通用数据集对模型进行初步训练，使其具备基础知识和技能，例如通用的语言能力和广泛的世界常识。就像刚刚发布的Llama 4，它接受了200种语言的预训练。这一过程类似于我们中小学阶段的学习，通过系统地掌握语文、数学、英语等基础学科知识，为未来的深入学习和应用打下坚实的基础。

ingFang SC", "Microsoft YaHei", "Source Han Sans SC", "Noto Sans CJK SC", "WenQuanYi Micro Hei", sans-serif;font-size: medium;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: start;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;"> 这一阶段的数据规模极其庞大，导致训练成本高昂，周期漫长，动辄需要数万GPU天的计算资源。例如，Llama 4 Scout的预训练就耗费了40万亿个tokens数据。

ingFang SC", "Microsoft YaHei", "Source Han Sans SC", "Noto Sans CJK SC", "WenQuanYi Micro Hei", sans-serif;font-size: medium;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: start;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;"> 这情形就如同我们小时候做过的一道道习题、经历的一次次磨难、投入的一分一秒时光，以及承受的一次次责备……这些具体的经历让预训练所需的成本和时间一下子变得触手可及，仿佛历历在目。

ingFang SC", "Microsoft YaHei", "Source Han Sans SC", "Noto Sans CJK SC", "WenQuanYi Micro Hei", sans-serif;font-size: medium;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: start;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;">再说后训练

ingFang SC", "Microsoft YaHei", "Source Han Sans SC", "Noto Sans CJK SC", "WenQuanYi Micro Hei", sans-serif;font-size: medium;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: start;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;"> 后训练是指模型在预训练阶段完成后，进一步进行有针对性的训练过程。其核心目标在于使模型能够更精准地契合实际的特定任务或应用需求。这一过程可以类比于高中毕业后进入大学学习，在明确的专业方向指导下，深入强化专业知识技能。

ingFang SC", "Microsoft YaHei", "Source Han Sans SC", "Noto Sans CJK SC", "WenQuanYi Micro Hei", sans-serif;font-size: medium;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: start;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;"> 在后训练阶段，数据的规模通常较小，且主要集中在特定领域的专业基础课和专业课上。由于学分制的要求，训练周期相对较短，只要修够规定的学分即可。回想起大学生活，与之前紧张的学习阶段相比，确实会感觉轻松不少。

ingFang SC", "Microsoft YaHei", "Source Han Sans SC", "Noto Sans CJK SC", "WenQuanYi Micro Hei", sans-serif;font-size: medium;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: start;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;"> 然而，职后的培训往往并非一次即可完成，通常需要依据实际的需求，持续进行深造与优化。这就像我们在完成本科教育后，可能还会选择攻读硕士乃至博士学位，通过不断地深入学习，使自己的专业能力变得更加扎实与精湛。

ingFang SC", "Microsoft YaHei", "Source Han Sans SC", "Noto Sans CJK SC", "WenQuanYi Micro Hei", sans-serif;font-size: medium;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: start;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;">

ingFang SC", "Microsoft YaHei", "Source Han Sans SC", "Noto Sans CJK SC", "WenQuanYi Micro Hei", sans-serif;font-size: medium;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: start;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;"> 当前，在模型的后训练阶段，强化学习（RL：Reinforcement Learning）方法备受青睐。例如，在DeepSeek-V3小版本的发布通告中，特别强调了其利用强化学习进行后训练的先进性。

简单来说，强化学习在这一过程中不断对模型进行引导：①当模型表现良好时，给予正反馈以鼓励其继续保持；②当模型表现欠佳时，提供负反馈以促使其及时改正。这种方法通过不断的反馈循环，能够显著提升模型的性能和准确性。

通过这种“奖惩机制”，模型能够进行更具针对性的学习，进而提升表现。然而，这种“打一巴掌、给个甜枣”的策略有时会令模型的状态崩溃，因其过于追求奖励而走向极端。

为了避免走向极端，最近兴起了一种全新的强化学习方法，名为GRPO（引导式正则化策略优化），例如DeepSeek R1的训练就应用了这一方法。

GRPO的核心思想是在传统强化学习的奖励机制中引入一个额外的约束条件（即正则项），以此确保最终策略与最初表现良好的模型之间不会产生过大的偏差。

通过这种方式，模型能够在保持稳定的同时取得进展，既能获得较高的奖励，又能够避免走向极端。

因此，GRPO成为当前大型模型后续训练中最受欢迎的强化学习手段，它能够更安全、稳定地提升AI的表现，使其生成的内容更加符合人类的喜好和预期。

最后说说微调

严格来讲，把微调单拎出来讲并不科学，因为微调其实也是模型「后训练」的一种方法。

不过，一般后训练（像前面说的强化学习方法），发生在模型提供商那里。模型提供商在「预训练」完成以后，通过多次「后训练」优化，最终把模型打造成可交付的产品或服务。

而微调这种「后训练」，通常发生在模型使用者那里（尤其是行业客户场景）。

只因出徒后的大模型虽然基础知识丰富、专业能力一流，可是实战技巧却是空白，到了行业场景没法直接上岗。

比如——

怎么办呢？进行上岗培训，这就是微调。

微调是针对特定任务（修电脑）的训练，数据量小但很精准、具体，老司机会把他的具体修理经验交给你，让你的知识更接地气。

至此，一个大模型经过预训练、后训练、微调。

终于可以上岗干活啦。

简单总结下↓

预训练：基础知识广泛学；

后训练：专业领域深入学；

微调：具体实操岗前学。

好了，基本概念介绍完毕。

从目前的国内的趋势看，做大规模预训练的公司会越来越少（坊间传闻，今年上半年真正在做预训练的公司只有两三家）。

未来训练方面的主要需求都是后训练和微调（当然更大的需求是推理）。