单卡4090用Deepseek-R1思想GPRO训练QWen基础模型完整step

显示全部楼层

都是单刀直入，直接开干的风格。这里略微做下说明：

A: 是基于Deepseek-R1蒸馏Qwen2.4-32B得到的模型DeepSeek-R1-Distill-Qwen-32B 做的 LoRA 微调。LoRA 是PEFT（Parameter-Efficient Fine-Tuning）参数高效微调的一种具体方法，简单理解就是锁定模型的大部分权重，只用特定领域的数据集训练改变少量权重以提升效果。优点是节约资源。加上 unsloth 优化和 int4 量化的加持，在有限的24G显存的4090卡上，就可以训练这个权重文件都有62G的模型了。

B: 是基于Qwen的基础原始模型Qwen2.5-3B，训练出一个推理模型。应用的就是Deepseek-R1的关键方法，通过简单的奖励函数加 GRPO 算法做强化学习，让模型具备更好的推理能力。过程中原始模型的全部权重都会参与训练，所以对显存和算力的需求更高。即使用 unsloth 叠加了这么多的优化Buff，也仍然只能训练到3B的模型。到7B 的模型实测显存会爆，还在优化中。

在上篇文章单卡4090通过GRPO训练QWen2.5基础模型复现Deepseek-R1关键思路中，为了快速完成测试，最大训练步数 max_steps 只设置了250步。后面取消了 max_steps 设置，让训练器自己根据数据量计算步数，完整跑了一遍，结果如下：

训练总步数 (Total steps) : 22419 步
训练总轮次 (Epochs) : 3.0 轮
训练时间: 总计 17.3 小时（62352.0686 秒）

资源峰值：

+---------------------------------------------------------------------------------------+|NVIDIA-SMI 535.161.07       Driver Version: 535.161.07  CUDA Version: 12.2  ||-----------------------------------------+----------------------+----------------------+|GPUName        Persistence-M |Bus-Id    Disp.A|VolatileUncorr.ECC||FanTemp Perf    Pwr:Usage/Cap|    Memory-Usage|GPU-UtilComputeM. ||                     |          |       MIGM. ||=========================================+======================+======================|| 0NVIDIAGeForceRTX4090   Off|00000000:01:00.0 Off|         Off||30%  56C  P2      251W / 450W | 18142MiB / 24564MiB|  93%   Default||                     |          |         N/A |+-----------------------------------------+----------------------+----------------------+

训练开始日志：

INFO02-1809:44:59model_runner.py:1115] Loading model weights took5.7701GBINFO02-1809:44:59punica_selector.py:18] Using PunicaWrapperGPU.INFO02-1809:45:00worker.py:267] Memory profiling takes1.43secondsINFO02-1809:45:00worker.py:267] the current vLLM instance canusetotal_gpu_memory (23.65GiB)xgpu_memory_utilization (0.59) =13.96GiBINFO02-1809:45:00worker.py:267] model weights take5.77GiB; non_torch_memory takes0.08GiB; PyTorch activation peak memory takes1.23GiB; the rest of the memory reservedforKV Cache is6.89GiB.INFO02-1809:45:01executor_base.py:110]# CUDA blocks: 12541, # CPU blocks: 10922INFO 02-18 09:45:01 executor_base.py:115] Maximum concurrency for 512 tokens per request: 391.91xINFO 02-18 09:45:04 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.Capturing CUDA graph shapes: 100%|██████████| 31/31 [00:21<00:00, 1.45it/s]INFO 02-18 09:45:26 model_runner.py:1562] Graph capturing finished in 21 secs, took 2.15 GiBINFO 02-18 09:45:26 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 27.19 secondsUnsloth 2025.2.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \\  /|  Num examples = 7,473 | Num Epochs = 3O^O/ \_/ \  Batch size per device = 1 | Gradient Accumulation steps = 1\    /  Total batch size = 1 | Total steps = 22,419"-____-"   Number of trainable parameters = 59,867,136 0%|     | 5/22419 [00:14<18:12:54, 2.93s/it]-------------------- Question:

训练结束日志：

s/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.25, 'rewards/int_reward_func': 0.25, 'rewards/correctness_reward_func': 1.0, 'reward': 1.7916667461395264, 'reward_std': 1.8719420433044434, 'kl': 0.4413377642631531, 'epoch': 3.0}-------------------- Question:Nellie had 380 legos, but she lost 57 of them and gave her sister 24 legos. How many legos does she have now?Answer:299Response:<reasoning>Nellie had 380 legos initially. She lost 57 legos, so she now has 380 - 57 = 323 legos. She then gave her sister 24 legos, so she now has 323 - 24 = 299 legos.</reasoning><answer>299</answer>
Extracted:299{'loss': 0.0023, 'grad_norm': 0.38597264885902405, 'learning_rate': 0.0, 'completion_length': 87.5, 'rewards/xmlcount_reward_func': 0.5, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.5, 'rewards/int_reward_func': 0.5, 'rewards/correctness_reward_func': 1.6666667461395264, 'reward': 3.1666667461395264, 'reward_std': 0.8164965510368347, 'kl': 0.058432161808013916, 'epoch': 3.0}{'train_runtime': 62352.0686, 'train_samples_per_second': 0.36, 'train_steps_per_second': 0.36, 'train_loss': 0.006079863988740294, 'epoch': 3.0}

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;letter-spacing: normal;">测试问题1：Calculate pi

原始模型的回答：

Here is a possible way to calculate pi using a Monte Carlo simulation. Let's say we want to estimate pi with an accuracy of 5 decimal places (0.00001). We can do this by generating a large number of random points in the range [-1, 1] for both x and y coordinates. We can then count the number of points that fall within the circle of radius 1 centered at the origin (i.e., points that satisfy x^2 + y^2 < 1). The ratio of the number of points that fall within the circle to the total number of points will be an estimate of pi/4. To get an estimate of pi with an accuracy of 5 decimal places, we would need to generate approximately 10^8 random points. Here is some sample Python code that implements this method:

importrandomn=10**8#Numberofrandompointstogeneratecount=0#Numberofpointsthatfallwithinthecircleforiinrange(n):x=random.uniform(-1,1)y=random.uniform(-1,1)ifx**2+y**2<1:count+=1pi=count/n*4print("pi=%f"%pi)

This code generates 10^8 random points and counts the number of points that fall within the circle. It then calculates an estimate of pi and prints the result. Note that this method is not particularly fast or efficient, and other methods (such as the Chudnovsky algorithm) exist for computing pi with greater accuracy and speed.

训练后模型的回答：

We can use the Monte Carlo method to approximate pi. In this method, we generate random points within a square and count how many fall within a quarter circle inscribed in the square. The ratio of the number of points inside the circle to the total number of points, multiplied by 4, gives an approximation of pi.

</reasoning>

3.14

</answer>

可见训练后模型的回答符合要求格式，且比较简洁，已经具备推理能力。

测试问题2：Which is bigger? 9.919 or 9.92?

原始模型的回答：

9.919 is smaller than 9.92.

】,【】,【】,【】,【】,【】,...

（这次原始模型回答对了，后面的空白字符应该是预先填充的，因为不是instruct模型，所以格式上会比较乱。

训练后模型的回答：

9.919 is equivalent to 9.91900, while 9.92 is equivalent to 9.92000. Since 9.92000 is greater than 9.91900, 9.92 is bigger.

</reasoning>

9.92

</answer>

可见训练后模型的回答完全符合要求的格式，且具备推理过程，回答简洁而准确。比之前只训练250步效果明显提升。