返回顶部
热门问答 更多热门问答
技术文章 更多技术文章

单卡4090用Deepseek-R1思想GPRO训练QWen基础模型完整step

[复制链接]
链载Ai 显示全部楼层 发表于 3 小时前 |阅读模式 打印 上一主题 下一主题


都是单刀直入,直接开干的风格。这里略微做下说明:

A: 是基于Deepseek-R1蒸馏Qwen2.4-32B得到的模型DeepSeek-R1-Distill-Qwen-32B 做的 LoRA 微调。LoRA 是PEFT(Parameter-Efficient Fine-Tuning)参数高效微调的一种具体方法,简单理解就是锁定模型的大部分权重,只用特定领域的数据集训练改变少量权重以提升效果。优点是节约资源。加上 unsloth 优化和 int4 量化的加持,在有限的24G显存的4090卡上,就可以训练这个权重文件都有62G的模型了。

B: 是基于Qwen的基础原始模型Qwen2.5-3B,训练出一个推理模型。应用的就是Deepseek-R1的关键方法,通过简单的奖励函数加 GRPO 算法做强化学习,让模型具备更好的推理能力。过程中原始模型的全部权重都会参与训练,所以对显存和算力的需求更高。即使用 unsloth 叠加了这么多的优化Buff,也仍然只能训练到3B的模型。到7B 的模型实测显存会爆,还在优化中。


在上篇文章单卡4090通过GRPO训练QWen2.5基础模型复现Deepseek-R1关键思路中,为了快速完成测试,最大训练步数 max_steps 只设置了250步。后面取消了 max_steps 设置,让训练器自己根据数据量计算步数,完整跑了一遍,结果如下:


  • 训练总步数 (Total steps) : 22419 步

  • 训练总轮次 (Epochs) : 3.0 轮

  • 训练时间: 总计 17.3 小时(62352.0686 秒)


资源峰值:
+---------------------------------------------------------------------------------------+|NVIDIA-SMI 535.161.07       Driver Version: 535.161.07  CUDA Version: 12.2  ||-----------------------------------------+----------------------+----------------------+|GPUName        Persistence-M |Bus-Id    Disp.A|VolatileUncorr.ECC||FanTemp Perf    Pwr:Usage/Cap|    Memory-Usage|GPU-UtilComputeM. ||                     |          |       MIGM. ||=========================================+======================+======================|| 0NVIDIAGeForceRTX4090   Off|00000000:01:00.0 Off|         Off||30%  56C  P2      251W / 450W | 18142MiB / 24564MiB|  93%   Default||                     |          |         N/A |+-----------------------------------------+----------------------+----------------------+

训练开始日志:

INFO02-1809:44:59model_runner.py:1115] Loading model weights took5.7701GBINFO02-1809:44:59punica_selector.py:18] Using PunicaWrapperGPU.INFO02-1809:45:00worker.py:267] Memory profiling takes1.43secondsINFO02-1809:45:00worker.py:267] the current vLLM instance canusetotal_gpu_memory (23.65GiB)xgpu_memory_utilization (0.59) =13.96GiBINFO02-1809:45:00worker.py:267] model weights take5.77GiB; non_torch_memory takes0.08GiB; PyTorch activation peak memory takes1.23GiB; the rest of the memory reservedforKV Cache is6.89GiB.INFO02-1809:45:01executor_base.py:110]# CUDA blocks: 12541, # CPU blocks: 10922INFO 02-18 09:45:01 executor_base.py:115] Maximum concurrency for 512 tokens per request: 391.91xINFO 02-18 09:45:04 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.Capturing CUDA graph shapes: 100%|██████████| 31/31 [00:21<00:00, 1.45it/s]INFO 02-18 09:45:26 model_runner.py:1562] Graph capturing finished in 21 secs, took 2.15 GiBINFO 02-18 09:45:26 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 27.19 secondsUnsloth 2025.2.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \\  /|  Num examples = 7,473 | Num Epochs = 3O^O/ \_/ \  Batch size per device = 1 | Gradient Accumulation steps = 1\    /  Total batch size = 1 | Total steps = 22,419"-____-"   Number of trainable parameters = 59,867,136 0%|     | 5/22419 [00:14<18:12:54, 2.93s/it]-------------------- Question:

训练结束日志:

s/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.25, 'rewards/int_reward_func': 0.25, 'rewards/correctness_reward_func': 1.0, 'reward': 1.7916667461395264, 'reward_std': 1.8719420433044434, 'kl': 0.4413377642631531, 'epoch': 3.0}-------------------- Question:Nellie had 380 legos, but she lost 57 of them and gave her sister 24 legos. How many legos does she have now?Answer:299Response:<reasoning>Nellie had 380 legos initially. She lost 57 legos, so she now has 380 - 57 = 323 legos. She then gave her sister 24 legos, so she now has 323 - 24 = 299 legos.</reasoning><answer>299</answer>
Extracted:299{'loss': 0.0023, 'grad_norm': 0.38597264885902405, 'learning_rate': 0.0, 'completion_length': 87.5, 'rewards/xmlcount_reward_func': 0.5, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.5, 'rewards/int_reward_func': 0.5, 'rewards/correctness_reward_func': 1.6666667461395264, 'reward': 3.1666667461395264, 'reward_std': 0.8164965510368347, 'kl': 0.058432161808013916, 'epoch': 3.0}{'train_runtime': 62352.0686, 'train_samples_per_second': 0.36, 'train_steps_per_second': 0.36, 'train_loss': 0.006079863988740294, 'epoch': 3.0}

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;letter-spacing: normal;">测试问题1:Calculate pi

原始模型的回答:
Here is a possible way to calculate pi using a Monte Carlo simulation. Let's say we want to estimate pi with an accuracy of 5 decimal places (0.00001). We can do this by generating a large number of random points in the range [-1, 1] for both x and y coordinates. We can then count the number of points that fall within the circle of radius 1 centered at the origin (i.e., points that satisfy x^2 + y^2 < 1). The ratio of the number of points that fall within the circle to the total number of points will be an estimate of pi/4. To get an estimate of pi with an accuracy of 5 decimal places, we would need to generate approximately 10^8 random points. Here is some sample Python code that implements this method:
importrandomn=10**8#Numberofrandompointstogeneratecount=0#Numberofpointsthatfallwithinthecircleforiinrange(n):x=random.uniform(-1,1)y=random.uniform(-1,1)ifx**2+y**2<1:count+=1pi=count/n*4print("pi=%f"%pi)
This code generates 10^8 random points and counts the number of points that fall within the circle. It then calculates an estimate of pi and prints the result. Note that this method is not particularly fast or efficient, and other methods (such as the Chudnovsky algorithm) exist for computing pi with greater accuracy and speed.
训练后模型的回答:

<reasoning>

We can use the Monte Carlo method to approximate pi. In this method, we generate random points within a square and count how many fall within a quarter circle inscribed in the square. The ratio of the number of points inside the circle to the total number of points, multiplied by 4, gives an approximation of pi.

</reasoning>

<answer>

3.14

</answer>

可见训练后模型的回答符合要求格式,且比较简洁,已经具备推理能力。

测试问题2:Which is bigger? 9.919 or 9.92?
原始模型的回答:

9.919 is smaller than 9.92.

】,【】,【】,【】,【】,【】,...

(这次原始模型回答对了,后面的空白字符应该是预先填充的,因为不是instruct模型,所以格式上会比较乱。

训练后模型的回答:

<reasoning>

9.919 is equivalent to 9.91900, while 9.92 is equivalent to 9.92000. Since 9.92000 is greater than 9.91900, 9.92 is bigger.

</reasoning>

<answer>

9.92

</answer>

可见训练后模型的回答完全符合要求的格式,且具备推理过程,回答简洁而准确。比之前只训练250步效果明显提升。

回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

链载AI是专业的生成式人工智能教程平台。提供Stable Diffusion、Midjourney AI绘画教程,Suno AI音乐生成指南,以及Runway、Pika等AI视频制作与动画生成实战案例。从提示词编写到参数调整,手把手助您从入门到精通。
  • 官方手机版

  • 微信公众号

  • 商务合作

  • Powered by Discuz! X3.5 | Copyright © 2025-2025. | 链载Ai
  • 桂ICP备2024021734号 | 营业执照 | |广西笔趣文化传媒有限公司|| QQ