链载Ai

标题: 单卡4090上一键GRPO微调Qwen3最新模型 [打印本页]

作者: 链载Ai 时间: 昨天 18:30
标题: 单卡4090上一键GRPO微调Qwen3最新模型

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;">模型和数据集下载

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;">因为国内的网络环境，造成我有 connection timeout 恐惧症，所以第一件事就是把该下载的下载好，不要在运行中去动态下载。本文用到的模型和数据集地址：

https://modelscope.cn/models/Qwen/Qwen3-4B-Base
https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini
https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed

modelscopedownload--modelQwen/Qwen3-4B-Base--revisionmaster--local_dir/models/Qwen/Qwen3-4B-Basehuggingface-clidownload--resume-download--repo-typedatasetunsloth/OpenMathReasoning-mini--local-dirunsloth/OpenMathReasoning-minihuggingface-clidownload--resume-download--repo-typedatasetopen-r1/DAPO-Math-17k-Processed--local-diropen-r1/DAPO-Math-17k-Processed

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;">启动容器

# docker run --name unsloth0517 -itd --gpus '"device=4"'  \ -v /data/ai/models:/models \ -v /data/ai/datasets:/datasets \ -v /data/ai/workspace/unsloth:/workspace \ unsloth:20250517_4cd5_cu121 bash
# docker exec -it unsloth0517 bashroot@ 1855d8235e1a:/home/unsloth# cd /workspace/scripts

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;">其中docker镜像 unsloth:20250517_4cd5_cu121 的构建方法在上一篇：《RTX4090单卡微调Qwen3-32B完整步骤》中有详细描述。

root@1855d8235e1a:/workspace/scripts#pythonunsloth-grpo-qwen3.py?Unsloth:Willpatchyourcomputertoenable2xfasterfreefinetuning.?UnslothZoowillnowpatcheverythingtomaketrainingfaster!Traceback(mostrecentcalllast):File"/workspace/scripts/unsloth-grpo-qwen3.py",line17,in<module>model,tokenizer=FastLanguageModel.from_pretrained(^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File"/opt/conda/lib/python3.11/site-packages/unsloth/models/loader.py",line138,infrom_pretrainedraiseImportError(ImportError:UnslothleaseinstallvLLMbeforeenabling`fast_inference`!Youcandothisinaterminalvia`pipinstallvllm`

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;">因为 unsloth:20250517_4cd5_cu121 这个容器镜像并未包含 vllm，所以会报这个错误。

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;">我们可以基于 unsloth:20250517_4cd5_cu121 镜像再制作一个包含了 vllm 的镜像。为了简单起见，这里直接在容器中安装 vllm：

root@1855d8235e1a:/workspace/scripts# export PIP_INDEX_URL=https://mirrors.aliyun.com/pypi/simple/root@1855d8235e1a:/workspace/scripts# pip install vllm。。。Successfully installed airportsdata-20250224annotated-types-0.7.0anyio-4.9.0astor-0.8.1blake3-1.0.5cachetools-5.5.2cloudpickle-3.1.1compressed-tensors-0.9.3cupy-cuda12x-13.4.1deprecated-1.2.18depyf-0.18.0diskcache-5.6.3einops-0.8.1email-validator-2.2.0fastapi-0.115.12fastapi-cli-0.0.7fastrlock-0.8.3gguf-0.16.3googleapis-common-protos-1.70.0h11-0.16.0hf-xet-1.1.2httpcore-1.0.9httptools-0.6.4httpx-0.28.1importlib_metadata-8.0.0interegular-0.3.3jinja2-3.1.6jiter-0.10.0lark-1.2.2llguidance-0.7.22llvmlite-0.44.0lm-format-enforcer-0.10.11mistral_common-1.5.5msgpack-1.1.0nest_asyncio-1.6.0numba-0.61.2nvidia-cublas-cu12-12.4.5.8nvidia-cuda-cupti-cu12-12.4.127nvidia-cuda-nvrtc-cu12-12.4.127nvidia-cuda-runtime-cu12-12.4.127nvidia-cufft-cu12-11.2.1.3nvidia-curand-cu12-10.3.5.147nvidia-cusolver-cu12-11.6.1.9nvidia-cusparse-cu12-12.3.1.170nvidia-cusparselt-cu12-0.6.2nvidia-nvjitlink-cu12-12.4.127nvidia-nvtx-cu12-12.4.127openai-1.81.0opencv-python-headless-4.11.0.86opentelemetry-api-1.26.0opentelemetry-exporter-otlp-1.26.0opentelemetry-exporter-otlp-proto-common-1.26.0opentelemetry-exporter-otlp-proto-grpc-1.26.0opentelemetry-exporter-otlp-proto-http-1.26.0opentelemetry-proto-1.26.0opentelemetry-sdk-1.26.0opentelemetry-semantic-conventions-0.47b0opentelemetry-semantic-conventions-ai-0.4.9outlines-0.1.11outlines_core-0.1.26partial-json-parser-0.2.1.1.post5 pillow-11.2.1prometheus-fastapi-instrumentator-7.1.0prometheus_client-0.22.0py-cpuinfo-9.0.0pycountry-24.6.1pydantic-2.11.4pydantic-core-2.33.2python-dotenv-1.1.0python-json-logger-3.3.0python-multipart-0.0.20pyzmq-26.4.0ray-2.46.0rich-toolkit-0.14.6scipy-1.15.3shellingham-1.5.4sniffio-1.3.1starlette-0.46.2tiktoken-0.9.0torch-2.6.0torchaudio-2.6.0torchvision-0.21.0triton-3.2.0typer-0.15.4typing-inspection-0.4.1uvicorn-0.34.2uvloop-0.21.0vllm-0.8.5.post1 watchfiles-1.0.5websockets-15.0.1wrapt-1.17.2xformers-0.0.29.post2 xgrammar-0.1.18WARNING: Running pip as the'root'user can result in broken permissionsandconflicting behaviour with thesystempackagemanager, possibly rendering yoursystemunusable.It is recommended tousea virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action optionifyou know what you are doingandwant to suppress this warning.

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;">启动训练

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;">进入容器后，确保容器中能看到如下目录：

要训练的基础模型目录：/models/Qwen/Qwen3-4B-Base
数据集1：/datasets/unsloth/OpenMathReasoning-mini/data/cot-00000-of-00001.parquet
数据集2：/datasets/open-r1/DAPO-Math-17k-Processed/en/train-00000-of-00001.parquet
工作目录和训练代码文件：/workspace/scripts/unsloth-grpo-qwen3.py

在容器的 /workspace/scripts/ 目录下执行如下代码启动训练：

catunsloth-grpo-qwen3.py>unsloth-grpo-qwen3.py.log&&\nohuppythonunsloth-grpo-qwen3.py>>unsloth-grpo-qwen3.py.log2>&1&

我们有意将训练代码刷到了训练日志的前面做固定，这样做的好处是，方便代码迭代过程中做问题排查。

其中训练代码 unsloth-grpo-qwen3.py 的最新版本已经针对 24G显存的4090卡做了参数边界优化，并且做了详细注释。代码内容在上一篇文章中：https://mp.weixin.qq.com/s/olblI2gE3HHDSEGnejGBrw 需要的小伙伴可自行取用。本文为了快速跑完测试，对 max_steps 等参数做了限制。下面是训练代码执行中各阶段对应的日志分析:

训练日志

加载模型

? Unsloth: Will patch your computertoenable2x fasterfreefinetuning.? Unsloth Zoo will now patch everythingtomake training faster!INFO05-2510:56:06[importing.py:53] Tritonmodulehas been replacedwitha placeholder.INFO05-2510:56:06[__init__.py:239] Automatically detected platform cuda.=====step1. 加载模型=======================================================================((====))== Unsloth2025.5.6: Fast Qwen3 patching. Transformers:4.51.3. vLLM:0.8.5.post1. \\ /|  NVIDIA GeForce RTX4090.Num GPUs=1.Max memory:23.65GB. Platform: Linux.O^O/\_/\  Torch:2.6.0+cu124. CUDA:8.9. CUDA Toolkit:12.4. Triton:3.2.0\    /  Bfloat16=TRUE. FA [Xformers=0.0.29.post2. FA2=False]"-____-"  Freelicense: http://github.com/unslothai/unslothUnsloth: Fast downloadingisenabled-ignore downloading bars whicharered colored!Unsloth: vLLM loading/models/Qwen/Qwen3-4B-Basewithactual GPU utilization=68.76%Unsloth: Your GPU has CUDA compute capability8.9withVRAM=23.65GB.Unsloth:Usingconservativeness=1.0. Chunked prefill tokens=2048.Num Sequences=224.Unsloth: vLLM's KV Cache can use up to 9.31 GB. Also swap space = 6 GB.INFO 05-25 10:59:17 [config.py:717] This model supports multiple tasks: {'embed', 'generate', 'score', 'reward', 'classify'}. Defaulting to 'generate'.INFO 05-25 10:59:17 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.INFO 05-25 10:59:17 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: 。。。WARNING 05-25 10:59:17 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f19973bbd50>INFO 05-25 10:59:28 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0INFO 05-25 10:59:28 [cuda.py:221] Using Flash Attention backend on V1 engine.WARNING 05-25 10:59:28 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.INFO 05-25 10:59:28 [gpu_model_runner.py:1329] Starting to load model /models/Qwen/Qwen3-4B-Base...Loading safetensors checkpoint shards:  0% Completed | 0/3 [00:00<?, ?it/s]Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:02<00:01, 1.20s/it]Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.60s/it]Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.52s/it]
INFO 05-25 10:59:33 [loader.py:458] Loading weights took 4.81 secondsINFO 05-25 10:59:33 [punica_selector.py:18] Using PunicaWrapperGPU.INFO 05-25 10:59:33 [gpu_model_runner.py:1347] Model loading took 7.6334 GiB and 5.084545 secondsINFO 05-25 10:59:48 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/f7b249c75c/rank_0_0 for vLLM's torch.compileINFO05-2510:59:48[backends.py:430] Dynamo bytecode transformtime:14.75sInductor Compilation:100%|██████████|6/6[00:01<00:00, 5.13it/s, triton_poi_fused_add_mul_sub_5]INFO05-2510:59:53[backends.py:136] Cache the graphofshapeNoneforlater use。。。Inductor Compilation:100%|██████████|5/5[00:00<00:00,28.37it/s, triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_4]INFO05-2511:00:39[backends.py:148] Compiling a graphforgeneral shape takes49.26sINFO05-2511:03:14[monitor.py:33] torch.compile takes64.01sintotalINFO05-2511:03:18[kv_cache_utils.py:634] GPU KV cache size:49,856tokensINFO05-2511:03:18[kv_cache_utils.py:637] Maximum concurrencyfor2,048tokensperrequest:24.34xINFO05-2511:04:19[gpu_model_runner.py:1686] Graph capturing finishedin61secs, took3.94GiBINFO05-2511:04:19[core.py:159] init engine (profile,createkv cache, warmup model) took286.49secondsUnsloth2025.5.6patched36layerswith36QKV layers,36O layersand36MLP layers.

模型结构

对加载模型，插入 lora 后的结构：

/models/Qwen/Qwen3-4B-Basedoesnothaveapaddingtoken!Willusepad_token=<|vision_pad|>.modeleftModelForCausalLM((base_model)oraModel((model)wen3ForCausalLM((model)wen3Model((embed_tokens):Embedding(151936,2560,padding_idx=151654)(layers):ModuleList((0-35):36xQwen3DecoderLayer((self_attn)wen3Attention((q_proj):lora.Linear((base_layer)inear(in_features=2560,out_features=4096,bias=False)(lora_dropout):ModuleDict((default):Identity())(lora_A):ModuleDict((default)inear(in_features=2560,out_features=32,bias=False))(lora_B):ModuleDict((default)inear(in_features=32,out_features=4096,bias=False))(lora_embedding_A)arameterDict()(lora_embedding_B)arameterDict()(lora_magnitude_vector):ModuleDict())(k_proj):lora.Linear((base_layer)inear(in_features=2560,out_features=1024,bias=False)(lora_dropout):ModuleDict((default):Identity())(lora_A):ModuleDict((default)inear(in_features=2560,out_features=32,bias=False))(lora_B):ModuleDict((default)inear(in_features=32,out_features=1024,bias=False))(lora_embedding_A)arameterDict()(lora_embedding_B)arameterDict()(lora_magnitude_vector):ModuleDict())(v_proj):lora.Linear((base_layer)inear(in_features=2560,out_features=1024,bias=False)(lora_dropout):ModuleDict((default):Identity())(lora_A):ModuleDict((default)inear(in_features=2560,out_features=32,bias=False))(lora_B):ModuleDict((default)inear(in_features=32,out_features=1024,bias=False))(lora_embedding_A)arameterDict()(lora_embedding_B)arameterDict()(lora_magnitude_vector):ModuleDict())(o_proj):lora.Linear((base_layer):Linear(in_features=4096,out_features=2560,bias=False)(lora_dropout):ModuleDict((default):Identity())(lora_A):ModuleDict((default):Linear(in_features=4096,out_features=32,bias=False))(lora_B):ModuleDict((default):Linear(in_features=32,out_features=2560,bias=False))(lora_embedding_A)arameterDict()(lora_embedding_B)arameterDict()(lora_magnitude_vector):ModuleDict())(q_norm)wen3RMSNorm((128,),eps=1e-06)(k_norm)wen3RMSNorm((128,),eps=1e-06)(rotary_emb):LlamaRotaryEmbedding())(mlp)wen3MLP((gate_proj):lora.Linear((base_layer):Linear(in_features=2560,out_features=9728,bias=False)(lora_dropout):ModuleDict((default):Identity())(lora_A):ModuleDict((default):Linear(in_features=2560,out_features=32,bias=False))(lora_B):ModuleDict((default):Linear(in_features=32,out_features=9728,bias=False))(lora_embedding_A):ParameterDict()(lora_embedding_B):ParameterDict()(lora_magnitude_vector):ModuleDict())(up_proj):lora.Linear((base_layer):Linear(in_features=2560,out_features=9728,bias=False)(lora_dropout):ModuleDict((default):Identity())(lora_A):ModuleDict((default):Linear(in_features=2560,out_features=32,bias=False))(lora_B):ModuleDict((default):Linear(in_features=32,out_features=9728,bias=False))(lora_embedding_A):ParameterDict()(lora_embedding_B):ParameterDict()(lora_magnitude_vector):ModuleDict())(down_proj):lora.Linear((base_layer):Linear(in_features=9728,out_features=2560,bias=False)(lora_dropout):ModuleDict((default):Identity())(lora_A):ModuleDict((default):Linear(in_features=9728,out_features=32,bias=False))(lora_B):ModuleDict((default):Linear(in_features=32,out_features=2560,bias=False))(lora_embedding_A):ParameterDict()(lora_embedding_B):ParameterDict()(lora_magnitude_vector):ModuleDict())(act_fn):SiLU())(input_layernorm)wen3RMSNorm((2560,),eps=1e-06)(post_attention_layernorm)wen3RMSNorm((2560,),eps=1e-06)))(norm)wen3RMSNorm((2560,),eps=1e-06)(rotary_emb):LlamaRotaryEmbedding())(lm_head):Linear(in_features=2560,out_features=151936,bias=False))))

GRPO对话模版

=====step2.准备GRPO对话模版================================================================对话模板输出样例:Youaregivenaproblem.Thinkabouttheproblemandprovideyourworkingout.Placeitbetween<start_working_out>and<end_working_out>.Then,provideyoursolutionbetween<SOLUTION>and</SOLUTION><|endoftext|>Whatis1+1?<start_working_out>Ithinkit's2.<end_working_out><SOLUTION>2</SOLUTION><|endoftext|>Whatis2+2?<start_working_out>

格式遵循预微调

在 GRPO 训练之前，先用带推理过程的对话数据，对原始模型做一个简单的 SFT 训练，以让模型具备按我们的推理格式进行输出的能力。

先要对原始数据集进行清理，筛选出适合做格式遵循微调的数据来：

===== step3. 格式遵循预微调 ===============================================================----- 清洗后的数据集:   expected_answer ...                 generated_solution0         14 ... <think>\nOkay, let's see. I need to solve the ...6         -2 ... <think>\nOkay, so I need to find the value of ...9         18 ... <think>\nOkay, so I need to solve the equation...13         2 ... <think>\nOkay, so I need to evaluate the infin...17         30 ... <think>\nAlright, so I need to find the larges......        ... ...                        ...19243       244 ... <think>\nOkay, so I need to find the value of ...19245        1 ... <think>\nOkay, so I have this problem where a ...19247        4 ... <think>\nOkay, let's tackle this problem step ...19248       18 ... <think>\nOkay, let's see. I need to find the n...19250     0.8960 ... <think>\nOkay, so I need to find the probabili...
[7507 rows x 3 columns]

-----第1条OpenMathReasoning数据格式化后应用对话模板的输出:Youare given a problem.Thinkabout the problem and provide your working out.Placeit between<start_working_out>and<end_working_out>.Then, provide your solution between<SOLUTION>and</SOLUTION><|endoftext|>Given$\sqrt{x^2+165}-\sqrt{x^2-52}=7$ and$x$ispositive, find all possible values of$x$.<start_working_out>Okay,let's see.Ineed to solve the equation√(x²+165)-√(x²-52)=7, and find all positive values of x.Hmm, radicals can be tricky, but maybeifIcan eliminate the square roots by squaring both sides.Letmetrythat.
First,letme write down the equation again to make sureIhave it right:
√(x²+165)-√(x²-52)=7.
Okay, so the ideaisto isolate one of the radicals and then square both sides.Letmetrymoving the second radical to the other side:
√(x²+165)=7+√(x²-52).
Now,ifIsquare both sides, maybeIcangetrid of the square roots.Let'sdothat:
(√(x²+165))²=(7+√(x²-52))².
Simplifyingthe left side:
x²+165=49+14√(x²-52)+(√(x²-52))².
Theright sideisexpanded using the formula (a+b)²=a²+2ab+b².Sothe right side becomes7²+2*7*√(x²-52)+(√(x²-52))², whichis49+14√(x²-52)+(x²-52).
Soputting it all together:
x²+165=49+14√(x²-52)+x²-52.
Hmm,let's simplify the right side.Thex² terms will cancel out, right?Let's subtract x² from both sides:
165=49+14√(x²-52)-52.
Simplifythe constants on the right:
49-52is-3, so:
165=-3+14√(x²-52).
Now, add3to both sides to isolate the radical term:
165+3=14√(x²-52).
So168=14√(x²-52).
Divideboth sides by14:
168/14=√(x²-52).
12=√(x²-52).
Now, square both sides again to eliminate the square root:
12²=x²-52.
144=x²-52.
Add52to both sides:
144+52=x².
196=x².
Sox=√196=14.
Butwait, since the problem states that xispositive, we only take the positive root.Sox=14.
Buthold on, when dealing with squaring equations, sometimes extraneous solutions can come up.Ishould checkifthis solution actually satisfies the original equation.
Let's plug x=14back into the original equation:
√(14²+165)-√(14²-52)=?
Calculateeachterm:
14²is196.
Sofirst radical:√(196+165)=√361=19.
Secondradical:√(196-52)=√144=12.
So19-12=7, whichisexactly the right-hand side.Soyes, it checks out.
Therefore, the only solutionisx=14.Sincethe problem says xispositive, we don't have to consider negative roots.SoIthink that's the answer.Tosolve the equation \(\sqrt{x^2+165}-\sqrt{x^2-52}=7\)forpositive \(x\), we proceedasfollows:
1.Startwith the given equation: \[ \sqrt{x^2+165}-\sqrt{x^2-52}=7 \]
2.Isolateone of the square roots by moving \(\sqrt{x^2-52}\) to the right side: \[ \sqrt{x^2+165}=7+\sqrt{x^2-52} \]
3.Squareboth sides to eliminate the square root on the left: \[ (\sqrt{x^2+165})^2=(7+\sqrt{x^2-52})^2 \] Simplifyingboth sides, weget: \[ x^2+165=49+14\sqrt{x^2-52}+(x^2-52) \]
4.Combinelike terms on the right side: \[ x^2+165=x^2-52+49+14\sqrt{x^2-52} \] Simplifyingfurther: \[ x^2+165=x^2-3+14\sqrt{x^2-52} \]
5.Subtract\(x^2\) from both sides: \[ 165=-3+14\sqrt{x^2-52} \]
6.Add3to both sides to isolate the term with the square root: \[ 168=14\sqrt{x^2-52} \]
7.Divideboth sides by14: \[ 12=\sqrt{x^2-52} \]
8.Squareboth sides again to eliminate the square root: \[ 12^2=x^2-52 \] Simplifying: \[ 144=x^2-52 \]
9.Add52to both sides to solvefor\(x^2\): \[ 196=x^2 \]
10.Takethe positive square root (since \(x\)ispositive):  \[  x=\sqrt{196}=14  \]
11.Verifythe solution by substituting \(x=14\) back into the original equation:  \[  \sqrt{14^2+165}-\sqrt{14^2-52}=\sqrt{196+165}-\sqrt{196-52}=\sqrt{361}-\sqrt{144}=19-12=7  \] Thesolution checks out.
Thus, the only positive solutionis:\[\boxed{14}\]<end_working_out><SOLUTION>14</SOLUTION><|endoftext|>num_proc must be<=58.Reducingnum_proc to58fordataset of size58.[2025-05-2511:05:41]WARNINGarrow_dataset.py:3010: num_proc must be<=58.Reducingnum_proc to58fordataset of size58.
dataset.shape58,5)

-----处理好的预微调数据集ataset({features:['expected_answer','problem','generated_solution','Messages','N','text','__index_level_0__'],num_rows:58})Unsloth:Tokenizing["text"](num_proc=58):100%|██████████|58/58[00:07<00:00,7.99examples/s]

然后开始预微调训练：

==((====))==Unsloth-2xfasterfreefinetuning|NumGPUsused=1\\/|Numexamples=58|NumEpochs=2|Totalsteps=116O^O/\_/\Batchsizeperdevice=1|Gradientaccumulationsteps=1\/DataParallelGPUs=1|Totalbatchsize(1x1x1)=1"-____-"Trainableparameters=66,060,288/4,088,528,384(1.62%trained)100%|██████████|116/116[00:47<00:00,2.46it/s]Unsloth:WillsmartlyoffloadgradientstosaveVRAM!{'loss':0.7447,'grad_norm':0.6478227376937866,'learning_rate':0.00016,'epoch':0.09}{'loss':0.6066,'grad_norm':0.640754759311676,'learning_rate':0.00019279279279279282,'epoch':0.17}{'loss':0.4543,'grad_norm':0.6311891674995422,'learning_rate':0.0001837837837837838,'epoch':0.26}{'loss':0.4684,'grad_norm':0.5015860199928284,'learning_rate':0.00017477477477477476,'epoch':0.34}{'loss':0.4063,'grad_norm':0.5008582472801208,'learning_rate':0.00016576576576576578,'epoch':0.43}{'loss':0.3979,'grad_norm':0.5995965600013733,'learning_rate':0.00015675675675675676,'epoch':0.52}{'loss':0.4248,'grad_norm':0.4734836518764496,'learning_rate':0.00014774774774774775,'epoch':0.6}{'loss':0.4197,'grad_norm':0.5012277960777283,'learning_rate':0.00013873873873873876,'epoch':0.69}{'loss':0.4511,'grad_norm':0.548245906829834,'learning_rate':0.00012972972972972974,'epoch':0.78}{'loss':0.3974,'grad_norm':0.42141056060791016,'learning_rate':0.00012072072072072073,'epoch':0.86}{'loss':0.3317,'grad_norm':0.4644368886947632,'learning_rate':0.0001117117117117117,'epoch':0.95}{'loss':0.3846,'grad_norm':0.3927017152309418,'learning_rate':0.0001027027027027027,'epoch':1.03}{'loss':0.2501,'grad_norm':0.5447007417678833,'learning_rate':9.36936936936937e-05,'epoch':1.12}{'loss':0.278,'grad_norm':0.4823240339756012,'learning_rate':8.468468468468469e-05,'epoch':1.21}{'loss':0.2645,'grad_norm':0.5164972543716431,'learning_rate':7.567567567567568e-05,'epoch':1.29}{'loss':0.2584,'grad_norm':0.5759400725364685,'learning_rate':6.666666666666667e-05,'epoch':1.38}{'loss':0.2121,'grad_norm':0.5618821978569031,'learning_rate':5.765765765765766e-05,'epoch':1.47}{'loss':0.2322,'grad_norm':0.5534489154815674,'learning_rate':4.8648648648648654e-05,'epoch':1.55}{'loss':0.2256,'grad_norm':0.6181885600090027,'learning_rate':3.963963963963964e-05,'epoch':1.64}{'loss':0.1841,'grad_norm':0.48197486996650696,'learning_rate':3.063063063063063e-05,'epoch':1.72}{'loss':0.2789,'grad_norm':0.6069267988204956,'learning_rate':2.1621621621621624e-05,'epoch':1.81}{'loss':0.2148,'grad_norm':0.5475031137466431,'learning_rate':1.2612612612612611e-05,'epoch':1.9}{'loss':0.2263,'grad_norm':0.6717495918273926,'learning_rate':3.603603603603604e-06,'epoch':1.98}{'train_runtime':47.1361,'train_samples_per_second':2.461,'train_steps_per_second':2.461,'train_loss':0.35218508824192246,'epoch':2.0}-----训练后模型的model.dtype:torch.bfloat16

-----格式遵循预微调训练完成，检测下模型是否学会了遵循我们自定义的格式-----Youare given a problem.Thinkabout the problem and provide your working out.Placeit between<start_working_out>and<end_working_out>.Then, provide your solution between<SOLUTION>and</SOLUTION><|endoftext|>Jeniferhas82centsinpennies and nickels.Heryounger brother mistook all her nickelsfordimes and counted the totalas$1.47.Howmany pennies doesJeniferhave?<start_working_out><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|>ollet's start by setting up the problem.Let's denote the number of penniesas\( p \) and the number of nickelsas\( n \).Weknow that the total value of the coinsis82cents, so we can write the equation:\[ p+5n=82\]
Next, we need to consider the mistake made byJenifer's younger brother.Hemistook all the nickelsfordimes, so he counted the totalas$1.47, whichis147cents.Thisgives us another equation:\[ p+10n=147\]
Now, we have a system of two equations:1. \( p+5n=82\)2. \( p+10n=147\)
Tosolve this system, we can subtract the first equation from the second to eliminate \( p \):\[ (p+10n)-(p+5n)=147-82\]\[ 5n=65\]\[ n=13\]
Nowthat we have \( n=13\), we can substitute this value back into the first equation to find \( p \):\[ p+5(13)=82\]\[ p+65=82\]\[ p=17\]
So,Jeniferhas17pennies.Let's verify the solution:-Thevalue of17penniesis\(17\times1=17\) cents.-Thevalue of13nickelsis\(13\times5=65\) cents.-Thetotal valueis\(17+65=82\) cents, which matches the given total.
Thus, the number of penniesJeniferhasis\(\boxed{17}\).<end_working_out><SOLUTION>17</SOLUTION><|endoftext|>

可以看到，预微调之后的模型输出，满足期望。推理过程放在了我们指定的标签<start_working_out> 和 <end_working_out> 之间，最终答案放在了指定的标签之间。（其中 <start_working_out> 会添加在对话问题中，引导模型输出推理。所以模型没有输出这个开始标签，而是直接输出思考内容，然后以<end_working_out>结束思考）

处理GRPO数据集

=====step4.加载并处理数据集==============================================================-----数据集DAPO-Math-17k-Processedataset({features:['prompt','solution','data_source','source_prompt','ability','reward_model','extra_info'],num_rows:14116})-----第一条的prompt:Intriangle$ABC$,$\sin\angleA=\frac{4}{5}$and$\angleA<90^\circ$.Let$D$beapointoutsidetriangle$ABC$suchthat$\angleBAD=\angleDAC$and$\angleBDC=90^\circ$.Supposethat$AD=1$andthat$\frac{BD}{CD}=\frac{3}{2}$.If$AB+AC$canbeexpressedintheform$\frac{a\sqrt{b}}{c}$where$a,b,c$arepairwiserelativelyprimeintegers,find$a+b+c$.-----第一条的solution:34Map:100%|██████████|14116/14116[00:01<00:00,11664.38examples/s]-----第1条对话格式的内容：{'prompt':[{'content':'Youaregivenaproblem.\nThinkabouttheproblemandprovideyourworkingout.\nPlaceitbetween<start_working_out>and<end_working_out>.\nThen,provideyoursolutionbetween<SOLUTION>and</SOLUTION>','role':'system'},{'content':'Intriangle$ABC$,$\\sin\\angleA=\\frac{4}{5}$and$\\angleA<90^\\circ$.Let$D$beapointoutsidetriangle$ABC$suchthat$\\angleBAD=\\angleDAC$and$\\angleBDC=90^\\circ$.Supposethat$AD=1$andthat$\\frac{BD}{CD}=\\frac{3}{2}$.If$AB+AC$canbeexpressedintheform$\\frac{a\\sqrt{b}}{c}$where$a,b,c$arepairwiserelativelyprimeintegers,find$a+b+c$.','role':'user'}],'solution':'34','data_source':'math_dapo','source_prompt':[{'content':'Solvethefollowingmathproblemstepbystep.ThelastlineofyourresponseshouldbeoftheformAnswerAnswer(withoutquotes)where$Answeristheanswertotheproblem.\n\nIntriangle$ABC$,$\\sin\\angleA=\\frac{4}{5}$and$\\angleA<90^\\circ$.Let$D$beapointoutsidetriangle$ABC$suchthat$\\angleBAD=\\angleDAC$and$\\angleBDC=90^\\circ$.Supposethat$AD=1$andthat$\\frac{BD}{CD}=\\frac{3}{2}$.If$AB+AC$canbeexpressedintheform$\\frac{a\\sqrt{b}}{c}$where$a,b,c$arepairwiserelativelyprimeintegers,find$a+b+c$.\n\nRemembertoputyouransweronitsownlineafter"Answer:".','role':'user'}],'ability':'MATH','reward_model':{'ground_truth':'34','style':'rule-lighteval/MATH_v2'},'extra_info':{'index':'9a9b6eb4-a1cb-49d1-8c1e-62eaf2f74079'},'answer':'34'}Map:100%|██████████|14116/14116[00:04<00:00,3005.13examples/s]Youaregivenaproblem.Thinkabouttheproblemandprovideyourworkingout.Placeitbetween<start_working_out>and<end_working_out>.Then,provideyoursolutionbetween<SOLUTION>and</SOLUTION><|endoftext|>Intriangle$ABC$,$\sin\angleA=\frac{4}{5}$and$\angleA<90^\circ$.Let$D$beapointoutsidetriangle$ABC$suchthat$\angleBAD=\angleDAC$and$\angleBDC=90^\circ$.Supposethat$AD=1$andthat$\frac{BD}{CD}=\frac{3}{2}$.If$AB+AC$canbeexpressedintheform$\frac{a\sqrt{b}}{c}$where$a,b,c$arepairwiserelativelyprimeintegers,find$a+b+c$.<start_working_out>Map:100%|██████████|14116/14116[00:02<00:00,5114.02examples/s]Youaregivenaproblem.Thinkabouttheproblemandprovideyourworkingout.Placeitbetween<start_working_out>and<end_working_out>.Then,provideyoursolutionbetween<SOLUTION>and</SOLUTION><|endoftext|>Intriangle$ABC$,$\sin\angleA=\frac{4}{5}$and$\angleA<90^\circ$.Let$D$beapointoutsidetriangle$ABC$suchthat$\angleBAD=\angleDAC$and$\angleBDC=90^\circ$.Supposethat$AD=1$andthat$\frac{BD}{CD}=\frac{3}{2}$.If$AB+AC$canbeexpressedintheform$\frac{a\sqrt{b}}{c}$where$a,b,c$arepairwiserelativelyprimeintegers,find$a+b+c$.<start_working_out>Map:100%|██████████|14116/14116[00:02<00:00,5114.02examples/s]==((====))==Unsloth-2xfasterfreefinetuning|NumGPUsused=1\\/|Numexamples=12,709|NumEpochs=1|Totalsteps=100O^O/\_/\Batchsizeperdevice=4|Gradientaccumulationsteps=1\/DataParallelGPUs=1|Totalbatchsize(4x1x1)=4"-____-"Trainableparameters=66,060,288/4,088,528,384(1.62%trained)MaxLength=203

定义并测试奖励函数

训练脚本中定义了4个奖励函数，如下是对其中2个的打分样例：

===== step5. 定义并测试奖励函数 ============================================================match_format:re.compile('<end_working_out>.*?<SOLUTION>(.+?)</SOLUTION>[\\s]{0,}(?:<\\|endoftext\\|>)?[\\s]{0,}$', re.MULTILINE|re.DOTALL)
----- 奖励函数 check_answer 评分样例：
Case | Question    | Response             | Answer  | Extracted | Score------------------------------------------------------------------------------------------1  | Q: 2+2 = ?   | Let me think!<end_working_out><SOLUTION>4</SOLUTION>| 4    | 4     | 5.02  | Q: Hello?    |<start_working_out>think..<end_working_out><SOLUTION> yes </SOLUTION>| yes   |  yes   | 3.53  | Q: Value?    | !<end_working_out><SOLUTION>9.5</SOLUTION>| 10    | 9.5    | 2.04  | Q: Value?    | !<end_working_out><SOLUTION>8.3</SOLUTION>| 10    | 8.3    | 1.55  | Q: Value?    | i!<end_working_out><SOLUTION>5</SOLUTION>| 10    | 5     | -2.56  | Q: Answer?   | i!<end_working_out><SOLUTION>no digit</SOLUTION>| 42    | no digit  | -4.57  | Q: String?   | f!<end_working_out><SOLUTION>oobar</SOLUTION>| baz   | oobar   | -4.5
----- 奖励函数 check_numbers 评分样例：
Case | Question       | Response(s)     | Answer(s)    | Score(s)--------------------------------------------------------------------------------------1  | Q: 2+2=?       |<SOLUTION>4     | 4        | 3.52  | 问：总量？        |<SOLUTION>1,234.00 | 1234.0     | 3.53  | Q: 输出吧        |<SOLUTION>没有数字    | 0        | -2.54  | Q: 10-3=?      |<SOLUTION>5     | 7        | -1.55  | Q: 1+1=?,2+2=?    |<SOLUTION>2,4    | 2,4       | 3.5,-2.5

进行GRPO训练

===== step6. 训练模型 =====================================================================。。。 5%|▌     | 5/100 [02:19<40:11, 25.38s/it]********************Question:。。。100%|██████████| 100/100 [46:13<00:00, 27.74s/it]can fire at a single
Extracted:None{'loss': 0.0067,'grad_norm': 0.2612026631832123,'learning_rate': 2.7777777777777776e-07,'rewards/match_format_exactly': 2.25,'rewards/match_format_approximately': 0.375,'rewards/check_answer': 3.25,'rewards/check_numbers': 2.0,'reward': 7.875,'reward_std': 10.25,'completion_length': 1379.25,'kl': 0.1681276112794876,'epoch': 0.01}{'loss': 0.0052,'grad_norm': 0.22037602961063385,'learning_rate': 2.2222222222222224e-07,'rewards/match_format_exactly': 1.5,'rewards/match_format_approximately': -0.75,'rewards/check_answer': 1.5,'rewards/check_numbers': 0.5,'reward': 2.75,'reward_std': 11.83568000793457,'completion_length': 1665.75,'kl': 0.1294582337141037,'epoch': 0.01}{'loss': 0.0043,'grad_norm': 0.18991202116012573,'learning_rate': 1.6666666666666668e-07,'rewards/match_format_exactly': 0.75,'rewards/match_format_approximately': -1.875,'rewards/check_answer': -0.25,'rewards/check_numbers': -1.0,'reward': -2.375,'reward_std': 10.25,'completion_length': 1775.5,'kl': 0.1086689755320549,'epoch': 0.01}{'loss': 0.0046,'grad_norm': 0.025854697450995445,'learning_rate': 1.1111111111111112e-07,'rewards/match_format_exactly': 0.0,'rewards/match_format_approximately': -3.0,'rewards/check_answer': -2.0,'rewards/check_numbers': -2.5,'reward': -7.5,'reward_std': 0.0,'completion_length': 1844.0,'kl': 0.11534835398197174,'epoch': 0.01}{'loss': 0.0052,'grad_norm': 0.05398529767990112,'learning_rate': 5.555555555555556e-08,'rewards/match_format_exactly': 0.0,'rewards/match_format_approximately': -3.0,'rewards/check_answer': -2.0,'rewards/check_numbers': -2.5,'reward': -7.5,'reward_std': 0.0,'completion_length': 1844.0,'kl': 0.1298024207353592,'epoch': 0.01}{'train_runtime': 2773.8914,'train_samples_per_second': 0.144,'train_steps_per_second': 0.036,'train_loss': 0.005772246685810387,'epoch': 0.01}

这里我们设置了 max_steps = 100 以便尽快完成测试。正式训练时，可以通过设置 epochs 来控制训练轮数，并根据 loss 收敛情况等提前结束训练。

测试训练好的模型

Processedprompts:100%|██████████|1/1[00:11<00:00,11.75s/it,est.speedinput:0.85toks/s,output:87.15toks/s]-----基础模型的回答:-AnswersMathandArithmeticWhatisthesqrtof101?WikiUser∙2010-05-2922:38:13BestAnswerCopyThesquarerootof101is10.0498756approximatelyWikiUser∙2010-05-2922:38:13Thisansweris:?0?0?0Whatisthesquarerootof101?Itisapprox.10.0498756211.Whatisthesquarerootof-101?Thesquarerootof-101canbewrittenastheproductofthepositivesquarerootof101andi(whereiisanimaginarynumber).Thesquarerootof101isapproximately10.04987751.Whatisthesquarerootof101simplified?sqrt(101)isalreadysimplifiedsince101isnotaperfectsquare.Also,wecannotsimplifyitsince101isaprimenumber.(Inotherwords,101=1x101,soitsonlyfactorizationis1and101)Indecimalformitis:10.049875621120891586572919348985505109596599416484785647300807046Whatisthesquarerootof20200?sqrt(20200)=sqrt(4x100x505)=sqrt(4)xsqrt(100)xsqrt(505)=2x10xsqrt(505)=20xsqrt(505)=10xsqrt(4)xsqrt(505)=50sqrt(149)Is17asquareroot?17isasquareroot.Whatisthesquarerootof3in101?sqrt(3in101)=sqrt(101)xsqrt(3)=sqrt(101x3)=sqrt(303)=17.4069...approx.Whatisanirrationalnumber?-101asanumber.Whyisonesixthofonethirdthesameasonesquarerootofonehundredsixtynine?sqrt(169)=131/6of1/3=(1/6)(1/3)=1/(6x3)=1/18=(1/13)(1/13)=1/sqrt(169)Whatnumberwhensquaredequalssix?Ifyoumeantsqrt(6)2thenthis=6andsqrt(6)=2.4494...Forthenumbertobeasquarerootyouneedthe6tobeinthedenominatororthesquarerootof6.Whatis-20sqrtof101?-20squarerootof(101)-20*sqrt(101)-20*sqrt(101)isarealnumberandcannotbesimplifiedanyfurther.Whatisthesquarerootof101.8?9.046921979920897...Whatisthesquarerootof7561?Assqrt(7561)=sqrt(169)*sqrt(41)=13*sqrt(41)~=274.84779...Whatisthesqrtof0.25?Thesqrtof0.25is0.5Whatisthesquarerootof169over10?Itis1.3Howdoyoufindthesquarerootof51?Thesquarerootof51isapprox.7.1414Theeasiestwaytodothatistouseacalculator.Ifyoudonothaveacalculator,Istronglysuggestusingone,sincethesqrtof51isanirrationalnumberwithaninfiniteamountofdecimalplaces.Squareandcuberootscanbecalculatedtheold-fashionedmannerbyusingtrialanderror.72=49whichistoosmall;82=64whichistoobig,etc.Ifyouneedtogothatroute,youneedtoknowyourbasicProcessedprompts:100%|██████████|1/1[00:26<00:00,26.65s/it,est.speedinput:2.25toks/s,output:74.61toks/s]

----- GRPO-LoRA模型的回答:Okay, so I need to find the square root of101.Hmm,letme think. The square root of a numberisthevaluethat,whenmultipliedbyitself, gives the original number. But101seems like it's not a perfect square, right? I remember that perfect squares like 100, 121, 144, etc., are numbers that have exact square roots since they're squares of integers.101isjust3less than100, whichis10squared. So maybe √101isclose to10butnotexactly10.Let me calculate10squared first.10×10is100.So √101isa little more than10.How much more? Well,10.5squaredis110.25, whichishigher than101.So it has to be between10and10.5. Maybe10.05? Let metrythat.10.05squaredis10.05×10.05. Let me compute that.10×10is100,10×0.05is0.5,0.05×10is0.5,and0.05×0.05is0.0025. Adding them up:100+0.5+0.5+0.0025=101.0025. That's very close to 101, so √101 is approximately 10.05. But wait, 10.04 squared might be slightly less. Let me check 10.04: 10.04 × 10.04. This is 10.04 × 10 = 100.4, then 0.04 × 10 = 0.4, and 0.04×0.04 = 0.0016. So total: 100.4 + 0.4 = 100.8, plus 0.0016, which is 100.8016. That's less than101.So √101isbetween10.04and10.05. Maybe10.045? Let metry10.045:10.045squared. Hmm,10×10=100,10×0.045=0.45,0.045×10=0.45,0.045×0.045=0.002025. Adding:100+0.45+0.45=100.9, plus0.002025is100.902025, whichisstill less than101.So it's closer to 10.04. Maybe 10.042? Let me try 10.042: 10.042 squared. 10×10=100, 10×0.042=0.42, 0.042×10=0.42, 0.042×0.042=0.001764. Adding: 100 + 0.42 + 0.42 = 100.84, plus 0.001764 is 100.841764, which is still below 101. So it's closer to10.043. Let metry10.043:10.043squared.10×10=100,10×0.043=0.43,0.043×10=0.43,0.043×0.043=0.001849. Adding:100+0.43+0.43=100.86, plus0.001849is100.861849, still less than101.So √101isapproximately10.043. But I wonderifthere's a better way. Maybe using the Newton-Raphson method for square roots? Let me recall that method. Let x be the number we want to find the square root of. Start with an initial guess, say 10. Then the method is x_(new) = (x + N/x)/2, where N is the number, which is 101. So first guess is 10. Then, (10 + 101/10)/2 = (10 + 10.1)/2 = 20.1/2 = 10.05. Next iteration: (10.05 + 101/10.05)/2. 101 divided by 10.05 is approximately 10.049751. So 10.05 + 10.049751 = 20.099751, divided by 2 is 10.049875. So after two iterations, the approximation is approximately 10.049875. That's more accurate than my previous guesses. So the square root of101isapproximately10.049875. But the problem didn't specify how precise the answer needs to be, so maybe just the decimal approximation is acceptable. So let me express that in decimal form. 10.049875... Hmm, four decimal places would be 10.0499. But to be precise, I should keep more. So maybe 10.049875? But that's a bit too far. Let me verifywiththe Newton-Raphson method again. Startwith10.(10+101/10)/2=10.05. Then (10.05+101/10.05)/2.101/10.05is10.04975. So10.05+10.04975=20.09975. Dividedby2is10.049875. Then next iteration: (10.049875+101/10.049875)/2.101dividedby10.049875isapproximately10.049874. So10.049875+10.049874=20.099749, dividedby2is10.0498745. So after three iterations, it's approximately 10.0498745. So the square root of 101 is approximately 10.049875. Therefore, I think the answer is around 10.05. But for a better approximation, maybe using more iterations or a calculator. But since the problem doesn't specify, I'll go with 10.05. Let me check if 10.05 squared is 101. 10.05 × 10.05. Let me multiply that out. 10×10=100, 10×0.05=0.5, 0.05×10=0.5, 0.05×0.05=0.0025. Adding: 100 + 0.5 + 0.5 = 101, plus 0.0025 is 101.0025. So √101 is slightly less than 10.05. Therefore, the square root is approximately 10.049875. So in decimal form, that's about10.050. Therefore, the square root of101isapproximately10.05.To find the square root of101, we can use the Newton-Raphson methodforapproximation. The method startswithan initial guessanditeratively refines it.
1.**Initial Guess**: Startwith\( x_0

这里看到回答因为长度问题被截断了。因为本文测试代码在推理时设置了最大生成token数 max_tokens = 1024；这些问题在最新的训练代码中已经解决了。参见：

《单卡4090上一键GRPO微调Qwen3最新模型的训练代码》

合并及保存模型

Lora训练完后，不会更改原来的模型，也不会生成完整的新模型。而是生成一个额外的较小的 Lora 权重，里面就是训练好的内容。在如上测试中，是把这个 Lora 权重作为外挂，和原始权重一起加载做的推理测试。测试完成后，我们需要把外挂lora权重，和原始模型做合并，生成一个新的完整的模型。

=====step8.合并及保存模型===============================================================Unsloth:Merging4bitandLoRAweightsto16bit...Unsloth:Willuseupto336.32outof503.72RAMforsaving.Unsloth:Savingmodel...Thismighttake5minutes...17%|█▋|6/36[00:00<00:00,57.02it/s]WewillsavetoDiskandnotRAMnow.100%|██████████|36/36[00:05<00:00,6.11it/s]Unsloth:Savingtokenizer...Done.Done.

得到的模型还可以进一步做需要的量化处理。

欢迎光临链载Ai (https://www.lianzai.com/)