中科大某大四科研小组复现,在Logic Puzzle Dataset数据集下。证明了:原始基座模型在测试集上只会基础的step by step逻辑。但在无 Long CoT冷启动蒸馏,三阶段Rule Based RL后,模型学会了:- 迟疑 (标记当前不确定的step等后续验证),- 多路径探索 (Les't test both possibilities),- 回溯之前的分析 (Analyze .. statement again),- 阶段性总结 (Let's summarize, Now we have determined),- Answer前习惯于最后一次验证答案(Let's verify all statements)https://github.com/Unakar/Logic-RL