1.grounding split:作者把以下数据集中的 Meta 信息都统一成 pyautogui 命令格式的数据
2.planning & reasoning split
"Thanks to our detailed inner monologue trajectory data, we implement areasoning mixture approach, where the model is exposed tovarious levels of cognitive complexity,from straightforward low-level action instructions to full inner monologuesthat include observation descriptions, thoughts, and detailed action plans. By dynamically adjusting the complexity of these trajectories, we train the model to be adaptable, fostering step-by-step reasoning and high-level decision-making abilities. Thisdiversityin reasoning ensures that the model can handle a wide range of tasks with nuanced understanding and precision."
reduces overall GPU hours from 6 hours to 1 hour. Moreover, this strategy even marginally improve the performance of ScreenSpot website split from 73.3 to 76.8.
可以在 16 个节点的机器上花费 2 天微调 72B VLM。
⛔ "We train AGUVIS on a cluster of H100-80G GPUs:AGUVIS-7Buses8 nodesand completes the grounding training within5 hoursandplanning & reasoning trainingwithin1 hour.AGUVIS-72B uses 16 nodesand completes the grounding training within30 hoursandplanning & reasoning trainingwithin6 hours."
"Thanks to our detailed inner monologue trajectory data, we implement areasoning mixture approach, where the model is exposed tovarious levels of cognitive complexity,from straightforward low-level action instructions to full inner monologuesthat include observation descriptions, thoughts, and detailed action plans. By dynamically adjusting the complexity of these trajectories, we train the model to be adaptable, fostering step-by-step reasoning and high-level decision-making abilities. Thisdiversityin reasoning ensures that the model can handle a wide range of tasks with nuanced understanding and precision."
第二阶段的训练数据中,也混合了 low-level instructions 数据?
Enforced Plan & Self Plan
<|recipient|>all:预测 IM;<|recipient|>os:预测具体动作
Enforced Plan: employ the<|recipient|>all\nThoughtprompt to compel the model tofirst generate a planning phase, andthen a pyautogui command.
Self Plan: do not add any word after<|recipient|>, so themodel can chooseto generateosto directly produce a pyautogui command, or generateallto first create natural language reasoning and then generate a pyautogui command.
作者发现使用 Enforced Plan 能获得更好的效果,把 grounding Error 降低 20%。