Aguvis：提升的不仅是 UI Agent 的规划推理能力

显示全部楼层

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: inherit;color: rgb(85, 201, 234);">Home^[1]|GitHub^[2]|Twitter^[3]|Youtube^[4]|Bilibili^[5]

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.1em;color: rgb(63, 63, 63);">本文介绍ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: inherit;color: rgb(85, 201, 234);">来自 HKU & Salesforce 的 Aguvis。如我之前所说，这篇论文（数据、代码都会开源）至少值 2 个算法工程师 1 个月的工资。论文里面有很多细节都值得深挖，属于外行看热闹，内行看门道的那种。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.1em;color: rgb(63, 63, 63);">本文是视频UI Agent 论文分享：Aguvis-来自 HKU & Salesforce 的大一统训练数据和训练框架^[6]对应的文字版，建议与视频对照着看。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.1em;color: rgb(63, 63, 63);">ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: inherit;color: rgb(85, 201, 234);">Aguvis 相关资料：

[2412.04454] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction^[7], HKU & Salesforce
https://aguvis-project.github.io^[8]
ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: inherit;color: rgb(85, 201, 234);">【视频分享】UI Agent 论文分享：Aguvis-来自 HKU & Salesforce 的大一统训练数据和训练框架^[9]

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.1em;color: rgb(63, 63, 63);">ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: inherit;color: rgb(85, 201, 234);">Aguvis这个词应该是作者造的，没查到什么意思。发现这个工作的作者跟ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;font-size: inherit;color: rgb(85, 201, 234);">OS-Copilot^[10]还有耦合，而OS-Copilot^[11]跟OS-Atlas^[12]是相同的一作。

Aguvis基于 Qwen2-VL-7B 和 Qwen2-VL-72B 进行全量微调（只 freeze ViT 部分），设置最大序列长度为 8192，max pixels 为 1280 x 720。

本文主要贡献：

生成了 IM（observation、thought、low-level instruction）数据，相当于 planning & reasoning 数据，用于第二阶段的模型微调。验证了 IM 数据能大幅提升模型的效果
构建了统一的 grounding 和 reasoning 大数据集，数据即将开源

利用 pyautogui 统一了不同平台的动作空间，这样来自不同平台的数据可以统一使用

训练数据使用 grounding packing strategy 方法，把训练效率提升了 5 倍

把多个单轮的 grounding 任务合成一个多轮的单个任务

统一了 grounding 和 planning & reasoning 2 个训练阶段的数据格式

论文详解

比较标准的两阶段训练方式。第一阶段主要针对 grounding 能力，第二阶段主要针对 planning & reasoning 能力。

Inner Monologue（内心独白，简称IM）包括 3 个部分：

1.observation description
2.internal reasoning (thought)
3.low-level action instruction

决策过程可以分为 2 步完成：Planner生成 IM 内容，然后Grounder按照产生具体的 grounding 信息。

可插拔的动作空间

把动作执行统一成了函数调用（可以借力 base 模型的 function call 能力）：

类似函数调用的方式在 prompt 中告知有哪些函数是可调用的。

Aguvis Collection数据集

Aguvis Collection 数据集是作者汇总其他数据集构建的训练数据集；包括以下 2 部分，顾名思义，对应上面的两阶段训练；后续会开源

1.grounding split：作者把以下数据集中的 Meta 信息都统一成 pyautogui 命令格式的数据

2.planning & reasoning split

"Thanks to our detailed inner monologue trajectory data, we implement areasoning mixture approach, where the model is exposed tovarious levels of cognitive complexity,from straightforward low-level action instructions to full inner monologuesthat include observation descriptions, thoughts, and detailed action plans. By dynamically adjusting the complexity of these trajectories, we train the model to be adaptable, fostering step-by-step reasoning and high-level decision-making abilities. Thisdiversityin reasoning ensures that the model can handle a wide range of tasks with nuanced understanding and precision."

Grounding Stage

以下是 grounding 阶段训练使用的数据格式：

⁉️疑问：
1. 对于 grounding 数据，Prompt 中的overall_goal和previous_actions分别是什么？
2.<|diff_marker|>这个标记的用途是什么？
模型可以利用这个标记来识别需要关注的特定部分，从而生成更加相关和准确的内容。例如，在进行内容编辑或补全时，模型能够基于此标记理解上下文中的变化。

Grounding Packing Strategy

效率提升了 5 倍，效果还稍微有点提升。

reduces overall GPU hours from 6 hours to 1 hour. Moreover, this strategy even marginally improve the performance of ScreenSpot website split from 73.3 to 76.8.
可以在 16 个节点的机器上花费 2 天微调 72B VLM。

⛔ "We train AGUVIS on a cluster of H100-80G GPUs:AGUVIS-7Buses8 nodesand completes the grounding training within5 hoursandplanning & reasoning trainingwithin1 hour.AGUVIS-72B uses 16 nodesand completes the grounding training within30 hoursandplanning & reasoning trainingwithin6 hours."

Planning & Reasoning Stage

IM 是用户自己通过 GPT-4o 构造出来的。

使用 GPT-4o 生成 planning & reasoning 数据，以下是 prompt 和示例：

上面获得的增强数据需要满足以下条件才被认为是成功的：

Match the action type and action target elements of the ground truth
Correctly describe the step’s intention
Establish a clear connection between the step’s intention and the overall goal
Assist the agent in successfully completing the task

在抽样的数据当中，作者发现86.7％展现出了与真实动作和总体目标的动作意图相一致的中间推理。剩下的7.8％的案例受到数据集噪声的影响（任务中的不相关或不必要动作），5.5％的案例则是由于在干净数据下对动作意图的误读。

作者分析发现，训练数据中的非必要动作可能致使 VLM 无法在这些多余动作和总体目标之间建立关联，最终造成不正确的推理和规划。

以下是此阶段训练使用的数据格式：

<|recipient|>all：预测 IM；<|recipient|>os：预测具体动作

作为对比，以下是上面给出的 Grounding 阶段的数据格式：

一些注意点：

planning 阶段的具体动作选择，形式上和 grounding 阶段是一样的
"Thanks to our detailed inner monologue trajectory data, we implement areasoning mixture approach, where the model is exposed tovarious levels of cognitive complexity,from straightforward low-level action instructions to full inner monologuesthat include observation descriptions, thoughts, and detailed action plans. By dynamically adjusting the complexity of these trajectories, we train the model to be adaptable, fostering step-by-step reasoning and high-level decision-making abilities. Thisdiversityin reasoning ensures that the model can handle a wide range of tasks with nuanced understanding and precision."

第二阶段的训练数据中，也混合了 low-level instructions 数据？

Enforced Plan & Self Plan

<|recipient|>all：预测 IM；<|recipient|>os：预测具体动作

Enforced Plan: employ the<|recipient|>all\nThoughtprompt to compel the model tofirst generate a planning phase, andthen a pyautogui command.

Self Plan: do not add any word after<|recipient|>, so themodel can chooseto generateosto directly produce a pyautogui command, or generateallto first create natural language reasoning and then generate a pyautogui command.

作者发现使用 Enforced Plan 能获得更好的效果，把 grounding Error 降低 20%。

各阶段训练效果

Grounding 能力：

Planning 能力：

消融实验

省略第二阶段（规划和推理）对模型的步骤成功率有更显著的负面影响，表明规划训练对于提高代理处理复杂 GUI 任务的能力至关重要。

提升可归因于两个关键因素：使用 IM 让模型能够引出对当前步骤的推理，同时推理作为背景也有助于为后续步骤进行更有效的规划。

另外，将训练数据中的 low-level instructions 纳入进来提高了模型动作执行的准确性。