Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks
TL;DR Summary
This work introduces query inference bridging coordinate grounding and action reasoning in MLLM-powered GUI agents, achieving superior performance with minimal data and enhanced reasoning by integrating semantic information.
Abstract
Perception-enhanced pre-training, particularly through grounding techniques, is widely adopted to enhance the performance of graphical user interface (GUI) agents. However, in resource-constrained scenarios, the format discrepancy between coordinate-oriented grounding and action-oriented reasoning limits the effectiveness of grounding for reasoning tasks. To address this challenge, we propose a query-oriented pivot approach called query inference, which serves as a bridge between GUI grounding and reasoning. By inferring potential user queries from a screenshot and its associated element coordinates, query inference improves the understanding of coordinates while aligning more closely with reasoning tasks. Experimental results show that query inference outperforms previous grounding techniques under the same training data scale. Notably, query inference achieves comparable or even better performance to large-scale grounding-enhanced OS-Atlas with less than 0.1% of training data. Furthermore, we explore the impact of reasoning formats and demonstrate that integrating additional semantic information into the input further boosts reasoning performance. The code is publicly available at https://github.com/ZrW00/GUIPivot.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks
1.2. Authors
Zongru Wu, Pengzhou Cheng, Zheng Wu, Tianjie Ju, Zhuosheng Zhang, Gongshen Liu. All authors are affiliated with the School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University. Their research backgrounds appear to be in areas related to multi-modal machine learning, large language models, graphical user interface (GUI) automation, and agent systems.
1.3. Journal/Conference
This paper was published on arXiv, a preprint server, with the specified publication date. While arXiv itself is not a peer-reviewed journal or conference, it is a widely recognized platform for disseminating early research in various scientific fields, particularly computer science. Papers published on arXiv often undergo subsequent peer review and publication in reputable conferences (e.g., ACL, EMNLP, NeurIPS, ICLR) or journals. The listed references include publications in top-tier NLP and ML venues, suggesting the authors are active in high-impact research.
1.4. Publication Year
2025
1.5. Abstract
The paper addresses the challenge of enhancing graphical user interface (GUI) agents powered by multi-modal large language models (MLLMs) in resource-constrained scenarios. It highlights a limitation where coordinate-oriented grounding (identifying element coordinates for queries) struggles to effectively support action-oriented reasoning (predicting actions to achieve goals) due to their format discrepancy, especially with limited training data. To bridge this gap, the authors propose a query-oriented pivot approach called query inference. This method infers potential user queries from a screenshot and its associated element coordinates, thereby improving the understanding of coordinates and better aligning with reasoning tasks. Experimental results demonstrate that query inference outperforms previous grounding techniques under the same training data scale. Notably, it achieves comparable or superior performance to large-scale grounding-enhanced OS-Atlas using less than 0.1% of the training data. The paper further explores the impact of reasoning formats, showing that integrating additional semantic information into the input boosts reasoning performance.
1.6. Original Source Link
https://arxiv.org/abs/2503.00401 The publication status is a preprint on arXiv (version 2), indicating it has not yet undergone formal peer review for a conference or journal publication. The PDF link is https://arxiv.org/pdf/2503.00401v2.pdf.
2. Executive Summary
2.1. Background & Motivation
The development of multi-modal large language models (MLLMs) has opened promising avenues for creating sophisticated GUI agents. These agents aim to interact with digital interfaces like humans do, performing actions such as clicking, typing, and scrolling to achieve user-specified goals. However, a significant challenge arises because most MLLMs are not inherently pre-trained on GUI screenshots, leading to difficulties in perceiving and understanding complex GUI environments.
To overcome this, perception-enhanced pre-training techniques, particularly grounding, have been widely adopted. Grounding involves training models to identify the coordinates of GUI elements corresponding to specific user queries. While effective, traditional grounding typically requires vast amounts of training data, often exceeding millions of samples. This reliance on large datasets poses a problem in resource-constrained scenarios, such as personalized agents, where computational resources and available training data are limited.
The core problem the paper aims to solve is the format discrepancy between coordinate-oriented grounding and action-oriented reasoning in resource-constrained scenarios. Grounding focuses on localizing coordinates based on low-level queries, whereas reasoning requires a deeper understanding of high-level user intent to predict a sequence of actions. This discrepancy, combined with limited training data, results in minimal improvements in reasoning performance when relying solely on grounding. This highlights a significant gap in current approaches, underscoring the need for a more efficient and aligned perception-enhancement strategy for GUI agents operating under constraints.
The paper's innovative idea is to introduce query inference as a query-oriented pivot task. Instead of just localizing elements from queries, query inference attempts to deduce the user's intended query from the element's coordinates. This "reverse grounding" process aims to improve the model's understanding of both GUI layouts and user intent, thereby creating a smoother bridge between low-level coordinate perception and high-level action reasoning.
2.2. Main Contributions / Findings
The paper makes several primary contributions to the field of MLLM-powered GUI agents, particularly in resource-constrained scenarios:
- Identification of a Critical Gap: The authors rigorously investigate the effectiveness of
groundingwith limited training data forreasoningtasks. They find thatgroundingprovides minimal improvements in such scenarios, exposing a significantgapstemming from thetask format discrepancybetweencoordinate-oriented groundingandaction-oriented reasoning. This highlights a fundamental limitation of existing perception-enhanced pre-training methods under resource constraints. - Proposal of Query Inference as a Pivot Task: To bridge this identified gap, the paper proposes
query inference, a novelquery-oriented pivot approach. This method deduces the intended user queries corresponding to action coordinates. By doing so,query inferenceaims to enhance theMLLM's understanding of coordinates andGUIlayouts while simultaneously aligning more closely with the requirements ofaction-oriented reasoning. This task implicitly improvesquery comprehensionand serves as a bridge to smooth the transition from perception to action. - Validation of Effectiveness and Efficiency: Through extensive experiments, the paper validates the effectiveness and potential of
query inferenceinresource-constrained scenarios. Key findings include:-
Query inferenceconsistentlyoutperforms groundingwhen using the same scale of training data. -
When used as a
pivot task(i.e., combininggroundingandquery inference), it further enhancesaction predictionperformance. -
Remarkably,
query inferenceachievescomparable or even better performancethanlarge-scale grounding-enhanced OS-Atlas(a model trained on over 13 million samples) usingless than 0.1%of the training data. This demonstrates its exceptionaldata efficiency. -
The paper also shows that integrating additional
semantic information(e.g., screen descriptions, previous action results) into the input further boostsreasoningperformance, especially forCLICKactions.These findings collectively demonstrate that
query inferenceoffers a highly effective and data-efficient alternative for enhancingMLLM-powered GUI agentsinresource-constrained environments, significantly improving theirreasoningcapabilities by smoothing the interface between perception and action.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the paper, a beginner should understand the following key concepts:
-
Multi-modal Large Language Models (MLLMs): An
MLLMis an advanced type ofLarge Language Model (LLM)that can process and understand information from multiple modalities, not just text. In the context ofGUI agents,MLLMstypically integratevision modules(to interpret screenshots or visual data) withlanguage models(to understand instructions and generate actions). They leverage the powerful text generation and reasoning capabilities ofLLMsbut extend them to include visual perception, making them suitable for tasks that involve both seeing and understanding. -
Graphical User Interface (GUI) Agents: A
GUI agentis an AI system designed to interact with digital interfaces (like mobile apps, websites, or desktop applications) in a human-like manner. It perceives theGUI(e.g., through screenshots), understands user goals, and executes actions (e.g.,CLICK,TYPE,SCROLL,OPENAPP) to achieve those goals. These agents aim to automate tasks that humans perform by interacting withGUIs. -
Perception-Enhanced Pre-training: Since
MLLMsare often initially trained on general-purpose data (e.g., natural images, text from the internet), they may not be optimized for the specific visual characteristics ofGUIs(e.g., high-density elements, specific UI conventions).Perception-enhanced pre-traininginvolves additional training onGUI-specificdatasets to improve theMLLM's ability to accurately perceive and understandGUIlayouts, elements, and their functionalities. This specialized training helps theMLLMbecome more adept atGUIinterpretation. -
Grounding (GUI Grounding): In
GUI agents,groundingis aperception-enhanced pre-training taskwhere the model learns to identify and localize specificGUI elementson a screenshot based on a textual query. For example, given a screenshot and the query "click the search icon," the model should output the coordinates (e.g., bounding box) of the search icon. It'scoordinate-orientedbecause its primary output is spatial location. -
Reasoning (Action-Oriented Reasoning):
ReasoninginGUI agentsrefers to the process of interpreting a high-level user goal (e.g., "find a restaurant nearby") and breaking it down into a sequence of executableactions(e.g.,CLICKsearch bar,TYPE"restaurants",CLICKsearch button,SCROLLdown,CLICKon a restaurant listing). This process requires understanding user intent,GUIcontext, and predicting appropriateaction typesandparameters. It'saction-orientedbecause its primary output is a plan of actions. -
Coordinate-Oriented vs. Action-Oriented Tasks:
- Coordinate-Oriented: Tasks like
groundingwhere the output is primarily spatial coordinates (points or bounding boxes) on a screen, responding to explicit element references. - Action-Oriented: Tasks like
reasoningwhere the output is a sequence of actions with their parameters, aiming to achieve a high-level goal. This requires understanding intent and context beyond simple element localization. The paper highlights theformat discrepancybetween these two types of tasks as a key challenge.
- Coordinate-Oriented: Tasks like
-
Chain-of-Thought (CoT) Reasoning:
CoTis a prompting technique used withLLMsandMLLMsthat encourages the model to generate a series of intermediate reasoning steps before arriving at a final answer or action. Instead of directly outputting an action, the model might first generate "thoughts" or "plans," which mimic human problem-solving. This makes the reasoning process more transparent and often improves the accuracy of complex tasks.CoTcomponents can be integrated intoinputs(e.g.,screen descriptions,previous action results) oroutputs(e.g.,action thoughts,next action descriptions). -
Supervised Fine-Tuning (SFT):
SFTis a common technique inmachine learningwhere a pre-trained model (like anMLLM) is further trained on a smaller, task-specific dataset with labeled examples. The goal is to adapt the pre-trained model's general knowledge to perform well on a particular downstream task, such asreasoningforGUI agents.
3.2. Previous Works
The paper contextualizes its work within three main areas: MLLM-powered GUI agents, perception-enhanced pre-training, and CoT enhanced reasoning.
3.2.1. MLLM-powered GUI Agents
Traditionally, GUI agents often relied on text-based perception, requiring programmatic access to GUI element hierarchies or accessibility trees (e.g., Zhou et al., 2024; Deng et al., 2024). These methods can be brittle or require system-level permissions.
With the advent of MLLMs (e.g., Yin et al., 2024; Chen et al., 2024b; Wang et al., 2024), the paradigm has shifted. MLLM-powered GUI agents (e.g., Cheng et al., 2024; Hong et al., 2024; Gou et al., 2025) can directly perceive GUI environments using their vision modules from screenshots, much like humans. They then interact through human-like actions (CLICK, TYPE, SCROLL) without needing explicit programmatic interactions or API calls (Sun et al., 2024a; Wu et al., 2024b; Zhang et al., 2024b). This direct visual perception makes them more generalizable across different GUI systems.
3.2.2. Perception-Enhanced Pre-training
Most MLLMs are pre-trained on natural images, which differ significantly from the structured, information-dense GUI layouts (Wu et al., 2025). To overcome this, perception-enhanced pre-training tasks are crucial for improving GUI understanding.
- Grounding: This is the most prevalent task (
Wu et al., 2025; Qian et al., 2024). It involves localizingGUI elements(outputting coordinates) based on textual queries. The paper formulatesgroundingas: $ \mathcal{G} : { \langle s, q \rangle } \to { c } $ where is the screenshot, is the low-level query (e.g., "click the clock icon"), and represents the coordinates (points or bounding boxes). An important aspect highlighted is that can be direct or require additional reasoning for context. - GUI Referring: Generating descriptions for specific
GUI elements(Zhang et al., 2024d; You et al., 2025). - Screen Question Answering (Screen QA): Answering questions about screen contents and functionalities (
Baechler et al., 2024; Chen et al., 2024a). The paper notes thatperception-enhanced pre-trainingtypically demandslarge-scale training data, and its efficacy inresource-constrained scenariosis underexplored.
3.2.3. CoT Enhanced Reasoning
Chain-of-Thought (CoT) has been adopted to improve reasoning in GUI agents (Zhang et al., 2024c; Sun et al., 2024b). This involves leveraging proprietary MLLMs as annotation models (Achiam et al., 2023; Bai et al., 2023) to automatically generate semantic information to enrich training data.
- Input Enhancements: Explanations of
GUIenvironments likescreen descriptions (SD),previous action results (PAR), andGUI layouts(Ma et al., 2024) are integrated into the model's input to enhance perception. - Output Enhancements: Intermediate
reasoning resultssuch asaction thoughts (AT)andnext action descriptions (AD)are included in the model's output to guide thereasoning process. The paper formulatesreasoningat step as: $ \mathscr{R} : { s_i, {a_{ is the current screenshot, are historical actions, is the final goal, and is the current action. typically includes anaction typeandaction parameters(e.g., typed text, coordinates).CoTcomponents can also be part of .
3.3. Technological Evolution
The field of GUI agents has evolved from early rule-based systems and programmatic interaction methods to data-driven approaches. Initially, agents might rely on parsing XML layouts or accessibility trees, which are exact but require specific system access and are not visually intuitive.
The rise of deep learning and computer vision enabled visual recognition of GUI elements. However, these were often specialized models.
The advent of large language models (LLMs) brought powerful reasoning capabilities, but they lacked visual perception.
The current frontier, MLLMs, combines the best of both worlds: visual perception and language-based reasoning. This allows GUI agents to see the screen (like humans), understand instructions (like LLMs), and act accordingly.
Within this MLLM paradigm, the challenge shifted from basic perception to refining perception for action. Perception-enhanced pre-training (like grounding) became critical to adapt MLLMs to GUI specifics.
This paper fits into this timeline by addressing a nuance within perception-enhanced pre-training: specifically, how to make grounding more effective and data-efficient for action-oriented reasoning when resources are limited. It proposes query inference as an evolution of grounding that is better aligned with the ultimate goal of GUI agents—performing coherent actions.
3.4. Differentiation Analysis
Compared to the main methods in related work, the core differences and innovations of this paper's query inference approach are:
-
Bridging the Format Discrepancy: Previous
groundingtechniques are inherentlycoordinate-oriented(outputting coordinates from a query), whilereasoningtasks areaction-oriented(outputting actions from a goal). This paper explicitly identifies and addresses theformat discrepancybetween these two, especially inresource-constrained scenarios.Query inferenceacts as apivotby performing a "reverse grounding" —inferring a query from coordinates. This makes the perception task morequery-centric, which naturally aligns better with the language-based instructions and goals ofreasoning. -
Enhanced Query Comprehension for Reasoning: While
groundingprovidesMLLMswith the ability to locate elements, it doesn't necessarily deepen the understanding of the intent behind interacting with those elements.Query inference, by deducing the intended user query for an action coordinate, forces the model to understand thepurposeof an interaction. Thisenhanced query comprehensionis directly beneficial forreasoning, which relies heavily on understanding user intent. -
Data Efficiency in Resource-Constrained Scenarios: The most significant differentiation is its effectiveness with
limited data.Groundingis known to requirelarge-scale datasets(millions of samples) to be truly effective. This paper demonstrates thatquery inferenceachieves comparable or even superiorreasoningperformance tolarge-scale grounding models(likeOS-Atlas) usingless than 0.1%of the training data. This makes it a highly practical and efficient solution forresource-constrained environmentswherelarge-scale data collectionortrainingis infeasible. -
Leveraging Proprietary MLLMs for Data Construction: The methodology for creating
query inferencedata involves a novel three-step pipeline (query refinement,re-grounding,accuracy analysis) that uses powerfulproprietary MLLMs(likeQwen-VL-Max) asannotation/refinement models. This ensures the generatedqueriesare high-quality, intended, and properly formatted, addressing the issue of "unintended" queries often found in rawgrounding datasets. This intelligent data curation process contributes to itsdata efficiency. -
Focus on Specificity of GUI Interaction: By moving from simply "finding an element" to "understanding the intention of interacting with an element at these coordinates,"
query inferencepushes theMLLMto learn more semantically rich representations ofGUIinteractions. Thissemantic groundingis a key innovation forGUI agents.In essence, while
groundingteaches anMLLM"where an element is,"query inferenceteaches it "what a user intends to do with an element at this location," which is a more direct pathway toaction-oriented reasoning.
4. Methodology
4.1. Principles
The core idea of the method is to address the inherent task format discrepancy between coordinate-oriented grounding and action-oriented reasoning in MLLM-powered GUI agents, particularly in resource-constrained scenarios. Grounding tasks focus on identifying coordinates from a query, while reasoning tasks aim to predict actions based on high-level goals. This paper proposes query inference as a query-oriented pivot approach to smooth this transition.
The theoretical basis and intuition behind query inference are rooted in the idea that for an MLLM to effectively perform action-oriented reasoning, it needs a profound comprehension of user intent. While grounding helps the model "see" the GUI, it doesn't necessarily deepen its "understanding" of why a user might interact with a particular element. By reversing the traditional grounding process, query inference trains the model to deduce the intended user query from given action coordinates. This forces the model to learn a more semantic understanding of GUI elements in relation to potential user goals. This query-centric understanding, which focuses on generating descriptive, intentional queries, naturally aligns better with the action-oriented nature of reasoning tasks, where high-level instructions dictate the required actions. It acts as a bridge by improving the MLLM's ability to interpret GUI elements not just as visual objects, but as targets for specific user intentions, thereby smoothing the flow from low-level perception to high-level action prediction.
4.2. Core Methodology In-depth
The proposed query inference approach involves a three-step pipeline to construct high-quality training data and subsequently train the foundation MLLM. This pipeline leverages proprietary MLLMs to refine grounding data into query inference samples, ensuring that the inferred queries are both accurate and reflect user intent.
4.2.1. Grounding and Reasoning Formulations
Before diving into query inference, it's crucial to understand the formal definitions of grounding and reasoning as presented in the paper.
Grounding:
Grounding is a perception-enhanced pre-training task that aims to localize the coordinates of specific GUI elements based on the perception of screenshots and low-level queries . The query can be explicit (e.g., "click the clock icon") or implicit, requiring additional reasoning for context (e.g., "click on the home button at top left"). The coordinates can be represented as points or bounding boxes. Formally, grounding is represented as:
$
\mathcal{G} : { \langle s, q \rangle } \to { c }
$
Here:
- represents the
groundingfunction or model. - is the input screenshot of the
GUI. - is the textual query, which can be a low-level instruction.
- is the output, representing the localized coordinates (e.g., bounding box) of the
GUI elementcorresponding to on .
Reasoning:
Reasoning predicts a chain of actions to achieve a high-level final goal based on the perception of GUI environments. At step , the agent perceives the current screenshot along with historical actions to predict the current action to achieve the final goal . The action typically comprises an action type and action parameters (which may include typed text or coordinates ). Optionally, CoT components (e.g., intermediate reasoning thoughts) can also be introduced into to enhance reasoning. Formally, reasoning at step is formulated as:
$
\mathscr{R} : { s_i, {a_{
-
represents the
reasoningfunction or model. -
is the screenshot at the current step .
-
represents the set of historical actions performed before step .
-
is the high-level final goal the
agentneeds to achieve. -
is the predicted action at step , consisting of
action type,action parameters, and optionallyCoTcomponents .The paper emphasizes that
reasoningisaction-orientedand requires understanding high-level user intent, whilegroundingiscoordinate-orientedand aligns low-level queries with coordinates, lacking high-level intent perception. This creates theformat discrepancy.
4.2.2. Query Inference Pipeline
The query inference methodology constructs samples by transforming existing grounding data into a format that deduces intended user queries from action coordinates. This is achieved through a three-step pipeline: query refinement, re-grounding, and analyzing the accuracy of re-grounding.
The overall three-step pipeline for constructing samples for query inference is illustrated in Figure 2.
该图像是一个示意图,展示了通过查询细化和重新定位坐标实现多模态输入的三元组到四元组的转换流程,包括使用提示模型进行查询细化和重新定位,以及对重新定位准确性的分析和保存或丢弃决策。
Figure 2: The image is a schematic diagram illustrating the process of transforming triplets to quadruplets for multimodal input through query refinement and re-grounding of coordinates, including the use of prompt models for query refinement and re-grounding, as well as analyzing the accuracy of re-grounding to decide saving or discarding triplets.
4.2.2.1. Query Refinement
The first step addresses the issue that existing grounding queries () are often "unintended" or too low-level, making them unsuitable for directly inferring user intent for reasoning. To overcome this, a proprietary MLLM is used as a refinement model () to transform these low-level unintended queries into intended, properly formatted queries ().
The refinement model used is Qwen-VL-Max (Bai et al., 2023). The model is prompted to generate in a specific format: "click on the [element_name] for [purpose]". This format encourages the MLLM to deduce the underlying intention of an action interacting with the coordinate-specified elements. The inputs to are the original screenshot, the low-level query, and the ground-truth coordinates.
Formally, the refinement process can be represented as:
$
{\mathcal{M}}_r : { \langle s, q, c \rangle } \to { q_r }
$
Here:
- denotes the
refinement model(e.g.,Qwen-VL-Max). - is the screenshot.
- is the original low-level, unintended query from the
grounding data. - is the corresponding ground-truth coordinates (e.g.,
bounding box) for on . - is the refined, intended, and properly formatted user query generated by .
4.2.2.2. Re-grounding
Automated query refinement can introduce errors or generate queries that don't accurately reflect the GUI content or user intent. To ensure data quality and filter out such incorrect information, a second step, re-grounding, is performed. In this step, Qwen-VL-Max is again used, but this time as a grounding model ().
The re-grounding model is prompted to localize coordinates () based on the newly refined queries () and the original screenshot (). This process checks if the refined query can still accurately point back to the original element.
The re-grounding process is formulated as:
$
\mathcal{M}_g : { \langle s, q_r \rangle } \to { c_r }
$
Here:
- denotes the
grounding model(e.g.,Qwen-VL-Max). - is the screenshot.
- is the refined query obtained from the
query refinementstep. - is the coordinates localized by based on and .
4.2.2.3. Analyzing the Accuracy of Re-grounding
The final step involves analyzing the accuracy of the re-grounded coordinates () against the original ground-truth coordinates () to filter for high-quality query inference samples. This ensures that only refined queries that accurately refer to the original GUI element are retained.
An indicator is established to determine accuracy. It checks if the center point of the re-grounded coordinates lies within the bounding box represented by the ground-truth coordinates .
The accuracy analysis is defined by the following indicator function: $ \mathcal{T}(c_r, c) = \left{ \begin{array}{ll} 1, & \mathrm{if~the~center~of~} c_r \mathrm{~is~inside~} c, \ 0, & \mathrm{otherwise.} \end{array} \right. $ Here:
-
is a binary indicator function.
-
are the coordinates localized by the
re-grounding model. -
are the original
ground-truth coordinates. -
A value of 1 indicates that the
re-grounded coordinatesare considered accurate, and the sample is retained. -
A value of 0 indicates inaccuracy, and the sample is discarded.
If , the triplet is retained as a data sample for
query inference. Otherwise, the sample is discarded. This process results in a high-quality dataset consisting of triplets for training thequery inferencemodel.
Subsequently, this dataset is used to train the foundation MLLM on the query inference task prior to reasoning SFT. The query inference task itself can be formally represented as:
$
\mathcal{Q} : { \langle s, c \rangle } \to { q_r }
$
Here:
-
represents the
query inferencefunction or model. -
is the input screenshot.
-
are the input coordinates (e.g., of an action target).
-
is the predicted refined user query corresponding to the action at on .
By training the
MLLMon thisquery inference task, the model's comprehension of user intention is enhanced, aligning it more closely with the needs ofreasoning, while simultaneously maintaining its sensitivity tocoordinates. This effectively bridges the gap betweencoordinate-oriented groundingandaction-oriented reasoning.
5. Experimental Setup
5.1. Datasets
The experiments primarily use UIBERT for perception-enhanced pre-training and two mobile agent benchmarks, AndroidControl and AITZ, for reasoning supervised fine-tuning (SFT) and evaluation.
-
UIBERT (Bai et al., 2021):
-
Source & Characteristics: This dataset is selected as the
grounding datasetforresource-constrained scenarios. It contains approximately 10,000 instances ofgrounding data.UIBERTis a subset of the largerOS-Atlasgrounding dataset(which has over 13 million samples). -
Usage: For fairness in comparison, samples from the original
UIBERTdataset are extracted forgrounding training data. Thequery inference datasetis constructed by refiningUIBERTsamples using the proposed three-step pipeline. The final refinedquery inference datasetconsists of 9,570 triplets of . -
Example Data Sample: An example of a triplet from the refined
UIBERTdataset, along with the original query , is provided in Figure 4. This shows how a low-level query like "click on location icon" is refined into an intended query like "click on the map icon for navigating."
该图像是论文中展示的查询细化提示模板示意图,描述了如何根据给定UI截图及动作生成查询,包括识别边界框位置、内容、语境相关性和任务意图,最终输出规范化查询。
Figure 4: Example triplets from the refined UIBERT dataset, along with the original query . After refinement, the action intent has been inferred, such as "selecting the 24h format". By training on the triplets with intended queries, the comprehension of user intention would be enhanced to align with reasoning while maintaining sensitivity to the coordinates.
-
-
AndroidControl (Li et al., 2024):
- Source & Characteristics: This is a
mobile agent datasetcomprising 15,283 demonstrations with step-wise instructions. It's collected from human raters performing various tasks across 833 different applications spanning 40 app categories on Android devices. - Scale: The training subset contains 89,144 step-wise samples.
- Settings: Evaluated in two settings:
AndroidControl-L: Both low-level step instructions and high-level goals are provided as inputs.AndroidControl-H: Only high-level goals are provided as inputs.
- Usage: Used for
SFTand evaluation ofreasoningperformance.
- Source & Characteristics: This is a
-
AITZ (Zhang et al., 2024c):
-
Source & Characteristics: This is a
mobile agent datasetderived from a subset ofAITW(Rawles et al., 2023) and annotated byproprietary MLLMsforChain-of-Action-Thought (CoAT)components. It consists of 2,504 operation trajectories across 18,643 steps. It is categorized into five subsets: General, Install, GoogleApps, Single, and Web Shopping. -
Scale: The training subset contains 13,919 step-wise samples.
-
Usage: Used for
SFTand evaluation ofreasoningperformance, particularly for exploringCoT-enhanced reasoning.The action type distributions for the test subsets of
AndroidControlandAITZare provided in Table 4: The following are the results from Table 4 of the original paper:
-
| Dataset | SCROLL | CLICK | TYPE | PRESS | WAIT | OPENAPP | COMPLETE | Others | Total |
| AndroidControl | 1,211 | 5,074 | 632 | 343 | 567 | 608 | 1543 | 9 | 9,987 |
| AITZ | 601 | 2,736 | 500 | 265 | √ | → | 504 | 118 | 4724 |
Note on AITZ Table 4: The original table uses and → in the WAIT and OPENAPP columns for AITZ. This appears to be a formatting error in the paper, likely intended to be numerical counts or placeholders for 'not applicable' / 'included elsewhere'. Given the context of action type distributions, these symbols are unusual. I will transcribe them as they appear in the original paper.
The datasets were chosen to represent mobile agent benchmarks and cover resource-constrained scenarios (via UIBERT's small scale relative to OS-Atlas), making them effective for validating the proposed method's performance under these specific conditions.
5.2. Evaluation Metrics
The final action prediction accuracy is used to assess the impact of grounding and query inference on reasoning performance. Two commonly used metrics for GUI agents are adopted: Action Type Match Rate (TMR) and Exact Action Match Rate (AMR).
5.2.1. Action Type Match Rate (TMR)
- Conceptual Definition:
TMRmeasures the percentage of predicted actions where theaction type(e.g.,CLICK,TYPE,SCROLL) exactly matches the ground-truthaction type. It assesses whether the model correctly identifies what kind of action needs to be performed. - Mathematical Formula: $ \mathrm{TMR} = \frac{\text{Number of actions with correctly predicted type}}{\text{Total number of actions}} \times 100% $
- Symbol Explanation:
Number of actions with correctly predicted type: The count of instances where the model's predicted action type is identical to the true action type.Total number of actions: The total count of actions in the evaluation set.
5.2.2. Exact Action Match Rate (AMR)
-
Conceptual Definition:
AMRis a more stringent metric that evaluates whether the predicted action exactly matches the ground truth within a single step. It considers both theaction typeand alloptional parameters(e.g., coordinates, typed text, app names). An action is considered an exact match only when both theaction typeand itsparametersperfectly align with the ground truth. It is a more accurate measure of step-wise action prediction success. -
Mathematical Formula: $ \mathrm{AMR} = \frac{\text{Number of actions with exactly predicted type and parameters}}{\text{Total number of actions}} \times 100% $
-
Symbol Explanation:
Number of actions with exactly predicted type and parameters: The count of instances where both the model's predicted action type and all its associated parameters are identical to the true action type and parameters.Total number of actions: The total count of actions in the evaluation set.
-
Calculation Variations for AMR (Parameter Handling): The calculation of
AMRvaries depending on theaction typeto account for different parameter requirements:- Actions without additional parameters (WAIT, COMPLETE, PRESS): For these actions,
AMRis equivalent toTMR, as only theaction typeneeds to be matched. - SCROLL actions:
AMRevaluates both theaction type(SCROLL) and thescroll direction(up, down, left, right) to ensure a perfect match. - Text-based actions (TYPE, OPENAPP): A rigorous examination is performed. The predicted action is an
exact matchonly if both theaction typeand thecorresponding text(e.g., typed content forTYPE, app name forOPENAPP) perfectly align with the ground truth. - CLICK actions: This is more complex and adapted from :
- If both the predicted and ground-truth actions are
CLICK, the correspondingscreenshot layout informationis used to locate the elementbounding boxthat contains the ground-truth coordinates. - If such a
bounding boxis found, it is checked whether thepredicted coordinatesfall within it. If they do, theCLICK actionis deemed correctly predicted. - If no
bounding boxis found (or the previous check fails), therelative distancebetween thepredictedandground-truth coordinatesis computed. TheCLICK actionis considered correct if thisrelative distanceis below14% of the screen.
- If both the predicted and ground-truth actions are
- Actions without additional parameters (WAIT, COMPLETE, PRESS): For these actions,
5.3. Baselines
The paper compares its query inference approach against several baselines and state-of-the-art MLLM configurations:
-
Qwen2-VL-7B-Instruct (Wang et al., 2024): This
MLLM(dubbed asQwenin the paper) serves as thefoundation MLLM.- Direct SFT: The
Qwenmodel is directlyfine-tunedon themobile agent benchmarks(AndroidControl,AITZ) without anyperception-enhanced pre-training. This serves as a strong baseline to show the performance ofreasoningwithout explicitgroundingorquery inference. - Grounding (G): The
Qwenmodel is first trained forgroundingon theUIBERTdataset (approx. 10,000 instances), followed byreasoning SFTon themobile agent benchmarks. This evaluates the effectiveness ofgroundingwithlimited data. - Query Inference (Q): The
Qwenmodel is first trained forquery inferenceon the refinedUIBERTdataset (9,570 instances), followed byreasoning SFT. This evaluates the proposedquery inferenceas analternative perception-enhanced task. - Grounding + Query Inference (G + Q): The
Qwenmodel is trained on a combined dataset: half of the refinedUIBERTdataset forgroundingand the other half forquery inference. This evaluatesquery inferenceas apivot taskalongsidegrounding.
- Direct SFT: The
-
OS-Atlas-Base-7B (Wu et al., 2025): This is a
large-scale grounding-enhanced model(dubbed asAtlasin the paper).-
Source:
Atlasis pre-trained on over13 million grounding samples, representing a state-of-the-art model with extensivegroundingknowledge. -
Usage: It is
fine-tunedon themobile agent benchmarksto establish a performance ceiling forlarge-scale grounding. The aim is to comparequery inference'sdata efficiencyagainst thislarge-scale model.These baselines allow for a comprehensive evaluation, comparing
query inferenceagainst models withoutperception pre-training, models withlimited-data grounding, and a stronglarge-scale groundingbaseline.
-
5.4. Implementation Details
- Coordinate Normalization: All coordinates are normalized to the range
[0, 1000]. This standardization helps in consistent processing across different screen resolutions andGUIscales. - Action Space Unification: For
reasoning SFT, theaction spaceis unified into three basic actions (CLICK,TYPE,SCROLL), along with custom actions likeOPENAPPforAndroidControlandAITZ. This provides a consistent set of interactions for theGUI agents. - Training Framework: The
LLaMa-Factory(Zheng et al., 2024) framework is used for training ongrounding,query inference, andSFTonmobile agent benchmarks. This ensures a standardized and efficient training environment. - Learning Rate: Uniformly set to . This is a common learning rate for fine-tuning large models.
- Training Epochs:
Groundingandquery inference: 5 epochs.SFTonreasoningbenchmarks: 3 epochs.
- Acceleration: During testing,
flashattn(Dao, 2024) is adopted for acceleration, indicating a focus on efficient inference. - Hardware: All experiments are conducted on
4 × NVIDIA A100 GPUs, each with80GB GPU memory. - Training Times:
Query inferenceandgroundingtraining: approximately 2 hours.Fine-tuningonAITZ: approximately 2 hours.Fine-tuningonAndroidControl: approximately 14 hours.
- Prompts: Detailed prompts for
query refinement,grounding,query inference, andaction prediction(forAndroidControl-L,AndroidControl-H, andAITZwithCoATcomponents) are provided in Appendix B of the paper. This transparency allows for reproducibility and deeper understanding of howMLLMsare guided.
6. Results & Analysis
6.1. Core Results Analysis
The paper presents comprehensive results comparing grounding, query inference (as an alternative or pivot task), and large-scale grounding (Atlas) on mobile agent benchmarks. The main results on overall and type-wise action prediction performance are summarized in Table 2.
The following are the results from Table 2 of the original paper:
| Dataset | Foundation Model | Approach | SCROLL | CLICK | TYPE | PRESS | OPENAPP | TOTAL | ||||
| TMR↑ | TMR↑ | AMR↑ | TMR↑ | AMR↑ | TMR↑ | TMR↑ | AMR↑ | TMR↑ | AMR↑ | |||
| AndroidControl-L | Qwen | SFT | 91.49 | 97.26 | 75.07 | 98.55 | 88.95 | 97.96 | 99.84 | 83.55 | 96.84 | 84.33 |
| G | 91.25 | 97.42 | 76.01 | 96.99 | 77.69 | 97.67 | 99.34 | 85.86 | 96.85 | 83.88 | ||
| Q | 91.08 | 97.32 | 78.95 | 97.78 | 79.59 | 97.67 | 99.51 | 86.02 | 96.79 | 85.45 | ||
| G + Q | 91.08 | 96.49 | 78.87 | 97.31 | 79.91 | 97.08 | 99.67 | 88.16 | 96.48 | 85.70 | ||
| Atlas | 91.58 | 97.48 | 85.69 | 97.38 | 79.59 | 97.67 | 99.84 | 83.39 | 94.96 | 86.80 | ||
| AndroidControl-H | Qwen | SFT | 60.94 | 85.26 | 59.83 | 87.82 | 69.92 | 56.27 | 90.13 | 75.66 | 80.38 | 65.23 |
| G | 59.95 | 85.87 | 61.17 | 90.51 | 55.22 | 61.52 | 92.76 | 75.99 | 81.37 | 65.57 | ||
| Q | 57.64 | 87.31 | 63.11 | 71.77 | 54.11 | 58.69 | 91.78 | 77.14 | 81.68 | 66.11 | ||
| G + Q | 58.79 | 87.76 | 63.83 | 89.72 | 53.32 | 57.14 | 90.95 | 76.48 | 81.59 | 66.24 | ||
| Atlas | 91.58 | 61.85 | 85.28 | 65.43 | 91.77 | 55.70 | 67.93 | 94.74 | 82.24 | 81.78 | 68.65 | |
| AITZ | Qwen | SFT | 59.73 | 81.40 | 63.23 | 86.40 | 48.60 | 71.32 | / | / | 75.76 | 61.43 |
| G | 60.39 | 86.51 | 66.88 | 86.80 | 50.40 | 73.58 | / | / | 81.58 | 63.48 | ||
| Q | 60.23 | 87.57 | 67.80 | 88.20 | 48.60 | 77.36 | / | / | 82.26 | 66.62 | ||
| G + Q | 63.06 | 87.54 | 67.65 | 87.80 | 48.80 | 78.49 | / | / | 82.54 | 66.91 | ||
| Atlas | 65.39 | 86.37 | 67.54 | 88.40 | 49.80 | 76.60 | / | / | 82.03 | 67.04 | ||
Key Findings:
-
Query Inference Outperforms Grounding with Same Data Scale:
- On
AndroidControl-L(low-level instructions),grounding (G)provides only a negligibleAMRimprovement (83.88% vs. 84.33% for SFT, actually a slight decrease in AMR, but TMR is 96.85% vs 96.84%).Query inference (Q)as an alternative task achieves 85.45% AMR, significantly outperformingSFTandgrounding. The combination reaches 85.70% AMR. - On
AndroidControl-H(high-level goals),grounding (G)yields 65.57% AMR, a small improvement overSFT(65.23%).Query inference (Q)reaches 66.11% AMR, and achieves 66.24% AMR, showing consistent outperformance. - On
AITZ,grounding (G)improves AMR from 61.43% (SFT) to 63.48%.Query inference (Q)shows a more substantial jump to 66.62% AMR, and achieves 66.91% AMR. - This demonstrates that
query inferenceis significantly more effective than traditionalgroundingwhen using a limited amount of training data, especially whenreasoningrequires understanding high-level goals.
- On
-
Query Inference as the Pivot Task Further Improves Reasoning:
- Across all benchmarks (
AndroidControl-L,AndroidControl-H,AITZ), the approach (query inference as a pivot task) generally achieves the optimal or near-optimalAMRsamong theQwenmodel configurations. For instance, onAndroidControl-L, (85.70% AMR) is slightly better than alone (85.45% AMR). OnAndroidControl-H, (66.24% AMR) is also marginally better than (66.11% AMR). OnAITZ, (66.91% AMR) is again slightly better than (66.62% AMR). - This suggests that integrating both
grounding(for coordinate sensitivity) andquery inference(for query comprehension and intent alignment) provides a synergistic effect, enhancing bothcoordinate understandinganduser query interpretation, leading to superiorreasoningperformance.
- Across all benchmarks (
-
Query Inference Achieves Performance Comparable to Large-Scale Grounding (Atlas):
Atlas, pre-trained onover 13 million grounding samples, serves as a strong benchmark. OnAndroidControl-L,Atlasachieves 86.80% AMR. withQwenachieves 85.70% AMR, which is very close.- On
AndroidControl-H,Atlasreaches 68.65% AMR, while withQwenachieves 66.24% AMR, a slightly larger gap but still competitive given the data difference. - On
AITZ,Atlasachieves 67.04% AMR, and withQwenachieves 66.91% AMR, a minimal discrepancy of only0.13%. - Crucially,
query inferenceachieves this performance withless than 0.1%of the training data used forAtlas(9,570 samples forUIBERTvs. 13 million forOS-Atlas). This highlights the exceptionaldata efficiencyofquery inferenceand its potential as a more effective approach inresource-constrained scenarios. - Furthermore, the
TMRof onAndroidControl-L(96.48%) andAITZ(82.54%) even surpasses that of directlyfine-tuning Atlas(94.96% and 82.03%, respectively), indicating robustaction type prediction.
-
Action Type-Specific Performance:
Query inferenceshows the most significant improvements forCLICKactions, consistently yielding either optimal or suboptimal results when adopted as an alternative or pivot task. This is expected, asCLICKactions often involve precise coordinate identification and understanding the intent behind clicking a particular element.- For other
action typeslikeSCROLL,PRESS, andOPENAPP,query inferencegenerally demonstrates superior or comparable performance toSFTandgrounding. - However, for
TYPEactions,AMRexperiences significant degradation across most models, includingAtlas, compared to directlyfine-tuning Qwenonmobile agent benchmarks. For example, onAndroidControl-L,SFTachieves 88.95%TYPE AMR, while , , , andAtlasall show lowerTYPE AMR(77.69%, 79.59%, 79.91%, 79.59% respectively). This degradation is attributed tovertical tuningonGUI scenarios, which might hinder the model's generalinstruction-following capability, especially for tasks likeTYPEthat involve generating text. Despite this, the overallAMRimprovements fromquery inferenceare substantial.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Influence of Training Data Scale
To understand the robustness of query inference under varying data constraints, the authors investigate its effectiveness across different training data scales (1,000, 2,000, and 5,000 samples) by randomly extracting subsets from the refined query inference dataset and subsequently fine-tuning on AITZ.
The results are presented in Figure 3.
该图像是图表,展示了在不同预训练数据规模下,基于AITZ数据集使用不同训练任务(标记为𝒢、𝒬及𝒢+𝒬)时的整体动作预测性能(TMR和AMR指标)。图中对比了三种方法的表现随数据量增长的变化趋势。
Figure 3: The overall action prediction performance on AITZ when trained with grounding, query inference as the alternative task, and query inference as the pivot task across various data scales.
Conclusions from Figure 3:
-
Query Inference is Generally More Effective and Data-Efficient:
Grounding (G)performanceincreases graduallywith more data, showing steady but slower improvements. It benefits fromlarger datasets.Query inference (Q)exhibits amuch faster rate of performance improvement. It reaches its peak performance with approximately2,000 training samplesand then plateaus. This highlights itsefficiencywithlimited data, consistentlyoutperforming groundingacross all tested data scales (1k, 2k, 5k samples).
-
Query Inference as Pivot Task (G + Q) Benefits from Larger Datasets:
- When more than
5,000 training samplesare utilized (i.e., at the 5k mark),query inference as the pivot task (G + Q)yields better performance thanquery inference as the alternative task (Q). This suggests that the combined approach (grounding + query inference) leverages the strengths of both tasks more effectively as data availability increases. - Conversely, with
smaller datasets(1k, 2k samples),query inference as the alternative task (Q)performs slightly better than , indicating that a simplerquery inferencetask might be more robust with very sparse data.
- When more than
-
Grounding is More Sensitive to Data Scale:
- A
significant performance increaseis observed forgroundingwhen training data exceeds5,000 samples. This corroborates previous research (Wu et al., 2025; Qin et al., 2025) thatgroundingsubstantially benefits fromlarge-scale training data, confirming its effectiveness only when sufficient resources are available. The current paper focuses on thelimited dataregime wherequery inferenceshines.
- A
6.2.2. Combination with CoT-enhanced Reasoning
The paper explores the impact of Chain-of-Thought (CoT) components on 7B-level perception-enhanced MLLMs in resource-constrained scenarios, using the CoAT dataset AITZ. Four components are considered: screen descriptions (SD) and previous action results (PAR) as input semantic information, and action thoughts (AT) and next action descriptions (AD) as intermediate reasoning results in outputs. Experiments are grouped into four categories: (i) no CoAT, (ii) only input components, (iii) only output components, and (iv) both input and output components.
The overall and type-wise results are presented in Table 3. The following are the results from Table 3 of the original paper:
| Pre-training | ID | Input | Output | SCROLL | CLICK | TYPE | PRESS | TOTAL | |||||
| SD | PAR | AT | AD | TMR↑ | TMR↑ | AMR↑ | TMR↑ | AMR↑ | TMR↑ | TMR↑ | AMR↑ | ||
| G | 1 | 60.39 | 86.51 | 66.88 | 86.80 | 48.60 | 73.58 | 81.58 | 63.48 | ||||
| 2 | ✓ | 60.40 | 85.96 | 66.56 | 88.80 | 49.00 | 73.96 | 81.22 | 65.77 | ||||
| 3 | ✓ | 60.73 | 86.95 | 67.32 | 86.40 | 47.20 | 75.47 | 81.75 | 66.23 | ||||
| 4 | √ | ✓ | 60.23 | 85.64 | 66.04 | 88.80 | 51.00 | 73.58 | 81.14 | 65.79 | |||
| 5 | ✓ | 53.24 | 84.28 | 61.51 | 83.80 | 48.00 | 72.08 | 77.10 | 60.12 | ||||
| 6 | √ | 60.57 | 88.67 | 65.83 | 85.20 | 48.00 | 73.96 | 82.36 | 65.09 | ||||
| 7 | ✓ | ✓ | 50.75 | 72.33 | 52.12 | 80.60 | 45.00 | 69.81 | 69.60 | 54.13 | |||
| 8 | √ | ✓ | ✓ | 50.42 | 73.61 | 53.76 | 82.00 | 44.40 | 69.81 | 70.07 | 54.59 | ||
| 9 | V | ✓ | ✓ | 50.92 | 72.40 | 52.56 | 82.40 | 46.40 | 70.57 | 70.00 | 54.59 | ||
| 10 | √ | ✓ | ✓ | ✓ | 50.58 | 73.90 | 54.09 | 84.00 | 45.20 | 69.81 | 70.17 | 54.59 | |
| Q | 1 | 60.23 | 87.57 | 67.80 | 88.20 | 48.60 | 77.36 | 82.26 | 66.62 | ||||
| 2 | ✓ | 61.73 | 87.61 | 68.46 | 88.80 | 49.40 | 76.98 | 82.77 | 66.62 | ||||
| 3 | ✓ | 61.23 | 87.76 | 67.84 | 89.60 | 49.20 | 76.98 | 82.87 | 67.06 | ||||
| 4 | ✓ | ✓ | 63.89 | 85.78 | 66.89 | 90.60 | 50.20 | 77.36 | 82.13 | 66.91 | |||
| 5 | ✓ | 50.25 | 84.61 | 63.71 | 84.20 | 47.40 | 72.08 | 77.05 | 61.05 | ||||
| 6 | ✓ | 58.74 | 89.00 | 66.52 | 86.40 | 47.00 | 74.72 | 82.15 | 64.97 | ||||
| 7 | ✓ | ✓ | 49.42 | 73.65 | 53.11 | 81.60 | 45.40 | 72.83 | 70.58 | 54.85 | |||
| 8 | √ | ✓ | ✓ | 52.75 | 72.77 | 52.92 | 82.60 | 46.80 | 70.19 | 70.03 | 55.21 | ||
| 9 | V | ✓ | ✓ | 51.41 | 73.21 | 53.33 | 82.40 | 46.40 | 73.21 | 70.53 | 54.74 | ||
| 10 | √ | ✓ | ✓ | ✓ | 50.42 | 72.84 | 52.81 | 81.80 | 43.80 | 69.43 | 69.71 | 54.09 | |
| G + Q | 1 | 63.06 | 87.54 | 67.65 | 87.80 | 48.80 | 78.49 | 82.54 | 66.91 | ||||
| 2 | ✓ | 61.73 | 87.61 | 67.98 | 89.00 | 50.20 | 75.85 | 82.35 | 67.27 | ||||
| 3 | ✓ | 60.73 | 87.43 | 67.25 | 89.80 | 51.60 | 75.85 | 82.88 | 66.62 | ||||
| 4 | √ | ✓ | 61.23 | 87.06 | 67.95 | 88.60 | 47.60 | 76.23 | 80.88 | 65.47 | |||
| 5 | ✓ | 52.25 | 84.14 | 62.83 | 81.60 | 47.40 | 72.83 | 77.34 | 61.39 | ||||
| 6 | ✓ | 60.40 | 88.78 | 65.57 | 86.60 | 49.20 | 71.70 | 82.26 | 64.86 | ||||
| 7 | ✓ | ✓ | 52.91 | 72.04 | 52.12 | 82.20 | 48.20 | 75.47 | 70.17 | 55.03 | |||
| 8 | ✓ | ✓ | ✓ | 50.25 | 72.62 | 52.81 | 81.80 | 46.60 | 70.57 | 69.39 | 54.19 | ||
| 9 | ✓ | ✓ | ✓ | 50.42 | 72.95 | 54.02 | 83.80 | 49.00 | 71.70 | 70.41 | 55.75 | ||
| 10 | √ | ✓ | ✓ | ✓ | 51.58 | 73.83 | 53.03 | 82.00 | 46.00 | 70.94 | 70.62 | 54.76 | |
Findings:
-
Input Semantic Information Boosts Performance:
Incorporating additional semantic information into inputs(i.e.,SD,PAR, or both) generallyimproves action prediction performance.- For
grounding-enhanced models (G): AddingSD(ID 2) orPAR(ID 3) increasesAMRfrom 63.48% (ID 1) to 65.77% and 66.23% respectively. Combining both (, ID 4) also shows improvement (65.79%). - For
query inference (Q): AddingSD(ID 2) boostsAMRfrom 66.62% (ID 1) to 66.62% (TMR higher), andPAR(ID 3) reaches 67.06%. Combining both (, ID 4) yields 66.91%. Notably, withPAR(67.06%) and withSD(67.27%) surpass theAtlasbaseline (67.04% from Table 2). - For
query inference as pivot (G + Q): AddingSD(ID 2) increasesAMRfrom 66.91% (ID 1) to 67.27%. AddingPAR(ID 3) achieves 66.62%. - This indicates that providing
MLLMswith richer contextual information about theGUI environment(screen descriptions,previous action outcomes) significantly enhances theirperceptionand leads to more accurate action decisions.
-
Output CoT Components Degrade Performance:
Incorporating intermediate reasoning results into outputs(i.e.,AT,AD, or both) generally leads to asignificant degradation in action prediction performance.- For all three pre-training approaches (, , ), adding
AT(ID 5),AD(ID 6), or (ID 7) consistently lowersAMRcompared to the no-CoATbaseline (ID 1). For example, withAT(ID 5) dropsAMRfrom 66.62% to 61.05%. With (ID 7),AMRplummets to around 54-55% across all models. - This degradation becomes even more pronounced when both
inputandoutput componentsare combined (ID 8-10), withAMRfalling below 60%. - The paper attributes this decline to the
relatively small scale of the 7B model. It suggests that such modelsstruggle to process complex reasoning effectively. Whenlengthy intermediate reasoning resultsare introduced into the output, the model may becomeoverly focused on the reasoning chain itselfrather than generating the correct final action. This implies a limitation in thereasoning capacityof smallerMLLMsto handle verboseCoToutputs.
-
Query Inference Outperforms Grounding with CoT:
-
Within each group of
CoATconfigurations (same ID),query inference(either or ) generallyoutperforms grounding (G). This holds true even whenCoTcomponents are integrated, further highlighting the effectiveness ofquery inferenceand its ability to robustly integrate withCoT-enhanced reasoningwhen only input information is leveraged.In summary, for
resource-constrained 7B-level MLLMs, enriching theinputwithsemantic informationsignificantly boostsreasoningperformance when combined withquery inference. However, attempting to force these smaller models to generateverbose CoT outputscan be detrimental, indicating a need to tailorCoTstrategies to model scale.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper effectively identifies and addresses a crucial challenge in the development of MLLM-powered GUI agents: the performance gap between coordinate-oriented grounding and action-oriented reasoning within resource-constrained scenarios. The authors propose query inference, a novel query-oriented pivot approach, as a solution. By training MLLMs to deduce intended user queries from given action coordinates and screenshots, query inference significantly enhances the model's comprehension of user intent while maintaining sensitivity to grounding coordinates.
The experimental results unequivocally demonstrate the superiority of query inference over traditional grounding techniques when operating under the same limited data scale. A particularly striking finding is that query inference achieves performance comparable to, or even better than, large-scale grounding-enhanced OS-Atlas (a model trained on millions of samples) using an astonishingly small fraction (less than 0.1%) of the training data. This underscores its exceptional data efficiency and practical utility for scenarios where extensive data collection and computational resources are prohibitive. Furthermore, the research shows that integrating additional semantic information (like screen descriptions and previous action results) into the input further boosts reasoning performance, providing another avenue for improvement in resource-constrained settings.
7.2. Limitations & Future Work
The authors acknowledge two primary limitations of their proposed approach:
- Reliance on SFT for Specific Benchmarks: The method focuses on enhancing perception for
reasoningwith asmall-scale dataset. This design choice, while boosting performance inresource-constrained scenarios, may weaken thezero-shot capabilityof theMLLM. Consequently, the model requiressupervised fine-tuning (SFT)onspecific agent benchmarksto perform well, rather than being able to generalize immediately to unseen tasks. Future work could explore how to retain or improvezero-shot generalizationwhile still benefiting fromquery inference's data efficiency. - Scope Limited to Resource-Constrained Scenarios: The study explicitly focuses on
resource-constrained scenarios. The authors caution that the observed results and conclusions might differ whenlarge-scale training datais available. They reiterate thatgroundinghas been proven effective in such data-rich settings. This implies that for scenarios without resource limitations,large-scale groundingmight still be the preferred or equally effective approach. Future research could investigate the optimal synergy or transition point betweenquery inferenceandlarge-scale groundingas data availability increases.
7.3. Personal Insights & Critique
This paper presents an elegant and highly practical solution to a critical problem in GUI agent development. The identification of the format discrepancy between coordinate-oriented grounding and action-oriented reasoning is a keen insight, and the query inference mechanism serves as a logically sound and empirically effective bridge.
Inspirations:
- The "Reverse" Thinking: The idea of "reverse grounding" by inferring queries from coordinates is particularly inspiring. It forces the model to learn a deeper, more intentional understanding of
GUIelements, which is a more natural fit foraction-oriented tasks. This approach could be applied to other domains where low-level perception needs to be aligned with high-level intent (e.g., inferring user intent from gaze data or touch inputs in AR/VR). - Data Efficiency for Small Models: The astonishing
data efficiencyofquery inference(achieving comparable performance toOS-Atlaswith <0.1% of data) is a significant contribution. This makes advancedGUI agentsaccessible to researchers and developers with limited computational resources, democratizing access to powerfulMLLMcapabilities. This highlights that intelligently structured tasks can compensate for raw data volume. - Leveraging Proprietary MLLMs for Data Curation: The three-step pipeline for constructing
query inference data(refinement, re-grounding, accuracy analysis) is a clever way to leverage the power of larger, more capableMLLMs(likeQwen-VL-Max) for high-qualitydata annotationandfiltering. This method could be generalized to create high-quality training data for other complexMLLMtasks, reducing manual annotation effort and improving data quality.
Potential Issues/Critique:
-
"TYPE" Action Degradation: The consistent
AMRdegradation forTYPEactions acrossgrounding,query inference, andAtlasmodels is a notable weakness. While the authors hypothesizevertical tuningharminginstruction-following, it warrants deeper investigation.TYPEactions are fundamental forGUI agents, and a robust solution for this remains crucial. Perhapsquery inferencecould be adapted to infer "what to type" in addition to "where to click." -
CoT Output Issues: The finding that
CoToutputs degrade performance for7B-level MLLMsis a critical insight. It suggests that whileCoTis powerful for very large models, smaller models might be overwhelmed by the verbosity, indicating a need for more concise or structured intermediate reasoning outputs, or perhaps a differentCoTstrategy (e.g., abstractthought tokensrather than full sentences). This highlights the importance of model-scale awareness when designingCoTprompts. -
Generalizability of Refined Queries: The
query refinementstep uses a specific format: "click on the [element_name] for [purpose]". While effective, the generalizability of this specific format across diverseGUItypes and user intents should be explored. Could some intents be poorly captured by this structure? -
Ethical Considerations: While the ethics statement addresses privacy and system security, the potential for
GUI agentsto be exploited for harmful purposes (as mentioned in the paper) implies a continuous need for robust safety mechanisms. Asquery inferencemakes these agents more capable with less data, the risk surface potentially broadens if not mitigated by strong ethical guardrails.Overall, this paper provides a valuable contribution to the field, offering a data-efficient and conceptually sound approach to improve
MLLM-powered GUI agents. Its insights intotask format discrepancyandCoTlimitations for smaller models are particularly insightful and will likely influence future research inGUI automationandmulti-modal AI.
Similar papers
Recommended via semantic vector search.