Paper status: completed

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks

Published:03/01/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work introduces query inference bridging coordinate grounding and action reasoning in MLLM-powered GUI agents, achieving superior performance with minimal data and enhanced reasoning by integrating semantic information.

Abstract

Perception-enhanced pre-training, particularly through grounding techniques, is widely adopted to enhance the performance of graphical user interface (GUI) agents. However, in resource-constrained scenarios, the format discrepancy between coordinate-oriented grounding and action-oriented reasoning limits the effectiveness of grounding for reasoning tasks. To address this challenge, we propose a query-oriented pivot approach called query inference, which serves as a bridge between GUI grounding and reasoning. By inferring potential user queries from a screenshot and its associated element coordinates, query inference improves the understanding of coordinates while aligning more closely with reasoning tasks. Experimental results show that query inference outperforms previous grounding techniques under the same training data scale. Notably, query inference achieves comparable or even better performance to large-scale grounding-enhanced OS-Atlas with less than 0.1% of training data. Furthermore, we explore the impact of reasoning formats and demonstrate that integrating additional semantic information into the input further boosts reasoning performance. The code is publicly available at https://github.com/ZrW00/GUIPivot.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks

1.2. Authors

Zongru Wu, Pengzhou Cheng, Zheng Wu, Tianjie Ju, Zhuosheng Zhang, Gongshen Liu. All authors are affiliated with the School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University. Their research backgrounds appear to be in areas related to multi-modal machine learning, large language models, graphical user interface (GUI) automation, and agent systems.

1.3. Journal/Conference

This paper was published on arXiv, a preprint server, with the specified publication date. While arXiv itself is not a peer-reviewed journal or conference, it is a widely recognized platform for disseminating early research in various scientific fields, particularly computer science. Papers published on arXiv often undergo subsequent peer review and publication in reputable conferences (e.g., ACL, EMNLP, NeurIPS, ICLR) or journals. The listed references include publications in top-tier NLP and ML venues, suggesting the authors are active in high-impact research.

1.4. Publication Year

2025

1.5. Abstract

The paper addresses the challenge of enhancing graphical user interface (GUI) agents powered by multi-modal large language models (MLLMs) in resource-constrained scenarios. It highlights a limitation where coordinate-oriented grounding (identifying element coordinates for queries) struggles to effectively support action-oriented reasoning (predicting actions to achieve goals) due to their format discrepancy, especially with limited training data. To bridge this gap, the authors propose a query-oriented pivot approach called query inference. This method infers potential user queries from a screenshot and its associated element coordinates, thereby improving the understanding of coordinates and better aligning with reasoning tasks. Experimental results demonstrate that query inference outperforms previous grounding techniques under the same training data scale. Notably, it achieves comparable or superior performance to large-scale grounding-enhanced OS-Atlas using less than 0.1% of the training data. The paper further explores the impact of reasoning formats, showing that integrating additional semantic information into the input boosts reasoning performance.

https://arxiv.org/abs/2503.00401 The publication status is a preprint on arXiv (version 2), indicating it has not yet undergone formal peer review for a conference or journal publication. The PDF link is https://arxiv.org/pdf/2503.00401v2.pdf.

2. Executive Summary

2.1. Background & Motivation

The development of multi-modal large language models (MLLMs) has opened promising avenues for creating sophisticated GUI agents. These agents aim to interact with digital interfaces like humans do, performing actions such as clicking, typing, and scrolling to achieve user-specified goals. However, a significant challenge arises because most MLLMs are not inherently pre-trained on GUI screenshots, leading to difficulties in perceiving and understanding complex GUI environments.

To overcome this, perception-enhanced pre-training techniques, particularly grounding, have been widely adopted. Grounding involves training models to identify the coordinates of GUI elements corresponding to specific user queries. While effective, traditional grounding typically requires vast amounts of training data, often exceeding millions of samples. This reliance on large datasets poses a problem in resource-constrained scenarios, such as personalized agents, where computational resources and available training data are limited.

The core problem the paper aims to solve is the format discrepancy between coordinate-oriented grounding and action-oriented reasoning in resource-constrained scenarios. Grounding focuses on localizing coordinates based on low-level queries, whereas reasoning requires a deeper understanding of high-level user intent to predict a sequence of actions. This discrepancy, combined with limited training data, results in minimal improvements in reasoning performance when relying solely on grounding. This highlights a significant gap in current approaches, underscoring the need for a more efficient and aligned perception-enhancement strategy for GUI agents operating under constraints.

The paper's innovative idea is to introduce query inference as a query-oriented pivot task. Instead of just localizing elements from queries, query inference attempts to deduce the user's intended query from the element's coordinates. This "reverse grounding" process aims to improve the model's understanding of both GUI layouts and user intent, thereby creating a smoother bridge between low-level coordinate perception and high-level action reasoning.

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field of MLLM-powered GUI agents, particularly in resource-constrained scenarios:

  1. Identification of a Critical Gap: The authors rigorously investigate the effectiveness of grounding with limited training data for reasoning tasks. They find that grounding provides minimal improvements in such scenarios, exposing a significant gap stemming from the task format discrepancy between coordinate-oriented grounding and action-oriented reasoning. This highlights a fundamental limitation of existing perception-enhanced pre-training methods under resource constraints.
  2. Proposal of Query Inference as a Pivot Task: To bridge this identified gap, the paper proposes query inference, a novel query-oriented pivot approach. This method deduces the intended user queries corresponding to action coordinates. By doing so, query inference aims to enhance the MLLM's understanding of coordinates and GUI layouts while simultaneously aligning more closely with the requirements of action-oriented reasoning. This task implicitly improves query comprehension and serves as a bridge to smooth the transition from perception to action.
  3. Validation of Effectiveness and Efficiency: Through extensive experiments, the paper validates the effectiveness and potential of query inference in resource-constrained scenarios. Key findings include:
    • Query inference consistently outperforms grounding when using the same scale of training data.

    • When used as a pivot task (i.e., combining grounding and query inference), it further enhances action prediction performance.

    • Remarkably, query inference achieves comparable or even better performance than large-scale grounding-enhanced OS-Atlas (a model trained on over 13 million samples) using less than 0.1% of the training data. This demonstrates its exceptional data efficiency.

    • The paper also shows that integrating additional semantic information (e.g., screen descriptions, previous action results) into the input further boosts reasoning performance, especially for CLICK actions.

      These findings collectively demonstrate that query inference offers a highly effective and data-efficient alternative for enhancing MLLM-powered GUI agents in resource-constrained environments, significantly improving their reasoning capabilities by smoothing the interface between perception and action.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully grasp the paper, a beginner should understand the following key concepts:

  • Multi-modal Large Language Models (MLLMs): An MLLM is an advanced type of Large Language Model (LLM) that can process and understand information from multiple modalities, not just text. In the context of GUI agents, MLLMs typically integrate vision modules (to interpret screenshots or visual data) with language models (to understand instructions and generate actions). They leverage the powerful text generation and reasoning capabilities of LLMs but extend them to include visual perception, making them suitable for tasks that involve both seeing and understanding.

  • Graphical User Interface (GUI) Agents: A GUI agent is an AI system designed to interact with digital interfaces (like mobile apps, websites, or desktop applications) in a human-like manner. It perceives the GUI (e.g., through screenshots), understands user goals, and executes actions (e.g., CLICK, TYPE, SCROLL, OPENAPP) to achieve those goals. These agents aim to automate tasks that humans perform by interacting with GUIs.

  • Perception-Enhanced Pre-training: Since MLLMs are often initially trained on general-purpose data (e.g., natural images, text from the internet), they may not be optimized for the specific visual characteristics of GUIs (e.g., high-density elements, specific UI conventions). Perception-enhanced pre-training involves additional training on GUI-specific datasets to improve the MLLM's ability to accurately perceive and understand GUI layouts, elements, and their functionalities. This specialized training helps the MLLM become more adept at GUI interpretation.

  • Grounding (GUI Grounding): In GUI agents, grounding is a perception-enhanced pre-training task where the model learns to identify and localize specific GUI elements on a screenshot based on a textual query. For example, given a screenshot and the query "click the search icon," the model should output the coordinates (e.g., bounding box) of the search icon. It's coordinate-oriented because its primary output is spatial location.

  • Reasoning (Action-Oriented Reasoning): Reasoning in GUI agents refers to the process of interpreting a high-level user goal (e.g., "find a restaurant nearby") and breaking it down into a sequence of executable actions (e.g., CLICK search bar, TYPE "restaurants", CLICK search button, SCROLL down, CLICK on a restaurant listing). This process requires understanding user intent, GUI context, and predicting appropriate action types and parameters. It's action-oriented because its primary output is a plan of actions.

  • Coordinate-Oriented vs. Action-Oriented Tasks:

    • Coordinate-Oriented: Tasks like grounding where the output is primarily spatial coordinates (points or bounding boxes) on a screen, responding to explicit element references.
    • Action-Oriented: Tasks like reasoning where the output is a sequence of actions with their parameters, aiming to achieve a high-level goal. This requires understanding intent and context beyond simple element localization. The paper highlights the format discrepancy between these two types of tasks as a key challenge.
  • Chain-of-Thought (CoT) Reasoning: CoT is a prompting technique used with LLMs and MLLMs that encourages the model to generate a series of intermediate reasoning steps before arriving at a final answer or action. Instead of directly outputting an action, the model might first generate "thoughts" or "plans," which mimic human problem-solving. This makes the reasoning process more transparent and often improves the accuracy of complex tasks. CoT components can be integrated into inputs (e.g., screen descriptions, previous action results) or outputs (e.g., action thoughts, next action descriptions).

  • Supervised Fine-Tuning (SFT): SFT is a common technique in machine learning where a pre-trained model (like an MLLM) is further trained on a smaller, task-specific dataset with labeled examples. The goal is to adapt the pre-trained model's general knowledge to perform well on a particular downstream task, such as reasoning for GUI agents.

3.2. Previous Works

The paper contextualizes its work within three main areas: MLLM-powered GUI agents, perception-enhanced pre-training, and CoT enhanced reasoning.

3.2.1. MLLM-powered GUI Agents

Traditionally, GUI agents often relied on text-based perception, requiring programmatic access to GUI element hierarchies or accessibility trees (e.g., Zhou et al., 2024; Deng et al., 2024). These methods can be brittle or require system-level permissions. With the advent of MLLMs (e.g., Yin et al., 2024; Chen et al., 2024b; Wang et al., 2024), the paradigm has shifted. MLLM-powered GUI agents (e.g., Cheng et al., 2024; Hong et al., 2024; Gou et al., 2025) can directly perceive GUI environments using their vision modules from screenshots, much like humans. They then interact through human-like actions (CLICK, TYPE, SCROLL) without needing explicit programmatic interactions or API calls (Sun et al., 2024a; Wu et al., 2024b; Zhang et al., 2024b). This direct visual perception makes them more generalizable across different GUI systems.

3.2.2. Perception-Enhanced Pre-training

Most MLLMs are pre-trained on natural images, which differ significantly from the structured, information-dense GUI layouts (Wu et al., 2025). To overcome this, perception-enhanced pre-training tasks are crucial for improving GUI understanding.

  • Grounding: This is the most prevalent task (Wu et al., 2025; Qian et al., 2024). It involves localizing GUI elements (outputting coordinates) based on textual queries. The paper formulates grounding as: $ \mathcal{G} : { \langle s, q \rangle } \to { c } $ where ss is the screenshot, qq is the low-level query (e.g., "click the clock icon"), and cc represents the coordinates (points or bounding boxes). An important aspect highlighted is that qq can be direct or require additional reasoning for context.
  • GUI Referring: Generating descriptions for specific GUI elements (Zhang et al., 2024d; You et al., 2025).
  • Screen Question Answering (Screen QA): Answering questions about screen contents and functionalities (Baechler et al., 2024; Chen et al., 2024a). The paper notes that perception-enhanced pre-training typically demands large-scale training data, and its efficacy in resource-constrained scenarios is underexplored.

3.2.3. CoT Enhanced Reasoning

Chain-of-Thought (CoT) has been adopted to improve reasoning in GUI agents (Zhang et al., 2024c; Sun et al., 2024b). This involves leveraging proprietary MLLMs as annotation models (Achiam et al., 2023; Bai et al., 2023) to automatically generate semantic information to enrich training data.

  • Input Enhancements: Explanations of GUI environments like screen descriptions (SD), previous action results (PAR), and GUI layouts (Ma et al., 2024) are integrated into the model's input to enhance perception.
  • Output Enhancements: Intermediate reasoning results such as action thoughts (AT) and next action descriptions (AD) are included in the model's output to guide the reasoning process. The paper formulates reasoning at step ii as: $ \mathscr{R} : { s_i, {a_{sis_i is the current screenshot, {a<i} \{a_{<i}\} are historical actions, gg is the final goal, and aia_i is the current action. aia_i typically includes an action type tt and action parameters pp (e.g., typed text, coordinates). CoT components rr can also be part of aia_i.

3.3. Technological Evolution

The field of GUI agents has evolved from early rule-based systems and programmatic interaction methods to data-driven approaches. Initially, agents might rely on parsing XML layouts or accessibility trees, which are exact but require specific system access and are not visually intuitive. The rise of deep learning and computer vision enabled visual recognition of GUI elements. However, these were often specialized models. The advent of large language models (LLMs) brought powerful reasoning capabilities, but they lacked visual perception. The current frontier, MLLMs, combines the best of both worlds: visual perception and language-based reasoning. This allows GUI agents to see the screen (like humans), understand instructions (like LLMs), and act accordingly. Within this MLLM paradigm, the challenge shifted from basic perception to refining perception for action. Perception-enhanced pre-training (like grounding) became critical to adapt MLLMs to GUI specifics. This paper fits into this timeline by addressing a nuance within perception-enhanced pre-training: specifically, how to make grounding more effective and data-efficient for action-oriented reasoning when resources are limited. It proposes query inference as an evolution of grounding that is better aligned with the ultimate goal of GUI agents—performing coherent actions.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's query inference approach are:

  1. Bridging the Format Discrepancy: Previous grounding techniques are inherently coordinate-oriented (outputting coordinates from a query), while reasoning tasks are action-oriented (outputting actions from a goal). This paper explicitly identifies and addresses the format discrepancy between these two, especially in resource-constrained scenarios. Query inference acts as a pivot by performing a "reverse grounding" — inferring a query from coordinates. This makes the perception task more query-centric, which naturally aligns better with the language-based instructions and goals of reasoning.

  2. Enhanced Query Comprehension for Reasoning: While grounding provides MLLMs with the ability to locate elements, it doesn't necessarily deepen the understanding of the intent behind interacting with those elements. Query inference, by deducing the intended user query for an action coordinate, forces the model to understand the purpose of an interaction. This enhanced query comprehension is directly beneficial for reasoning, which relies heavily on understanding user intent.

  3. Data Efficiency in Resource-Constrained Scenarios: The most significant differentiation is its effectiveness with limited data. Grounding is known to require large-scale datasets (millions of samples) to be truly effective. This paper demonstrates that query inference achieves comparable or even superior reasoning performance to large-scale grounding models (like OS-Atlas) using less than 0.1% of the training data. This makes it a highly practical and efficient solution for resource-constrained environments where large-scale data collection or training is infeasible.

  4. Leveraging Proprietary MLLMs for Data Construction: The methodology for creating query inference data involves a novel three-step pipeline (query refinement, re-grounding, accuracy analysis) that uses powerful proprietary MLLMs (like Qwen-VL-Max) as annotation/refinement models. This ensures the generated queries are high-quality, intended, and properly formatted, addressing the issue of "unintended" queries often found in raw grounding datasets. This intelligent data curation process contributes to its data efficiency.

  5. Focus on Specificity of GUI Interaction: By moving from simply "finding an element" to "understanding the intention of interacting with an element at these coordinates," query inference pushes the MLLM to learn more semantically rich representations of GUI interactions. This semantic grounding is a key innovation for GUI agents.

    In essence, while grounding teaches an MLLM "where an element is," query inference teaches it "what a user intends to do with an element at this location," which is a more direct pathway to action-oriented reasoning.

4. Methodology

4.1. Principles

The core idea of the method is to address the inherent task format discrepancy between coordinate-oriented grounding and action-oriented reasoning in MLLM-powered GUI agents, particularly in resource-constrained scenarios. Grounding tasks focus on identifying coordinates from a query, while reasoning tasks aim to predict actions based on high-level goals. This paper proposes query inference as a query-oriented pivot approach to smooth this transition.

The theoretical basis and intuition behind query inference are rooted in the idea that for an MLLM to effectively perform action-oriented reasoning, it needs a profound comprehension of user intent. While grounding helps the model "see" the GUI, it doesn't necessarily deepen its "understanding" of why a user might interact with a particular element. By reversing the traditional grounding process, query inference trains the model to deduce the intended user query from given action coordinates. This forces the model to learn a more semantic understanding of GUI elements in relation to potential user goals. This query-centric understanding, which focuses on generating descriptive, intentional queries, naturally aligns better with the action-oriented nature of reasoning tasks, where high-level instructions dictate the required actions. It acts as a bridge by improving the MLLM's ability to interpret GUI elements not just as visual objects, but as targets for specific user intentions, thereby smoothing the flow from low-level perception to high-level action prediction.

4.2. Core Methodology In-depth

The proposed query inference approach involves a three-step pipeline to construct high-quality training data and subsequently train the foundation MLLM. This pipeline leverages proprietary MLLMs to refine grounding data into query inference samples, ensuring that the inferred queries are both accurate and reflect user intent.

4.2.1. Grounding and Reasoning Formulations

Before diving into query inference, it's crucial to understand the formal definitions of grounding and reasoning as presented in the paper.

Grounding: Grounding is a perception-enhanced pre-training task that aims to localize the coordinates cc of specific GUI elements based on the perception of screenshots ss and low-level queries qq. The query qq can be explicit (e.g., "click the clock icon") or implicit, requiring additional reasoning for context (e.g., "click on the home button at top left"). The coordinates cc can be represented as points or bounding boxes. Formally, grounding is represented as: $ \mathcal{G} : { \langle s, q \rangle } \to { c } $ Here:

  • G \mathcal{G} represents the grounding function or model.
  • s s is the input screenshot of the GUI.
  • q q is the textual query, which can be a low-level instruction.
  • c c is the output, representing the localized coordinates (e.g., bounding box) of the GUI element corresponding to qq on ss.

Reasoning: Reasoning predicts a chain of actions to achieve a high-level final goal based on the perception of GUI environments. At step ii, the agent perceives the current screenshot sis_i along with historical actions {a<i} \{a_{<i}\} to predict the current action aia_i to achieve the final goal gg. The action aia_i typically comprises an action type tt and action parameters pp (which may include typed text or coordinates cc). Optionally, CoT components rr (e.g., intermediate reasoning thoughts) can also be introduced into aia_i to enhance reasoning. Formally, reasoning at step ii is formulated as: $ \mathscr{R} : { s_i, {a_{

  • R \mathscr{R} represents the reasoning function or model.

  • si s_i is the screenshot at the current step ii.

  • {a<i} \{a_{<i}\} represents the set of historical actions performed before step ii.

  • g g is the high-level final goal the agent needs to achieve.

  • ai a_i is the predicted action at step ii, consisting of action type tt, action parameters pp, and optionally CoT components rr.

    The paper emphasizes that reasoning is action-oriented and requires understanding high-level user intent, while grounding is coordinate-oriented and aligns low-level queries with coordinates, lacking high-level intent perception. This creates the format discrepancy.

4.2.2. Query Inference Pipeline

The query inference methodology constructs samples by transforming existing grounding data into a format that deduces intended user queries from action coordinates. This is achieved through a three-step pipeline: query refinement, re-grounding, and analyzing the accuracy of re-grounding.

The overall three-step pipeline for constructing samples for query inference is illustrated in Figure 2.

该图像是一个示意图,展示了通过查询细化和重新定位坐标实现多模态输入的三元组到四元组的转换流程,包括使用提示模型进行查询细化和重新定位,以及对重新定位准确性的分析和保存或丢弃决策。 该图像是一个示意图,展示了通过查询细化和重新定位坐标实现多模态输入的三元组到四元组的转换流程,包括使用提示模型进行查询细化和重新定位,以及对重新定位准确性的分析和保存或丢弃决策。

Figure 2: The image is a schematic diagram illustrating the process of transforming triplets to quadruplets for multimodal input through query refinement and re-grounding of coordinates, including the use of prompt models for query refinement and re-grounding, as well as analyzing the accuracy of re-grounding to decide saving or discarding triplets.

4.2.2.1. Query Refinement

The first step addresses the issue that existing grounding queries (qq) are often "unintended" or too low-level, making them unsuitable for directly inferring user intent for reasoning. To overcome this, a proprietary MLLM is used as a refinement model (Mr \mathcal{M}_r ) to transform these low-level unintended queries into intended, properly formatted queries (qrq_r).

The refinement model used is Qwen-VL-Max (Bai et al., 2023). The model is prompted to generate qrq_r in a specific format: "click on the [element_name] for [purpose]". This format encourages the MLLM to deduce the underlying intention of an action interacting with the coordinate-specified elements. The inputs to Mr \mathcal{M}_r are the original screenshot, the low-level query, and the ground-truth coordinates.

Formally, the refinement process can be represented as: $ {\mathcal{M}}_r : { \langle s, q, c \rangle } \to { q_r } $ Here:

  • Mr \mathcal{M}_r denotes the refinement model (e.g., Qwen-VL-Max).
  • s s is the screenshot.
  • q q is the original low-level, unintended query from the grounding data.
  • c c is the corresponding ground-truth coordinates (e.g., bounding box) for qq on ss.
  • qr q_r is the refined, intended, and properly formatted user query generated by Mr \mathcal{M}_r .

4.2.2.2. Re-grounding

Automated query refinement can introduce errors or generate queries that don't accurately reflect the GUI content or user intent. To ensure data quality and filter out such incorrect information, a second step, re-grounding, is performed. In this step, Qwen-VL-Max is again used, but this time as a grounding model (Mg \mathcal{M}_g ).

The re-grounding model is prompted to localize coordinates (crc_r) based on the newly refined queries (qrq_r) and the original screenshot (ss). This process checks if the refined query can still accurately point back to the original element.

The re-grounding process is formulated as: $ \mathcal{M}_g : { \langle s, q_r \rangle } \to { c_r } $ Here:

  • Mg \mathcal{M}_g denotes the grounding model (e.g., Qwen-VL-Max).
  • s s is the screenshot.
  • qr q_r is the refined query obtained from the query refinement step.
  • cr c_r is the coordinates localized by Mg \mathcal{M}_g based on ss and qrq_r.

4.2.2.3. Analyzing the Accuracy of Re-grounding

The final step involves analyzing the accuracy of the re-grounded coordinates (crc_r) against the original ground-truth coordinates (cc) to filter for high-quality query inference samples. This ensures that only refined queries that accurately refer to the original GUI element are retained.

An indicator T \mathcal{T} is established to determine accuracy. It checks if the center point of the re-grounded coordinates crc_r lies within the bounding box represented by the ground-truth coordinates cc.

The accuracy analysis is defined by the following indicator function: $ \mathcal{T}(c_r, c) = \left{ \begin{array}{ll} 1, & \mathrm{if~the~center~of~} c_r \mathrm{~is~inside~} c, \ 0, & \mathrm{otherwise.} \end{array} \right. $ Here:

  • T \mathcal{T} is a binary indicator function.

  • cr c_r are the coordinates localized by the re-grounding model.

  • c c are the original ground-truth coordinates.

  • A value of 1 indicates that the re-grounded coordinates are considered accurate, and the sample is retained.

  • A value of 0 indicates inaccuracy, and the sample is discarded.

    If T(cr,c)=1 \mathcal{T}(c_r, c) = 1 , the triplet s,qr,c \langle s, q_r, c \rangle is retained as a data sample for query inference. Otherwise, the sample is discarded. This process results in a high-quality dataset consisting of triplets s,qr,c \langle s, q_r, c \rangle for training the query inference model.

Subsequently, this dataset is used to train the foundation MLLM on the query inference task prior to reasoning SFT. The query inference task itself can be formally represented as: $ \mathcal{Q} : { \langle s, c \rangle } \to { q_r } $ Here:

  • Q \mathcal{Q} represents the query inference function or model.

  • s s is the input screenshot.

  • c c are the input coordinates (e.g., of an action target).

  • qr q_r is the predicted refined user query corresponding to the action at cc on ss.

    By training the MLLM on this query inference task, the model's comprehension of user intention is enhanced, aligning it more closely with the needs of reasoning, while simultaneously maintaining its sensitivity to coordinates. This effectively bridges the gap between coordinate-oriented grounding and action-oriented reasoning.

5. Experimental Setup

5.1. Datasets

The experiments primarily use UIBERT for perception-enhanced pre-training and two mobile agent benchmarks, AndroidControl and AITZ, for reasoning supervised fine-tuning (SFT) and evaluation.

  • UIBERT (Bai et al., 2021):

    • Source & Characteristics: This dataset is selected as the grounding dataset for resource-constrained scenarios. It contains approximately 10,000 instances of grounding data. UIBERT is a subset of the larger OS-Atlas grounding dataset (which has over 13 million samples).

    • Usage: For fairness in comparison, samples from the original UIBERT dataset are extracted for grounding training data. The query inference dataset is constructed by refining UIBERT samples using the proposed three-step pipeline. The final refined query inference dataset consists of 9,570 triplets of s,qr,c \langle s, q_r, c \rangle .

    • Example Data Sample: An example of a triplet s,qr,c \langle s, q_r, c \rangle from the refined UIBERT dataset, along with the original query qq, is provided in Figure 4. This shows how a low-level query like "click on location icon" is refined into an intended query like "click on the map icon for navigating."

      Figure 5: The prompt template for query refinement. 该图像是论文中展示的查询细化提示模板示意图,描述了如何根据给定UI截图及动作生成查询,包括识别边界框位置、内容、语境相关性和任务意图,最终输出规范化查询。

    Figure 4: Example triplets s,qr,c \langle s, q_r, c \rangle from the refined UIBERT dataset, along with the original query qq. After refinement, the action intent has been inferred, such as "selecting the 24h format". By training on the triplets s,qr,c \langle s, q_r, c \rangle with intended queries, the comprehension of user intention would be enhanced to align with reasoning while maintaining sensitivity to the coordinates.

  • AndroidControl (Li et al., 2024):

    • Source & Characteristics: This is a mobile agent dataset comprising 15,283 demonstrations with step-wise instructions. It's collected from human raters performing various tasks across 833 different applications spanning 40 app categories on Android devices.
    • Scale: The training subset contains 89,144 step-wise samples.
    • Settings: Evaluated in two settings:
      • AndroidControl-L: Both low-level step instructions and high-level goals are provided as inputs.
      • AndroidControl-H: Only high-level goals are provided as inputs.
    • Usage: Used for SFT and evaluation of reasoning performance.
  • AITZ (Zhang et al., 2024c):

    • Source & Characteristics: This is a mobile agent dataset derived from a subset of AITW (Rawles et al., 2023) and annotated by proprietary MLLMs for Chain-of-Action-Thought (CoAT) components. It consists of 2,504 operation trajectories across 18,643 steps. It is categorized into five subsets: General, Install, GoogleApps, Single, and Web Shopping.

    • Scale: The training subset contains 13,919 step-wise samples.

    • Usage: Used for SFT and evaluation of reasoning performance, particularly for exploring CoT-enhanced reasoning.

      The action type distributions for the test subsets of AndroidControl and AITZ are provided in Table 4: The following are the results from Table 4 of the original paper:

Dataset SCROLL CLICK TYPE PRESS WAIT OPENAPP COMPLETE Others Total
AndroidControl 1,211 5,074 632 343 567 608 1543 9 9,987
AITZ 601 2,736 500 265 504 118 4724

Note on AITZ Table 4: The original table uses and in the WAIT and OPENAPP columns for AITZ. This appears to be a formatting error in the paper, likely intended to be numerical counts or placeholders for 'not applicable' / 'included elsewhere'. Given the context of action type distributions, these symbols are unusual. I will transcribe them as they appear in the original paper.

The datasets were chosen to represent mobile agent benchmarks and cover resource-constrained scenarios (via UIBERT's small scale relative to OS-Atlas), making them effective for validating the proposed method's performance under these specific conditions.

5.2. Evaluation Metrics

The final action prediction accuracy is used to assess the impact of grounding and query inference on reasoning performance. Two commonly used metrics for GUI agents are adopted: Action Type Match Rate (TMR) and Exact Action Match Rate (AMR).

5.2.1. Action Type Match Rate (TMR)

  • Conceptual Definition: TMR measures the percentage of predicted actions where the action type (e.g., CLICK, TYPE, SCROLL) exactly matches the ground-truth action type. It assesses whether the model correctly identifies what kind of action needs to be performed.
  • Mathematical Formula: $ \mathrm{TMR} = \frac{\text{Number of actions with correctly predicted type}}{\text{Total number of actions}} \times 100% $
  • Symbol Explanation:
    • Number of actions with correctly predicted type: The count of instances where the model's predicted action type is identical to the true action type.
    • Total number of actions: The total count of actions in the evaluation set.

5.2.2. Exact Action Match Rate (AMR)

  • Conceptual Definition: AMR is a more stringent metric that evaluates whether the predicted action exactly matches the ground truth within a single step. It considers both the action type and all optional parameters (e.g., coordinates, typed text, app names). An action is considered an exact match only when both the action type and its parameters perfectly align with the ground truth. It is a more accurate measure of step-wise action prediction success.

  • Mathematical Formula: $ \mathrm{AMR} = \frac{\text{Number of actions with exactly predicted type and parameters}}{\text{Total number of actions}} \times 100% $

  • Symbol Explanation:

    • Number of actions with exactly predicted type and parameters: The count of instances where both the model's predicted action type and all its associated parameters are identical to the true action type and parameters.
    • Total number of actions: The total count of actions in the evaluation set.
  • Calculation Variations for AMR (Parameter Handling): The calculation of AMR varies depending on the action type to account for different parameter requirements:

    • Actions without additional parameters (WAIT, COMPLETE, PRESS): For these actions, AMR is equivalent to TMR, as only the action type needs to be matched.
    • SCROLL actions: AMR evaluates both the action type (SCROLL) and the scroll direction (up, down, left, right) to ensure a perfect match.
    • Text-based actions (TYPE, OPENAPP): A rigorous examination is performed. The predicted action is an exact match only if both the action type and the corresponding text (e.g., typed content for TYPE, app name for OPENAPP) perfectly align with the ground truth.
    • CLICK actions: This is more complex and adapted from Wuetal.(2025)Wu et al. (2025):
      1. If both the predicted and ground-truth actions are CLICK, the corresponding screenshot layout information is used to locate the element bounding box that contains the ground-truth coordinates.
      2. If such a bounding box is found, it is checked whether the predicted coordinates fall within it. If they do, the CLICK action is deemed correctly predicted.
      3. If no bounding box is found (or the previous check fails), the relative distance between the predicted and ground-truth coordinates is computed. The CLICK action is considered correct if this relative distance is below 14% of the screen.

5.3. Baselines

The paper compares its query inference approach against several baselines and state-of-the-art MLLM configurations:

  • Qwen2-VL-7B-Instruct (Wang et al., 2024): This MLLM (dubbed as Qwen in the paper) serves as the foundation MLLM.

    • Direct SFT: The Qwen model is directly fine-tuned on the mobile agent benchmarks (AndroidControl, AITZ) without any perception-enhanced pre-training. This serves as a strong baseline to show the performance of reasoning without explicit grounding or query inference.
    • Grounding (G): The Qwen model is first trained for grounding on the UIBERT dataset (approx. 10,000 instances), followed by reasoning SFT on the mobile agent benchmarks. This evaluates the effectiveness of grounding with limited data.
    • Query Inference (Q): The Qwen model is first trained for query inference on the refined UIBERT dataset (9,570 instances), followed by reasoning SFT. This evaluates the proposed query inference as an alternative perception-enhanced task.
    • Grounding + Query Inference (G + Q): The Qwen model is trained on a combined dataset: half of the refined UIBERT dataset for grounding and the other half for query inference. This evaluates query inference as a pivot task alongside grounding.
  • OS-Atlas-Base-7B (Wu et al., 2025): This is a large-scale grounding-enhanced model (dubbed as Atlas in the paper).

    • Source: Atlas is pre-trained on over 13 million grounding samples, representing a state-of-the-art model with extensive grounding knowledge.

    • Usage: It is fine-tuned on the mobile agent benchmarks to establish a performance ceiling for large-scale grounding. The aim is to compare query inference's data efficiency against this large-scale model.

      These baselines allow for a comprehensive evaluation, comparing query inference against models without perception pre-training, models with limited-data grounding, and a strong large-scale grounding baseline.

5.4. Implementation Details

  • Coordinate Normalization: All coordinates are normalized to the range [0, 1000]. This standardization helps in consistent processing across different screen resolutions and GUI scales.
  • Action Space Unification: For reasoning SFT, the action space is unified into three basic actions (CLICK, TYPE, SCROLL), along with custom actions like OPENAPP for AndroidControl and AITZ. This provides a consistent set of interactions for the GUI agents.
  • Training Framework: The LLaMa-Factory (Zheng et al., 2024) framework is used for training on grounding, query inference, and SFT on mobile agent benchmarks. This ensures a standardized and efficient training environment.
  • Learning Rate: Uniformly set to 1×1051 × 10^-5. This is a common learning rate for fine-tuning large models.
  • Training Epochs:
    • Grounding and query inference: 5 epochs.
    • SFT on reasoning benchmarks: 3 epochs.
  • Acceleration: During testing, flashattn (Dao, 2024) is adopted for acceleration, indicating a focus on efficient inference.
  • Hardware: All experiments are conducted on 4 × NVIDIA A100 GPUs, each with 80GB GPU memory.
  • Training Times:
    • Query inference and grounding training: approximately 2 hours.
    • Fine-tuning on AITZ: approximately 2 hours.
    • Fine-tuning on AndroidControl: approximately 14 hours.
  • Prompts: Detailed prompts for query refinement, grounding, query inference, and action prediction (for AndroidControl-L, AndroidControl-H, and AITZ with CoAT components) are provided in Appendix B of the paper. This transparency allows for reproducibility and deeper understanding of how MLLMs are guided.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents comprehensive results comparing grounding, query inference (as an alternative or pivot task), and large-scale grounding (Atlas) on mobile agent benchmarks. The main results on overall and type-wise action prediction performance are summarized in Table 2.

The following are the results from Table 2 of the original paper:

Dataset Foundation Model Approach SCROLL CLICK TYPE PRESS OPENAPP TOTAL
TMR↑ TMR↑ AMR↑ TMR↑ AMR↑ TMR↑ TMR↑ AMR↑ TMR↑ AMR↑
AndroidControl-L Qwen SFT 91.49 97.26 75.07 98.55 88.95 97.96 99.84 83.55 96.84 84.33
G 91.25 97.42 76.01 96.99 77.69 97.67 99.34 85.86 96.85 83.88
Q 91.08 97.32 78.95 97.78 79.59 97.67 99.51 86.02 96.79 85.45
G + Q 91.08 96.49 78.87 97.31 79.91 97.08 99.67 88.16 96.48 85.70
Atlas 91.58 97.48 85.69 97.38 79.59 97.67 99.84 83.39 94.96 86.80
AndroidControl-H Qwen SFT 60.94 85.26 59.83 87.82 69.92 56.27 90.13 75.66 80.38 65.23
G 59.95 85.87 61.17 90.51 55.22 61.52 92.76 75.99 81.37 65.57
Q 57.64 87.31 63.11 71.77 54.11 58.69 91.78 77.14 81.68 66.11
G + Q 58.79 87.76 63.83 89.72 53.32 57.14 90.95 76.48 81.59 66.24
Atlas 91.58 61.85 85.28 65.43 91.77 55.70 67.93 94.74 82.24 81.78 68.65
AITZ Qwen SFT 59.73 81.40 63.23 86.40 48.60 71.32 / / 75.76 61.43
G 60.39 86.51 66.88 86.80 50.40 73.58 / / 81.58 63.48
Q 60.23 87.57 67.80 88.20 48.60 77.36 / / 82.26 66.62
G + Q 63.06 87.54 67.65 87.80 48.80 78.49 / / 82.54 66.91
Atlas 65.39 86.37 67.54 88.40 49.80 76.60 / / 82.03 67.04

Key Findings:

  1. Query Inference Outperforms Grounding with Same Data Scale:

    • On AndroidControl-L (low-level instructions), grounding (G) provides only a negligible AMR improvement (83.88% vs. 84.33% for SFT, actually a slight decrease in AMR, but TMR is 96.85% vs 96.84%). Query inference (Q) as an alternative task achieves 85.45% AMR, significantly outperforming SFT and grounding. The G+QG + Q combination reaches 85.70% AMR.
    • On AndroidControl-H (high-level goals), grounding (G) yields 65.57% AMR, a small improvement over SFT (65.23%). Query inference (Q) reaches 66.11% AMR, and G+QG + Q achieves 66.24% AMR, showing consistent outperformance.
    • On AITZ, grounding (G) improves AMR from 61.43% (SFT) to 63.48%. Query inference (Q) shows a more substantial jump to 66.62% AMR, and G+QG + Q achieves 66.91% AMR.
    • This demonstrates that query inference is significantly more effective than traditional grounding when using a limited amount of training data, especially when reasoning requires understanding high-level goals.
  2. Query Inference as the Pivot Task Further Improves Reasoning:

    • Across all benchmarks (AndroidControl-L, AndroidControl-H, AITZ), the G+QG + Q approach (query inference as a pivot task) generally achieves the optimal or near-optimal AMRs among the Qwen model configurations. For instance, on AndroidControl-L, G+QG + Q (85.70% AMR) is slightly better than QQ alone (85.45% AMR). On AndroidControl-H, G+QG + Q (66.24% AMR) is also marginally better than QQ (66.11% AMR). On AITZ, G+QG + Q (66.91% AMR) is again slightly better than QQ (66.62% AMR).
    • This suggests that integrating both grounding (for coordinate sensitivity) and query inference (for query comprehension and intent alignment) provides a synergistic effect, enhancing both coordinate understanding and user query interpretation, leading to superior reasoning performance.
  3. Query Inference Achieves Performance Comparable to Large-Scale Grounding (Atlas):

    • Atlas, pre-trained on over 13 million grounding samples, serves as a strong benchmark. On AndroidControl-L, Atlas achieves 86.80% AMR. G+QG + Q with Qwen achieves 85.70% AMR, which is very close.
    • On AndroidControl-H, Atlas reaches 68.65% AMR, while G+QG + Q with Qwen achieves 66.24% AMR, a slightly larger gap but still competitive given the data difference.
    • On AITZ, Atlas achieves 67.04% AMR, and G+QG + Q with Qwen achieves 66.91% AMR, a minimal discrepancy of only 0.13%.
    • Crucially, query inference achieves this performance with less than 0.1% of the training data used for Atlas (9,570 samples for UIBERT vs. 13 million for OS-Atlas). This highlights the exceptional data efficiency of query inference and its potential as a more effective approach in resource-constrained scenarios.
    • Furthermore, the TMR of G+QG + Q on AndroidControl-L (96.48%) and AITZ (82.54%) even surpasses that of directly fine-tuning Atlas (94.96% and 82.03%, respectively), indicating robust action type prediction.
  4. Action Type-Specific Performance:

    • Query inference shows the most significant improvements for CLICK actions, consistently yielding either optimal or suboptimal results when adopted as an alternative or pivot task. This is expected, as CLICK actions often involve precise coordinate identification and understanding the intent behind clicking a particular element.
    • For other action types like SCROLL, PRESS, and OPENAPP, query inference generally demonstrates superior or comparable performance to SFT and grounding.
    • However, for TYPE actions, AMR experiences significant degradation across most models, including Atlas, compared to directly fine-tuning Qwen on mobile agent benchmarks. For example, on AndroidControl-L, SFT achieves 88.95% TYPE AMR, while GG, QQ, G+QG+Q, and Atlas all show lower TYPE AMR (77.69%, 79.59%, 79.91%, 79.59% respectively). This degradation is attributed to vertical tuning on GUI scenarios, which might hinder the model's general instruction-following capability, especially for tasks like TYPE that involve generating text. Despite this, the overall AMR improvements from query inference are substantial.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Influence of Training Data Scale

To understand the robustness of query inference under varying data constraints, the authors investigate its effectiveness across different training data scales (1,000, 2,000, and 5,000 samples) by randomly extracting subsets from the refined query inference dataset and subsequently fine-tuning on AITZ.

The results are presented in Figure 3.

Figure 3: The overall action prediction performance on AITZ when trained with grounding, query inference as the alternative task, and query inference as the pivot task across various data scales. 该图像是图表,展示了在不同预训练数据规模下,基于AITZ数据集使用不同训练任务(标记为𝒢、𝒬及𝒢+𝒬)时的整体动作预测性能(TMR和AMR指标)。图中对比了三种方法的表现随数据量增长的变化趋势。

Figure 3: The overall action prediction performance on AITZ when trained with grounding, query inference as the alternative task, and query inference as the pivot task across various data scales.

Conclusions from Figure 3:

  1. Query Inference is Generally More Effective and Data-Efficient:

    • Grounding (G) performance increases gradually with more data, showing steady but slower improvements. It benefits from larger datasets.
    • Query inference (Q) exhibits a much faster rate of performance improvement. It reaches its peak performance with approximately 2,000 training samples and then plateaus. This highlights its efficiency with limited data, consistently outperforming grounding across all tested data scales (1k, 2k, 5k samples).
  2. Query Inference as Pivot Task (G + Q) Benefits from Larger Datasets:

    • When more than 5,000 training samples are utilized (i.e., at the 5k mark), query inference as the pivot task (G + Q) yields better performance than query inference as the alternative task (Q). This suggests that the combined approach (grounding + query inference) leverages the strengths of both tasks more effectively as data availability increases.
    • Conversely, with smaller datasets (1k, 2k samples), query inference as the alternative task (Q) performs slightly better than G+QG + Q, indicating that a simpler query inference task might be more robust with very sparse data.
  3. Grounding is More Sensitive to Data Scale:

    • A significant performance increase is observed for grounding when training data exceeds 5,000 samples. This corroborates previous research (Wu et al., 2025; Qin et al., 2025) that grounding substantially benefits from large-scale training data, confirming its effectiveness only when sufficient resources are available. The current paper focuses on the limited data regime where query inference shines.

6.2.2. Combination with CoT-enhanced Reasoning

The paper explores the impact of Chain-of-Thought (CoT) components on 7B-level perception-enhanced MLLMs in resource-constrained scenarios, using the CoAT dataset AITZ. Four components are considered: screen descriptions (SD) and previous action results (PAR) as input semantic information, and action thoughts (AT) and next action descriptions (AD) as intermediate reasoning results in outputs. Experiments are grouped into four categories: (i) no CoAT, (ii) only input components, (iii) only output components, and (iv) both input and output components.

The overall and type-wise results are presented in Table 3. The following are the results from Table 3 of the original paper:

Pre-training ID Input Output SCROLL CLICK TYPE PRESS TOTAL
SD PAR AT AD TMR↑ TMR↑ AMR↑ TMR↑ AMR↑ TMR↑ TMR↑ AMR↑
G 1 60.39 86.51 66.88 86.80 48.60 73.58 81.58 63.48
2 60.40 85.96 66.56 88.80 49.00 73.96 81.22 65.77
3 60.73 86.95 67.32 86.40 47.20 75.47 81.75 66.23
4 60.23 85.64 66.04 88.80 51.00 73.58 81.14 65.79
5 53.24 84.28 61.51 83.80 48.00 72.08 77.10 60.12
6 60.57 88.67 65.83 85.20 48.00 73.96 82.36 65.09
7 50.75 72.33 52.12 80.60 45.00 69.81 69.60 54.13
8 50.42 73.61 53.76 82.00 44.40 69.81 70.07 54.59
9 V 50.92 72.40 52.56 82.40 46.40 70.57 70.00 54.59
10 50.58 73.90 54.09 84.00 45.20 69.81 70.17 54.59
Q 1 60.23 87.57 67.80 88.20 48.60 77.36 82.26 66.62
2 61.73 87.61 68.46 88.80 49.40 76.98 82.77 66.62
3 61.23 87.76 67.84 89.60 49.20 76.98 82.87 67.06
4 63.89 85.78 66.89 90.60 50.20 77.36 82.13 66.91
5 50.25 84.61 63.71 84.20 47.40 72.08 77.05 61.05
6 58.74 89.00 66.52 86.40 47.00 74.72 82.15 64.97
7 49.42 73.65 53.11 81.60 45.40 72.83 70.58 54.85
8 52.75 72.77 52.92 82.60 46.80 70.19 70.03 55.21
9 V 51.41 73.21 53.33 82.40 46.40 73.21 70.53 54.74
10 50.42 72.84 52.81 81.80 43.80 69.43 69.71 54.09
G + Q 1 63.06 87.54 67.65 87.80 48.80 78.49 82.54 66.91
2 61.73 87.61 67.98 89.00 50.20 75.85 82.35 67.27
3 60.73 87.43 67.25 89.80 51.60 75.85 82.88 66.62
4 61.23 87.06 67.95 88.60 47.60 76.23 80.88 65.47
5 52.25 84.14 62.83 81.60 47.40 72.83 77.34 61.39
6 60.40 88.78 65.57 86.60 49.20 71.70 82.26 64.86
7 52.91 72.04 52.12 82.20 48.20 75.47 70.17 55.03
8 50.25 72.62 52.81 81.80 46.60 70.57 69.39 54.19
9 50.42 72.95 54.02 83.80 49.00 71.70 70.41 55.75
10 51.58 73.83 53.03 82.00 46.00 70.94 70.62 54.76

Findings:

  1. Input Semantic Information Boosts Performance:

    • Incorporating additional semantic information into inputs (i.e., SD, PAR, or both) generally improves action prediction performance.
    • For grounding-enhanced models (G): Adding SD (ID 2) or PAR (ID 3) increases AMR from 63.48% (ID 1) to 65.77% and 66.23% respectively. Combining both (SD+PARSD + PAR, ID 4) also shows improvement (65.79%).
    • For query inference (Q): Adding SD (ID 2) boosts AMR from 66.62% (ID 1) to 66.62% (TMR higher), and PAR (ID 3) reaches 67.06%. Combining both (SD+PARSD + PAR, ID 4) yields 66.91%. Notably, QQ with PAR (67.06%) and G+QG+Q with SD (67.27%) surpass the Atlas baseline (67.04% from Table 2).
    • For query inference as pivot (G + Q): Adding SD (ID 2) increases AMR from 66.91% (ID 1) to 67.27%. Adding PAR (ID 3) achieves 66.62%.
    • This indicates that providing MLLMs with richer contextual information about the GUI environment (screen descriptions, previous action outcomes) significantly enhances their perception and leads to more accurate action decisions.
  2. Output CoT Components Degrade Performance:

    • Incorporating intermediate reasoning results into outputs (i.e., AT, AD, or both) generally leads to a significant degradation in action prediction performance.
    • For all three pre-training approaches (GG, QQ, G+QG+Q), adding AT (ID 5), AD (ID 6), or AT+ADAT + AD (ID 7) consistently lowers AMR compared to the no-CoAT baseline (ID 1). For example, QQ with AT (ID 5) drops AMR from 66.62% to 61.05%. With AT+ADAT + AD (ID 7), AMR plummets to around 54-55% across all models.
    • This degradation becomes even more pronounced when both input and output components are combined (ID 8-10), with AMR falling below 60%.
    • The paper attributes this decline to the relatively small scale of the 7B model. It suggests that such models struggle to process complex reasoning effectively. When lengthy intermediate reasoning results are introduced into the output, the model may become overly focused on the reasoning chain itself rather than generating the correct final action. This implies a limitation in the reasoning capacity of smaller MLLMs to handle verbose CoT outputs.
  3. Query Inference Outperforms Grounding with CoT:

    • Within each group of CoAT configurations (same ID), query inference (either QQ or G+QG + Q) generally outperforms grounding (G). This holds true even when CoT components are integrated, further highlighting the effectiveness of query inference and its ability to robustly integrate with CoT-enhanced reasoning when only input information is leveraged.

      In summary, for resource-constrained 7B-level MLLMs, enriching the input with semantic information significantly boosts reasoning performance when combined with query inference. However, attempting to force these smaller models to generate verbose CoT outputs can be detrimental, indicating a need to tailor CoT strategies to model scale.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper effectively identifies and addresses a crucial challenge in the development of MLLM-powered GUI agents: the performance gap between coordinate-oriented grounding and action-oriented reasoning within resource-constrained scenarios. The authors propose query inference, a novel query-oriented pivot approach, as a solution. By training MLLMs to deduce intended user queries from given action coordinates and screenshots, query inference significantly enhances the model's comprehension of user intent while maintaining sensitivity to grounding coordinates.

The experimental results unequivocally demonstrate the superiority of query inference over traditional grounding techniques when operating under the same limited data scale. A particularly striking finding is that query inference achieves performance comparable to, or even better than, large-scale grounding-enhanced OS-Atlas (a model trained on millions of samples) using an astonishingly small fraction (less than 0.1%) of the training data. This underscores its exceptional data efficiency and practical utility for scenarios where extensive data collection and computational resources are prohibitive. Furthermore, the research shows that integrating additional semantic information (like screen descriptions and previous action results) into the input further boosts reasoning performance, providing another avenue for improvement in resource-constrained settings.

7.2. Limitations & Future Work

The authors acknowledge two primary limitations of their proposed approach:

  1. Reliance on SFT for Specific Benchmarks: The method focuses on enhancing perception for reasoning with a small-scale dataset. This design choice, while boosting performance in resource-constrained scenarios, may weaken the zero-shot capability of the MLLM. Consequently, the model requires supervised fine-tuning (SFT) on specific agent benchmarks to perform well, rather than being able to generalize immediately to unseen tasks. Future work could explore how to retain or improve zero-shot generalization while still benefiting from query inference's data efficiency.
  2. Scope Limited to Resource-Constrained Scenarios: The study explicitly focuses on resource-constrained scenarios. The authors caution that the observed results and conclusions might differ when large-scale training data is available. They reiterate that grounding has been proven effective in such data-rich settings. This implies that for scenarios without resource limitations, large-scale grounding might still be the preferred or equally effective approach. Future research could investigate the optimal synergy or transition point between query inference and large-scale grounding as data availability increases.

7.3. Personal Insights & Critique

This paper presents an elegant and highly practical solution to a critical problem in GUI agent development. The identification of the format discrepancy between coordinate-oriented grounding and action-oriented reasoning is a keen insight, and the query inference mechanism serves as a logically sound and empirically effective bridge.

Inspirations:

  • The "Reverse" Thinking: The idea of "reverse grounding" by inferring queries from coordinates is particularly inspiring. It forces the model to learn a deeper, more intentional understanding of GUI elements, which is a more natural fit for action-oriented tasks. This approach could be applied to other domains where low-level perception needs to be aligned with high-level intent (e.g., inferring user intent from gaze data or touch inputs in AR/VR).
  • Data Efficiency for Small Models: The astonishing data efficiency of query inference (achieving comparable performance to OS-Atlas with <0.1% of data) is a significant contribution. This makes advanced GUI agents accessible to researchers and developers with limited computational resources, democratizing access to powerful MLLM capabilities. This highlights that intelligently structured tasks can compensate for raw data volume.
  • Leveraging Proprietary MLLMs for Data Curation: The three-step pipeline for constructing query inference data (refinement, re-grounding, accuracy analysis) is a clever way to leverage the power of larger, more capable MLLMs (like Qwen-VL-Max) for high-quality data annotation and filtering. This method could be generalized to create high-quality training data for other complex MLLM tasks, reducing manual annotation effort and improving data quality.

Potential Issues/Critique:

  • "TYPE" Action Degradation: The consistent AMR degradation for TYPE actions across grounding, query inference, and Atlas models is a notable weakness. While the authors hypothesize vertical tuning harming instruction-following, it warrants deeper investigation. TYPE actions are fundamental for GUI agents, and a robust solution for this remains crucial. Perhaps query inference could be adapted to infer "what to type" in addition to "where to click."

  • CoT Output Issues: The finding that CoT outputs degrade performance for 7B-level MLLMs is a critical insight. It suggests that while CoT is powerful for very large models, smaller models might be overwhelmed by the verbosity, indicating a need for more concise or structured intermediate reasoning outputs, or perhaps a different CoT strategy (e.g., abstract thought tokens rather than full sentences). This highlights the importance of model-scale awareness when designing CoT prompts.

  • Generalizability of Refined Queries: The query refinement step uses a specific format: "click on the [element_name] for [purpose]". While effective, the generalizability of this specific format across diverse GUI types and user intents should be explored. Could some intents be poorly captured by this structure?

  • Ethical Considerations: While the ethics statement addresses privacy and system security, the potential for GUI agents to be exploited for harmful purposes (as mentioned in the paper) implies a continuous need for robust safety mechanisms. As query inference makes these agents more capable with less data, the risk surface potentially broadens if not mitigated by strong ethical guardrails.

    Overall, this paper provides a valuable contribution to the field, offering a data-efficient and conceptually sound approach to improve MLLM-powered GUI agents. Its insights into task format discrepancy and CoT limitations for smaller models are particularly insightful and will likely influence future research in GUI automation and multi-modal AI.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.