WorldGUI Benchmark

Dynamic Testing for Comprehensive Desktop GUI Automation

Henry Hengyuan Zhao, Difei Gao, Mike Zheng Shou

National University of Singapore

What's new with WorldGUI Benchmark?

TL;DR: WorldGUI extends the evaluation of GUIs from a static to a dynamic testing process, which is more relevant for reflecting the complex and dynamic nature of GUI environments.

WorldGUI is an early work to stimulate dynamism in the real user-computer scenarios. As illustrated in above figure, most GUI benchmarks focus on initial and final states, measuring success rates but overlooking the changing initial conditions present in real GUI scenarios. These benchmarks often ignore situations where:
(1) The software interface is not in its default state.
(2) The agent might get user queries at any time.
(3) Differences in agent robustness, where agents with the same low success rate (e.g. 2%) may vary in their ability to self-verify or self-correct, but these abilities are not measured in a static setting.

Benchmark Overview

WorldGUI: The left shows that for each task, WorldGUI provides a user query, instructional video, and pre-actions. The pre-actions lead to different initial states. The key characteristic of our WorldGUI is the various initial states of the same task to stimulate the real-world testing process. The right shows the software included in our benchmark and the interactions about testing the agents in our GUI environment.

Agent Overview

GUI-Thinker includes five proposed components: Planner, Planner-Critic, Step-Check, Actor, and Actor-Critic. The Planner module receives the user query and an instructional video as input and generates an initial plan for the Planner-Critic process. This plan is then refined and executed step by step. Before each step is passed to the Actor module, it undergoes a Step-Check. After the Actor produces an action, the Actor-Critic module iteratively verifies the completion of the action and makes corrections if needed.

Benchmark Comparison

Table 1: WorldGUI is a unique benchmark that has the various states for each task to stimulate the real-world agent-computer interactions.

Data Statistic

Table 2: This table shows all tasks, task activities, and project file of the desktop applications used in WorldGUI.

Figure 1: Distribution of collect tasks, selected queries, and task amount of WorldGUI. We have gathered tasks across 10 desktop applications, focusing on the use of productivity software as well as fundamental computer operations and settings.

An Successful Execution Example

An Example of Augmented Data Construction

Visualization of Parser Results

An Example of Planner-Ciritc

An Example of Step-Check

An Example of Actor-Ciritc

An Example of Planner-Ciritc

Erorr Cases Visualization

Algorithm: GUI-Thinker Reasoning Loop

Citation

@misc{zhao2025worldguidynamictestingcomprehensive,
      title={WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation}, 
      author={Henry Hengyuan Zhao and Difei Gao and Mike Zheng Shou},
      year={2025},
      eprint={2502.08047},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.08047}, 
}

Acknowledge: Thanks to Carlos & John for this webpage template. Also thanks to the SWE-bench team and their benchmark https://www.swebench.com/multimodal.html.

Template Usage: If you would like to use this website template for your own leaderboard, please send Carlos & John an email requesting permission. If granted, please make sure to acknowledge the SWE-bench team and link to this leaderboard on the home page of the website.