HomeGUI Action Narrator:
Where and When Did That Action Take Place?

Qinchen Wu1,
Difei Gao1,
Kevin Qinghong Lin1,
Zhuoyu Wu2,

Xiangwu Guo1,
Peiran Li1,
Weichen Zhang1,
Hengxu Wang1,
Mike Zheng Shou1
1Show Lab, National University of Singapore, 2Chinese Academy of Sciences, Shenzhen

We introduce GUI action dataset Act2Cap as well as an effective framework: GUI Narrator for GUI video captioning that utilizes the cursor as a visual prompt to enhance the interpretation of high-resolution screenshots.

Abstract

The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. To rigorously evaluate such capabilities, we developed a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning: 1) GUI screenshots typically contain denser information than natural scenes, and 2) events within GUIs are subtler and occur more rapidly, requiring precise attention to the appropriate time span and spatial region for accurate understanding. To address these challenges, we introduce our GUI action dataset Act2Cap as well as a simple yet effective framework, GUI Narrator , for GUI video captioning that uti lizes the cursor as a visual prompt to enhance the interpretation of high-resolution screenshots. Specifically, a cursor detector is trained on our dataset, and a mul timodal LLM model with mechanisms for selecting keyframes and key regions generates the captions. Experimental results indicate that even for today’s most advanced multimodal models, such as GPT-4o, the task remains highly challenging. Additionally, our evaluations show that our strategy effectively enhances model performance, whether integrated into the fine-tuning of open-source models or employed as a prompting strategy in closed-source models. Moreover, we propose an advanced Actor-Critic Embodied Agent framework, which incorporates a sophisticated GUI parser driven by an LLM-agent and an enhanced reasoning mechanism adept at handling lengthy procedural tasks. Our experimental results reveal that our GUI Parser and Reasoning mechanism outshine existing methods in performance. Nevertheless, the potential remains substantial, with the best model attaining only a 46% success rate on our benchmark. We conclude with a thorough analysis of the current methods' limitations, setting the stage for future breakthroughs in this domain.

Main contributions

Our work places emphasis on the following three aspects

  • Dataset: Act2Cap contains 4K+ GUI video (Action frames), caption pairs collected from GUI layouts including WORD, EXCEL, PPT, AE, PR, WEB through automatic pipeline and human demonstration.
  • Benchmark: Metric for evaluating the quality of narration generated from LLMs.
  • Model baseline: Two stage model effectively designed for narrating actions in GUI.


描述文字