ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands

Unified discrete-continuous actions for free-form drag computer use

Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou*

Show Lab, National University of Singapore
† Equal contribution * Corresponding author

ShowUI-pi teaser

ShowUI-π is a 450M flow-based vision-language-action model that treats GUI actions as continuous trajectories, generating smooth clicks and drags directly from screen observations. It unifies discrete and continuous actions, enabling precise drawing, rotation, sorting, and captcha solving without tokenized coordinates.

20K
Recorded drag trajectories
505
Tasks across five domains
0.45B
Parameters in ShowUI-π

Abstract

Existing GUI agents predict discrete click coordinates, limiting them to short drags and preventing closed-loop, fine-grained control. ShowUI-π is a flow-based generative model that unifies clicks and continuous drags as sequences of cursor waypoints with mouse states. A lightweight action expert trained with flow matching outputs incremental cursor deltas conditioned on streaming visual observations, yielding stable trajectories for tasks such as drawing, rotation, and captcha solving. We also introduce ScreenDrag: 505 real drag tasks across PowerPoint, OS/file manager, Premiere Pro, captchas, and handwriting, plus 20K manually collected and synthesized trajectories with dense coordinates for training and evaluation. ShowUI-π achieves 26.98% online success and the lowest trajectory error with only 450M parameters.

ShowUI-pi architecture overview

ScreenDrag Benchmark

ScreenDrag targets continuous GUI manipulation with offline and online protocols. It covers 505 tasks (101 per domain) across PowerPoint, OS/file manager, Adobe Premiere Pro, captcha rotation, and handwriting. Training data mixes 20K human and synthetic trajectories with dense cursor coordinates, enabling models to learn long-horizon, smooth motion.

ScreenDrag data pipeline

ScreenDrag domains (inner ring) and task types (outer ring) with sample UI thumbnails.

Results

ShowUI-π surpasses proprietary Gemini-2.5-CUA and open-source OpenCUA-7B on ScreenDrag online success while using far fewer parameters, and leads on captcha and handwriting tasks that need continuous actions and observations on-the-fly.

Proprietary models

Model Params OS PPT Premiere Captcha Handwriting Overall
Operator NA 53.47 9.90 2.97 0.00 0.00 13.27
Seed-1.6-Vision NA 77.23 3.96 1.98 8.91 2.97 19.01
Gemini-2.5-CUA NA 86.14 20.79 0.00 3.96 0.00 22.18

Open-source models

Model Params OS PPT Premiere Captcha Handwriting Overall
UI-TARS-1.5-7B 7B 73.27 1.98 1.98 7.92 0.00 17.03
OpenCUA-32B 32B 97.03 6.93 0.00 0.00 0.00 20.79
OpenCUA-7B 7B 99.01 4.95 0.00 5.94 0.00 21.98
Qwen3-VL-32B 32B 83.17 4.95 3.96 2.97 0.00 19.01
Qwen3-VL-8B 8B 24.75 5.94 6.93 0.00 0.00 7.52
Qwen3-VL-2B 2B 13.86 6.93 2.97 0.00 0.00 4.75
ShowUI-π-450M 0.45B 13.11 22.93 8.64 55.91 34.32 26.98

Online success (%) for ScreenDrag; ShowUI-π is highlighted.

Visualization

Real ShowUI-π trajectories from the paper show continuous control across domains: rotation, captcha dialing, handwriting strokes, and video-editing.

PowerPoint trajectory
PowerPoint rotation (long circular drag).
OS desktop trajectory
OS desktop drag-and-sort between folders.
Captcha trajectory
Captcha dial rotation with stable arcs.
Handwriting trajectory
Handwriting strokes with minimal wobble.
Premiere trajectory
Premiere Pro drag-and-drop into effect stack.

BibTeX

@misc{hu2025showuipi,
  title={ShowUI-$\pi$: Flow-based Generative Models as GUI Dexterous Hands},
  author={Siyuan Hu and Kevin Qinghong Lin and Mike Zheng Shou},
  year={2025},
  note={arXiv preprint},
}