Unified discrete-continuous actions for free-form drag computer use
Siyuan Hu†, Kevin Qinghong Lin†, Mike Zheng Shou*
Show Lab, National University of Singapore
† Equal contribution * Corresponding author
ShowUI-π is a 450M flow-based vision-language-action model that treats GUI actions as continuous trajectories, generating smooth clicks and drags directly from screen observations. It unifies discrete and continuous actions, enabling precise drawing, rotation, sorting, and captcha solving without tokenized coordinates.
Existing GUI agents predict discrete click coordinates, limiting them to short drags and preventing closed-loop, fine-grained control. ShowUI-π is a flow-based generative model that unifies clicks and continuous drags as sequences of cursor waypoints with mouse states. A lightweight action expert trained with flow matching outputs incremental cursor deltas conditioned on streaming visual observations, yielding stable trajectories for tasks such as drawing, rotation, and captcha solving. We also introduce ScreenDrag: 505 real drag tasks across PowerPoint, OS/file manager, Premiere Pro, captchas, and handwriting, plus 20K manually collected and synthesized trajectories with dense coordinates for training and evaluation. ShowUI-π achieves 26.98% online success and the lowest trajectory error with only 450M parameters.
ScreenDrag targets continuous GUI manipulation with offline and online protocols. It covers 505 tasks (101 per domain) across PowerPoint, OS/file manager, Adobe Premiere Pro, captcha rotation, and handwriting. Training data mixes 20K human and synthetic trajectories with dense cursor coordinates, enabling models to learn long-horizon, smooth motion.
ScreenDrag domains (inner ring) and task types (outer ring) with sample UI thumbnails.
ShowUI-π surpasses proprietary Gemini-2.5-CUA and open-source OpenCUA-7B on ScreenDrag online success while using far fewer parameters, and leads on captcha and handwriting tasks that need continuous actions and observations on-the-fly.
Proprietary models
| Model | Params | OS | PPT | Premiere | Captcha | Handwriting | Overall |
|---|---|---|---|---|---|---|---|
| Operator | NA | 53.47 | 9.90 | 2.97 | 0.00 | 0.00 | 13.27 |
| Seed-1.6-Vision | NA | 77.23 | 3.96 | 1.98 | 8.91 | 2.97 | 19.01 |
| Gemini-2.5-CUA | NA | 86.14 | 20.79 | 0.00 | 3.96 | 0.00 | 22.18 |
Open-source models
| Model | Params | OS | PPT | Premiere | Captcha | Handwriting | Overall |
|---|---|---|---|---|---|---|---|
| UI-TARS-1.5-7B | 7B | 73.27 | 1.98 | 1.98 | 7.92 | 0.00 | 17.03 |
| OpenCUA-32B | 32B | 97.03 | 6.93 | 0.00 | 0.00 | 0.00 | 20.79 |
| OpenCUA-7B | 7B | 99.01 | 4.95 | 0.00 | 5.94 | 0.00 | 21.98 |
| Qwen3-VL-32B | 32B | 83.17 | 4.95 | 3.96 | 2.97 | 0.00 | 19.01 |
| Qwen3-VL-8B | 8B | 24.75 | 5.94 | 6.93 | 0.00 | 0.00 | 7.52 |
| Qwen3-VL-2B | 2B | 13.86 | 6.93 | 2.97 | 0.00 | 0.00 | 4.75 |
| ShowUI-π-450M | 0.45B | 13.11 | 22.93 | 8.64 | 55.91 | 34.32 | 26.98 |
Online success (%) for ScreenDrag; ShowUI-π is highlighted.
Real ShowUI-π trajectories from the paper show continuous control across domains: rotation, captcha dialing, handwriting strokes, and video-editing.
@misc{hu2025showuipi,
title={ShowUI-$\pi$: Flow-based Generative Models as GUI Dexterous Hands},
author={Siyuan Hu and Kevin Qinghong Lin and Mike Zheng Shou},
year={2025},
note={arXiv preprint},
}