We introduce AUI-Gym, a benchmark of 52 applications with 1560 tasks.
Each screenshot shows initial vs agent-optimized UI across individual coder.
We evaluate our framework using Function Completeness Rate (FC) and CUA Success Rate (SR). The results demonstrate that our Coder-CUA collaboration significantly improves both metrics compared to the baseline, especially for stronger models like GPT-5 and Gemini-3-Pro.
| Coder | Method | Overall Performance | |
|---|---|---|---|
| Func. Completeness (%) | CUA Success Rate (%) | ||
| GPT-5 | Baseline | 67.9 | 24.5 |
| + Ours | 81.5 | 26.0 | |
| Qwen3-Coder-30B | Baseline | 42.1 | 7.3 |
| + Ours | 60.1 | 19.0 | |
| GPT-4o | Baseline | 36.3 | 8.8 |
| + Ours | 43.1 | 16.1 | |
| Gemini-3-Pro | Baseline | 71.7 | 35.8 |
| + Ours | 72.5 | 47.0 | |
@misc{lin2025aui,
title={Computer-Use Agents as Judges for Generative User Interface},
author={Kevin Qinghong Lin and Siyuan Hu and Linjie Li and Zhengyuan Yang and Lijuan Wang and Philip Torr and Mike Zheng Shou},
year={2025},
eprint={2511.15567},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.15567},
}