FOCUSUI icon FOCUSUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Mingyu Ouyang1, Kevin Qinghong Lin2, Mike Zheng Shou1, Hwee Tou Ng1
1National University of Singapore 2University of Oxford
† Corresponding authors

TL;DR: FOCUSUI teaches VLMs where to look in UI screenshots.

Teaser: FOCUSUI selects instruction-relevant visual tokens while preserving positional continuity via POSPAD.
Comparison of vanilla UI grounding VLMs, VLMs with visual token pruning, and our FOCUSUI.

Key Motivation:
Modern user interfaces (UIs) are high‑resolution and compositionally structured, with large homogeneous panes interspersed with small interactive widgets, yet humans naturally only focus on regions of interest when interacting with UI.
🔍 Hover over the images below to see how FOCUSUI views and prioritizes different regions of the interface.

screen saliency focus
screen saliency focus

Abstract

Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI.

In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task’s characteristics and challenges, we propose FOCUSUI, an efficient UI grounding framework that selects patches most relevant to the instruction, while preserving positional continuity for precise grounding.

(1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned and a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens.

(2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to breaking positional information. We introduce a novel POSPAD strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence’s last index to preserve positional continuity.

Comprehensive experiments on four grounding benchmarks demonstrate that FOCUSUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FOCUSUI-7B achieves performance improvement of 3.7% over GUI-Actor-7B. Also, even with only 30% visual token retention, the performance of FOCUSUI-7B only drops by 3.2%, while achieving up to 1.44× faster inference and 17% lower peak GPU memory.

FOCUSUI: Overview of Efficient UI Grounding Framework

FOCUSUI overview: pipeline with Query-Guided selection and position-preserving transform.
Overview of our proposed FOCUSUI: (a) Illustration of how the Instruction-to-Patch saliency score is constructed. (b) Query-guided Saliency Scorer and token selection. (c) Overall UI grounding framework illustrating how POSPAD is applied to dropped sequences to preserve positional continuity.

Key Components: Instruction-to-Patch Saliency Score and POSPAD Transformation

Instruction-to-patch scoring.
Build Instruction-to-patch saliency scorings for visual token selection: highlighting instruction-conditioned regions fused with a rule-based UI graph to down-weight homogeneous regions.
POSPAD: position-preserving compression of dropped token runs.
POSPAD sequence transform: each contiguous sequence of dropped tokens is replaced with a single marker at the last index to preserve positional continuity.

Main Results

We present performance comparisons on ScreenSpot-V2, ScreenSpot-Pro, OS-World-G, and UI-Vision benchmarks. We test a series of retention ratios r ∈ {100%, 50%, 30%} to characterize degradation curves and compare to dense baselines that consume all visual tokens.

Across all four benchmarks, FOCUSUI exceeds GUI-specific baselines with the same size even at 30 − 50% token retention, achieving state-of-the-art grounding performance.

FOCUSUI vs. general visual token pruning methods.
Comparison with pruning: FOCUSUI maintains accuracy under high token reduction, while general visual token pruning methods suffer from sharp performance drop.
ScreenSpot-V2 pruning study. ScreenSpot-Pro pruning study.
Accuracy vs. visual token reduction.
Performance on ScreenSpot-V2, ScreenSpot-Pro, OS-World-G and UI-Vision

ScreenSpot-V2

Model Mobile-Text Mobile-Icon Desktop-Text Desktop-Icon Web-Text Web-Icon Avg
Operator47.341.590.280.392.884.370.5
OS-Atlas-7B95.275.890.763.690.677.384.1
Aguvis-7B95.577.395.477.991.072.486.0
Tong-UI-7B93.181.596.482.990.284.788.7
UGround-V1-7B95.083.395.077.892.177.287.6
UI-TARS-7B96.989.195.485.093.685.291.6
UI-TARS-72B94.886.391.287.991.587.790.3
UI-TARS-1.5-7B------90.0
Qwen2.5-VL-3B93.473.588.158.688.071.480.9
Qwen2.5-VL-7B97.687.290.274.293.281.388.8
Qwen2.5-VL-32B97.988.298.579.391.286.291.3
GUI-Actor-3B97.683.496.983.694.085.791.0
GUI-Actor-7B97.688.296.985.793.286.792.1
Jedi-3B96.681.596.978.688.583.788.6
Jedi-7B96.987.295.987.994.484.291.7
FOCUSUI-3B (r=100%)99.285.996.187.395.481.991.5
FOCUSUI-3B (r=50%)98.886.995.087.395.481.991.4
FOCUSUI-3B (r=30%)98.585.396.187.394.381.991.0
FOCUSUI-7B (r=100%)98.891.695.692.195.084.493.1
FOCUSUI-7B (r=50%)98.892.293.987.395.085.292.6
FOCUSUI-7B (r=30%)98.890.193.385.793.985.291.8

ScreenSpot-Pro

Model Dev Creative CAD Scientific Office OS Avg-Text Avg-Icon Avg
Operator35.139.616.143.753.032.745.023.036.6
OS-Atlas-7B17.717.910.324.427.416.828.14.018.9
Aguvis-7B16.121.413.834.634.319.4--22.9
Tong-UI-7B22.721.115.334.338.318.435.18.025.7
UGround-V1-7B28.131.714.639.049.624.5--31.1
UI-TARS-7B36.132.818.050.053.524.547.816.235.7
UI-TARS-72B40.839.617.245.754.830.150.917.538.1
UI-TARS-1.5-7B31.840.231.847.265.633.2--42.6
Qwen2.5-VL-3B21.425.818.429.540.920.437.86.625.9
Qwen2.5-VL-7B29.124.913.831.145.722.439.97.627.6
Qwen2.5-VL-32B48.541.132.657.167.442.363.222.547.6
GUI-Actor-3B39.836.734.149.661.335.2--42.2
GUI-Actor-7B38.141.438.350.863.038.8--44.6
Jedi-3B38.134.62338.657.025.049.813.736.1
Jedi-7B27.434.032.252.468.726.052.618.239.5
FOCUSUI-3B (r=100%)43.137.037.648.461.738.359.318.943.8
FOCUSUI-3B (r=50%)42.137.036.446.958.335.256.719.042.3
FOCUSUI-3B (r=30%)38.135.833.344.557.837.255.017.440.6
FOCUSUI-7B (r=100%)44.541.142.952.069.644.464.721.948.3
FOCUSUI-7B (r=50%)42.840.540.251.667.040.361.721.946.5
FOCUSUI-7B (r=30%)38.839.942.949.264.438.860.420.445.1

UI-Vision

Model Basic Functional Spatial Avg
Claude-3.7-Sonnet9.487.737.608.27
ShowUI-2B8.077.672.075.94
OSAtlas-7B12.211.23.679.02
UGround-7B11.512.22.798.83
UGround-V1-7B15.417.16.2512.9
Aguvis-7B17.818.35.0613.7
UI-TARS-7B20.124.38.3717.6
UI-TARS-72B31.430.514.725.5
GUI-Actor-3B27.424.67.019.3
GUI-Actor-7B30.128.17.821.6
Jedi-3B22.325.29.3518.7
Jedi-7B32.330.512.824.8
FOCUSUI-3B (r=100%)30.026.98.721.5
FOCUSUI-3B (r=50%)29.726.08.220.9
FOCUSUI-3B (r=30%)29.126.47.620.6
FOCUSUI-7B (r=100%)33.631.211.224.9
FOCUSUI-7B (r=50%)32.531.011.324.5
FOCUSUI-7B (r=30%)32.329.211.023.8
Performance comparison on UI-Vision.

OSWorld-G

Model Text Elem Layout Manip Refuse Avg
Gemini-2.5-Pro59.845.549.033.638.945.2
Operator51.342.446.631.50.040.6
UGround-V1-7B51.340.343.524.80.036.4
Aguvis-7B55.941.243.928.20.038.7
UI-TARS-7B60.251.854.935.60.047.5
UI-TARS-1.5-7B70.157.959.751.70.056.0
Qwen2.5-VL-3B41.428.834.813.40.027.3
Qwen2.5-VL-7B45.632.741.918.10.031.4
GUI-Actor-3B60.556.158.532.20.050.5
GUI-Actor-7B60.254.258.130.90.049.5
Jedi-3B67.453.053.844.37.450.9
Jedi-7B65.955.557.746.97.454.1
FOCUSUI-3B (r=100%)65.957.659.737.60.053.4
FOCUSUI-3B (r=50%)64.859.463.637.60.054.6
FOCUSUI-3B (r=30%)62.556.762.933.60.051.8
FOCUSUI-7B (r=100%)63.661.263.634.90.054.4
FOCUSUI-7B (r=50%)64.062.163.631.50.054.1
FOCUSUI-7B (r=30%)63.660.964.431.50.053.9
Performance comparison on OSWorld-G.

Qwen3-VL Backbone Models

Model ScreenSpot-V2 ScreenSpot-Pro
Avg-TextAvg-IconAvg Avg-TextAvg-IconAvg
Qwen3-VL-2B†94.778.987.852.816.739.0
FOCUSUI-2B (r=100%)95.885.691.451.520.939.8
FOCUSUI-2B (r=50%)95.785.091.052.520.940.4
FOCUSUI-2B (r=30%)93.584.389.549.720.238.5
† Evaluated from the official HuggingFace model.
Comparison to General Visual Token Pruning Methods
Model + Pruning Method (Venue) %Ret. Ratio SS‑V2 Avg SS‑Pro Avg OSWorld‑G Avg
Qwen2.5‑VL‑3B100%81.526.127.3
   + Fast‑V (ECCV'24)30%38.6 (‑52.7%)4.8 (‑81.6%)14.4 (‑47.4%)
   + HiPrune (arXiv'25)30%72.0 (‑11.7%)18.0 (‑30.8%)20.4 (‑25.3%)
   + Vision‑Zip (CVPR'25)30%75.4 (‑7.5%)18.9 (‑27.4%)23.0 (‑15.6%)
Jedi‑3B100%88.936.148.8
   + Fast‑V (ECCV'24)30%51.0 (‑42.6%)14.1 (‑60.9%)23.9 (‑51.0%)
   + HiPrune (arXiv'25)30%80.9 (‑9.0%)26.2 (‑27.3%)40.4 (‑17.1%)
   + Vision‑Zip (CVPR'25)30%82.8 (‑6.9%)28.8 (‑20.3%)41.5 (‑14.9%)
FOCUSUI‑3B100%91.543.853.4
   + Saliency Scorer w/ POSPAD30%91.0 (‑0.5%)40.6 (‑7.3%)51.8 (‑3.0%)
Comparison against general visual token pruning methods at 30% retention. Numbers in parentheses show drop relative to the 100% baseline.
Efficiency Analysis

Efficiency analysis on ScreenSpot-Pro benchmark under different retention ratios and model backbones of FOCUSUI.

FOCUSUI‑7B (Base: Qwen2.5‑VL, max_pixel=6400×28×28)
%Ret. #Vis. Token Time (sec) Max GPU Mem (MB) SS‑Pro Acc
100%53191.75 (1.00×)20994 (1.00×)48.3
70%39891.67 (1.05×)18334 (0.87×)47.7
50%26591.49 (1.18×)17944 (0.85×)46.5
30%13291.22 (1.44×)17392 (0.83×)45.1
FOCUSUI‑2B (Base: Qwen3‑VL, max_pixel=6000×32×32)
%Ret. #Vis. Token Time (sec) Max GPU Mem (MB) SS‑Pro Acc
100%46270.97 (1.00×)6278 (1.00×)39.8
70%34700.90 (1.08×)6142 (0.98×)40.1
50%23130.85 (1.14×)5680 (0.91×)40.4
30%11560.71 (1.37×)5170 (0.82×)38.5

Predicted Per-Patch Saliency Score for Structured UI interfaces

Saliency maps and qualitative results.
query-guided visual token selection example 1.
query-guided visual token selection example 2.
Qualitative visualization of predicted saliency heatmaps and retained patches under a retention ratio r = 30% . Black regions denote dropped visual tokens that are not consumed by the LM during decoding. Examples are taken from the ScreenSpot-V2, ScreenSpot-Pro, OS-World-G and UI-Vision benchmarks, spanning web, desktop, and mobile interfaces.

BibTeX

@article{ouyang2026focusui,
  title={FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection},
  author={Ouyang, Mingyu and Lin, Kevin Qinghong and Shou, Mike Zheng and Ng, Hwee Tou},
  journal={arXiv preprint arXiv:2601.03928},
  year={2026}
}