FOCUSUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Key Motivation:
Modern user interfaces (UIs) are high‑resolution and compositionally structured, with large homogeneous panes interspersed with small interactive widgets, yet humans naturally only focus on regions of interest when interacting with UI.
🔍 Hover over the images below to see how FOCUSUI views and prioritizes different regions of the interface.

Abstract

Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI.

In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task’s characteristics and challenges, we propose FOCUSUI, an efficient UI grounding framework that selects patches most relevant to the instruction, while preserving positional continuity for precise grounding.

(1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned and a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens.

(2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to breaking positional information. We introduce a novel POSPAD strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence’s last index to preserve positional continuity.

Comprehensive experiments on four grounding benchmarks demonstrate that FOCUSUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FOCUSUI-7B achieves performance improvement of 3.7% over GUI-Actor-7B. Also, even with only 30% visual token retention, the performance of FOCUSUI-7B only drops by 3.2%, while achieving up to 1.44× faster inference and 17% lower peak GPU memory.

FOCUSUI: Overview of Efficient UI Grounding Framework

FOCUSUI overview: pipeline with Query-Guided selection and position-preserving transform. — **Overview of our proposed FOCUSUI**: (a) Illustration of how the Instruction-to-Patch saliency score is constructed. (b) Query-guided Saliency Scorer and token selection. (c) Overall UI grounding framework illustrating how POSPAD is applied to dropped sequences to preserve positional continuity.

Key Components: Instruction-to-Patch Saliency Score and POSPAD Transformation

Instruction-to-patch scoring. — **Build Instruction-to-patch saliency scorings** for visual token selection: highlighting instruction-conditioned regions fused with a rule-based UI graph to down-weight homogeneous regions.

POSPAD: position-preserving compression of dropped token runs. — **POSPAD sequence transform**: each contiguous sequence of dropped tokens is replaced with a single marker at the last index to preserve positional continuity.

Main Results

We present performance comparisons on ScreenSpot-V2, ScreenSpot-Pro, OS-World-G, and UI-Vision benchmarks. We test a series of retention ratios r ∈ {100%, 50%, 30%} to characterize degradation curves and compare to dense baselines that consume all visual tokens.

Across all four benchmarks, FOCUSUI exceeds GUI-specific baselines with the same size even at 30 − 50% token retention, achieving state-of-the-art grounding performance.

FOCUSUI vs. general visual token pruning methods. — Comparison with pruning: FOCUSUI maintains accuracy under high token reduction, while general visual token pruning methods suffer from sharp performance drop.

ScreenSpot-V2 pruning study. — Accuracy vs. visual token reduction.

ScreenSpot-Pro pruning study. — Accuracy vs. visual token reduction.

Performance on ScreenSpot-V2, ScreenSpot-Pro, OS-World-G and UI-Vision

ScreenSpot-V2

Model	Mobile-Text	Mobile-Icon	Desktop-Text	Desktop-Icon	Web-Text	Web-Icon	Avg
Operator	47.3	41.5	90.2	80.3	92.8	84.3	70.5
OS-Atlas-7B	95.2	75.8	90.7	63.6	90.6	77.3	84.1
Aguvis-7B	95.5	77.3	95.4	77.9	91.0	72.4	86.0
Tong-UI-7B	93.1	81.5	96.4	82.9	90.2	84.7	88.7
UGround-V1-7B	95.0	83.3	95.0	77.8	92.1	77.2	87.6
UI-TARS-7B	96.9	89.1	95.4	85.0	93.6	85.2	91.6
UI-TARS-72B	94.8	86.3	91.2	87.9	91.5	87.7	90.3
UI-TARS-1.5-7B	-	-	-	-	-	-	90.0
Qwen2.5-VL-3B	93.4	73.5	88.1	58.6	88.0	71.4	80.9
Qwen2.5-VL-7B	97.6	87.2	90.2	74.2	93.2	81.3	88.8
Qwen2.5-VL-32B	97.9	88.2	98.5	79.3	91.2	86.2	91.3
GUI-Actor-3B	97.6	83.4	96.9	83.6	94.0	85.7	91.0
GUI-Actor-7B	97.6	88.2	96.9	85.7	93.2	86.7	92.1
Jedi-3B	96.6	81.5	96.9	78.6	88.5	83.7	88.6
Jedi-7B	96.9	87.2	95.9	87.9	94.4	84.2	91.7
FOCUSUI-3B (r=100%)	99.2	85.9	96.1	87.3	95.4	81.9	91.5
FOCUSUI-3B (r=50%)	98.8	86.9	95.0	87.3	95.4	81.9	91.4
FOCUSUI-3B (r=30%)	98.5	85.3	96.1	87.3	94.3	81.9	91.0
FOCUSUI-7B (r=100%)	98.8	91.6	95.6	92.1	95.0	84.4	93.1
FOCUSUI-7B (r=50%)	98.8	92.2	93.9	87.3	95.0	85.2	92.6
FOCUSUI-7B (r=30%)	98.8	90.1	93.3	85.7	93.9	85.2	91.8

ScreenSpot-Pro

Model	Dev	Creative	CAD	Scientific	Office	OS	Avg-Text	Avg-Icon	Avg
Operator	35.1	39.6	16.1	43.7	53.0	32.7	45.0	23.0	36.6
OS-Atlas-7B	17.7	17.9	10.3	24.4	27.4	16.8	28.1	4.0	18.9
Aguvis-7B	16.1	21.4	13.8	34.6	34.3	19.4	-	-	22.9
Tong-UI-7B	22.7	21.1	15.3	34.3	38.3	18.4	35.1	8.0	25.7
UGround-V1-7B	28.1	31.7	14.6	39.0	49.6	24.5	-	-	31.1
UI-TARS-7B	36.1	32.8	18.0	50.0	53.5	24.5	47.8	16.2	35.7
UI-TARS-72B	40.8	39.6	17.2	45.7	54.8	30.1	50.9	17.5	38.1
UI-TARS-1.5-7B	31.8	40.2	31.8	47.2	65.6	33.2	-	-	42.6
Qwen2.5-VL-3B	21.4	25.8	18.4	29.5	40.9	20.4	37.8	6.6	25.9
Qwen2.5-VL-7B	29.1	24.9	13.8	31.1	45.7	22.4	39.9	7.6	27.6
Qwen2.5-VL-32B	48.5	41.1	32.6	57.1	67.4	42.3	63.2	22.5	47.6
GUI-Actor-3B	39.8	36.7	34.1	49.6	61.3	35.2	-	-	42.2
GUI-Actor-7B	38.1	41.4	38.3	50.8	63.0	38.8	-	-	44.6
Jedi-3B	38.1	34.6	23	38.6	57.0	25.0	49.8	13.7	36.1
Jedi-7B	27.4	34.0	32.2	52.4	68.7	26.0	52.6	18.2	39.5
FOCUSUI-3B (r=100%)	43.1	37.0	37.6	48.4	61.7	38.3	59.3	18.9	43.8
FOCUSUI-3B (r=50%)	42.1	37.0	36.4	46.9	58.3	35.2	56.7	19.0	42.3
FOCUSUI-3B (r=30%)	38.1	35.8	33.3	44.5	57.8	37.2	55.0	17.4	40.6
FOCUSUI-7B (r=100%)	44.5	41.1	42.9	52.0	69.6	44.4	64.7	21.9	48.3
FOCUSUI-7B (r=50%)	42.8	40.5	40.2	51.6	67.0	40.3	61.7	21.9	46.5
FOCUSUI-7B (r=30%)	38.8	39.9	42.9	49.2	64.4	38.8	60.4	20.4	45.1

UI-Vision

Performance comparison on UI-Vision.
Model	Basic	Functional	Spatial	Avg
Claude-3.7-Sonnet	9.48	7.73	7.60	8.27
ShowUI-2B	8.07	7.67	2.07	5.94
OSAtlas-7B	12.2	11.2	3.67	9.02
UGround-7B	11.5	12.2	2.79	8.83
UGround-V1-7B	15.4	17.1	6.25	12.9
Aguvis-7B	17.8	18.3	5.06	13.7
UI-TARS-7B	20.1	24.3	8.37	17.6
UI-TARS-72B	31.4	30.5	14.7	25.5
GUI-Actor-3B	27.4	24.6	7.0	19.3
GUI-Actor-7B	30.1	28.1	7.8	21.6
Jedi-3B	22.3	25.2	9.35	18.7
Jedi-7B	32.3	30.5	12.8	24.8
FOCUSUI-3B (r=100%)	30.0	26.9	8.7	21.5
FOCUSUI-3B (r=50%)	29.7	26.0	8.2	20.9
FOCUSUI-3B (r=30%)	29.1	26.4	7.6	20.6
FOCUSUI-7B (r=100%)	33.6	31.2	11.2	24.9
FOCUSUI-7B (r=50%)	32.5	31.0	11.3	24.5
FOCUSUI-7B (r=30%)	32.3	29.2	11.0	23.8

OSWorld-G

Performance comparison on OSWorld-G.
Model	Text	Elem	Layout	Manip	Refuse	Avg
Gemini-2.5-Pro	59.8	45.5	49.0	33.6	38.9	45.2
Operator	51.3	42.4	46.6	31.5	0.0	40.6
UGround-V1-7B	51.3	40.3	43.5	24.8	0.0	36.4
Aguvis-7B	55.9	41.2	43.9	28.2	0.0	38.7
UI-TARS-7B	60.2	51.8	54.9	35.6	0.0	47.5
UI-TARS-1.5-7B	70.1	57.9	59.7	51.7	0.0	56.0
Qwen2.5-VL-3B	41.4	28.8	34.8	13.4	0.0	27.3
Qwen2.5-VL-7B	45.6	32.7	41.9	18.1	0.0	31.4
GUI-Actor-3B	60.5	56.1	58.5	32.2	0.0	50.5
GUI-Actor-7B	60.2	54.2	58.1	30.9	0.0	49.5
Jedi-3B	67.4	53.0	53.8	44.3	7.4	50.9
Jedi-7B	65.9	55.5	57.7	46.9	7.4	54.1
FOCUSUI-3B (r=100%)	65.9	57.6	59.7	37.6	0.0	53.4
FOCUSUI-3B (r=50%)	64.8	59.4	63.6	37.6	0.0	54.6
FOCUSUI-3B (r=30%)	62.5	56.7	62.9	33.6	0.0	51.8
FOCUSUI-7B (r=100%)	63.6	61.2	63.6	34.9	0.0	54.4
FOCUSUI-7B (r=50%)	64.0	62.1	63.6	31.5	0.0	54.1
FOCUSUI-7B (r=30%)	63.6	60.9	64.4	31.5	0.0	53.9

Qwen3-VL Backbone Models

† Evaluated from the official HuggingFace model.
Model	ScreenSpot-V2			ScreenSpot-Pro
Model	Avg-Text	Avg-Icon	Avg	Avg-Text	Avg-Icon	Avg
Qwen3-VL-2B†	94.7	78.9	87.8	52.8	16.7	39.0
FOCUSUI-2B (r=100%)	95.8	85.6	91.4	51.5	20.9	39.8
FOCUSUI-2B (r=50%)	95.7	85.0	91.0	52.5	20.9	40.4
FOCUSUI-2B (r=30%)	93.5	84.3	89.5	49.7	20.2	38.5

Comparison to General Visual Token Pruning Methods

Comparison against general visual token pruning methods at 30% retention. Numbers in parentheses show drop relative to the 100% baseline.
Model + Pruning Method (Venue)	%Ret. Ratio	SS‑V2 Avg	SS‑Pro Avg	OSWorld‑G Avg
Qwen2.5‑VL‑3B	100%	81.5	26.1	27.3
+ Fast‑V (ECCV'24)	30%	38.6 (‑52.7%)	4.8 (‑81.6%)	14.4 (‑47.4%)
+ HiPrune (arXiv'25)	30%	72.0 (‑11.7%)	18.0 (‑30.8%)	20.4 (‑25.3%)
+ Vision‑Zip (CVPR'25)	30%	75.4 (‑7.5%)	18.9 (‑27.4%)	23.0 (‑15.6%)
Jedi‑3B	100%	88.9	36.1	48.8
+ Fast‑V (ECCV'24)	30%	51.0 (‑42.6%)	14.1 (‑60.9%)	23.9 (‑51.0%)
+ HiPrune (arXiv'25)	30%	80.9 (‑9.0%)	26.2 (‑27.3%)	40.4 (‑17.1%)
+ Vision‑Zip (CVPR'25)	30%	82.8 (‑6.9%)	28.8 (‑20.3%)	41.5 (‑14.9%)
FOCUSUI‑3B	100%	91.5	43.8	53.4
+ Saliency Scorer w/ POSPAD	30%	91.0 (‑0.5%)	40.6 (‑7.3%)	51.8 (‑3.0%)

Efficiency Analysis

Efficiency analysis on ScreenSpot-Pro benchmark under different retention ratios and model backbones of FOCUSUI.

FOCUSUI‑7B (Base: Qwen2.5‑VL, max_pixel=6400×28×28)
%Ret.	#Vis. Token	Time (sec)	Max GPU Mem (MB)	SS‑Pro Acc
100%	5319	1.75 (1.00×)	20994 (1.00×)	48.3
70%	3989	1.67 (1.05×)	18334 (0.87×)	47.7
50%	2659	1.49 (1.18×)	17944 (0.85×)	46.5
30%	1329	1.22 (1.44×)	17392 (0.83×)	45.1

FOCUSUI‑2B (Base: Qwen3‑VL, max_pixel=6000×32×32)
%Ret.	#Vis. Token	Time (sec)	Max GPU Mem (MB)	SS‑Pro Acc
100%	4627	0.97 (1.00×)	6278 (1.00×)	39.8
70%	3470	0.90 (1.08×)	6142 (0.98×)	40.1
50%	2313	0.85 (1.14×)	5680 (0.91×)	40.4
30%	1156	0.71 (1.37×)	5170 (0.82×)	38.5

Predicted Per-Patch Saliency Score for Structured UI interfaces

Saliency maps and qualitative results. — Qualitative visualization of **predicted saliency heatmaps** and **retained patches under a retention ratio r = 30%** . Black regions denote dropped visual tokens that are not consumed by the LM during decoding. Examples are taken from the ScreenSpot-V2, ScreenSpot-Pro, OS-World-G and UI-Vision benchmarks, spanning web, desktop, and mobile interfaces.

query-guided visual token selection example 1. — Qualitative visualization of **predicted saliency heatmaps** and **retained patches under a retention ratio r = 30%** . Black regions denote dropped visual tokens that are not consumed by the LM during decoding. Examples are taken from the ScreenSpot-V2, ScreenSpot-Pro, OS-World-G and UI-Vision benchmarks, spanning web, desktop, and mobile interfaces.

BibTeX

@article{ouyang2026focusui,
  title={FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection},
  author={Ouyang, Mingyu and Lin, Kevin Qinghong and Shou, Mike Zheng and Ng, Hwee Tou},
  journal={arXiv preprint arXiv:2601.03928},
  year={2026}
}