ShowUI-Aloha — Human-Taught Computer-Use Agent

Project Video

A short overview video showcasing how ShowUI-Aloha learns from human demonstrations and executes new task variants on real desktops.

What is ShowUI-Aloha?

ShowUI-Aloha is a human-taught computer-use agent designed for real Windows and macOS desktops. Instead of relying purely on prompts, Aloha learns directly from human demonstrations: it records the screen, mouse, and keyboard while a human completes a task, then distills the demonstration into a semantic action trace.

Aloha learns through abstraction, not memorization. A single demonstration can generalize to an entire family of related tasks — such as booking different flights, editing new spreadsheets, or modifying other slide decks — as long as they share the same workflow structure.

Records human demonstrations (screen + mouse + keyboard)
Learns semantic action traces from demonstrations
Plans new tasks by reusing the learned workflow
Executes robustly with OS-level clicks, drags, typing, scrolling, and hotkeys

Aloha 4-step teaching and execution pipeline

Figure: Aloha learns from human demonstrations and reuses the abstracted trace to execute new task variants.

Comparisons with Commercial Agents

We compare ShowUI-Aloha with strong commercial agents on realistic multi-step workflows. While business models often struggle with long-horizon UI interaction, ambiguous states, or recovering from partial progress, Aloha can leverage a single human demonstration to remain grounded and consistent.

Case 1 — GitHub Repository Update

Commercial agent cannot infer that the repository lives under Documents/GitHub.
Falls into repeated path-search loops and opens incorrect folders.
Aloha reuses the human-taught workflow to navigate to the correct directory and complete the update.

Case 2 — PowerPoint Background Color Editing

Commercial agent misselects the ribbon icon and applies the wrong background color (orange instead of yellow).
Small UI ambiguities in the toolbar cause cascading errors it cannot recover from.
Aloha imitates the demonstrated selection and applies the correct yellow background reliably.

Case 3 — Excel Matrix Transpose Challenge

Commercial agent fails to locate the “Transpose” option inside Excel’s paste menu.
Gets stuck exploring menus without ever completing the matrix transpose.
Aloha reproduces the human-taught sequence and executes a clean transpose of the matrix.

System & Architecture

ShowUI-Aloha is built as a modular pipeline that cleanly separates data collection, learning, planning, and execution. This design makes the system easy to extend and adapt to different desktops and agents.

Figure: Overall architecture of ShowUI-Aloha.

Recorder

The Recorder captures human demonstrations on real Windows and macOS desktops, logging screenshots, mouse trajectories, button presses, and keystrokes into a project folder.

Learner

The Learner parses raw logs into semantic action traces, grouping low-level events into high-level operations such as “open browser”, “fill in form”, “resize window”, or “save edited slide”.

Planner

Given a new natural language task, the Planner uses the human-taught trace as in-context guidance, deciding how to reuse, skip, or adapt steps from the demonstration for the new goal.

Actor & Executor

Finally, the Actor and Executor ground the plan in the actual UI: they carry out OS-level clicks, drag-and-drop operations, scrolling, and typing, while monitoring visual feedback to keep the agent on track.

Demo Gallery

A single demonstration teaches Aloha a workflow, which can then be reused to solve new instances of the same task family. Below are a few representative demos.

Air-ticket booking

End-to-end flight booking with form filling, date picking, and confirmation screens.

Excel: matrix transpose

Spreadsheet manipulation including range selection, copy-paste, and formula application.

PowerPoint batch background editing

Bulk editing of slide backgrounds with consistent visual style across the deck.

GitHub repository editing

Editing and updating repository files directly from the desktop without manual repetition.

OSWorld Benchmark

We evaluate ShowUI-Aloha on the full OSWorld benchmark of 361 realistic computer-use tasks spanning web, office, multimedia, and system operations. Aloha solves 217 tasks end-to-end, achieving a strict success rate of 60.1% and significantly outperforming existing baselines, especially on longer workflows.

Category-wise success rates across the 361 OSWorld tasks.

Comparison against open and commercial agents on OSWorld.

Getting Started

The full installation and usage instructions are available in the GitHub README. Here is a high-level overview of a typical end-to-end run with ShowUI-Aloha.

Install Aloha. Clone the repository, create a virtual environment, and install dependencies: pip install -r requirements.txt.
Record a demonstration. Launch the Recorder (Windows or macOS binary from Releases), perform your workflow, and save the project under Aloha_Learn/projects/<project_name>/.
Parse into a trace. Run the parser to convert the raw recording into a semantic trace: python Aloha_Learn/parser.py <project_name>, which produces Aloha_Learn/projects/<project_name>_trace.json.
Execute via Actor & Executor. Place the trace into Aloha_Act/trace_data/ and call: python Aloha_Act/scripts/aloha_run.py --task "Your task" --trace_id "<trace_id>".

For more details, including configuration of VLM APIs and advanced options, please refer to the GitHub README.

Roadmap & License

Roadmap

Better fine-grained element targeting.
More robust drag-based text editing.
Few-shot generalization to related workflows.
Linux adaptation.

License

ShowUI-Aloha is released under the MIT License. You are welcome to use, modify, and extend the code for research and practical applications.

Citation

If you find ShowUI-Aloha useful in your research or applications, please cite:

@article{showui_aloha,
  title   = {ShowUI-Aloha: Human-Taught GUI Agent},
  author  = {Yichun Zhang and Xiangwu Guo and
            Yauhong Goh and Jessica Hu and Zhiheng Chen and Xin Wang and
            Difei Gao and Mike Zheng Shou},
  journal = {arXiv:2601.07181},
  year    = {2026}
}