Recorder
The Recorder captures human demonstrations on real Windows and macOS desktops, logging screenshots,
mouse trajectories, button presses, and keystrokes into a project folder.
Learner
The Learner parses raw logs into semantic action traces, grouping low-level events into high-level
operations such as “open browser”, “fill in form”, “resize window”, or “save edited slide”.
Planner
Given a new natural language task, the Planner uses the human-taught trace as in-context guidance, deciding
how to reuse, skip, or adapt steps from the demonstration for the new goal.
Actor & Executor
Finally, the Actor and Executor ground the plan in the actual UI: they carry out OS-level clicks,
drag-and-drop operations, scrolling, and typing, while monitoring visual feedback to keep the agent on track.