SynthUX
A low-compute synthetic visual computer-use dataset generator: goals expand into concrete Node Tree-style grammar trees, terminal affordances are replayed as low-level mouse and keyboard input in browser desktop simulators, and the resulting trajectories are recorded as frames, observations, videos, and alignment metadata.
SynthUX is a low-compute synthetic visual computer-use dataset generator. The current project is not a cinematic system, not a prompt-only data generator, and not a VM recording pipeline. It is a grammar-driven way to produce grounded computer-use records where the goal, the task tree, the low-level input stream, the visual frames, and the observed simulator state all stay connected.
The active loop is:
goal -> concrete Node Tree-style grammar expansion -> terminal affordance leaves -> low-level simulator input -> observed visual trajectory
The point is to generate grounded computer-use data without full virtual machines and without asking a huge model to invent every action from a flat prompt.
What It Generates
Each row is a complete visual computer-use record:
- a rendered
goal; - a concrete
treesampled from a nondeterministic, partially context-sensitive grammar; - executable terminal affordance nodes embedded directly in that tree;
- generated artifacts such as code modules, tests, PR text, incident notes, runbooks, source tables, and research memos;
- a recorded
trajectorywith low-level input events, simulator observations, alignment metadata, screenshots, and optional MP4/WebM replay videos.
The tree is the target side of the record. There is no separate top-level target_trajectory, input_trajectory, or actual_trajectory. Execution filters terminal nodes from the tree in order and records what happened in the simulator.
How A Row Is Built
The generator starts with a domain pack. A domain defines weighted priors over task families, roles, repositories, organizations, goals, constraints, surfaces, artifact kinds, and trajectory policies. Sampling chooses a coherent archetype first, then expands a program tree.
That expansion has two jobs:
- render the human-facing task goal and instruction;
- carry the executable terminal affordances that later become simulator input.
Artifacts are generated before execution so terminals can refer to real paths and payloads. For example, a PR can cite the generated code module, a test can import that module, and a Slack handoff can mention the actual file path. The point is not to create a pretty transcript; it is to preserve causal relations across the record.
At execution time, SynthUX walks the concrete tree, filters terminal leaves, enriches those leaves with visible artifact content, compiles them into low-level input, and replays the input into a live browser desktop. The replay stream contains mouse moves, clicks, keyboard shortcuts, and per-character text_input events. Every replayed input event can carry a frame blob reference.
Why Node Trees Matter
SynthUX inherits the important idea from Node Tree: language, behavior, and execution should come from the same expanded structure. A generated instruction is not just text; it is a tree whose terminal leaves can be executed against an environment.
For computer-use data, that matters because the training record can preserve relations across levels:
- root task goal;
- arbitrary-depth grammar nodes and constraints;
- nested subgoals, branches, workflows, and affordance clusters;
- terminal affordances such as
editor.replace_bufferorslack.post_message; - low-level mouse moves, clicks, keyboard shortcuts, and per-character text input;
- visual frames and simulator state observations caused by those inputs.
This creates data where high-level task structure and low-level interaction traces are aligned without requiring full VM rollouts. The tree is not supposed to be a fixed root/medium/terminal stack. Some samples can be shallow, but the grammar should be able to expand into many layers whenever the task demands it.
Execution Substrates
SynthUX targets three lightweight web desktop simulators, each with its own SynthUX bridge:
browser-os— generic browser-native desktop shell.windows-web-next— Windows 11 simulation in Svelte.macos-web-next— macOS desktop simulation in Svelte.
The simulator bridge is observation-only during capture with one specific exception: a synthux-launch / window.__synthuxLaunchApp shim that is invoked only by a real low-level click on the SynthUX launcher overlay button. Clicking the launcher button routes through the simulator's normal app launch event (taskbar:shortcut:clicked / wm.openApp / windowManager.openApp), so visible state changes still flow from real DOM events, not from a high-level mutation command.
That constraint is what keeps the dataset honest. Earlier versions could mutate simulator state through synthux.targetAction-style high-level commands, which produced plausible-looking state transitions without enough low-level evidence. The current direction cuts that out: high-level grammar leaves may guide input compilation, but they are not allowed to be the mutation channel during capture.
Per-(env, surface) Compiler Dispatch
Each terminal affordance node is handled by an env-specific compiler. The full registry is 21 compilers — 7 surfaces × 3 simulators — defined in src/synthux/_compilers/ and registered at import time.
Each compiler walks the live page through an ExecutionDriver (click, type, press, wait) instead of pre-computing coordinates. The driver re-reads the DOM before each click, so launcher clicks land on real launcher buttons and post-launch selectors target the just-opened app's controls. When a compiler's selector misses the live DOM (app didn't render, layout shifted), the executor falls back to a workbench overlay and tags the observation with state.used_fallback=true so the validator (synthux validate) can flag rows that did not exercise the native simulator surface.
The result: a tree containing GitHub, editor, terminal, Slack, browser, Notion, and dashboard terminals produces visible movement across the corresponding native apps in each simulator, not just rows appended to a generic workbench.
Dataset Shape
The active row shape is:
{
"goal": "...",
"tree": {
"kind": "node_tree_expansion",
"grammar_id": "synthux.swe.ui_task_grammar.v1",
"root": {
"node_id": "n0001",
"kind": "behavior_program",
"children": [
{
"node_id": "n0004.t0003",
"kind": "terminal",
"surface": "editor",
"action": "replace_buffer",
"target": "src/example.py"
}
]
}
},
"trajectory": {
"input_events": [],
"observations": [],
"alignment": [],
"media": {"videos": []}
}
}
Input events are frame-level. Text input is expanded into individual character events, so typing produces a sequence of frames rather than a single instantaneous state jump.
The alignment table links simulator observations and input-event ranges back to terminal node_ids. The alignment is annotation, not the primary structure. The dataset should be read as continuous simulator recording plus a grammar tree, not as a list of artificial "steps."
Current Dataset
The first multi-simulator, app-native run contains:
- 9 trajectory rows;
- 3 domains: software engineering, ops, and research (1 episode per domain × 3 envs);
- 3 simulators driven natively:
browser-os,windows-web-next, andmacos-web-next; - 111 simulator observations, of which 90 (81.1%) are app-native and 21 fell back to the workbench overlay;
- 1,949 low-level input frames + 111 application-state keyframe screenshots;
- 9 replay videos (mp4);
- 0 validator errors, 0 warnings under
synthux validate.
It is published as a Hugging Face dataset:
jacob-valdez/synthux-visual-tree
The dataset layout is deliberately flat per trajectory:
data.jsonl
<trajectory_id>/
screenshots/
node-<node_id>-<sha1>.png
node-<node_id>-frame-<ordinal>-<sha1>.png
videos/
replay.mp4
replay.webm
Simulator identity is metadata in the row, not a top-level directory grouping. Asset refs are relative blob references, so a trajectory viewer can resolve screenshots and videos directly from the dataset root.
Neighborhood