·building·project·active

SynthUX

A low-compute synthetic visual computer-use dataset generator: goals expand into concrete Node Tree-style grammar trees, terminal affordances are replayed as low-level mouse and keyboard input in browser desktop simulators, and the resulting trajectories are recorded as frames, observations, videos, and alignment metadata.

SynthUX is a low-compute synthetic visual computer-use dataset generator. The current project is not a cinematic system, not a prompt-only data generator, and not a VM recording pipeline. It is a grammar-driven way to produce grounded computer-use records where the goal, the task tree, the low-level input stream, the visual frames, and the observed simulator state all stay connected.

The active loop is:

goal -> concrete Node Tree-style grammar expansion -> terminal affordance leaves -> low-level simulator input -> observed visual trajectory

The point is to generate grounded computer-use data without full virtual machines and without asking a huge model to invent every action from a flat prompt.

SynthUX per-(env, surface) compiler dispatch

What It Generates

Each row is a complete visual computer-use record:

  • a rendered goal;
  • a concrete tree sampled from a nondeterministic, partially context-sensitive grammar;
  • executable terminal affordance nodes embedded directly in that tree;
  • generated artifacts such as code modules, tests, PR text, incident notes, runbooks, source tables, and research memos;
  • a recorded trajectory with low-level input events, simulator observations, alignment metadata, screenshots, and optional MP4/WebM replay videos.

The tree is the target side of the record. There is no separate top-level target_trajectory, input_trajectory, or actual_trajectory. Execution filters terminal nodes from the tree in order and records what happened in the simulator.

How A Row Is Built

The generator starts with a domain pack. A domain defines weighted priors over task families, roles, repositories, organizations, goals, constraints, surfaces, artifact kinds, and trajectory policies. Sampling chooses a coherent archetype first, then expands a program tree.

That expansion has two jobs:

  • render the human-facing task goal and instruction;
  • carry the executable terminal affordances that later become simulator input.

Artifacts are generated before execution so terminals can refer to real paths and payloads. For example, a PR can cite the generated code module, a test can import that module, and a Slack handoff can mention the actual file path. The point is not to create a pretty transcript; it is to preserve causal relations across the record.

At execution time, SynthUX walks the concrete tree, filters terminal leaves, enriches those leaves with visible artifact content, compiles them into low-level input, and replays the input into a live browser desktop. The replay stream contains mouse moves, clicks, keyboard shortcuts, and per-character text_input events. Every replayed input event can carry a frame blob reference.

Why Node Trees Matter

SynthUX inherits the important idea from Node Tree: language, behavior, and execution should come from the same expanded structure. A generated instruction is not just text; it is a tree whose terminal leaves can be executed against an environment.

For computer-use data, that matters because the training record can preserve relations across levels:

  • root task goal;
  • arbitrary-depth grammar nodes and constraints;
  • nested subgoals, branches, workflows, and affordance clusters;
  • terminal affordances such as editor.replace_buffer or slack.post_message;
  • low-level mouse moves, clicks, keyboard shortcuts, and per-character text input;
  • visual frames and simulator state observations caused by those inputs.

This creates data where high-level task structure and low-level interaction traces are aligned without requiring full VM rollouts. The tree is not supposed to be a fixed root/medium/terminal stack. Some samples can be shallow, but the grammar should be able to expand into many layers whenever the task demands it.

Execution Substrates

SynthUX targets three lightweight web desktop simulators, each with its own SynthUX bridge:

The simulator bridge is observation-only during capture with one specific exception: a synthux-launch / window.__synthuxLaunchApp shim that is invoked only by a real low-level click on the SynthUX launcher overlay button. Clicking the launcher button routes through the simulator's normal app launch event (taskbar:shortcut:clicked / wm.openApp / windowManager.openApp), so visible state changes still flow from real DOM events, not from a high-level mutation command.

That constraint is what keeps the dataset honest. Earlier versions could mutate simulator state through synthux.targetAction-style high-level commands, which produced plausible-looking state transitions without enough low-level evidence. The current direction cuts that out: high-level grammar leaves may guide input compilation, but they are not allowed to be the mutation channel during capture.

Causal launch flow: low-level click → launcher overlay → bridge → real OS app open → low-level drive

Per-(env, surface) Compiler Dispatch

Each terminal affordance node is handled by an env-specific compiler. The full registry is 21 compilers — 7 surfaces × 3 simulators — defined in src/synthux/_compilers/ and registered at import time.

Surface-to-app matrix across browser-os, windows-web-next, and macos-web-next

Each compiler walks the live page through an ExecutionDriver (click, type, press, wait) instead of pre-computing coordinates. The driver re-reads the DOM before each click, so launcher clicks land on real launcher buttons and post-launch selectors target the just-opened app's controls. When a compiler's selector misses the live DOM (app didn't render, layout shifted), the executor falls back to a workbench overlay and tags the observation with state.used_fallback=true so the validator (synthux validate) can flag rows that did not exercise the native simulator surface.

The result: a tree containing GitHub, editor, terminal, Slack, browser, Notion, and dashboard terminals produces visible movement across the corresponding native apps in each simulator, not just rows appended to a generic workbench.

Dataset Shape

The active row shape is:

{
  "goal": "...",
  "tree": {
    "kind": "node_tree_expansion",
    "grammar_id": "synthux.swe.ui_task_grammar.v1",
    "root": {
      "node_id": "n0001",
      "kind": "behavior_program",
      "children": [
        {
          "node_id": "n0004.t0003",
          "kind": "terminal",
          "surface": "editor",
          "action": "replace_buffer",
          "target": "src/example.py"
        }
      ]
    }
  },
  "trajectory": {
    "input_events": [],
    "observations": [],
    "alignment": [],
    "media": {"videos": []}
  }
}

Input events are frame-level. Text input is expanded into individual character events, so typing produces a sequence of frames rather than a single instantaneous state jump.

The alignment table links simulator observations and input-event ranges back to terminal node_ids. The alignment is annotation, not the primary structure. The dataset should be read as continuous simulator recording plus a grammar tree, not as a list of artificial "steps."

Current Dataset

The first multi-simulator, app-native run contains:

  • 9 trajectory rows;
  • 3 domains: software engineering, ops, and research (1 episode per domain × 3 envs);
  • 3 simulators driven natively: browser-os, windows-web-next, and macos-web-next;
  • 111 simulator observations, of which 90 (81.1%) are app-native and 21 fell back to the workbench overlay;
  • 1,949 low-level input frames + 111 application-state keyframe screenshots;
  • 9 replay videos (mp4);
  • 0 validator errors, 0 warnings under synthux validate.

It is published as a Hugging Face dataset:

jacob-valdez/synthux-visual-tree

The dataset layout is deliberately flat per trajectory:

data.jsonl
<trajectory_id>/
  screenshots/
    node-<node_id>-<sha1>.png
    node-<node_id>-frame-<ordinal>-<sha1>.png
  videos/
    replay.mp4
    replay.webm

Simulator identity is metadata in the row, not a top-level directory grouping. Asset refs are relative blob references, so a trajectory viewer can resolve screenshots and videos directly from the dataset root.

Neighborhood

Related

belief-graph-orchestratorbelief-graph-orchestrat...👩🏽‍🌾 The Fertile Cresent👩🏽‍🌾 The Fertile Cre...Node TreeNode Treebrowser-osbrowser-oswindows-web-nextwindows-web-nextmacos-web-nextmacos-web-nextgeneral-unified-world-modelinggeneral-unified-world-model...SynthUX