Skip to content

fix: tracking UI graph view freezes on state machines with dense transitions#834

Open
msradam wants to merge 2 commits into
apache:mainfrom
msradam:fix/ui-graph-perf
Open

fix: tracking UI graph view freezes on state machines with dense transitions#834
msradam wants to merge 2 commits into
apache:mainfrom
msradam:fix/ui-graph-perf

Conversation

@msradam

@msradam msradam commented Jul 3, 2026

Copy link
Copy Markdown

The tracking UI graph view freezes on state machines with a lot of transitions: every click or hover on a step re-runs A* pathfinding for every edge. I first hit this on one of my own apps (17 actions, 257 conditional transitions, about a one second freeze per click or hover). On current main a 75-action synthetic graph blocks ~10s per click and a 300-action graph never finishes its initial render. Benchmarks, profiles, and root cause are in #833.

Fixes #833

Changes

All in GraphView.tsx, plus two call sites in AppView.tsx:

  • Memoize the smart-edge path per edge, keyed on edge geometry and the node set. Highlighting doesn't move nodes, so clicks and hovers in the step list stop paying for pathfinding (interactions that change the node set itself still recompute, as before). This alone takes my real app from 999ms to 32ms per interaction, with rendering unchanged.
  • Above 100 rendered nodes (SMART_EDGE_NODE_LIMIT), skip A* and use the existing getBezierPath fallback so very large graphs render at all. Below the threshold, edge routing is untouched.
  • Memoize the NodeStateProvider context value instead of rebuilding it inline every render.
  • Key the relayout effect on the rendered graph structure (node ids and types plus edge source/target/condition, derived from the same conversion the renderer uses, so the key can't disagree with what's drawn) instead of stateMachine object identity. react-query's structural sharing keeps the reference stable across most polls, so the practical per-poll cost was the context churn above, but anything that hands the graph a new identity with unchanged structure (switching focus between a parent app and a structurally identical sub-app, for example) used to trigger a full dagre relayout plus fitView and throw away the user's pan and zoom. Now relayout happens only when the graph actually changes.
  • Build a fresh dagre graph per layout. The old module-level instance kept nodes from previously viewed applications.
  • Remove a leftover label={'test'} debug prop, and pass undefined instead of a fresh [] literal for the currently unused highlightedActions prop (a new array identity each render would defeat the context memoization).
  • Node data now carries only the rendered label instead of the whole ActionModel, so a replaced model object can't leave stale data behind in nodes.

How I tested this

Playwright benchmark, click dispatch to next paint, median of 12 clicks, production builds of the same source where the only variable is this diff:

Application Unfixed With fix
Synthetic, 75 actions / 221 edges 9,669ms 32ms
Synthetic, 150 actions / 449 edges benchmark hung (>300s) 31ms
Synthetic, 300 actions / 899 edges never renders 75ms
Real agent app, 17 actions / 257 transitions 999ms 32ms
Real agent apps at 9-13 nodes, and bundled demo_chatbot 32ms 32ms

The 150 and 300 node synthetics are deliberate stress sizes to bound the improvement and find where rendering falls over; the 17-action agent app is the realistic case. Hover costs the same as click in every case. V8 profiles put the unfixed time in the smart-edge stack (findPath, _buildNodes, getNeighbors, plus the GC churn they cause); with the fix that stack is gone. The harness lives on spike/ui-graph-perf if you want to re-run it.

For visual parity I screenshotted the graph pane on both builds: byte-identical PNGs (cmp exit 0) for the bundled demo_chatbot (below) and for a 75-node synthetic graph. Below the threshold, rendering is unchanged pixel for pixel. Above it the change is deliberate: bezier edges instead of A*-routed ones, in territory where the view previously took ~6s to appear on the 0.40.2 release (worse on this source) or didn't appear at all.

demo_chatbot graph pane, identical on both builds

I also checked the highlight behavior directly rather than trusting screenshots alone: clicking a step row applies the same classes to its graph node on both builds (bg-green-500/80 ... border-dwlightblue/50 text-white border-2), hovering a row applies opacity-50 to its node, and full-page screenshots taken after 12 identical clicks and hovers are byte-identical across builds for four real applications.

After the second commit I re-ran the validation against the PR head: the dense real app benches 32ms clicks and 29ms hovers, the 150-action synthetic 32ms, the highlight and Show Inputs assertions pass, and the graph pane screenshot is still byte-identical to the unfixed build's.

I ran the CI checks locally the way the workflows define them: npm run build, npm run lint:fix (--max-warnings=0), and npm run format:fix all pass; pre-commit hooks pass on the changed files; python -m pytest tests --ignore=tests/integrations/persisters --ignore=tests/integrations/test_bip0042_bedrock.py gives 538 passed and 1 skipped, plus one Ray-startup timeout on my Mac that fails identically without this diff (there is no Python in it). npm test fails on main either way (react-syntax-highlighter ships ESM the CRA Jest config can't parse), which I assume is why ui.yml doesn't invoke it.

Notes

  • The 100-node threshold is a judgment call from the measurements: at 75 nodes smart edges still complete initial rendering (and output is unchanged by this diff), while at 150 nodes the unfixed benchmark on this source hung outright and the 0.40.2 release cost 11s per click. Cost grows with edge count times layout area, so a node-count cutoff is blunt but predictable. Happy to tune the constant or make it configurable.
  • The unfixed numbers here are worse than the 0.40.2-based numbers in UI graph view unusable on large state machines: smart-edge pathfinding reruns for every edge on every render #833 because Various UI updates: remove elkjs add dagre, cost updates, executable script #586 swapped elkjs for dagre after 0.40.2, and the dagre layout covers more area, which the pathfinding grid scales with. Same mechanism and same fix on both versions.
  • The threshold counts rendered nodes, which includes input nodes when Show Inputs is on, so toggling that checkbox can move a graph across the limit and switch edge styles. That tracks the real pathfinding cost (the grid covers everything rendered), but it's worth knowing it's mode-dependent.
  • There's one eslint-disable-next-line react-hooks/exhaustive-deps on the relayout effect: depending on the structure key instead of the object it reads is the point of that change. (Under the current lint config the rule is off, so the comment is documentation for whenever the config tightens.)
  • Initial render of very large graphs is still slow (~60s at 300 actions, dominated by dagre layout and first paint of 899 edges). That's a separate follow-up, noted in UI graph view unusable on large state machines: smart-edge pathfinding reruns for every edge on every render #833.

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

msradam added 2 commits July 3, 2026 04:07
…ction

The graph view reran A* pathfinding for every edge on every render.
Highlight state arrives via a context whose value object was rebuilt
inline, so every click and hover re-rendered all nodes and edges, each
edge re-running pathfinding over a grid sized to the whole layout.

- memoize the smart-edge path per edge on geometry and node set
- skip A* above 100 nodes and fall back to bezier so large graphs render
- memoize the highlight context value
- key the relayout effect on graph structure instead of object identity,
  so identity churn without a structural change (focus switches,
  responses that are not reference-stable) no longer forces a full
  relayout and fitView
- create a fresh dagre graph per layout instead of reusing a module-level
  instance that accumulates stale nodes across applications
- remove leftover debug edge label

Measured with production builds: clicks on a 150-action application went
from 11s+ on the 0.40.2 release (the same benchmark hung outright on a
build of this source) to ~31ms; a 300-action application previously never
finished its initial render and now renders with ~75ms interactions.

Fixes apache#833
…n node data

The relayout key now comes from the same conversion the renderer uses
(node ids and types plus edge source/target/condition), so it matches
rendering by construction: hidden __-prefixed inputs and non-rendered
model fields can neither trigger nor miss a relayout, and showInputs is
covered by the key because input nodes have their own ids. Node data
carries only the rendered label instead of the whole ActionModel, so
replacing the model object cannot leave stale data in nodes.
@msradam msradam force-pushed the fix/ui-graph-perf branch from a47a1e7 to a5ec8dd Compare July 3, 2026 08:08
@msradam msradam changed the title fix: graph view freezes on state machines with dense transitions fix: tracking UI graph view freezes on state machines with dense transitions Jul 3, 2026
msradam added a commit to msradam/burr that referenced this pull request Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ui Burr UI (telemetry frontend)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UI graph view unusable on large state machines: smart-edge pathfinding reruns for every edge on every render

1 participant