Skip to content

UI graph view unusable on large state machines: smart-edge pathfinding reruns for every edge on every render #833

Description

@msradam

I first hit this on one of my own Burr apps: clicking through steps in the tracking UI would freeze the tab for a second or more per click, and the bigger the state machine got, the worse it was. I dug in and it turns out the graph view re-runs A* pathfinding for every edge on every render.

Current behavior

Clicking or hovering a step in the step list highlights the matching node and edges in the graph. On a small app that's instant. Past a few dozen actions it blocks the main thread for seconds per click.

I benchmarked it with Playwright against the production UI build, timing from click dispatch to the next paint, median of 12 clicks spread across the step list:

Actions Edges Initial render Click to paint (median) Hover to paint (median)
10 25 0.9s 32ms 30ms
75 221 1.1s 489ms 495ms
150 449 5.8s 11,194ms (max 23,835ms) 11,549ms
300 899 never renders, killed after 5 min n/a n/a

Hover is just as expensive as click, and hover fires continuously while the pointer moves down the step list, so on a large app the graph pane is effectively frozen the whole time you're inspecting it.

A V8 profile during the click loop on the 150-action app shows 41% garbage collector, 16% _buildNodes, 12% findPath, plus getNeighbors. All of that is the A* pathfinding stack inside @tisoap/react-flow-smart-edge and the allocation churn it causes.

Root cause

All in telemetry/ui/src/components/routes/app/GraphView.tsx (line numbers from current main):

  1. ActionActionEdge calls getSmartEdge in its render body (line 162). That builds a pathfinding grid over the bounding box of all nodes and runs A* for that one edge. Nothing memoizes it, so it re-runs for every edge on every render. Per-edge cost grows with layout area, which grows with node count, so the total is superlinear in graph size.
  2. Highlight state reaches nodes and edges through NodeStateProvider, and _Graph recreates the context value object inline on every render (lines 346-352). So every click and every hover enter/leave re-renders all N nodes and all E edges, and every edge re-runs A*.
  3. ActionActionEdge also subscribes to the full node array via useNodes() (line 152), so any node store change re-renders every edge.
  4. With auto-refresh on, the app refetches every 500ms (REFRESH_INTERVAL, AppView.tsx:38). react-query's structural sharing usually keeps the data.application reference stable across polls, but each poll still re-renders the view, and because of item 2 that re-renders every node and edge and re-runs A* for every edge, twice a second, with no interaction at all. Anything that does hand the graph a new object identity with the same structure (switching focus between a parent app and a structurally identical sub-app, for example) re-triggers the layout effect (GraphView.tsx:332-343) on top of that: full dagre relayout, new node/edge arrays, and a fitView that resets the viewport.

The 300-action hang is just item 1 paid once for all 899 edges over a very large grid at initial render.

Steps to replicate behavior

  1. Run the script below to log synthetic applications into ~/.burr, then open the UI and open ui-perf-bench/bench-150n.
  2. Click or hover step rows and feel the freeze. Open bench-300n: the graph never renders.
gen_large_app.py
import argparse

from burr.core import ApplicationBuilder, State, action, default, expr


@action(reads=["counter"], writes=["counter"])
def step(state: State) -> State:
    return state.update(counter=state["counter"] + 1)


def build(n_nodes: int, extra_edges: int):
    names = [f"step_{i:03d}" for i in range(n_nodes)]
    transitions = []
    for i in range(n_nodes - 1):
        transitions.append((names[i], names[i + 1], default))
    for i in range(n_nodes):
        for k in range(1, extra_edges + 1):
            j = (i * 7 + k * 13 + 3) % n_nodes
            if j != i and j != i + 1:
                transitions.append((names[i], names[j], expr("counter < 0")))
    return names, transitions


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--nodes", type=int, default=150)
    parser.add_argument("--extra-edges", type=int, default=2)
    args = parser.parse_args()

    names, transitions = build(args.nodes, args.extra_edges)
    app = (
        ApplicationBuilder()
        .with_actions(**{name: step for name in names})
        .with_transitions(*transitions)
        .with_state(counter=0)
        .with_entrypoint(names[0])
        .with_identifiers(app_id=f"bench-{args.nodes}n")
        .with_tracker("local", project="ui-perf-bench")
        .build()
    )
    app.run(halt_after=[names[-1]])
    print(f"done: {args.nodes} nodes, {len(transitions)} edges")


if __name__ == "__main__":
    main()

Library & System Information

Reproduced on the apache-burr 0.40.2 release build (macOS, Chromium headless via Playwright). The per-edge getSmartEdge call in the edge render body is in both 0.40.2 and current main, so both are affected. The file isn't identical between the two though: #586 swapped the layout engine from elkjs to dagre after 0.40.2, and when I re-ran the same benchmark on a main source build it came out much slower than 0.40.2 (9.7s vs 0.5s per click at 75 actions, numbers in the first comment below). That fits the mechanism: the pathfinding grid covers the layout's bounding box, and dagre spreads the same graph over more area, so every edge's A* run costs more.

Expected behavior

The graph should repaint within a frame or two of a click or hover regardless of graph size, and a 300-action graph should render.

Additional context

Sketch of a fix, all local to GraphView.tsx:

  1. Memoize the getSmartEdge result per edge on geometry and node positions. Selection and hover don't move nodes, so highlighting stops paying for pathfinding entirely.
  2. Above some node count, skip getSmartEdge and take the existing getBezierPath fallback so big graphs render at all (the code already falls back to bezier when getSmartEdge returns null).
  3. Memoize the NodeStateProvider value in _Graph.
  4. Key the relayout effect on graph structure (action names and transitions) instead of object identity, so identity churn without a structural change (focus switches, responses that aren't reference-stable) stops triggering full relayouts.

Happy to submit a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions