|
| 1 | +# Node Classification Audit |
| 2 | + |
| 3 | +## Problem 1: Node Type Misclassification (CONFIRMED) |
| 4 | + |
| 5 | +### Root Cause |
| 6 | + |
| 7 | +The binary protocol encodes node types in the high bits of a u32 node ID: |
| 8 | + |
| 9 | +``` |
| 10 | +Bits 31-30: Node type flags (Agent=0x80000000, Knowledge=0x40000000) |
| 11 | +Bits 28-26: Ontology sub-type flags (Class=0x04000000, Individual=0x08000000, Property=0x10000000) |
| 12 | +Bits 25-0: Actual node ID (NODE_ID_MASK = 0x03FFFFFF, max value 67,108,863) |
| 13 | +``` |
| 14 | + |
| 15 | +The `set_agent_flag()` and `set_knowledge_flag()` functions mask the ID to 26 bits before OR-ing the flag: |
| 16 | + |
| 17 | +```rust |
| 18 | +pub fn set_agent_flag(node_id: u32) -> u32 { |
| 19 | + (node_id & NODE_ID_MASK) | AGENT_NODE_FLAG // NODE_ID_MASK = 0x03FFFFFF |
| 20 | +} |
| 21 | +``` |
| 22 | + |
| 23 | +**The problem**: Node IDs come from Neo4j via `n.id` (a user-defined property, NOT Neo4j's internal `elementId`). The `id` field is stored as `i64` in Neo4j and cast to `u32`: |
| 24 | + |
| 25 | +```rust |
| 26 | +// neo4j_adapter.rs:339 |
| 27 | +let mut node = Node::new_with_id(metadata_id, Some(id as u32)); |
| 28 | +``` |
| 29 | + |
| 30 | +If the Neo4j `id` property value exceeds 67,108,863 (0x03FFFFFF), the high bits of the raw ID will collide with the flag region. When `set_knowledge_flag()` applies `node_id & NODE_ID_MASK`, it **truncates** the ID -- two different nodes can map to the same 26-bit ID, causing silent data corruption. |
| 31 | + |
| 32 | +But the **misclassification** scenario described (all nodes are `node_type: None` in Neo4j, yet client sees 452 agents) has a different mechanism: |
| 33 | + |
| 34 | +1. Neo4j stores `node_type: None` for all 934 nodes |
| 35 | +2. `classify_node()` correctly puts them all in `knowledge_node_ids` (the `_ => knowledge` fallback) |
| 36 | +3. `fetch_nodes()` line 97-103 checks `agent_set` then `knowledge_set`, and since all are in `knowledge_set`, it calls `set_knowledge_flag(node.id)` for all |
| 37 | +4. `set_knowledge_flag()` does `(node_id & 0x03FFFFFF) | 0x40000000` |
| 38 | + |
| 39 | +If any `node.id` values have bit 31 already set (values >= 2,147,483,648), then after masking and OR-ing with KNOWLEDGE_NODE_FLAG (bit 30), the wire value has bit 30 set but NOT bit 31. The client correctly reads these as knowledge nodes. |
| 40 | + |
| 41 | +**However**, if node IDs are generated by `NEXT_NODE_ID.fetch_add(1)` starting at 1, they will be small and fit in 26 bits. The issue arises only if: |
| 42 | +- The Neo4j `id` property stores values > 67,108,863 |
| 43 | +- OR the `id` is cast from `i64` in a way that wraps |
| 44 | + |
| 45 | +The actual client-side mismatch (452 agent, 256 knowledge, 207 ontology) suggests the **client is re-interpreting** truncated IDs. When `set_knowledge_flag()` truncates a node ID to 26 bits and adds bit 30, the client's `stringToU32()` FNV hash might produce different mapping. Or, more likely, the client has its own classification logic that doesn't match the server's. |
| 46 | + |
| 47 | +### Investigation Needed |
| 48 | + |
| 49 | +1. **Query actual Neo4j ID range**: Run `MATCH (n:GraphNode) RETURN min(n.id) AS min_id, max(n.id) AS max_id` to determine if any IDs exceed 0x03FFFFFF (67,108,863) |
| 50 | +2. **Check client classification**: The client may have independent classification in `graph.worker.ts` that overrides the binary flags |
| 51 | +3. **Log flag application**: Add temporary logging in `fetch_nodes()` to show raw ID vs flagged ID for a sample of nodes |
| 52 | + |
| 53 | +### Recommended Fix: Server-Side ID Remapping |
| 54 | + |
| 55 | +Add a `HashMap<u32, u32>` in `GraphStateActor` that maps original Neo4j IDs to compact sequential IDs (0..N). This guarantees all wire IDs fit in 26 bits. |
| 56 | + |
| 57 | +```rust |
| 58 | +// In GraphStateActor |
| 59 | +struct GraphStateActor { |
| 60 | + // ... existing fields ... |
| 61 | + neo4j_to_wire: HashMap<u32, u32>, // neo4j_id -> compact_id (0..N) |
| 62 | + wire_to_neo4j: HashMap<u32, u32>, // compact_id -> neo4j_id |
| 63 | + next_wire_id: u32, |
| 64 | +} |
| 65 | + |
| 66 | +impl GraphStateActor { |
| 67 | + fn get_or_create_wire_id(&mut self, neo4j_id: u32) -> u32 { |
| 68 | + *self.neo4j_to_wire.entry(neo4j_id).or_insert_with(|| { |
| 69 | + let id = self.next_wire_id; |
| 70 | + self.next_wire_id += 1; |
| 71 | + self.wire_to_neo4j.insert(id, neo4j_id); |
| 72 | + id |
| 73 | + }) |
| 74 | + } |
| 75 | +} |
| 76 | +``` |
| 77 | + |
| 78 | +Apply the mapping in `fetch_nodes()` before calling `set_*_flag()`, and reverse-map in any handler that receives IDs from the client. |
| 79 | + |
| 80 | +**Impact**: All existing code that reads `node.id` from `graph_data` would use compact IDs on the wire. The mapping must be applied consistently in: |
| 81 | +- `position_updates.rs:fetch_nodes()` (line 97-103) |
| 82 | +- `position_updates.rs:handle_request_full_snapshot()` (line 186-190) |
| 83 | +- `types.rs` (line 389-391) |
| 84 | +- `delta_encoding.rs` (lines 126-140) |
| 85 | +- Any handler receiving node IDs from client WebSocket messages |
| 86 | + |
| 87 | +--- |
| 88 | + |
| 89 | +## Problem 2: Settings Changes Not Syncing Across Clients |
| 90 | + |
| 91 | +### Current State |
| 92 | + |
| 93 | +Settings updates flow through: |
| 94 | +1. Client A sends HTTP PUT to `/api/settings/*` endpoints |
| 95 | +2. `settings_handler` updates the `OptimizedSettingsActor` (or `SettingsActor`) |
| 96 | +3. A `DomainEvent::PhysicsSettingsUpdated` event is emitted (`src/application/events.rs:25`) |
| 97 | +4. **No WebSocket broadcast** of the new settings to other clients |
| 98 | + |
| 99 | +There is **no** `BroadcastSettings` message type. The domain event `PhysicsSettingsUpdated` is defined but there is no subscriber that pushes it to WebSocket clients. |
| 100 | + |
| 101 | +The `SocketFlowServer` handles position broadcasts but has no settings broadcast path. Each client reads settings on initial connection via `GetGraphData` or HTTP API, then stores them locally in Zustand. |
| 102 | + |
| 103 | +### Recommended Fix |
| 104 | + |
| 105 | +**Option A: Event-driven WebSocket push (preferred)** |
| 106 | + |
| 107 | +1. Subscribe to `DomainEvent::PhysicsSettingsUpdated` in the WebSocket broadcast actor |
| 108 | +2. When received, serialize current settings and send a JSON text message to all connected clients: |
| 109 | + ```json |
| 110 | + {"type": "settingsUpdated", "settings": {...}} |
| 111 | + ``` |
| 112 | +3. Client-side: listen for `settingsUpdated` message type and merge into Zustand store |
| 113 | + |
| 114 | +**Option B: Piggyback on position updates** |
| 115 | + |
| 116 | +Include a settings hash in the binary position protocol header. When the hash changes, clients request fresh settings via HTTP. This avoids adding a new message type but adds latency. |
| 117 | + |
| 118 | +**Option A implementation sketch:** |
| 119 | + |
| 120 | +In `SocketFlowServer` (or the `ClientCoordinatorActor`): |
| 121 | +```rust |
| 122 | +// Add a new actor message |
| 123 | +pub struct BroadcastSettings { |
| 124 | + pub settings_json: String, |
| 125 | +} |
| 126 | + |
| 127 | +// In the settings update handler, after persisting: |
| 128 | +if let Some(coordinator) = &app_state.client_coordinator_addr { |
| 129 | + coordinator.do_send(BroadcastSettings { |
| 130 | + settings_json: serde_json::to_string(&updated_settings).unwrap(), |
| 131 | + }); |
| 132 | +} |
| 133 | +``` |
| 134 | + |
| 135 | +Files to modify: |
| 136 | +- `src/actors/messages/settings_messages.rs` -- add `BroadcastSettings` message |
| 137 | +- `src/actors/client_coordinator_actor.rs` -- handle broadcast to all connected WS sessions |
| 138 | +- `src/handlers/settings_handler/mod.rs` -- emit broadcast after settings update |
| 139 | +- `client/` -- add handler for `settingsUpdated` WS message type |
| 140 | + |
| 141 | +--- |
| 142 | + |
| 143 | +## Problem 3: Live Claude Session Detection (Design) |
| 144 | + |
| 145 | +### Architecture |
| 146 | + |
| 147 | +A new Rust actor `ClaudeSessionMonitorActor` that: |
| 148 | +1. Runs on a 10-second `IntervalFunc` timer |
| 149 | +2. Shells out to `tmux list-panes -a -F "#{pane_pid} #{pane_current_command} #{window_name}"` |
| 150 | +3. Parses output for Claude processes (command contains "claude") |
| 151 | +4. Creates/updates/removes nodes via the existing `AddNode` CQRS command |
| 152 | + |
| 153 | +### Node Schema |
| 154 | + |
| 155 | +```json |
| 156 | +{ |
| 157 | + "id": <auto-generated>, |
| 158 | + "metadata_id": "claude-session-<pid>", |
| 159 | + "label": "Claude: <window_name>", |
| 160 | + "node_type": "claude_session", |
| 161 | + "metadata": { |
| 162 | + "pid": "<pane_pid>", |
| 163 | + "status": "active", |
| 164 | + "window_name": "<window_name>", |
| 165 | + "detected_at": "<iso8601>" |
| 166 | + } |
| 167 | +} |
| 168 | +``` |
| 169 | + |
| 170 | +### API Surface |
| 171 | + |
| 172 | +The existing REST API at `/api/graph-state/nodes` (POST) already supports adding nodes with arbitrary `node_type`. No new endpoints needed. |
| 173 | + |
| 174 | +The `classify_node()` function needs a new match arm: |
| 175 | +```rust |
| 176 | +Some("claude_session") => { |
| 177 | + self.agent_node_ids.insert(node_id); // Treat as agent for visual purposes |
| 178 | +} |
| 179 | +``` |
| 180 | + |
| 181 | +### Implementation Plan |
| 182 | + |
| 183 | +1. **New actor**: `src/actors/claude_session_monitor_actor.rs` |
| 184 | + - Timer-based polling of tmux |
| 185 | + - Maintains a `HashSet<u32>` of known session PIDs |
| 186 | + - On each tick: detect new sessions (add node), detect departed sessions (mark inactive or remove) |
| 187 | + |
| 188 | +2. **Registration**: Start the actor in `main.rs` alongside other actors |
| 189 | + |
| 190 | +3. **Client visual treatment**: The client can check `node_type === "claude_session"` in the node renderer to apply pulsing animation and a distinct color (e.g., green glow) |
| 191 | + |
| 192 | +4. **Alternative (simpler)**: A shell script cron job that POSTs to `/api/graph-state/nodes`: |
| 193 | + ```bash |
| 194 | + #!/bin/bash |
| 195 | + tmux list-panes -a -F "#{pane_pid} #{pane_current_command} #{window_name}" | \ |
| 196 | + grep -i claude | while read pid cmd wname; do |
| 197 | + curl -s -X POST http://localhost:4000/api/graph-state/nodes \ |
| 198 | + -H "Content-Type: application/json" \ |
| 199 | + -d "{\"node\":{\"metadata_id\":\"claude-session-$pid\",\"label\":\"Claude: $wname\",\"node_type\":\"claude_session\"}}" |
| 200 | + done |
| 201 | + ``` |
| 202 | + This is simpler but doesn't handle session removal. |
| 203 | + |
| 204 | +### Recommendation |
| 205 | + |
| 206 | +Start with the shell script approach for rapid validation, then promote to a Rust actor for production use. The Rust actor approach is more robust (handles removal, integrates with the actor lifecycle, no external process dependency). |
| 207 | + |
| 208 | +--- |
| 209 | + |
| 210 | +## Deliverable Summary |
| 211 | + |
| 212 | +| # | Item | Status | Finding | |
| 213 | +|---|------|--------|---------| |
| 214 | +| 1 | Node ID overflow identification | CONFIRMED | IDs > 0x03FFFFFF (67M) will have bits truncated by NODE_ID_MASK, causing ID collisions and wrong type flags | |
| 215 | +| 2 | Minimal fix for node IDs | DESIGNED | Server-side `HashMap<u32,u32>` remapping to sequential compact IDs in GraphStateActor | |
| 216 | +| 3 | WebSocket settings broadcast | ABSENT | No broadcast mechanism exists; `PhysicsSettingsUpdated` event has no WS subscriber | |
| 217 | +| 4 | Claude session detection | DESIGNED | Actor-based tmux polling with AddNode CQRS integration, or simpler shell script approach | |
| 218 | + |
| 219 | +## Priority Order |
| 220 | + |
| 221 | +1. **P0**: Fix node ID overflow (causes data corruption on wire) |
| 222 | +2. **P1**: Add settings broadcast (improves multi-client UX) |
| 223 | +3. **P2**: Claude session detection (new feature, no existing breakage) |
0 commit comments