Reduce log noise, mount clustering routes, add node classification audit

Roo Code · ruvnet · Roo Code · commit fd2f393ceac5 · 2026-03-24T10:08:35.000Z
- docker-compose: change RUST_LOG default from debug to info with warn
  for noisy modules (neo4j_ontology_repository, graph_state_actor, actix_web)
- main.rs: mount clustering_handler routes at /api/clustering/*
- Add node classification audit documenting the 26-bit ID overflow bug
  where Neo4j IDs exceed NODE_ID_MASK causing phantom agent/ontology nodes

Co-Authored-By: claude-flow &lt;ruv@ruv.net&gt;
diff --git a/docker-compose.unified-with-neo4j.yml b/docker-compose.unified-with-neo4j.yml
@@ -111,7 +111,7 @@ services:
       VITE_DEV_SERVER_PORT: ${VITE_DEV_SERVER_PORT:-5173}
       VITE_API_PORT: ${VITE_API_PORT:-4000}
       VITE_HMR_PORT: ${VITE_HMR_PORT:-24678}
-      RUST_LOG: ${RUST_LOG:-debug}
+      RUST_LOG: ${RUST_LOG:-info,webxr::adapters::neo4j_ontology_repository=warn,webxr::actors::graph_state_actor=warn,actix_web=warn}
       RUST_LOG_REDIRECT: ${RUST_LOG_REDIRECT:-true}
       DOCKER_ENV: ${DOCKER_ENV:-true}
     volumes:
diff --git a/docker-compose.unified.yml b/docker-compose.unified.yml
@@ -135,7 +135,7 @@ services:
       VITE_DEV_SERVER_PORT: ${VITE_DEV_SERVER_PORT:-5173}
       VITE_API_PORT: ${VITE_API_PORT:-4000}
       VITE_HMR_PORT: ${VITE_HMR_PORT:-24678}
-      RUST_LOG: ${RUST_LOG:-debug}
+      RUST_LOG: ${RUST_LOG:-info,webxr::adapters::neo4j_ontology_repository=warn,webxr::actors::graph_state_actor=warn,actix_web=warn}
       RUST_LOG_REDIRECT: ${RUST_LOG_REDIRECT:-true}
       DOCKER_ENV: ${DOCKER_ENV:-true}
       # Dev-only: bypass Nostr auth for settings writes (never set in production)
diff --git a/docs/diagrams/data-flow/node-classification-audit.md b/docs/diagrams/data-flow/node-classification-audit.md
@@ -0,0 +1,223 @@
+# Node Classification Audit
+
+## Problem 1: Node Type Misclassification (CONFIRMED)
+
+### Root Cause
+
+The binary protocol encodes node types in the high bits of a u32 node ID:
+
+```
+Bits 31-30: Node type flags (Agent=0x80000000, Knowledge=0x40000000)
+Bits 28-26: Ontology sub-type flags (Class=0x04000000, Individual=0x08000000, Property=0x10000000)
+Bits 25-0:  Actual node ID (NODE_ID_MASK = 0x03FFFFFF, max value 67,108,863)
+```
+
+The `set_agent_flag()` and `set_knowledge_flag()` functions mask the ID to 26 bits before OR-ing the flag:
+
+```rust
+pub fn set_agent_flag(node_id: u32) -> u32 {
+    (node_id & NODE_ID_MASK) | AGENT_NODE_FLAG  // NODE_ID_MASK = 0x03FFFFFF
+}
+```
+
+**The problem**: Node IDs come from Neo4j via `n.id` (a user-defined property, NOT Neo4j's internal `elementId`). The `id` field is stored as `i64` in Neo4j and cast to `u32`:
+
+```rust
+// neo4j_adapter.rs:339
+let mut node = Node::new_with_id(metadata_id, Some(id as u32));
+```
+
+If the Neo4j `id` property value exceeds 67,108,863 (0x03FFFFFF), the high bits of the raw ID will collide with the flag region. When `set_knowledge_flag()` applies `node_id & NODE_ID_MASK`, it **truncates** the ID -- two different nodes can map to the same 26-bit ID, causing silent data corruption.
+
+But the **misclassification** scenario described (all nodes are `node_type: None` in Neo4j, yet client sees 452 agents) has a different mechanism:
+
+1. Neo4j stores `node_type: None` for all 934 nodes
+2. `classify_node()` correctly puts them all in `knowledge_node_ids` (the `_ => knowledge` fallback)
+3. `fetch_nodes()` line 97-103 checks `agent_set` then `knowledge_set`, and since all are in `knowledge_set`, it calls `set_knowledge_flag(node.id)` for all
+4. `set_knowledge_flag()` does `(node_id & 0x03FFFFFF) | 0x40000000`
+
+If any `node.id` values have bit 31 already set (values >= 2,147,483,648), then after masking and OR-ing with KNOWLEDGE_NODE_FLAG (bit 30), the wire value has bit 30 set but NOT bit 31. The client correctly reads these as knowledge nodes.
+
+**However**, if node IDs are generated by `NEXT_NODE_ID.fetch_add(1)` starting at 1, they will be small and fit in 26 bits. The issue arises only if:
+- The Neo4j `id` property stores values > 67,108,863
+- OR the `id` is cast from `i64` in a way that wraps
+
+The actual client-side mismatch (452 agent, 256 knowledge, 207 ontology) suggests the **client is re-interpreting** truncated IDs. When `set_knowledge_flag()` truncates a node ID to 26 bits and adds bit 30, the client's `stringToU32()` FNV hash might produce different mapping. Or, more likely, the client has its own classification logic that doesn't match the server's.
+
+### Investigation Needed
+
+1. **Query actual Neo4j ID range**: Run `MATCH (n:GraphNode) RETURN min(n.id) AS min_id, max(n.id) AS max_id` to determine if any IDs exceed 0x03FFFFFF (67,108,863)
+2. **Check client classification**: The client may have independent classification in `graph.worker.ts` that overrides the binary flags
+3. **Log flag application**: Add temporary logging in `fetch_nodes()` to show raw ID vs flagged ID for a sample of nodes
+
+### Recommended Fix: Server-Side ID Remapping
+
+Add a `HashMap<u32, u32>` in `GraphStateActor` that maps original Neo4j IDs to compact sequential IDs (0..N). This guarantees all wire IDs fit in 26 bits.
+
+```rust
+// In GraphStateActor
+struct GraphStateActor {
+    // ... existing fields ...
+    neo4j_to_wire: HashMap<u32, u32>,  // neo4j_id -> compact_id (0..N)
+    wire_to_neo4j: HashMap<u32, u32>,  // compact_id -> neo4j_id
+    next_wire_id: u32,
+}
+
+impl GraphStateActor {
+    fn get_or_create_wire_id(&mut self, neo4j_id: u32) -> u32 {
+        *self.neo4j_to_wire.entry(neo4j_id).or_insert_with(|| {
+            let id = self.next_wire_id;
+            self.next_wire_id += 1;
+            self.wire_to_neo4j.insert(id, neo4j_id);
+            id
+        })
+    }
+}
+```
+
+Apply the mapping in `fetch_nodes()` before calling `set_*_flag()`, and reverse-map in any handler that receives IDs from the client.
+
+**Impact**: All existing code that reads `node.id` from `graph_data` would use compact IDs on the wire. The mapping must be applied consistently in:
+- `position_updates.rs:fetch_nodes()` (line 97-103)
+- `position_updates.rs:handle_request_full_snapshot()` (line 186-190)
+- `types.rs` (line 389-391)
+- `delta_encoding.rs` (lines 126-140)
+- Any handler receiving node IDs from client WebSocket messages
+
+---
+
+## Problem 2: Settings Changes Not Syncing Across Clients
+
+### Current State
+
+Settings updates flow through:
+1. Client A sends HTTP PUT to `/api/settings/*` endpoints
+2. `settings_handler` updates the `OptimizedSettingsActor` (or `SettingsActor`)
+3. A `DomainEvent::PhysicsSettingsUpdated` event is emitted (`src/application/events.rs:25`)
+4. **No WebSocket broadcast** of the new settings to other clients
+
+There is **no** `BroadcastSettings` message type. The domain event `PhysicsSettingsUpdated` is defined but there is no subscriber that pushes it to WebSocket clients.
+
+The `SocketFlowServer` handles position broadcasts but has no settings broadcast path. Each client reads settings on initial connection via `GetGraphData` or HTTP API, then stores them locally in Zustand.
+
+### Recommended Fix
+
+**Option A: Event-driven WebSocket push (preferred)**
+
+1. Subscribe to `DomainEvent::PhysicsSettingsUpdated` in the WebSocket broadcast actor
+2. When received, serialize current settings and send a JSON text message to all connected clients:
+   ```json
+   {"type": "settingsUpdated", "settings": {...}}
+   ```
+3. Client-side: listen for `settingsUpdated` message type and merge into Zustand store
+
+**Option B: Piggyback on position updates**
+
+Include a settings hash in the binary position protocol header. When the hash changes, clients request fresh settings via HTTP. This avoids adding a new message type but adds latency.
+
+**Option A implementation sketch:**
+
+In `SocketFlowServer` (or the `ClientCoordinatorActor`):
+```rust
+// Add a new actor message
+pub struct BroadcastSettings {
+    pub settings_json: String,
+}
+
+// In the settings update handler, after persisting:
+if let Some(coordinator) = &app_state.client_coordinator_addr {
+    coordinator.do_send(BroadcastSettings {
+        settings_json: serde_json::to_string(&updated_settings).unwrap(),
+    });
+}
+```
+
+Files to modify:
+- `src/actors/messages/settings_messages.rs` -- add `BroadcastSettings` message
+- `src/actors/client_coordinator_actor.rs` -- handle broadcast to all connected WS sessions
+- `src/handlers/settings_handler/mod.rs` -- emit broadcast after settings update
+- `client/` -- add handler for `settingsUpdated` WS message type
+
+---
+
+## Problem 3: Live Claude Session Detection (Design)
+
+### Architecture
+
+A new Rust actor `ClaudeSessionMonitorActor` that:
+1. Runs on a 10-second `IntervalFunc` timer
+2. Shells out to `tmux list-panes -a -F "#{pane_pid} #{pane_current_command} #{window_name}"`
+3. Parses output for Claude processes (command contains "claude")
+4. Creates/updates/removes nodes via the existing `AddNode` CQRS command
+
+### Node Schema
+
+```json
+{
+  "id": <auto-generated>,
+  "metadata_id": "claude-session-<pid>",
+  "label": "Claude: <window_name>",
+  "node_type": "claude_session",
+  "metadata": {
+    "pid": "<pane_pid>",
+    "status": "active",
+    "window_name": "<window_name>",
+    "detected_at": "<iso8601>"
+  }
+}
+```
+
+### API Surface
+
+The existing REST API at `/api/graph-state/nodes` (POST) already supports adding nodes with arbitrary `node_type`. No new endpoints needed.
+
+The `classify_node()` function needs a new match arm:
+```rust
+Some("claude_session") => {
+    self.agent_node_ids.insert(node_id);  // Treat as agent for visual purposes
+}
+```
+
+### Implementation Plan
+
+1. **New actor**: `src/actors/claude_session_monitor_actor.rs`
+   - Timer-based polling of tmux
+   - Maintains a `HashSet<u32>` of known session PIDs
+   - On each tick: detect new sessions (add node), detect departed sessions (mark inactive or remove)
+
+2. **Registration**: Start the actor in `main.rs` alongside other actors
+
+3. **Client visual treatment**: The client can check `node_type === "claude_session"` in the node renderer to apply pulsing animation and a distinct color (e.g., green glow)
+
+4. **Alternative (simpler)**: A shell script cron job that POSTs to `/api/graph-state/nodes`:
+   ```bash
+   #!/bin/bash
+   tmux list-panes -a -F "#{pane_pid} #{pane_current_command} #{window_name}" | \
+     grep -i claude | while read pid cmd wname; do
+       curl -s -X POST http://localhost:4000/api/graph-state/nodes \
+         -H "Content-Type: application/json" \
+         -d "{\"node\":{\"metadata_id\":\"claude-session-$pid\",\"label\":\"Claude: $wname\",\"node_type\":\"claude_session\"}}"
+     done
+   ```
+   This is simpler but doesn't handle session removal.
+
+### Recommendation
+
+Start with the shell script approach for rapid validation, then promote to a Rust actor for production use. The Rust actor approach is more robust (handles removal, integrates with the actor lifecycle, no external process dependency).
+
+---
+
+## Deliverable Summary
+
+| # | Item | Status | Finding |
+|---|------|--------|---------|
+| 1 | Node ID overflow identification | CONFIRMED | IDs > 0x03FFFFFF (67M) will have bits truncated by NODE_ID_MASK, causing ID collisions and wrong type flags |
+| 2 | Minimal fix for node IDs | DESIGNED | Server-side `HashMap<u32,u32>` remapping to sequential compact IDs in GraphStateActor |
+| 3 | WebSocket settings broadcast | ABSENT | No broadcast mechanism exists; `PhysicsSettingsUpdated` event has no WS subscriber |
+| 4 | Claude session detection | DESIGNED | Actor-based tmux polling with AddNode CQRS integration, or simpler shell script approach |
+
+## Priority Order
+
+1. **P0**: Fix node ID overflow (causes data corruption on wire)
+2. **P1**: Add settings broadcast (improves multi-client UX)
+3. **P2**: Claude session detection (new feature, no existing breakage)
diff --git a/src/main.rs b/src/main.rs
@@ -573,6 +573,9 @@ async fn main() -> std::io::Result<()> {
                     .configure(bots_visualization_handler::configure_routes)
                     .configure(graph_export_handler::configure_routes)
 
+                    // GPU analytics: clustering, anomaly detection, stress optimisation
+                    .configure(webxr::handlers::clustering_handler::config)
+
                     // Ontology agent tools (MCP surface)
                     .configure(webxr::handlers::configure_ontology_agent_routes)