Skip to content

Commit fd2f393

Browse files
Roo Coderuvnet
andcommitted
Reduce log noise, mount clustering routes, add node classification audit
- docker-compose: change RUST_LOG default from debug to info with warn for noisy modules (neo4j_ontology_repository, graph_state_actor, actix_web) - main.rs: mount clustering_handler routes at /api/clustering/* - Add node classification audit documenting the 26-bit ID overflow bug where Neo4j IDs exceed NODE_ID_MASK causing phantom agent/ontology nodes Co-Authored-By: claude-flow <ruv@ruv.net>
1 parent 6ce1b05 commit fd2f393

4 files changed

Lines changed: 228 additions & 2 deletions

File tree

docker-compose.unified-with-neo4j.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ services:
111111
VITE_DEV_SERVER_PORT: ${VITE_DEV_SERVER_PORT:-5173}
112112
VITE_API_PORT: ${VITE_API_PORT:-4000}
113113
VITE_HMR_PORT: ${VITE_HMR_PORT:-24678}
114-
RUST_LOG: ${RUST_LOG:-debug}
114+
RUST_LOG: ${RUST_LOG:-info,webxr::adapters::neo4j_ontology_repository=warn,webxr::actors::graph_state_actor=warn,actix_web=warn}
115115
RUST_LOG_REDIRECT: ${RUST_LOG_REDIRECT:-true}
116116
DOCKER_ENV: ${DOCKER_ENV:-true}
117117
volumes:

docker-compose.unified.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ services:
135135
VITE_DEV_SERVER_PORT: ${VITE_DEV_SERVER_PORT:-5173}
136136
VITE_API_PORT: ${VITE_API_PORT:-4000}
137137
VITE_HMR_PORT: ${VITE_HMR_PORT:-24678}
138-
RUST_LOG: ${RUST_LOG:-debug}
138+
RUST_LOG: ${RUST_LOG:-info,webxr::adapters::neo4j_ontology_repository=warn,webxr::actors::graph_state_actor=warn,actix_web=warn}
139139
RUST_LOG_REDIRECT: ${RUST_LOG_REDIRECT:-true}
140140
DOCKER_ENV: ${DOCKER_ENV:-true}
141141
# Dev-only: bypass Nostr auth for settings writes (never set in production)
Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
# Node Classification Audit
2+
3+
## Problem 1: Node Type Misclassification (CONFIRMED)
4+
5+
### Root Cause
6+
7+
The binary protocol encodes node types in the high bits of a u32 node ID:
8+
9+
```
10+
Bits 31-30: Node type flags (Agent=0x80000000, Knowledge=0x40000000)
11+
Bits 28-26: Ontology sub-type flags (Class=0x04000000, Individual=0x08000000, Property=0x10000000)
12+
Bits 25-0: Actual node ID (NODE_ID_MASK = 0x03FFFFFF, max value 67,108,863)
13+
```
14+
15+
The `set_agent_flag()` and `set_knowledge_flag()` functions mask the ID to 26 bits before OR-ing the flag:
16+
17+
```rust
18+
pub fn set_agent_flag(node_id: u32) -> u32 {
19+
(node_id & NODE_ID_MASK) | AGENT_NODE_FLAG // NODE_ID_MASK = 0x03FFFFFF
20+
}
21+
```
22+
23+
**The problem**: Node IDs come from Neo4j via `n.id` (a user-defined property, NOT Neo4j's internal `elementId`). The `id` field is stored as `i64` in Neo4j and cast to `u32`:
24+
25+
```rust
26+
// neo4j_adapter.rs:339
27+
let mut node = Node::new_with_id(metadata_id, Some(id as u32));
28+
```
29+
30+
If the Neo4j `id` property value exceeds 67,108,863 (0x03FFFFFF), the high bits of the raw ID will collide with the flag region. When `set_knowledge_flag()` applies `node_id & NODE_ID_MASK`, it **truncates** the ID -- two different nodes can map to the same 26-bit ID, causing silent data corruption.
31+
32+
But the **misclassification** scenario described (all nodes are `node_type: None` in Neo4j, yet client sees 452 agents) has a different mechanism:
33+
34+
1. Neo4j stores `node_type: None` for all 934 nodes
35+
2. `classify_node()` correctly puts them all in `knowledge_node_ids` (the `_ => knowledge` fallback)
36+
3. `fetch_nodes()` line 97-103 checks `agent_set` then `knowledge_set`, and since all are in `knowledge_set`, it calls `set_knowledge_flag(node.id)` for all
37+
4. `set_knowledge_flag()` does `(node_id & 0x03FFFFFF) | 0x40000000`
38+
39+
If any `node.id` values have bit 31 already set (values >= 2,147,483,648), then after masking and OR-ing with KNOWLEDGE_NODE_FLAG (bit 30), the wire value has bit 30 set but NOT bit 31. The client correctly reads these as knowledge nodes.
40+
41+
**However**, if node IDs are generated by `NEXT_NODE_ID.fetch_add(1)` starting at 1, they will be small and fit in 26 bits. The issue arises only if:
42+
- The Neo4j `id` property stores values > 67,108,863
43+
- OR the `id` is cast from `i64` in a way that wraps
44+
45+
The actual client-side mismatch (452 agent, 256 knowledge, 207 ontology) suggests the **client is re-interpreting** truncated IDs. When `set_knowledge_flag()` truncates a node ID to 26 bits and adds bit 30, the client's `stringToU32()` FNV hash might produce different mapping. Or, more likely, the client has its own classification logic that doesn't match the server's.
46+
47+
### Investigation Needed
48+
49+
1. **Query actual Neo4j ID range**: Run `MATCH (n:GraphNode) RETURN min(n.id) AS min_id, max(n.id) AS max_id` to determine if any IDs exceed 0x03FFFFFF (67,108,863)
50+
2. **Check client classification**: The client may have independent classification in `graph.worker.ts` that overrides the binary flags
51+
3. **Log flag application**: Add temporary logging in `fetch_nodes()` to show raw ID vs flagged ID for a sample of nodes
52+
53+
### Recommended Fix: Server-Side ID Remapping
54+
55+
Add a `HashMap<u32, u32>` in `GraphStateActor` that maps original Neo4j IDs to compact sequential IDs (0..N). This guarantees all wire IDs fit in 26 bits.
56+
57+
```rust
58+
// In GraphStateActor
59+
struct GraphStateActor {
60+
// ... existing fields ...
61+
neo4j_to_wire: HashMap<u32, u32>, // neo4j_id -> compact_id (0..N)
62+
wire_to_neo4j: HashMap<u32, u32>, // compact_id -> neo4j_id
63+
next_wire_id: u32,
64+
}
65+
66+
impl GraphStateActor {
67+
fn get_or_create_wire_id(&mut self, neo4j_id: u32) -> u32 {
68+
*self.neo4j_to_wire.entry(neo4j_id).or_insert_with(|| {
69+
let id = self.next_wire_id;
70+
self.next_wire_id += 1;
71+
self.wire_to_neo4j.insert(id, neo4j_id);
72+
id
73+
})
74+
}
75+
}
76+
```
77+
78+
Apply the mapping in `fetch_nodes()` before calling `set_*_flag()`, and reverse-map in any handler that receives IDs from the client.
79+
80+
**Impact**: All existing code that reads `node.id` from `graph_data` would use compact IDs on the wire. The mapping must be applied consistently in:
81+
- `position_updates.rs:fetch_nodes()` (line 97-103)
82+
- `position_updates.rs:handle_request_full_snapshot()` (line 186-190)
83+
- `types.rs` (line 389-391)
84+
- `delta_encoding.rs` (lines 126-140)
85+
- Any handler receiving node IDs from client WebSocket messages
86+
87+
---
88+
89+
## Problem 2: Settings Changes Not Syncing Across Clients
90+
91+
### Current State
92+
93+
Settings updates flow through:
94+
1. Client A sends HTTP PUT to `/api/settings/*` endpoints
95+
2. `settings_handler` updates the `OptimizedSettingsActor` (or `SettingsActor`)
96+
3. A `DomainEvent::PhysicsSettingsUpdated` event is emitted (`src/application/events.rs:25`)
97+
4. **No WebSocket broadcast** of the new settings to other clients
98+
99+
There is **no** `BroadcastSettings` message type. The domain event `PhysicsSettingsUpdated` is defined but there is no subscriber that pushes it to WebSocket clients.
100+
101+
The `SocketFlowServer` handles position broadcasts but has no settings broadcast path. Each client reads settings on initial connection via `GetGraphData` or HTTP API, then stores them locally in Zustand.
102+
103+
### Recommended Fix
104+
105+
**Option A: Event-driven WebSocket push (preferred)**
106+
107+
1. Subscribe to `DomainEvent::PhysicsSettingsUpdated` in the WebSocket broadcast actor
108+
2. When received, serialize current settings and send a JSON text message to all connected clients:
109+
```json
110+
{"type": "settingsUpdated", "settings": {...}}
111+
```
112+
3. Client-side: listen for `settingsUpdated` message type and merge into Zustand store
113+
114+
**Option B: Piggyback on position updates**
115+
116+
Include a settings hash in the binary position protocol header. When the hash changes, clients request fresh settings via HTTP. This avoids adding a new message type but adds latency.
117+
118+
**Option A implementation sketch:**
119+
120+
In `SocketFlowServer` (or the `ClientCoordinatorActor`):
121+
```rust
122+
// Add a new actor message
123+
pub struct BroadcastSettings {
124+
pub settings_json: String,
125+
}
126+
127+
// In the settings update handler, after persisting:
128+
if let Some(coordinator) = &app_state.client_coordinator_addr {
129+
coordinator.do_send(BroadcastSettings {
130+
settings_json: serde_json::to_string(&updated_settings).unwrap(),
131+
});
132+
}
133+
```
134+
135+
Files to modify:
136+
- `src/actors/messages/settings_messages.rs` -- add `BroadcastSettings` message
137+
- `src/actors/client_coordinator_actor.rs` -- handle broadcast to all connected WS sessions
138+
- `src/handlers/settings_handler/mod.rs` -- emit broadcast after settings update
139+
- `client/` -- add handler for `settingsUpdated` WS message type
140+
141+
---
142+
143+
## Problem 3: Live Claude Session Detection (Design)
144+
145+
### Architecture
146+
147+
A new Rust actor `ClaudeSessionMonitorActor` that:
148+
1. Runs on a 10-second `IntervalFunc` timer
149+
2. Shells out to `tmux list-panes -a -F "#{pane_pid} #{pane_current_command} #{window_name}"`
150+
3. Parses output for Claude processes (command contains "claude")
151+
4. Creates/updates/removes nodes via the existing `AddNode` CQRS command
152+
153+
### Node Schema
154+
155+
```json
156+
{
157+
"id": <auto-generated>,
158+
"metadata_id": "claude-session-<pid>",
159+
"label": "Claude: <window_name>",
160+
"node_type": "claude_session",
161+
"metadata": {
162+
"pid": "<pane_pid>",
163+
"status": "active",
164+
"window_name": "<window_name>",
165+
"detected_at": "<iso8601>"
166+
}
167+
}
168+
```
169+
170+
### API Surface
171+
172+
The existing REST API at `/api/graph-state/nodes` (POST) already supports adding nodes with arbitrary `node_type`. No new endpoints needed.
173+
174+
The `classify_node()` function needs a new match arm:
175+
```rust
176+
Some("claude_session") => {
177+
self.agent_node_ids.insert(node_id); // Treat as agent for visual purposes
178+
}
179+
```
180+
181+
### Implementation Plan
182+
183+
1. **New actor**: `src/actors/claude_session_monitor_actor.rs`
184+
- Timer-based polling of tmux
185+
- Maintains a `HashSet<u32>` of known session PIDs
186+
- On each tick: detect new sessions (add node), detect departed sessions (mark inactive or remove)
187+
188+
2. **Registration**: Start the actor in `main.rs` alongside other actors
189+
190+
3. **Client visual treatment**: The client can check `node_type === "claude_session"` in the node renderer to apply pulsing animation and a distinct color (e.g., green glow)
191+
192+
4. **Alternative (simpler)**: A shell script cron job that POSTs to `/api/graph-state/nodes`:
193+
```bash
194+
#!/bin/bash
195+
tmux list-panes -a -F "#{pane_pid} #{pane_current_command} #{window_name}" | \
196+
grep -i claude | while read pid cmd wname; do
197+
curl -s -X POST http://localhost:4000/api/graph-state/nodes \
198+
-H "Content-Type: application/json" \
199+
-d "{\"node\":{\"metadata_id\":\"claude-session-$pid\",\"label\":\"Claude: $wname\",\"node_type\":\"claude_session\"}}"
200+
done
201+
```
202+
This is simpler but doesn't handle session removal.
203+
204+
### Recommendation
205+
206+
Start with the shell script approach for rapid validation, then promote to a Rust actor for production use. The Rust actor approach is more robust (handles removal, integrates with the actor lifecycle, no external process dependency).
207+
208+
---
209+
210+
## Deliverable Summary
211+
212+
| # | Item | Status | Finding |
213+
|---|------|--------|---------|
214+
| 1 | Node ID overflow identification | CONFIRMED | IDs > 0x03FFFFFF (67M) will have bits truncated by NODE_ID_MASK, causing ID collisions and wrong type flags |
215+
| 2 | Minimal fix for node IDs | DESIGNED | Server-side `HashMap<u32,u32>` remapping to sequential compact IDs in GraphStateActor |
216+
| 3 | WebSocket settings broadcast | ABSENT | No broadcast mechanism exists; `PhysicsSettingsUpdated` event has no WS subscriber |
217+
| 4 | Claude session detection | DESIGNED | Actor-based tmux polling with AddNode CQRS integration, or simpler shell script approach |
218+
219+
## Priority Order
220+
221+
1. **P0**: Fix node ID overflow (causes data corruption on wire)
222+
2. **P1**: Add settings broadcast (improves multi-client UX)
223+
3. **P2**: Claude session detection (new feature, no existing breakage)

src/main.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -573,6 +573,9 @@ async fn main() -> std::io::Result<()> {
573573
.configure(bots_visualization_handler::configure_routes)
574574
.configure(graph_export_handler::configure_routes)
575575

576+
// GPU analytics: clustering, anomaly detection, stress optimisation
577+
.configure(webxr::handlers::clustering_handler::config)
578+
576579
// Ontology agent tools (MCP surface)
577580
.configure(webxr::handlers::configure_ontology_agent_routes)
578581

0 commit comments

Comments
 (0)