The Fleet Management application includes a built-in observability system for monitoring, debugging, and identifying performance bottlenecks in production. The system uses tiered debug levels to control overhead, allowing safe use even on large instances with thousands of devices.
For Docker container-level logs, the deploy script supports an optional Dozzle integration:
# Public deploy
./deploy/deploy-public.sh up --logging
# Internal deploy
./deploy/deploy.sh up --env local --with loggingDozzle runs on port 9999 and provides a browser-based, real-time view of all container logs. It is read-only and zero-config — no configuration files or persistent storage needed. It reads logs directly from the Docker socket.
This complements the application-level observability system described below, which provides structured metrics, counters, and timings from inside Fleet Manager.
| Level | Name | Overhead | What It Shows | Production Safe? |
|---|---|---|---|---|
| 0 | OFF | Zero | Logs only | Yes |
| 1 | Light | Negligible | System vitals + module gauges | Yes |
| 2 | Medium | Low | Light + counters + RPC/DB timings | Yes |
| 3 | Full | Moderate | Medium + WS msg/s + pending RPCs + client ring buffer | Use with caution |
Key design principle: Each recording function checks if (level < requiredTier) return; — a single integer comparison — so disabled tiers have effectively zero overhead. Module stat getters only run when /health is polled (not on every message).
Click the level button in the toolbar. It cycles: OFF → Light → Medium → Full → OFF. The button changes color to indicate the current tier:
- OFF: gray
- Light: blue
- Medium: yellow
- Full: red
window.fmObservability(2) // Set to Medium
window.fmObservability(0) // Turn offSet FM_OBSERVABILITY=true in the deployment config to start with Level 2 (Medium) enabled by default.
fm_obs_level = '0' | '1' | '2' | '3'
# Set level
curl -X POST /health/observability -H 'Content-Type: application/json' -d '{"level": 2}'
# Legacy boolean (backward compatible)
curl -X POST /health/observability -H 'Content-Type: application/json' -d '{"enabled": true}'
# Reset all timings and counters
curl -X POST /health/observability/reset
# Get current metrics
curl /healthLocated at backend/src/modules/Observability.ts. Central metrics collection with three mechanisms:
-
Module Getters (Tier 1+): Modules register lightweight stat-getter functions that read
.sizefrom existing Maps — O(1), no iteration. Only called when/healthis polled. -
Incremental Counters (Tier 2+): One-line
incrementCounter('name')calls at existing code points. Cumulative values, never iterate device lists. -
Timing Maps (Tier 2+): Per-RPC-method and per-DB-method timing stats (count, avgMs, maxMs).
| Module | Stats | Source |
|---|---|---|
| devices | total (device count) |
DeviceCollector.ts |
| events | listeners, eventTypes, groupCacheSize, groupVersion |
EventDistributor.ts |
| statusQueue | pending, queueSize, flushing |
ShellyMessageHandler.ts |
| audit | queueLength |
AuditLogger.ts |
| deviceInit | active, queued |
ShellyWebsocketHandler.ts |
| commander | registered (component count) |
Commander.ts |
| dbPool | total, idle, waiting (connection pool stats) |
Database pool (if pg pool is accessible) |
| Counter | Incremented In | Tier |
|---|---|---|
devices_connected |
DeviceCollector.register() | 2 |
devices_disconnected |
DeviceCollector.deleteDevice() | 2 |
status_messages |
ShellyMessageHandler.statusSelectivePush() | 2 |
status_flushes |
ShellyMessageHandler flush interval | 2 |
audit_entries |
AuditLogger.log() | 2 |
audit_flushes |
AuditLogger.flushAuditLogQueue() | 2 |
ws_connections |
ClientWebsocketHandler on connect | 2 |
ws_disconnections |
ClientWebsocketHandler on close | 2 |
device_inits_started |
ShellyWebsocketHandler.acquireInitSlot() | 2 |
device_inits_completed |
ShellyWebsocketHandler.releaseInitSlot() | 2 |
device_inits_failed |
ShellyWebsocketHandler error catch | 2 |
events_broadcast |
EventDistributor.notifyAll() | 2 |
rpc_success |
Commander.exec() success | 2 |
rpc_errors |
Commander.exec() error | 2 |
GET /health returns metrics based on the current observability level:
{
"online": true,
"version": "x.y.z",
"metrics": {
"level": 2,
"uptimeS": 3600,
"eventLoopLagMs": 2,
"memory": { "rssM": 150, "heapUsedM": 80, "heapTotalM": 128, "heapTrend": "stable" },
"wsClients": 5,
"modules": {
"devices": { "total": 500 },
"events": { "listeners": 45, "eventTypes": 8, "groupCacheSize": 450 },
"statusQueue": { "pending": 0, "queueSize": 15, "flushing": false },
"deviceInit": { "active": 2, "queued": 0 },
"audit": { "queueLength": 3 },
"rpcCommands": { "registered": 28 },
"dbPool": { "total": 10, "idle": 8, "waiting": 0 }
},
"counters": { "devices_connected": 512, "rpc_success": 890 },
"rpcTimings": { "Device.List": { "count": 42, "avgMs": 12, "maxMs": 45 } },
"dbTimings": { "fn_fetch_devices": { "count": 100, "avgMs": 3, "maxMs": 15 } },
"rpcErrors": [
{ "method": "Device.GetConfig", "error": "timeout after 10000ms", "ts": 1700000000000 }
],
"initFailures": [
{ "shellyID": "shellyplus1-AABBCC", "error": "connection refused", "ts": 1700000000000 }
]
}
}Tier gating:
- Level 0:
metricsisnull - Level 1:
level,uptimeS,eventLoopLagMs,memory(withheapTrend),wsClients,modules - Level 2+: Level 1 +
counters,rpcTimings,dbTimings,rpcErrors,initFailures
Located at frontend/src/tools/observability.ts. Mirrors the backend tier system:
- RPC Timing Ring Buffer (Tier 2+): Records the last 200 RPC calls with method, duration, and timestamp.
- Counter Rate-of-Change Tracking (Tier 2+): Computes per-minute rate of change for each counter between poll intervals.
- WS Message Rate (Tier 3): Counts WebSocket messages per second via a 1-second interval.
- Pending RPC Count (Tier 3): Tracks how many RPCs are awaiting responses.
- Backend Metrics Cache: Polls
/healthand caches the response for UI display. - Debug Report Export: Fetches
/health/debug-reportfrom the backend and combines it with frontend state for a comprehensive JSON dump.
The log page adapts its UI based on the current debug tier:
Always visible:
- Log output with level-based filtering (ALL, ERROR, WARN, INFO, DEBUG)
- Log level border colors (red = ERROR/FATAL, yellow = WARN, subtle gray = DEBUG)
- Log pinning (hover to reveal pin button, pinned logs appear at top)
- Text search across log messages (Ctrl+K / Cmd+K to focus)
- Copy, Download, Export Report, Clear buttons
- Auto-scroll toggle
- Keyboard shortcuts (Ctrl+K search, Ctrl+L clear, Escape dismiss)
Tier 1 (Light) adds:
- System vitals bar (uptime, event loop lag, memory with heap trend indicator, WS clients)
- Collapsible module performance grid with color-coded stat cards
Tier 2 (Medium) adds:
- Grouped counter display with rate-of-change (+N/min) indicators
- Sortable RPC timings table (click headers to sort)
- Sortable DB timings table
- RPC Errors panel (last 50 errors with timestamp, method, and error message)
- Init Failures panel (last 50 device init failures with shellyID and error)
- Reset Timings button
Tier 3 (Full) adds:
- Frontend metrics panel (WS msg/s, pending RPCs, client-side RPC ring buffer)
- Faster polling (2s instead of 5s)
Module stat cards are color-coded to highlight bottlenecks:
- Green (default): Normal operation
- Yellow: Warning thresholds exceeded (e.g., >1000 devices, >50 active inits)
- Red: Critical thresholds exceeded (e.g., >5000 devices, >100 active inits, stuck flush)
When OBS level >= 2, the Device Web GUI modal shows a collapsible section with device-specific RPC timings filtered from the ring buffer by shellyID.
Each log entry displays a subtle left border color based on its severity level:
- Red (
border-red-600): ERROR and FATAL level logs - Yellow (
border-yellow-600): WARN level logs - Gray (
border-neutral-700): DEBUG level logs - Transparent: INFO and other levels (no visible border)
This provides an at-a-glance visual scan of log severity without reading each entry.
Important log entries can be pinned for quick reference. Hover over any log entry to reveal a diamond-shaped pin button on the left side. Clicking it toggles the pin state:
- Filled diamond: Log is pinned
- Empty diamond: Log is not pinned
Pinned logs appear in a dedicated "Pinned" section at the top of the log area, above the scrolling log output. The pinned section shows the count of pinned logs and a "Clear pins" button to unpin all at once. Pins use the log's timestamp (ts) as identifier and are stored in the Pinia console store.
The log page supports the following keyboard shortcuts:
| Shortcut | Action |
|---|---|
Ctrl+K / Cmd+K |
Focus the search input |
Ctrl+L / Cmd+L |
Clear all logs |
Escape |
Clear search query and blur search input |
Shortcuts are registered on mount and cleaned up on unmount to avoid leaks.
The System Vitals Bar (Tier 1+) displays a heap trend arrow next to the heap usage:
- Red up arrow: Heap is growing (potential memory leak)
- Green down arrow: Heap is shrinking (GC is reclaiming memory)
- Gray right arrow: Heap is stable
The heapTrend field is computed by the backend by comparing heap snapshots over time and is included in the memory object of the /health response.
At Tier 2+, counters are grouped by their prefix (the part before the first underscore). For example, devices_connected and devices_disconnected appear under a "devices" group header, while audit_entries and audit_flushes appear under "audit". This makes it easier to find related counters in large deployments with many counter types.
Each counter in the grouped counter panel shows a rate-of-change indicator in the format (+N/min) or (-N/min). The rate is computed on the frontend by comparing counter values between consecutive /health poll responses and normalizing to a per-minute rate. A positive rate indicates the counter is actively incrementing; zero-rate counters show no indicator.
At Tier 2+, the backend maintains a ring buffer of the last 50 RPC errors. Each entry contains:
ts: Timestamp of the errormethod: The RPC method that failederror: The error message
These are displayed in a collapsible "RPC Errors" panel in the log page, sorted by time with the most recent first. This helps identify recurring RPC failures without having to search through log output.
At Tier 2+, the backend maintains a ring buffer of the last 50 device initialization failures. Each entry contains:
ts: Timestamp of the failureshellyID: The Shelly device ID that failed to initializeerror: The error message
These are displayed in a collapsible "Init Failures" panel. This is particularly useful for diagnosing devices that repeatedly fail to connect or initialize.
The "Export Report" button in the toolbar generates a comprehensive JSON debug dump that combines:
- Backend debug report: Fetched from
/health/debug-report, includes server-side metrics, configuration, and state - Frontend metrics: Current RPC timings ring buffer, WS message rate, pending RPC count, and current OBS level
- Filtered logs: All currently displayed logs as text (respecting active filters and search)
- Browser info: User agent, current URL, and ISO timestamp
The report is downloaded as debug-report-<timestamp>.json. This is useful for filing bug reports or sharing diagnostic information with support.
When the PostgreSQL connection pool is accessible, the dbPool module is registered in the observability system and reports:
total: Total number of connections in the poolidle: Number of idle (available) connectionswaiting: Number of queued requests waiting for a connection
These stats appear automatically in the Modules grid at Tier 1+. High waiting counts indicate the pool is saturated and queries are being delayed.
The Audit Log tab is visible when devMode is enabled or when the observability level is greater than 0 (obsLevel > 0). This provides access to the audit trail without requiring full observability to be enabled.
- Enable Medium tier
- Open the Modules collapse panel
- Look for yellow/red cards — these indicate modules under stress
- Check RPC Timings for slow methods (>200ms yellow, >1s red)
- Check DB Timings for slow database queries
- Check RPC Errors panel for recurring method failures and their error messages
- Use counter rate-of-change indicators to spot rapidly incrementing error counters
- EL Lag > 50ms (yellow) or > 100ms (red) indicates the Node.js event loop is blocked
- Check
statusQueue.queueSize— large queues mean status updates are backing up - Check
deviceInit.active— many concurrent device initializations can block the loop
- Monitor RSS and Heap in the vitals bar
- Watch the heap trend arrow: a persistent red up-arrow indicates a potential memory leak
- If Heap Used approaches Heap Total, garbage collection pressure is high
- Check
events.groupCacheSize— large group caches consume memory (24h TTL) - Use Export Report to capture a full snapshot for offline analysis
- Check
devices.totalvs expected count - Compare
devices_connectedvsdevices_disconnectedcounters - Check
device_inits_failedfor initialization errors - Open the Init Failures panel to see the last 50 failures with device IDs and error messages
- Use Device GUI modal diagnostics + RPC timings for per-device debugging
- Click Export Report in the toolbar to download a JSON file
- The report includes backend state, frontend metrics, current logs, and browser info
- Share this file when reporting bugs or requesting support
- Pin important log entries before exporting — they will be clearly visible in the pinned section
- Press
Ctrl+K/Cmd+Kto quickly search logs for error patterns - Press
Ctrl+L/Cmd+Lto clear logs when starting a fresh investigation - Press
Escapeto clear the search and return to the full log view - Pin relevant error logs as you find them for easy reference
To add observability to a new module:
import * as Observability from './Observability';
// 1. Register a stat getter (called on /health poll, Tier 1+)
Observability.registerModule('myModule', () => ({
total: myMap.size,
active: activeCount,
}));
// 2. Add counters at key code points (Tier 2+)
Observability.incrementCounter('my_module_processed');
Observability.incrementCounter('my_module_errors');No frontend changes needed — the module grid and counters automatically pick up new entries from the backend /health response.