Skip to content

fix: ensure data published to MQTT is not dropped#305

Draft
paulstuart wants to merge 26 commits intodevelopfrom
fix/OBS-2208-mqtt-queue
Draft

fix: ensure data published to MQTT is not dropped#305
paulstuart wants to merge 26 commits intodevelopfrom
fix/OBS-2208-mqtt-queue

Conversation

@paulstuart
Copy link
Copy Markdown
Contributor

As the MQTT credentials rotate frequently, it encounters situations where the connection is invalid and the data is unable to be sent.

This leverages the queuing capability in the handler and ensures that normal credentials rotation does not drop data, nor pollute the logs with errors that are "part of doing business"

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

Vulnerability Scan: Passed

Image: orb-agent:scan

Source Library CVE Severity Installed Fixed Title
Python pip CVE-2026-1703 ⚪ LOW 25.3 26.0 pip: pip: Information disclosure via path traversal when installing crafted whee

Commit: d913d7f

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

Go test coverage

STATUS ELAPSED PACKAGE COVER PASS FAIL SKIP
🟢 PASS 1.05s github.com/netboxlabs/orb-agent/agent 45.2% 6 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend 75.2% 42 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend/devicediscovery 66.5% 4 0 0
🟢 PASS 0.05s github.com/netboxlabs/orb-agent/agent/backend/mocks 0.0% 0 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend/networkdiscovery 58.3% 4 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend/opentelemetryinfinity 45.2% 2 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend/pktvisor 66.5% 2 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend/snmpdiscovery 58.3% 4 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend/worker 67.8% 7 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/config 100.0% 6 0 0
🟢 PASS 3.55s github.com/netboxlabs/orb-agent/agent/configmgr 50.7% 48 0 0
🟢 PASS 3.50s github.com/netboxlabs/orb-agent/agent/configmgr/fleet 65.3% 161 0 0
🟢 PASS 1.03s github.com/netboxlabs/orb-agent/agent/otlpbridge 53.8% 15 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/policies 99.1% 21 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/policymgr 71.6% 11 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/redact 81.6% 84 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/secretsmgr 47.1% 65 0 0
🟢 PASS 1.01s github.com/netboxlabs/orb-agent/agent/telemetry 81.7% 19 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/version 62.5% 5 0 0

Total coverage: 59.9%

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent telemetry/OTLP data from being dropped during MQTT disconnects caused by frequent credential rotation, by queueing publishes and refreshing JWT credentials automatically on MQTT auto-reconnect. It also adds a debug-only signal trigger for forcing rotation/inspecting token status and wires build tags through local builds and Docker builds.

Changes:

  • Switch OTLP bridge publishing to PublishViaQueue to buffer messages across MQTT disconnects.
  • Add an autopaho ConnectPacketBuilder that refreshes the JWT on auto-reconnect, and wire it from FleetConfigManager.
  • Add a debug-only SIGUSR1/SIGUSR2 trigger for credential rotation/status logging, plus build-tag plumbing (BUILD_TAGS) for make/Docker and a small golangci-lint config tweak.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
Makefile Adds BUILD_TAGS plumbing into go build/go test and passes build args into Docker builds.
agent/otlpbridge/publisher_adapter.go Uses PublishViaQueue to buffer OTLP publishes during disconnects.
agent/otlpbridge/mqtt.go Uses PublishViaQueue for the OTLP bridge MQTT publisher implementation.
agent/docker/Dockerfile Accepts/passes BUILD_TAGS into the make agent_bin build step.
agent/configmgr/fleet/token_refresh_test.go Adds unit tests for the reconnect-time JWT refresh ConnectPacketBuilder behavior.
agent/configmgr/fleet/debug.go Introduces a small interface used by debug-only trigger code (avoids package dependency).
agent/configmgr/fleet/debug_trigger.go Adds debug-tag-only OS signal trigger to rotate/log credentials.
agent/configmgr/fleet/debug_trigger_test.go Adds debug-tag-only tests for signal-trigger behavior.
agent/configmgr/fleet/debug_trigger_off.go Provides a no-op StartDebugTrigger when not built with -tags debug.
agent/configmgr/fleet/connection.go Adds token refresher plumbing, ConnectPacketBuilder for auto-reconnect JWT refresh, and dispatch shutdown race handling changes.
agent/configmgr/fleet.go Wires token refresher into the MQTT connection; starts debug triggers; implements debug credential methods.
.github/golangci.yaml Enables relative-path-mode: gomod for lint output paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

paulstuart and others added 9 commits April 2, 2026 10:54
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Somewhat moot as the debug code will go away once code is guaranteed solid

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…d require refresh. An 'expected' error that is recoverable and part of the process should not be treated as a real error. The goal is that the agent should never log an issue as an error unless it is a real problem and we don't get blind to 'normal errors'
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@paulstuart paulstuart requested a review from Copilot April 3, 2026 17:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +119 to +123
s.pendingMu.Lock()
if s.ready {
s.pendingMu.Unlock()
return
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drainPending can run concurrently (e.g., if multiple goroutines call SetPublisher/Set*Topic when all fields are already set). Because s.ready is only set to true at the end of the drain, a second drainPending invocation can observe ready == false, do no work, and set ready = true while the first drain is still publishing. That allows new Enqueue calls to publish directly and overtake queued messages (breaking FIFO ordering). Consider adding a separate draining/drainInProgress flag under pendingMu (or a sync.Once + a draining state) so only one drain can proceed and Enqueue keeps queueing until the active drain completes.

Copilot uses AI. Check for mistakes.
Comment on lines +178 to +185
// Enqueue marshaled OTLP data for publishing. Before the MQTT connection is
// ready the payload is queued in memory (up to maxPending messages; oldest
// messages are dropped when the queue is full). Once ready, publishes directly.
func (s *BridgeServer) Enqueue(ctx context.Context, isIngest bool, payload []byte) error {
s.pendingMu.Lock()
if !s.ready {
if s.maxPending > 0 && len(s.pending) >= s.maxPending {
s.pendingDropped++
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment for Enqueue says “oldest messages are dropped when the queue is full”, but the implementation actually rejects the new message and preserves the existing queue (see the len(s.pending) >= s.maxPending branch returning ResourceExhausted). Please either update the comment to match the behavior (reject-new), or change the implementation to evict the oldest entry when at capacity.

Copilot uses AI. Check for mistakes.

func TestBridge_Enqueue_QueuesDrainsOnReady(t *testing.T) {
fp := &fakePublisher{}
bridge := &BridgeServer{enc: ProtobufEncoder{}}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test constructs BridgeServer via a struct literal (&BridgeServer{...}), which bypasses NewBridgeServer initialization (notably maxPending defaulting). That can make tests diverge from real runtime behavior (e.g., maxPending stays 0 → effectively unbounded). Prefer constructing the bridge via NewBridgeServer(...) (or explicitly set maxPending in the literal) so queue semantics match production.

Suggested change
bridge := &BridgeServer{enc: ProtobufEncoder{}}
bridge := NewBridgeServer(ProtobufEncoder{})

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants