Skip to content

fix: manage MQTT connections by cycling credentials before they expire#302

Open
paulstuart wants to merge 20 commits intodevelopfrom
fix/OBS-2208-mqtt-cycling
Open

fix: manage MQTT connections by cycling credentials before they expire#302
paulstuart wants to merge 20 commits intodevelopfrom
fix/OBS-2208-mqtt-cycling

Conversation

@paulstuart
Copy link
Copy Markdown
Contributor

@paulstuart paulstuart commented Apr 1, 2026

Rather than wait for a broken connection, this updates the connection credentials before it expires.

This adds an optional debug service that allows probing and cycling the MQTT token status/connection.
The debug functionality is gated by a debug build flag, and is passed in via a Makefile BUILD_TAGS arg, e.g.,
make agent_bin BUILD_TAGS=debug

One can publish the default port (6166) or exec in and run:

curl "http://127.0.0.1:6166/debug/token-status"
curl -X POST "http://127.0.0.1:6166/debug/force-token-rotation"

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

Vulnerability Scan: Passed

Image: orb-agent:scan

Source Library CVE Severity Installed Fixed Title
Python pip CVE-2026-1703 ⚪ LOW 25.3 26.0 pip: pip: Information disclosure via path traversal when installing crafted whee

Commit: dfc4c90

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

Go test coverage

STATUS ELAPSED PACKAGE COVER PASS FAIL SKIP
🟢 PASS 1.03s github.com/netboxlabs/orb-agent/agent 45.2% 6 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend 75.2% 42 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend/devicediscovery 66.5% 4 0 0
🟢 PASS 0.06s github.com/netboxlabs/orb-agent/agent/backend/mocks 0.0% 0 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend/networkdiscovery 58.3% 4 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend/opentelemetryinfinity 45.2% 2 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend/pktvisor 66.5% 2 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend/snmpdiscovery 58.3% 4 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/backend/worker 67.8% 7 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/config 100.0% 6 0 0
🟢 PASS 3.51s github.com/netboxlabs/orb-agent/agent/configmgr 53.2% 39 0 0
🟢 PASS 3.48s github.com/netboxlabs/orb-agent/agent/configmgr/fleet 66.2% 158 0 0
🟢 PASS 1.02s github.com/netboxlabs/orb-agent/agent/otlpbridge 48.3% 10 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/policies 99.1% 21 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/policymgr 71.6% 11 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/redact 81.6% 84 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/secretsmgr 47.1% 65 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/telemetry 81.7% 19 0 0
🟢 PASS (cached) github.com/netboxlabs/orb-agent/agent/version 62.5% 5 0 0

Total coverage: 60.2%

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve the OTLP→MQTT pipeline by proactively refreshing MQTT credentials before expiry, buffering outbound OTLP publishes to tolerate reconnect windows, and adding an optional debug HTTP server (behind a debug build tag) to inspect/trigger token cycling and reconnects.

Changes:

  • Added a token refresher hook to MQTT reconnect logic (via ConnectPacketBuilder) to avoid reconnects using stale JWTs.
  • Added a debug-tagged HTTP debug server plus build plumbing (BUILD_TAGS) for local probing/forcing reconnects and token rotation.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
Makefile Adds BUILD_TAGS support and changes version/tag computation for builds/images.
agent/docker/Dockerfile Plumbs BUILD_TAGS into container builds.
agent/otlpbridge/server.go Adds an internal publish queue and drain goroutine, and stops it on shutdown.
agent/otlpbridge/handlers.go Switches OTLP handlers from direct publish to enqueue-based publishing.
agent/otlpbridge/handlers_test.go Updates tests to wait for async publish via the new queue.
agent/otlpbridge/mqtt.go Extends publisher construction to support token refresh before CONNECT.
agent/configmgr/fleet/connection.go Adds token refresher support via ConnectPacketBuilder for autopaho reconnects.
agent/configmgr/fleet/token_refresh_test.go Adds tests validating token refresher behavior for CONNECT packet building.
agent/configmgr/fleet.go Wires token refresher into the fleet manager connection and starts optional debug server.
agent/configmgr/debug_types.go Adds shared debug endpoint DTOs and debug server callback options.
agent/configmgr/debug_server.go Implements the debug-tagged HTTP debug server and endpoints.
agent/configmgr/debug_server_test.go Adds debug-tagged tests for the debug server endpoints.
agent/configmgr/debug_server_off.go Provides a no-op stub when not built with -tags debug.
Comments suppressed due to low confidence (1)

agent/otlpbridge/handlers_test.go:67

  • NewBridgeServer starts a background drainQueue goroutine, but tests never stop the bridge. This can leak goroutines across tests. Consider checking the NewBridgeServer error and registering t.Cleanup(func(){ _ = bridge.Stop(context.Background()) }) (or similar) inside newBridgeWithTopics.
func newBridgeWithTopics(enc Encoder) (*BridgeServer, *fakePublisher) {
	fp := newFakePublisher()
	logger := slog.Default()
	bridge, _ := NewBridgeServer(BridgeConfig{Encoding: "protobuf"}, nil, logger)
	bridge.enc = enc
	bridge.SetPublisher(fp)
	bridge.SetIngestTopic("ingest")
	bridge.SetTelemetryTopic("telemetry")
	return bridge, fp

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@paulstuart paulstuart marked this pull request as ready for review April 1, 2026 22:51
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab0d9351ac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +365 to +366
if builder := buildConnectPacketBuilder(connection); builder != nil {
cfg.ConnectPacketBuilder = builder
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid double-refreshing JWT during reconnect

refreshAndReconnect already fetches a new token and derives fresh connection details (topics/zone/url) before calling Reconnect, but Connect now always installs a ConnectPacketBuilder that calls tokenRefresher again. That means a managed reconnect can use token A to compute topics and token B for the CONNECT password; if claims differ between refreshes (e.g., zone or topic scope), the broker auth and subscribed topics diverge, causing failed subscriptions/publishes or reconnect loops.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulstuart I think we already have a mechanism for refreshing tokens here https://github.com/netboxlabs/orb-agent/blob/develop/agent/configmgr/fleet.go#L229. If it's not working of there are changes then I think we should modify that rather than create a new way to do so.

//go:build debug

package configmgr

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If debug server is related to fleet only, I don't think it should be part of config manager

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also how often do we plan to use it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent was to have it opt in for tagged debug only and to minimize changes to normal code. I believe the functionality itself is valuable (the ability to trigger a key rotation on demand) but am fine taking it out and rethinking it -- it definitely makes testing it easier.

I'll pull it out in the morning so it's just the pre-rotation fix.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is it if is important for dev only or should be important do investigate a prod issue. If it is important for prod, maybe we should generated a orb-agent:debug image as well. If not, maybe the code could live in a branch.

@leoparente leoparente self-requested a review April 2, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants