fix: manage MQTT connections by cycling credentials before they expire#302
fix: manage MQTT connections by cycling credentials before they expire#302paulstuart wants to merge 20 commits intodevelopfrom
Conversation
…ding to an invalid connection; add channel buffer to mqtt writes so nothing gets dropped; update debug endpoint to check/force rotation
…ld tag in to makefile
Vulnerability Scan: PassedImage:
Commit: dfc4c90 |
|
Go test coverage
Total coverage: 60.2% |
There was a problem hiding this comment.
Pull request overview
This PR aims to improve the OTLP→MQTT pipeline by proactively refreshing MQTT credentials before expiry, buffering outbound OTLP publishes to tolerate reconnect windows, and adding an optional debug HTTP server (behind a debug build tag) to inspect/trigger token cycling and reconnects.
Changes:
- Added a token refresher hook to MQTT reconnect logic (via
ConnectPacketBuilder) to avoid reconnects using stale JWTs. - Added a
debug-tagged HTTP debug server plus build plumbing (BUILD_TAGS) for local probing/forcing reconnects and token rotation.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| Makefile | Adds BUILD_TAGS support and changes version/tag computation for builds/images. |
| agent/docker/Dockerfile | Plumbs BUILD_TAGS into container builds. |
| agent/otlpbridge/server.go | Adds an internal publish queue and drain goroutine, and stops it on shutdown. |
| agent/otlpbridge/handlers.go | Switches OTLP handlers from direct publish to enqueue-based publishing. |
| agent/otlpbridge/handlers_test.go | Updates tests to wait for async publish via the new queue. |
| agent/otlpbridge/mqtt.go | Extends publisher construction to support token refresh before CONNECT. |
| agent/configmgr/fleet/connection.go | Adds token refresher support via ConnectPacketBuilder for autopaho reconnects. |
| agent/configmgr/fleet/token_refresh_test.go | Adds tests validating token refresher behavior for CONNECT packet building. |
| agent/configmgr/fleet.go | Wires token refresher into the fleet manager connection and starts optional debug server. |
| agent/configmgr/debug_types.go | Adds shared debug endpoint DTOs and debug server callback options. |
| agent/configmgr/debug_server.go | Implements the debug-tagged HTTP debug server and endpoints. |
| agent/configmgr/debug_server_test.go | Adds debug-tagged tests for the debug server endpoints. |
| agent/configmgr/debug_server_off.go | Provides a no-op stub when not built with -tags debug. |
Comments suppressed due to low confidence (1)
agent/otlpbridge/handlers_test.go:67
NewBridgeServerstarts a backgrounddrainQueuegoroutine, but tests never stop the bridge. This can leak goroutines across tests. Consider checking theNewBridgeServererror and registeringt.Cleanup(func(){ _ = bridge.Stop(context.Background()) })(or similar) insidenewBridgeWithTopics.
func newBridgeWithTopics(enc Encoder) (*BridgeServer, *fakePublisher) {
fp := newFakePublisher()
logger := slog.Default()
bridge, _ := NewBridgeServer(BridgeConfig{Encoding: "protobuf"}, nil, logger)
bridge.enc = enc
bridge.SetPublisher(fp)
bridge.SetIngestTopic("ingest")
bridge.SetTelemetryTopic("telemetry")
return bridge, fp
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab0d9351ac
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
agent/configmgr/fleet/connection.go
Outdated
| if builder := buildConnectPacketBuilder(connection); builder != nil { | ||
| cfg.ConnectPacketBuilder = builder |
There was a problem hiding this comment.
Avoid double-refreshing JWT during reconnect
refreshAndReconnect already fetches a new token and derives fresh connection details (topics/zone/url) before calling Reconnect, but Connect now always installs a ConnectPacketBuilder that calls tokenRefresher again. That means a managed reconnect can use token A to compute topics and token B for the CONNECT password; if claims differ between refreshes (e.g., zone or topic scope), the broker auth and subscribed topics diverge, causing failed subscriptions/publishes or reconnect loops.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
@paulstuart I think we already have a mechanism for refreshing tokens here https://github.com/netboxlabs/orb-agent/blob/develop/agent/configmgr/fleet.go#L229. If it's not working of there are changes then I think we should modify that rather than create a new way to do so.
agent/configmgr/debug_server.go
Outdated
| //go:build debug | ||
|
|
||
| package configmgr | ||
|
|
There was a problem hiding this comment.
If debug server is related to fleet only, I don't think it should be part of config manager
There was a problem hiding this comment.
Also how often do we plan to use it?
There was a problem hiding this comment.
The intent was to have it opt in for tagged debug only and to minimize changes to normal code. I believe the functionality itself is valuable (the ability to trigger a key rotation on demand) but am fine taking it out and rethinking it -- it definitely makes testing it easier.
I'll pull it out in the morning so it's just the pre-rotation fix.
There was a problem hiding this comment.
My question is it if is important for dev only or should be important do investigate a prod issue. If it is important for prod, maybe we should generated a orb-agent:debug image as well. If not, maybe the code could live in a branch.
Rather than wait for a broken connection, this updates the connection credentials before it expires.
This adds an optional debug service that allows probing and cycling the MQTT token status/connection.
The debug functionality is gated by a
debugbuild flag, and is passed in via a MakefileBUILD_TAGSarg, e.g.,make agent_bin BUILD_TAGS=debugOne can publish the default port (6166) or exec in and run: