feat(relay): wire loki tracing init and graceful shutdown#419
Conversation
- Build the relay TracingConfig in try_into so --log-color and --loki-addresses / --loki-service flags reach the tracing layer. - Move pluto_tracing::init into relay::run, spawn the returned Loki BackgroundTask, and bound shutdown with a 3s flush window so buffered logs reach Loki on SIGTERM. - Tolerate TryInitError in run so multiple in-process tests can share an already-installed global subscriber.
varex83agent
left a comment
There was a problem hiding this comment.
Solid wiring of the previously-unused --loki-addresses / --loki-service flags into the relay's TracingConfig, plus a clean enough shutdown sequence to flush buffered batches before the runtime tears down. Three findings push to request-changes:
info!(config = ?config)now leaks Loki URL credentials — operators routinely embed basic-auth in the Loki push URL, and this PR is what newly populates that field, so the previously-harmless log line is now the credential-exfiltration vector.Err(InitError(_)) => Nonesilently disables Loki in production — same code path serves the test-isolation case and the production case where any other crate already installed the global subscriber. Operators see no warning that their flags were ignored.- Only the first
--loki-addressesentry is used — the flag isVec<String>with plural help text, but.first()drops the rest. Charon supports multi-endpoint Loki fan-out. Either honour the contract or shrink the CLI surface.
Minor: duplicates build_console_tracing_config instead of extending it; the 3-second flush is an abort rather than a graceful shutdown (use tracing_loki::BackgroundTaskController::shutdown); --log-format / --log-output-path flags are still dead; no test exercises the Loki path.
Nothing here is unsafe-Rust or correctness-of-protocol — these are operability / secrets-handling concerns. Verdict: REQUEST_CHANGES on the three majors; the rest are minor/nit.
| let loki_task = match pluto_tracing::init(&config.log_config) { | ||
| Ok(Some(task)) => Some(tokio::spawn(task)), | ||
| Ok(None) => None, | ||
| Err(pluto_tracing::init::Error::InitError(_)) => { |
There was a problem hiding this comment.
Swallowing TryInitError silently disables all of this PR's behaviour in production.
The comment frames this as a test-isolation accommodation, but the same arm fires in the real binary if any other component (a panic handler, a build-time ctor, an inadvertently-pulled-in logger crate) won the install race. In that case:
- The relay continues without warning.
- All flags wired by this PR —
--log-level,--log-color,--loki-addresses,--loki-service— are silently discarded. loki_taskreturnsNone, so the operator's configured Loki shipping is silently off.- The credential-bearing
info!(config = ?config)at line 302 still emits to whatever subscriber was already installed — an even worse exfiltration vector than the in-our-own-Loki case.
Recommend one of:
- Gate the swallow behind
#[cfg(test)]so the production binary fails fast. - Emit a
tracing::warn!(oreprintln!, since we don't own the subscriber) noting that init was bypassed, and which config fields were discarded. - Use a
OnceCellin the test harness to install the subscriber exactly once, and keep production strict.
Charon's cmd/relay.go:24-26 propagates the equivalent error unconditionally.
| ConsoleColor::Disable => builder.console_with_ansi(false), | ||
| }; | ||
|
|
||
| if let Some(loki_url) = self.loki.loki_addresses.first() { |
There was a problem hiding this comment.
--loki-addresses silently drops all but the first entry.
RelayLokiArgs.loki_addresses is declared Vec<String> with value_delimiter = ',' (line 269-274) and the help text says "these Loki log aggregation server addresses" (plural). The new code uses only .first() and discards the rest with no warning.
This is both a parity gap (Charon app/log/config.go:227-242 iterates the slice and tee's logs to every endpoint via a multi-sink logger) and a configuration-validation gap (an operator listing a primary + failover endpoint sees their failover dropped on the floor).
Options, ranked:
- Honour the contract: extend
pluto_tracing::LokiConfig/pluto_tracing::initto accept multiple URLs (onetracing_loki::Layerper URL stacked in the registry). Matches Charon and keeps the CLI surface honest. - Shrink the contract: change the flag to
--loki-address: Option<String>, drop thevalue_delimiter, and update help text + env var name. Easy and explicit.
What we have today — accept N, use 1 — is the worst combination: looks like it works, silently doesn't.
| }; | ||
|
|
||
| let log_config = build_console_tracing_config(self.log.level.clone(), &self.log.color); | ||
| let log_config = { |
There was a problem hiding this comment.
Lines 63-84 are a copy of commands::common::build_console_tracing_config (common.rs:27-44) with the Loki block appended — the existing // TODO: Handle loki config (common.rs:39) was the natural extension point. Today dkg.rs:127 still calls the helper, so any future change to console handling (the still-dead --log-format and --log-output-path flags being the obvious next two) needs to be applied in two places. Recommend extending the shared helper to take an optional LokiConfig (or accept the RelayLokiArgs directly) so both call sites stay synchronised.
| // subscriber and never completes on its own, so the wait is bounded. | ||
| if let Some(handle) = loki_task { | ||
| let abort_handle = handle.abort_handle(); | ||
| let _ = tokio::time::timeout(std::time::Duration::from_secs(3), handle).await; |
There was a problem hiding this comment.
The 3-second "flush window" is actually a fixed wait followed by an abort, not a graceful shutdown.
tracing_loki::BackgroundTask only resolves when the channel sender drops AND there is no in-flight send_task (tracing-loki-0.2.6/src/lib.rs). The sender lives inside the Layer clone held by the global Registry, which lives for the entire process. So tokio::time::timeout(3s, handle) always returns Err(Elapsed), and abort_handle.abort() then force-cancels the task in whatever state it's in — including mid-HTTP-POST, which drops the in-flight batch.
tracing_loki exposes the right API for this case: Builder::build_controller_url returns a (Layer, BackgroundTaskController, BackgroundTask) triple, and BackgroundTaskController::shutdown().await sends the sentinel that flips the task into draining mode — it then finishes any in-flight request and resolves naturally. Plumbing the controller through pluto_tracing::init (return both controller and task) would let the relay do:
if let Some(controller) = loki_controller {
controller.shutdown().await;
let _ = tokio::time::timeout(Duration::from_secs(3), loki_task).await;
abort_handle.abort(); // last-resort fallback
}As written, the bounded wait helps a previously-buffered request finish, but anything queued after the layer stops receiving events is lost because the task never re-enters its prepare_sending loop.
| }; | ||
|
|
||
| if let Some(loki_url) = self.loki.loki_addresses.first() { | ||
| let mut labels = HashMap::new(); |
There was a problem hiding this comment.
nit: Single-entry map can be built in one expression:
let labels = HashMap::from([("service".to_string(), self.loki.loki_service.clone())]);Drops the mut and one line.
|
|
||
| if let Some(loki_url) = self.loki.loki_addresses.first() { | ||
| let mut labels = HashMap::new(); | ||
| labels.insert("service".to_string(), self.loki.loki_service.clone()); |
There was a problem hiding this comment.
nit: try_into takes self by value, so self.loki.loki_service and self.loki.loki_addresses can be moved instead of cloned. Pull let RelayLokiArgs { loki_addresses, loki_service } = self.loki; and loki_addresses.into_iter().next() for the URL. Same for self.log.level.clone() at line 83. Saves three string allocations on startup.
| ct: CancellationToken, | ||
| ) -> Result<(), CliError> { | ||
| let loki_task = match pluto_tracing::init(&config.log_config) { | ||
| Ok(Some(task)) => Some(tokio::spawn(task)), |
There was a problem hiding this comment.
nit: The four-arm match can be reduced to three:
let loki_task = match pluto_tracing::init(&config.log_config) {
Ok(maybe_task) => maybe_task.map(tokio::spawn),
Err(pluto_tracing::init::Error::InitError(_)) => None,
Err(err) => return Err(err.into()),
};Option::map(tokio::spawn) collapses the Ok(Some) / Ok(None) pair. Also consider use pluto_tracing::init::Error as TracingInitError; at the top so the verbose path on line 292 wraps less awkwardly.
| // runtime aborts it. The task holds a `Layer` clone through the global | ||
| // subscriber and never completes on its own, so the wait is bounded. | ||
| if let Some(handle) = loki_task { | ||
| let abort_handle = handle.abort_handle(); |
There was a problem hiding this comment.
nit: tokio::time::timeout(d, handle) consumes the JoinHandle, which is why abort_handle has to be captured first. The pattern is correct but non-obvious — a one-liner comment would help:
// `timeout` consumes the JoinHandle; grab an abort handle first so we can
// still cancel the task after the timer elapses (dropping a JoinHandle does
// not abort).
let abort_handle = handle.abort_handle();| // subscriber and never completes on its own, so the wait is bounded. | ||
| if let Some(handle) = loki_task { | ||
| let abort_handle = handle.abort_handle(); | ||
| let _ = tokio::time::timeout(std::time::Duration::from_secs(3), handle).await; |
There was a problem hiding this comment.
nit: Duration::from_secs(3) is a magic number sitting inline. Lift to a const LOKI_FLUSH_TIMEOUT: Duration = Duration::from_secs(3); near the top of the module (or — better — make it part of LokiConfig so operators can tune it). Either way greppable beats inline.
Loki push URLs commonly embed `user:password@` basic-auth credentials. `info!(config = ?config)` in relay::run runs AFTER the Loki layer is wired, so the credentials would be forwarded to Loki itself (and any stderr collector) on every startup. Replace the derived Debug with a manual impl that strips userinfo from `loki_url` before formatting.
The match arm that swallowed `TryInitError` was added so multiple in-process `super::run` invocations in the relay unit tests could share the already-installed global subscriber. In a production binary the same arm would mask a real init failure (e.g. another component unexpectedly installed a subscriber first), silently dropping Loki forwarding without any signal. Restrict the tolerance to `#[cfg(test)]` so production builds surface the error.
Charon's `InitLogger` iterates every entry in `LokiAddresses` and attaches one Loki client per URL. Pluto's `TracingConfig` currently exposes only a single `LokiConfig`, so the CLI was silently using only the first address. Operators who deliberately configure multiple Loki sinks for redundancy would lose forwarding to all but one without any signal. Emit a `tracing::warn!` listing the dropped count so the behaviour gap is visible until `pluto-tracing` is extended to accept multiple Loki layers.
The previous shutdown path spawned the Loki BackgroundTask and waited on its JoinHandle with a 3s timeout, then aborted. Because the global subscriber keeps the Layer's mpsc sender alive for the lifetime of the process, the task never resolves on its own — the timeout always elapsed and the abort dropped any buffered events that were in flight. Switch `pluto_tracing::init` to `tracing_loki::builder().build_controller_url(..)` so it returns the `BackgroundTaskController` alongside the task. In the relay shutdown path, signal `controller.shutdown()` first (which sends the explicit-close marker the worker observes to drain its queue and exit), then await the JoinHandle within the same bounded window, then abort as a hard fallback for the case where the Loki endpoint is unreachable. Path-stripping is preserved by joining the parsed URL to "/" before handing it to the builder, matching what `tracing_loki::layer` used to do internally. Also emit a `tracing::error!` for a failed `run_relay_p2p_node` before the flush window so the shutdown reason reaches Loki, and lift the 3s grace period into a named `LOKI_FLUSH_TIMEOUT` constant.
The relay command had inlined the console-builder block from `build_console_tracing_config` so it could splice in the Loki layer. Extend the helper to accept an optional `LokiConfig` and route both `relay` and `dkg` through it, removing the duplicated `ConsoleColor` match and `override_env_filter(...).build()` call. The `// TODO: Handle loki config` note that pointed at this gap can now go.
Two coupled issues with the iteration-1 shutdown wiring: * Early `return Err(..)` / `?` paths between `tokio::spawn(loki.task)` and the cleanup block (e.g. missing `charon-enr-private-key` with `--auto-p2pkey` disabled) dropped `loki_shutdown` without signalling shutdown, so the `error!` lines just emitted on those paths were buffered in the mpsc and lost when the runtime tore down. The new `error!(error = %err, "relay exited with error")` block was also skipped on those paths — exactly the diagnostic this PR set out to ship to Loki. * `controller.shutdown().await` is `mpsc::Sender::send(None).await`, which parks unbounded if the channel is full. The `LOKI_FLUSH_TIMEOUT` only covered the subsequent `handle.await`, so a hung Loki endpoint (no `reqwest::Client::timeout()` is set inside `tracing_loki`) could wedge process exit forever. Restructure: extract the relay body into `serve_relay`, capture its result, and wrap the whole `controller.shutdown().await` + `handle.await` sequence in a single `tokio::time::timeout` so the budget applies to the full drain. The hard `abort_handle.abort()` fallback stays as a safety net.
The iteration-1 `tracing::warn!` fired from `RelayArgs::try_into`, which runs in `main.rs` *before* `commands::relay::run` calls `pluto_tracing::init`. No global subscriber is installed yet, so the warning was dispatched to the no-op default and dropped — silently re-introducing the very gap the warning was meant to surface. Switch to `eprintln!` so the message reaches stderr regardless of tracing state. Also reshape the `match` to make the dead-extra-addresses branch explicit rather than hiding it inside `Option::map`.
Iteration 1 added a redacting Debug for `LokiConfig` so the application
no longer prints embedded basic-auth credentials when it logs the full
config. That fix only covers Pluto-side logging — `tracing-loki`'s own
send-failure logging (`tracing::error!(error = %e, ..)` in `BackgroundTask`,
where `e` is a `reqwest::Error`) formats the failing request URL via
`reqwest::Error`'s Display, which writes the URL verbatim including
userinfo. Any 5xx/TLS hiccup against a Grafana Cloud-style endpoint
therefore still leaks the credentials to stderr — and back to Loki itself.
When the user supplies a Loki URL with embedded userinfo, parse it,
base64-encode `user:password`, attach it via `Builder::http_header("Authorization",
"Basic …")`, and hand `tracing-loki` a userinfo-free URL. Now no
credential-bearing string ever lands inside `reqwest::Url`, so the
send-failure log path cannot reprint them. Add `base64` to the
`pluto-tracing` deps and cover `extract_basic_auth` / `strip_userinfo`
with five unit tests.
- Re-export `LokiInit` from the crate root so callers can name the
return type as `pluto_tracing::LokiInit` instead of
`pluto_tracing::init::LokiInit`.
- Add `#[must_use]` to `LokiInit` so silently dropping the value
(which would leave the Layer's mpsc accumulating events with no
consumer) is at least a compiler hint.
- Drop the redundant `Error` suffix on `Error::{InitError, ParseError,
CreateLayerError}` — the variants already live inside an `Error` enum,
and the only matcher (`relay.rs`) reads `pluto_tracing::init::Error::Init(_)`
more cleanly than `...::InitError(_)`.
- Document on `LokiInit` that the controller is optional for short-lived
programs.
| let user = url.username(); | ||
| let pass = url.password().unwrap_or(""); |
There was a problem hiding this comment.
Do we need to handle %40, %2F, %3A ?
`Url::username` / `Url::password` return the percent-encoded form, so credentials containing `@`, `:`, `/`, etc. (e.g. `bob%40corp:p%3A%2Fss`) would otherwise be sent as the literal `%xx` escapes and fail auth. Co-Authored-By: varex83 <vladzhitenev2003@gmail.com>
Co-Authored-By: varex83 <vladzhitenev2003@gmail.com>
Summary
TracingConfigdirectly inRelayArgs::try_intoso the existing--log-color,--loki-addresses, and--loki-serviceflags actually flow into the tracing layer (previously parsed but ignored —common.rshad aTODO: Handle loki config).pluto_tracing::initintorelay::runand spawn the returnedBackgroundTaskso thetracing-lokiworker actually ships records (the previousmain.rscall discarded it).TryInitErrorinrelay::runso multiple in-process integration tests can share an already-installed global subscriber instead of panicking on re-init.Test plan
cargo clippy -p pluto-cli --all-targets --all-features -- -D warnings(passes locally)cargo test -p pluto-cli --all-features— confirm the relay server tests still pass with the new init pathpluto relaywithPLUTO_LOKI_ADDRESSES=http://localhost:3100against a local Loki, verify logs arrive