Skip to content

feat(relay): wire loki tracing init and graceful shutdown#419

Merged
varex83 merged 15 commits into
mainfrom
bohdan/relay-loki-init
May 20, 2026
Merged

feat(relay): wire loki tracing init and graceful shutdown#419
varex83 merged 15 commits into
mainfrom
bohdan/relay-loki-init

Conversation

@varex83
Copy link
Copy Markdown
Collaborator

@varex83 varex83 commented May 18, 2026

Summary

  • Build the relay's TracingConfig directly in RelayArgs::try_into so the existing --log-color, --loki-addresses, and --loki-service flags actually flow into the tracing layer (previously parsed but ignored — common.rs had a TODO: Handle loki config).
  • Move pluto_tracing::init into relay::run and spawn the returned BackgroundTask so the tracing-loki worker actually ships records (the previous main.rs call discarded it).
  • Add a 3-second bounded flush window on shutdown so buffered Loki batches drain before the runtime tears the worker down on SIGTERM.
  • Tolerate TryInitError in relay::run so multiple in-process integration tests can share an already-installed global subscriber instead of panicking on re-init.

Test plan

  • cargo clippy -p pluto-cli --all-targets --all-features -- -D warnings (passes locally)
  • cargo test -p pluto-cli --all-features — confirm the relay server tests still pass with the new init path
  • Manual: run pluto relay with PLUTO_LOKI_ADDRESSES=http://localhost:3100 against a local Loki, verify logs arrive
  • Manual: SIGTERM mid-run, confirm the last few seconds of logs reach Loki and shutdown completes within ~3s

- Build the relay TracingConfig in try_into so --log-color and
  --loki-addresses / --loki-service flags reach the tracing layer.
- Move pluto_tracing::init into relay::run, spawn the returned Loki
  BackgroundTask, and bound shutdown with a 3s flush window so
  buffered logs reach Loki on SIGTERM.
- Tolerate TryInitError in run so multiple in-process tests can
  share an already-installed global subscriber.
@varex83 varex83 added this to the S1 milestone May 18, 2026
@varex83 varex83 self-assigned this May 18, 2026
Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid wiring of the previously-unused --loki-addresses / --loki-service flags into the relay's TracingConfig, plus a clean enough shutdown sequence to flush buffered batches before the runtime tears down. Three findings push to request-changes:

  1. info!(config = ?config) now leaks Loki URL credentials — operators routinely embed basic-auth in the Loki push URL, and this PR is what newly populates that field, so the previously-harmless log line is now the credential-exfiltration vector.
  2. Err(InitError(_)) => None silently disables Loki in production — same code path serves the test-isolation case and the production case where any other crate already installed the global subscriber. Operators see no warning that their flags were ignored.
  3. Only the first --loki-addresses entry is used — the flag is Vec<String> with plural help text, but .first() drops the rest. Charon supports multi-endpoint Loki fan-out. Either honour the contract or shrink the CLI surface.

Minor: duplicates build_console_tracing_config instead of extending it; the 3-second flush is an abort rather than a graceful shutdown (use tracing_loki::BackgroundTaskController::shutdown); --log-format / --log-output-path flags are still dead; no test exercises the Loki path.

Nothing here is unsafe-Rust or correctness-of-protocol — these are operability / secrets-handling concerns. Verdict: REQUEST_CHANGES on the three majors; the rest are minor/nit.

Comment thread crates/cli/src/commands/relay.rs
Comment thread crates/cli/src/commands/relay.rs Outdated
let loki_task = match pluto_tracing::init(&config.log_config) {
Ok(Some(task)) => Some(tokio::spawn(task)),
Ok(None) => None,
Err(pluto_tracing::init::Error::InitError(_)) => {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swallowing TryInitError silently disables all of this PR's behaviour in production.

The comment frames this as a test-isolation accommodation, but the same arm fires in the real binary if any other component (a panic handler, a build-time ctor, an inadvertently-pulled-in logger crate) won the install race. In that case:

  • The relay continues without warning.
  • All flags wired by this PR — --log-level, --log-color, --loki-addresses, --loki-service — are silently discarded.
  • loki_task returns None, so the operator's configured Loki shipping is silently off.
  • The credential-bearing info!(config = ?config) at line 302 still emits to whatever subscriber was already installed — an even worse exfiltration vector than the in-our-own-Loki case.

Recommend one of:

  1. Gate the swallow behind #[cfg(test)] so the production binary fails fast.
  2. Emit a tracing::warn! (or eprintln!, since we don't own the subscriber) noting that init was bypassed, and which config fields were discarded.
  3. Use a OnceCell in the test harness to install the subscriber exactly once, and keep production strict.

Charon's cmd/relay.go:24-26 propagates the equivalent error unconditionally.

Comment thread crates/cli/src/commands/relay.rs Outdated
ConsoleColor::Disable => builder.console_with_ansi(false),
};

if let Some(loki_url) = self.loki.loki_addresses.first() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--loki-addresses silently drops all but the first entry.

RelayLokiArgs.loki_addresses is declared Vec<String> with value_delimiter = ',' (line 269-274) and the help text says "these Loki log aggregation server addresses" (plural). The new code uses only .first() and discards the rest with no warning.

This is both a parity gap (Charon app/log/config.go:227-242 iterates the slice and tee's logs to every endpoint via a multi-sink logger) and a configuration-validation gap (an operator listing a primary + failover endpoint sees their failover dropped on the floor).

Options, ranked:

  1. Honour the contract: extend pluto_tracing::LokiConfig / pluto_tracing::init to accept multiple URLs (one tracing_loki::Layer per URL stacked in the registry). Matches Charon and keeps the CLI surface honest.
  2. Shrink the contract: change the flag to --loki-address: Option<String>, drop the value_delimiter, and update help text + env var name. Easy and explicit.

What we have today — accept N, use 1 — is the worst combination: looks like it works, silently doesn't.

Comment thread crates/cli/src/commands/relay.rs Outdated
};

let log_config = build_console_tracing_config(self.log.level.clone(), &self.log.color);
let log_config = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 63-84 are a copy of commands::common::build_console_tracing_config (common.rs:27-44) with the Loki block appended — the existing // TODO: Handle loki config (common.rs:39) was the natural extension point. Today dkg.rs:127 still calls the helper, so any future change to console handling (the still-dead --log-format and --log-output-path flags being the obvious next two) needs to be applied in two places. Recommend extending the shared helper to take an optional LokiConfig (or accept the RelayLokiArgs directly) so both call sites stay synchronised.

Comment thread crates/cli/src/commands/relay.rs Outdated
// subscriber and never completes on its own, so the wait is bounded.
if let Some(handle) = loki_task {
let abort_handle = handle.abort_handle();
let _ = tokio::time::timeout(std::time::Duration::from_secs(3), handle).await;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 3-second "flush window" is actually a fixed wait followed by an abort, not a graceful shutdown.

tracing_loki::BackgroundTask only resolves when the channel sender drops AND there is no in-flight send_task (tracing-loki-0.2.6/src/lib.rs). The sender lives inside the Layer clone held by the global Registry, which lives for the entire process. So tokio::time::timeout(3s, handle) always returns Err(Elapsed), and abort_handle.abort() then force-cancels the task in whatever state it's in — including mid-HTTP-POST, which drops the in-flight batch.

tracing_loki exposes the right API for this case: Builder::build_controller_url returns a (Layer, BackgroundTaskController, BackgroundTask) triple, and BackgroundTaskController::shutdown().await sends the sentinel that flips the task into draining mode — it then finishes any in-flight request and resolves naturally. Plumbing the controller through pluto_tracing::init (return both controller and task) would let the relay do:

if let Some(controller) = loki_controller {
    controller.shutdown().await;
    let _ = tokio::time::timeout(Duration::from_secs(3), loki_task).await;
    abort_handle.abort(); // last-resort fallback
}

As written, the bounded wait helps a previously-buffered request finish, but anything queued after the layer stops receiving events is lost because the task never re-enters its prepare_sending loop.

Comment thread crates/cli/src/commands/relay.rs Outdated
};

if let Some(loki_url) = self.loki.loki_addresses.first() {
let mut labels = HashMap::new();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Single-entry map can be built in one expression:

let labels = HashMap::from([("service".to_string(), self.loki.loki_service.clone())]);

Drops the mut and one line.

Comment thread crates/cli/src/commands/relay.rs Outdated

if let Some(loki_url) = self.loki.loki_addresses.first() {
let mut labels = HashMap::new();
labels.insert("service".to_string(), self.loki.loki_service.clone());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: try_into takes self by value, so self.loki.loki_service and self.loki.loki_addresses can be moved instead of cloned. Pull let RelayLokiArgs { loki_addresses, loki_service } = self.loki; and loki_addresses.into_iter().next() for the URL. Same for self.log.level.clone() at line 83. Saves three string allocations on startup.

Comment thread crates/cli/src/commands/relay.rs Outdated
ct: CancellationToken,
) -> Result<(), CliError> {
let loki_task = match pluto_tracing::init(&config.log_config) {
Ok(Some(task)) => Some(tokio::spawn(task)),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The four-arm match can be reduced to three:

let loki_task = match pluto_tracing::init(&config.log_config) {
    Ok(maybe_task) => maybe_task.map(tokio::spawn),
    Err(pluto_tracing::init::Error::InitError(_)) => None,
    Err(err) => return Err(err.into()),
};

Option::map(tokio::spawn) collapses the Ok(Some) / Ok(None) pair. Also consider use pluto_tracing::init::Error as TracingInitError; at the top so the verbose path on line 292 wraps less awkwardly.

Comment thread crates/cli/src/commands/relay.rs Outdated
// runtime aborts it. The task holds a `Layer` clone through the global
// subscriber and never completes on its own, so the wait is bounded.
if let Some(handle) = loki_task {
let abort_handle = handle.abort_handle();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: tokio::time::timeout(d, handle) consumes the JoinHandle, which is why abort_handle has to be captured first. The pattern is correct but non-obvious — a one-liner comment would help:

// `timeout` consumes the JoinHandle; grab an abort handle first so we can
// still cancel the task after the timer elapses (dropping a JoinHandle does
// not abort).
let abort_handle = handle.abort_handle();

Comment thread crates/cli/src/commands/relay.rs Outdated
// subscriber and never completes on its own, so the wait is bounded.
if let Some(handle) = loki_task {
let abort_handle = handle.abort_handle();
let _ = tokio::time::timeout(std::time::Duration::from_secs(3), handle).await;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Duration::from_secs(3) is a magic number sitting inline. Lift to a const LOKI_FLUSH_TIMEOUT: Duration = Duration::from_secs(3); near the top of the module (or — better — make it part of LokiConfig so operators can tune it). Either way greppable beats inline.

Loki push URLs commonly embed `user:password@` basic-auth credentials.
`info!(config = ?config)` in relay::run runs AFTER the Loki layer is
wired, so the credentials would be forwarded to Loki itself (and any
stderr collector) on every startup. Replace the derived Debug with a
manual impl that strips userinfo from `loki_url` before formatting.
The match arm that swallowed `TryInitError` was added so multiple
in-process `super::run` invocations in the relay unit tests could share
the already-installed global subscriber. In a production binary the
same arm would mask a real init failure (e.g. another component
unexpectedly installed a subscriber first), silently dropping Loki
forwarding without any signal. Restrict the tolerance to `#[cfg(test)]`
so production builds surface the error.
Charon's `InitLogger` iterates every entry in `LokiAddresses` and
attaches one Loki client per URL. Pluto's `TracingConfig` currently
exposes only a single `LokiConfig`, so the CLI was silently using only
the first address. Operators who deliberately configure multiple Loki
sinks for redundancy would lose forwarding to all but one without any
signal. Emit a `tracing::warn!` listing the dropped count so the
behaviour gap is visible until `pluto-tracing` is extended to accept
multiple Loki layers.
The previous shutdown path spawned the Loki BackgroundTask and waited
on its JoinHandle with a 3s timeout, then aborted. Because the global
subscriber keeps the Layer's mpsc sender alive for the lifetime of the
process, the task never resolves on its own — the timeout always
elapsed and the abort dropped any buffered events that were in flight.

Switch `pluto_tracing::init` to `tracing_loki::builder().build_controller_url(..)`
so it returns the `BackgroundTaskController` alongside the task. In the
relay shutdown path, signal `controller.shutdown()` first (which sends
the explicit-close marker the worker observes to drain its queue and
exit), then await the JoinHandle within the same bounded window, then
abort as a hard fallback for the case where the Loki endpoint is
unreachable. Path-stripping is preserved by joining the parsed URL to
"/" before handing it to the builder, matching what `tracing_loki::layer`
used to do internally.

Also emit a `tracing::error!` for a failed `run_relay_p2p_node` before
the flush window so the shutdown reason reaches Loki, and lift the 3s
grace period into a named `LOKI_FLUSH_TIMEOUT` constant.
The relay command had inlined the console-builder block from
`build_console_tracing_config` so it could splice in the Loki layer.
Extend the helper to accept an optional `LokiConfig` and route both
`relay` and `dkg` through it, removing the duplicated `ConsoleColor`
match and `override_env_filter(...).build()` call. The `// TODO: Handle
loki config` note that pointed at this gap can now go.
Two coupled issues with the iteration-1 shutdown wiring:

* Early `return Err(..)` / `?` paths between `tokio::spawn(loki.task)`
  and the cleanup block (e.g. missing `charon-enr-private-key` with
  `--auto-p2pkey` disabled) dropped `loki_shutdown` without signalling
  shutdown, so the `error!` lines just emitted on those paths were
  buffered in the mpsc and lost when the runtime tore down. The new
  `error!(error = %err, "relay exited with error")` block was also
  skipped on those paths — exactly the diagnostic this PR set out to
  ship to Loki.
* `controller.shutdown().await` is `mpsc::Sender::send(None).await`,
  which parks unbounded if the channel is full. The `LOKI_FLUSH_TIMEOUT`
  only covered the subsequent `handle.await`, so a hung Loki endpoint
  (no `reqwest::Client::timeout()` is set inside `tracing_loki`) could
  wedge process exit forever.

Restructure: extract the relay body into `serve_relay`, capture its
result, and wrap the whole `controller.shutdown().await` +
`handle.await` sequence in a single `tokio::time::timeout` so the
budget applies to the full drain. The hard `abort_handle.abort()`
fallback stays as a safety net.
The iteration-1 `tracing::warn!` fired from `RelayArgs::try_into`,
which runs in `main.rs` *before* `commands::relay::run` calls
`pluto_tracing::init`. No global subscriber is installed yet, so the
warning was dispatched to the no-op default and dropped — silently
re-introducing the very gap the warning was meant to surface.

Switch to `eprintln!` so the message reaches stderr regardless of
tracing state. Also reshape the `match` to make the dead-extra-addresses
branch explicit rather than hiding it inside `Option::map`.
Iteration 1 added a redacting Debug for `LokiConfig` so the application
no longer prints embedded basic-auth credentials when it logs the full
config. That fix only covers Pluto-side logging — `tracing-loki`'s own
send-failure logging (`tracing::error!(error = %e, ..)` in `BackgroundTask`,
where `e` is a `reqwest::Error`) formats the failing request URL via
`reqwest::Error`'s Display, which writes the URL verbatim including
userinfo. Any 5xx/TLS hiccup against a Grafana Cloud-style endpoint
therefore still leaks the credentials to stderr — and back to Loki itself.

When the user supplies a Loki URL with embedded userinfo, parse it,
base64-encode `user:password`, attach it via `Builder::http_header("Authorization",
"Basic …")`, and hand `tracing-loki` a userinfo-free URL. Now no
credential-bearing string ever lands inside `reqwest::Url`, so the
send-failure log path cannot reprint them. Add `base64` to the
`pluto-tracing` deps and cover `extract_basic_auth` / `strip_userinfo`
with five unit tests.
- Re-export `LokiInit` from the crate root so callers can name the
  return type as `pluto_tracing::LokiInit` instead of
  `pluto_tracing::init::LokiInit`.
- Add `#[must_use]` to `LokiInit` so silently dropping the value
  (which would leave the Layer's mpsc accumulating events with no
  consumer) is at least a compiler hint.
- Drop the redundant `Error` suffix on `Error::{InitError, ParseError,
  CreateLayerError}` — the variants already live inside an `Error` enum,
  and the only matcher (`relay.rs`) reads `pluto_tracing::init::Error::Init(_)`
  more cleanly than `...::InitError(_)`.
- Document on `LokiInit` that the controller is optional for short-lived
  programs.
@emlautarom1 emlautarom1 self-requested a review May 18, 2026 21:30
Comment thread crates/tracing/src/init.rs Outdated
Comment on lines +104 to +105
let user = url.username();
let pass = url.password().unwrap_or("");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle %40, %2F, %3A ?

varex83agent and others added 3 commits May 20, 2026 14:54
`Url::username` / `Url::password` return the percent-encoded form, so
credentials containing `@`, `:`, `/`, etc. (e.g. `bob%40corp:p%3A%2Fss`)
would otherwise be sent as the literal `%xx` escapes and fail auth.

Co-Authored-By: varex83 <vladzhitenev2003@gmail.com>
Co-Authored-By: varex83 <vladzhitenev2003@gmail.com>
@varex83 varex83 merged commit e8b60cc into main May 20, 2026
12 of 13 checks passed
@varex83 varex83 deleted the bohdan/relay-loki-init branch May 20, 2026 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Loki and tracing initialization for relay server Deploy relay server to production

4 participants