Edge telemetry: vector-only (drop otelcol, Loki, Tempo)#20
Merged
Conversation
Consolidates edge telemetry onto Vector and removes otelcol-contrib: - vector docker_logs -> OTLP /v1/logs with clean per-container metadata (container.name/image + network/testnet/instance/ethereum_cl/ethereum_el/ forwarder), reduce-batched (~500 records/envelope). docker_logs reads the Docker API, so every container is covered with no duplication, no missed port-less containers, and no empty-metadata race (the problems otelcol's filelog + docker_observer hit). - vector opentelemetry source (use_otlp_decoding) on :4317/:4318 -> OTLP /v1/traces; the client's OTLP batching is preserved (1 request in = 1 out). - otelcol_contrib_cleanup=true removes the otelcol container. - clients re-pointed: --telemetry-collector-url / --rpc.telemetry.endpoint otelcol -> vector. - vector 0.55.0 -> 0.56.0. - Loki and Tempo sinks dropped (no longer needed). Verified live on lighthouse-geth-super-1: logs 0% empty-metadata, 0 dups, all containers incl. validator; traces flowing to platform batched.
The shaper extracts the source log level (logfmt level=xxx, else a level token near the start of the line) and sets SeverityNumber on the OTel scale (TRACE=1, DEBUG=5, INFO=9, WARN=13, ERROR=17, CRIT=18, FATAL=21). SeverityText keeps the source's exact text; only the number is normalised, and unrecognised lines are left unset rather than guessed (previously every line was hardcoded INFO).
JSON loggers (e.g. xatu-sentry: {"level":"info",...}) weren't caught by the
logfmt/text level matchers, so they shipped with severity unset. Parse the line
as JSON when it starts with "{" and read .level / .severity before falling back
to the text matchers. The raw line is still the log body; only severity is
derived.
…ol config Ingress identity (OTLP auth user + ingress_user tag) now derives from ethereum_network_name instead of secret_loki.username, so log/trace attribution stays correct when the sops username isn't bumped between devnet iterations. Remove the dead otelcol_contrib_config block (the container is already removed via otelcol_contrib_cleanup); keep only the Vector-based config.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Consolidates edge telemetry onto Vector, removing otelcol-contrib, Loki, and Tempo.
Why: otelcol's
filelogcan't enrich Docker logs with per-container metadata, and thedocker_observer+receiver_creatorworkaround is port-oriented — it duplicates multi-port containers, misses port-less ones (e.g.validator), and races on metadata (~27% of logs landed with nocontainer.name). This is a known dead end (contrib #44555, closed stale; community consensus is "use a sidecar like Vector"). Vector'sdocker_logsreads the Docker API directly, so metadata is inline and clean for every container.Changes:
docker_logs→ OTLP/v1/logs, full metadata (container.name/container.image.name+network/testnet/instance/ethereum_cl/ethereum_el/forwarder),reduce-batched (~500 records/envelope).opentelemetrysource (use_otlp_decoding: true) on 4317/4318 → OTLP/v1/traces; client batching preserved (1 request in = 1 out). The trace-batching issue from the earlier Vector attempt was a pre-use_otlp_decodingartifact and is resolved.otelcol_contrib_cleanup=true); clients re-pointedotelcol→vector; Vector 0.55.0 → 0.56.0; Loki + Tempo dropped.Verified live on
lighthouse-geth-super-1: logs 0% empty-metadata, 0 dups, every container covered (incl. port-lessvalidator); traces flowing to platform batched; otelcol gone; vector healthchecks pass.Rollout note: applying fleet-wide re-points + recreates the client containers (telemetry endpoint change) — i.e. it rolls the clients.