From 23c53dbbd7d3aa7b977e5dc5539884ff17746730 Mon Sep 17 00:00:00 2001 From: sametd Date: Wed, 11 Feb 2026 15:07:09 +0100 Subject: [PATCH 1/7] Add ECMWF observability guidelines for logging and metrics --- Observability/observability-guidelines.md | 471 ++++++++++++++++++++++ README.md | 1 + 2 files changed, 472 insertions(+) create mode 100644 Observability/observability-guidelines.md diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md new file mode 100644 index 0000000..b1b581c --- /dev/null +++ b/Observability/observability-guidelines.md @@ -0,0 +1,471 @@ +# ECMWF Observability Guidelines + +## Table of Contents + +- [1. Purpose and Scope](#1-purpose-and-scope) +- [2. Core Principles](#2-core-principles) + - [2.1 Normative Language](#21-normative-language) +- [3. Platform Context](#3-platform-context) +- [4. Logging Standard](#4-logging-standard) + - [4.1 Log Event Model](#41-log-event-model) + - [4.2 Required Fields (Minimum Contract)](#42-required-fields-minimum-contract) + - [4.3 Event Naming and Attribute Cardinality](#43-event-naming-and-attribute-cardinality) + - [4.4 Library vs Binary Application Logging](#44-library-vs-binary-application-logging) + - [4.5 Good and Bad Log Lines](#45-good-and-bad-log-lines) + - [4.6 Severity and Event Design](#46-severity-and-event-design) + - [4.7 Exception and Error Logging](#47-exception-and-error-logging) + - [4.8 Safety and Compliance Rules](#48-safety-and-compliance-rules) + - [4.9 Common Anti-Patterns](#49-common-anti-patterns) + - [4.10 Validation Checklist and Ownership](#410-validation-checklist-and-ownership) +- [5. Metrics Standard](#5-metrics-standard) + - [5.1 Scope and Standard](#51-scope-and-standard) + - [5.2 References](#52-references) + - [5.3 Metric Types and Usage](#53-metric-types-and-usage) + - [5.4 Naming Conventions](#54-naming-conventions) + - [5.5 Labels and Cardinality](#55-labels-and-cardinality) + - [5.6 Required Baseline Metrics](#56-required-baseline-metrics) + - [5.7 Histogram Guidance](#57-histogram-guidance) + - [5.8 Good and Bad Metric Examples](#58-good-and-bad-metric-examples) + - [5.9 Validation Checklist and Ownership](#59-validation-checklist-and-ownership) + +## 1. Purpose and Scope + +This document defines the ECMWF baseline for observability across software and services. + +Current scope: + +- Defines common expectations for observability signals. +- Defines logging and metrics standardisation. +- Covers all deployment contexts at a principle level: + - Kubernetes + - Virtual machines (VMs) + - HPC + +Out of scope in this version: + +- Detailed environment-specific collection pipelines and agent deployment patterns. +- Full tracing specification (to be defined in a later revision). + +## 2. Core Principles + +- Use consistent observability conventions across all ECMWF software. +- Prefer machine-parseable telemetry over free-form text. +- Keep telemetry actionable and low-noise. +- Correlate signals where possible (for example, include trace/span identifiers in logs when available). +- Protect sensitive data by design (no credentials, tokens, or personal data in logs/metrics/traces). + +### 2.1 Normative Language + +The keywords `MUST`, `SHOULD`, and `MAY` are used as requirement levels: + +- `MUST`: mandatory requirement for compliance. +- `SHOULD`: recommended default; deviations should be justified. +- `MAY`: optional behavior. + +## 3. Platform Context + +ECMWF software runs in multiple environments: + +- Kubernetes clusters +- Virtual machines +- HPC systems + +This document focuses on common logs and metrics structure plus application emission rules. Environment-specific collection design for Kubernetes, VMs, and HPC will be specified later. + +## 4. Logging Standard + +ECMWF software should emit structured logs aligned with the OpenTelemetry log data model. + +Useful references: + +- OpenTelemetry logs data model: +- OpenTelemetry semantic conventions: + +### 4.1 Log Event Model + +Each log record should contain: + +- A clear event message (`body` / message). +- Severity (`severity_text`, `severity_number`). +- Timestamp in UTC. +- Stable resource attributes (service and environment metadata). +- Context attributes for debugging and operations. + +Canonical structure (OpenTelemetry-aligned): + +```json +{ + "timestamp": "2026-02-11T12:20:43Z", + "severity_text": "INFO", + "severity_number": 9, + "body": "Operation completed", + "resource": { + "service.name": "example-service", + "service.version": "1.0.0", + "deployment.environment": "prod", + "k8s.namespace.name": "default", + "k8s.pod.name": "example-service-7f8b66f9f7-rj8vd" + }, + "attributes": { + "event.name": "operation.completed", + "request.id": "req-8f31c9", + "job.id": "job-42a7", + "trace_id": "7f3fbbf5b8f24f32a59ec8ef9b264f93", + "span_id": "f9c3a29d03ef154f" + } +} +``` + +### 4.2 Required Fields (Minimum Contract) + +All production logs MUST include the following fields. + +Application-emitted fields: + +| Field | Requirement | Notes | +| --- | --- | --- | +| `timestamp` | MUST | UTC, RFC 3339 / ISO-8601 format | +| `severity_text` | MUST | `DEBUG`, `INFO`, `WARN`, `ERROR`, `FATAL` | +| `severity_number` | MUST | Numeric OTel-compatible severity | +| `body` | MUST | Human-readable message describing one event | +| `service.name` | MUST | Logical service/application name | +| `service.version` | MUST | Deployed version/build identifier | +| `deployment.environment` | MUST | e.g. `dev`, `test`, `staging`, `prod` | +| `trace_id` | MUST when available | Enables log-trace correlation | +| `span_id` | MUST when available | Enables log-trace correlation | + +Collector-enriched or infrastructure fields: + +| Field | Requirement | Notes | +| --- | --- | --- | +| `host.name` | MUST (VM/HPC context) | May be emitted by app or added by collector/resource detection | +| `k8s.namespace.name` | MUST (K8s context) | May be added at collection layer | +| `k8s.pod.name` | MUST (K8s context) | May be added at collection layer | + +Recommended additional fields: + +- `event.name` (stable event type) +- `event.domain` (component/domain group) +- `error.type` and `error.message` for failures +- Request/work item identifiers (for example `request.id`, `job.id`) + +### 4.3 Event Naming and Attribute Cardinality + +Event naming convention: + +- Use `event.name` in the form `domain.action.result`. +- Use lowercase with `.` separators. +- Keep names stable over time. +- If an event meaning changes materially, create a new event name. + +Examples: + +- `operation.completed` +- `operation.failed` + +Attribute cardinality guidance: + +- Low to medium cardinality fields are preferred for repeated events. +- Request/job identifiers are allowed for correlation. +- Do not create dynamic field names. +- Do not move arbitrary payloads into attributes. +- Large free-text content SHOULD stay in `body` only when necessary. + +### 4.4 Library vs Binary Application Logging + +#### Libraries + +- MUST not configure global logging policy. +- MUST use the application-provided logger interface. +- MUST emit structured fields, not only formatted strings. +- MUST not log secrets or large payloads. +- SHOULD avoid excessive `INFO`/`DEBUG` logs in hot code paths. +- SHOULD include stable event names for reusable log points: + - Example: `event.name="library.decode.failed"` + - Avoid changing field keys between library versions without migration notes. + +#### Binary Applications / Services + +- MUST own logger initialisation and runtime configuration. +- MUST enforce structured JSON output compatible with OTel pipelines. +- MUST add resource context at startup (`service.*`, environment, runtime metadata). +- MUST define log level policy by environment. +- SHOULD control repetitive low-value log volume. +- MUST implement redaction/masking filters before emission. +- SHOULD ensure resource attributes are complete: + - `service.name`, `service.version`, `deployment.environment` + - Runtime and infrastructure attributes when available + +### 4.5 Good and Bad Log Lines + +Good log line characteristics: + +- Structured key/value format. +- One clear event per line. +- Includes identifiers and outcome. +- Uses stable field names. +- Supports correlation: + - Include `trace_id` and `span_id` when context exists. + - Include request/job identifiers when available. + +Examples below use the same canonical structure as Section 4.1 (`resource` and `attributes`) for consistency. + +Bad log line characteristics: + +- Free-form text without structure. +- Missing context or identifiers. +- Ambiguous message content. +- Includes sensitive information. +- Breaks schema consistency: + - Changes field names for the same event type. + - Encodes structured data only inside a message string. + +Good example: + +```json +{ + "timestamp": "2026-02-11T12:20:43Z", + "severity_text": "INFO", + "severity_number": 9, + "body": "Operation completed", + "resource": { + "service.name": "example-service", + "service.version": "1.0.0", + "deployment.environment": "prod", + "k8s.namespace.name": "default", + "k8s.pod.name": "example-service-7f8b66f9f7-rj8vd" + }, + "attributes": { + "event.name": "operation.completed", + "request.id": "req-8f31c9", + "job.id": "job-42a7", + "trace_id": "7f3fbbf5b8f24f32a59ec8ef9b264f93", + "span_id": "f9c3a29d03ef154f" + } +} +``` + +Bad example: + +```text +done request ok +``` + +Bad example (sensitive data leak): + +```text +Login failed for user alice password=PlainTextSecret token=eyJhbGci... +``` + +### 4.6 Severity and Event Design + +- `DEBUG`: development diagnostics and verbose internals. +- `INFO`: normal lifecycle and business-relevant state changes. +- `WARN`: unexpected but recoverable conditions. +- `ERROR`: failed operation requiring attention. +- `FATAL`: unrecoverable condition before shutdown. + +Use stable event names (`event.name`) where possible, and make messages explicit about outcome, target, and reason. + +For severity mapping guidance, follow OpenTelemetry severity concepts in the logs data model reference. + +### 4.7 Exception and Error Logging + +- Log an exception once at the handling boundary. +- Avoid duplicate logging of the same error in multiple layers. +- Include stack traces when they materially improve diagnosis. +- Sanitize stack traces and exception messages before emission. +- Include `error.type` and `error.message` for failed operations. + +### 4.8 Safety and Compliance Rules + +- MUST never log secrets, credentials, session tokens, private keys, or personal data. +- MUST redact sensitive substrings before writing log output. +- SHOULD avoid full object dumps unless explicitly sanitized. +- SHOULD include stack traces for errors only when useful and sanitized. +- SHOULD define deny-lists and redaction rules centrally: + - Authentication headers and bearer tokens + - Passwords, API keys, secrets + - User personal data fields + +### 4.9 Common Anti-Patterns + +| Anti-pattern | Why it is harmful | Preferred pattern | +| --- | --- | --- | +| Free-text logs only | Hard to parse, search, and alert | Structured JSON with stable keys | +| Dynamic field names | Breaks queries and dashboards | Stable schema and key names | +| Logging in tight loops at `INFO` | Noise and cost explosion | Reduce frequency and log only meaningful state changes | +| Duplicate exception logs across layers | Inflates incident noise | Log once at handling boundary | +| Logging secrets/tokens | Security and compliance risk | Redaction and explicit deny-lists | + +### 4.10 Validation Checklist and Ownership + +Before release, teams should verify: + +- Required fields are present in production logs. +- Log output is valid structured JSON. +- Secrets and sensitive data are redacted. +- Library and binary responsibilities are correctly separated. +- Severity levels are used consistently. +- Correlation fields (`trace_id`, `span_id`) are present when tracing context exists. + +Ownership split for compliance: + +| Control | App Team | Platform Team | +| --- | --- | --- | +| Structured JSON emitted by app | MUST | N/A | +| Required app fields (`service.name`, `service.version`, `deployment.environment`, `body`, severity) | MUST | Validate only | +| Secret redaction in app logs | MUST | SHOULD add defensive redaction in pipeline | +| `k8s.namespace.name`, `k8s.pod.name`, `host.name` enrichment | MAY | MUST where collector supports it | +| Log transport to backend (for example Splunk) | N/A | MUST | +| Parsing/schema validation in collector | N/A | SHOULD | +| Log noise and volume control | SHOULD at source | SHOULD as safety net | + +## 5. Metrics Standard + +Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format. +ECMWF services MUST use Prometheus metric types and naming conventions, and MUST expose metrics in a Prometheus/OpenMetrics-compatible text format. +Metrics defined in this section are the source for alerting rules defined in the Alerting section. + +### 5.1 Scope and Standard + +- This section defines instrumentation expectations, metric schema, and quality requirements. +- Environment-specific scrape/discovery designs for Kubernetes, VMs, and HPC are specified separately. + +### 5.2 References + +- Prometheus metric types: +- Prometheus naming best practices: +- OpenMetrics specification: + +### 5.3 Metric Types and Usage + +- `Counter`: + - MUST be monotonic. + - MUST use `_total` suffix. + - Use for counts of events and outcomes. +- `Gauge`: + - Use for values that increase and decrease (for example in-flight operations). +- `Histogram`: + - SHOULD be used for latency and size distributions. + - MUST have stable bucket boundaries for the same metric across instances. +- `Summary`: + - SHOULD be avoided for cross-instance aggregation use cases. + - MAY be used only with clear justification. + +### 5.4 Naming Conventions + +- Metric names MUST be lowercase `snake_case`. +- Metric names MUST include base units where applicable: + - `_seconds` for duration + - `_bytes` for size + - `_total` for counters +- Metric names SHOULD be stable over time. +- If a name must change, introduce the new metric and deprecate the old one before removal. + +Good naming examples: + +- `http_server_requests_total` +- `http_server_request_duration_seconds` +- `job_execution_duration_seconds` +- `process_resident_memory_bytes` + +Bad naming examples: + +- `HttpRequests` +- `requestDurationMs` +- `errors` + +### 5.5 Labels and Cardinality + +Labels add dimensionality to metrics but increase cardinality. + +- Labels MUST use stable keys and bounded value sets. +- Labels SHOULD describe dimensions such as: + - `service` + - `environment` + - `operation` + - `status` +- Labels MUST NOT include unbounded identifiers such as: + - `request_id` + - `user_id` + - Raw URLs with path parameters + - UUIDs or timestamps +- Label values SHOULD be normalized: + - Prefer route templates (for example `/api/v1/items/{id}`) over raw paths. + - Prefer status classes (`2xx`, `4xx`, `5xx`) when detail is not required. + +### 5.6 Required Baseline Metrics + +Application and service metrics: + +- Request/operation throughput counter + - Example: `service_requests_total` +- Request/operation failure counter + - Example: `service_request_failures_total` +- Request/operation duration histogram + - Example: `service_request_duration_seconds` +- In-flight operation gauge (if applicable) + - Example: `service_requests_in_flight` + +Runtime/process metrics (where runtime supports them): + +- CPU usage +- Memory usage +- Uptime/start time +- Runtime-specific health metrics (for example GC metrics) + +Batch/HPC job metrics (where applicable): + +- Job execution count by outcome +- Job execution duration +- Queue/wait duration + +### 5.7 Histogram Guidance + +- Histogram bucket boundaries SHOULD align with SLO/SLA objectives. +- Bucket sets MUST remain consistent for the same metric across services and versions. +- Bucket count SHOULD be limited to a practical set to control cost and query complexity. + +Example bucket set for service latency metric: + +- `0.005`, `0.01`, `0.025`, `0.05`, `0.1`, `0.25`, `0.5`, `1`, `2.5`, `5`, `10` seconds + +### 5.8 Good and Bad Metric Examples + +Good examples: + +```text +service_requests_total{service="example-service",environment="prod",operation="create",status="2xx"} 12842 +service_request_duration_seconds_bucket{service="example-service",environment="prod",operation="create",le="0.5"} 12011 +service_request_duration_seconds_sum{service="example-service",environment="prod",operation="create"} 3184.22 +service_request_duration_seconds_count{service="example-service",environment="prod",operation="create"} 12842 +``` + +Bad examples: + +```text +requests{request_id="d9fd0f7a-3d8e-4c17-9d8b-9b57f43dc40e",user_id="483992"} 1 +requestDurationMs{path="/api/v1/items/123456"} 187 +``` + +### 5.9 Validation Checklist and Ownership + +Before release, teams should verify: + +- Metric names, units, and suffixes are compliant. +- Required baseline metrics are present. +- Label keys and values are bounded and normalized. +- No high-cardinality identifiers are emitted as labels. +- Histogram buckets are defined and justified. + +Ownership split for compliance: + +| Control | App Team | Platform Team | +| --- | --- | --- | +| Instrument required baseline metrics | MUST | N/A | +| Naming and unit compliance | MUST | SHOULD validate | +| Label cardinality discipline | MUST | SHOULD enforce guardrails | +| Scrape/discovery pipeline configuration | N/A | MUST | +| Central metric relabeling and hygiene checks | N/A | SHOULD | +| Cost and cardinality monitoring at platform level | N/A | SHOULD | diff --git a/README.md b/README.md index 5cb47c9..03798a2 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,7 @@ The Codex is a set of guidelines for development of software and services at ECM - [Project Maturity](./Project%20Maturity) - [Containerisation](./Containerisation) - [Testing](./Testing) +- [Observability](./Observability) - [ECMWF Software EnginE (ESEE)](./ESEE) - [Contributing to External Projects](./Contributing%20Externally/) - [Incoming External Contributions](./External%20Contributions/) From 7ff4198d166e74d204b2e5ac612c17bd2c683399 Mon Sep 17 00:00:00 2001 From: sametd Date: Wed, 11 Feb 2026 15:33:24 +0100 Subject: [PATCH 2/7] linting and wrapping --- Observability/observability-guidelines.md | 50 ++++++++++++++++------- 1 file changed, 35 insertions(+), 15 deletions(-) diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md index b1b581c..3ecaa61 100644 --- a/Observability/observability-guidelines.md +++ b/Observability/observability-guidelines.md @@ -1,5 +1,9 @@ # ECMWF Observability Guidelines + + ## Table of Contents - [1. Purpose and Scope](#1-purpose-and-scope) @@ -51,8 +55,10 @@ Out of scope in this version: - Use consistent observability conventions across all ECMWF software. - Prefer machine-parseable telemetry over free-form text. - Keep telemetry actionable and low-noise. -- Correlate signals where possible (for example, include trace/span identifiers in logs when available). -- Protect sensitive data by design (no credentials, tokens, or personal data in logs/metrics/traces). +- Correlate signals where possible (for example, include trace/span + identifiers in logs when available). +- Protect sensitive data by design (no credentials, tokens, or personal data + in logs/metrics/traces). ### 2.1 Normative Language @@ -70,11 +76,14 @@ ECMWF software runs in multiple environments: - Virtual machines - HPC systems -This document focuses on common logs and metrics structure plus application emission rules. Environment-specific collection design for Kubernetes, VMs, and HPC will be specified later. +This document focuses on common logs and metrics structure plus application +emission rules. Environment-specific collection design for Kubernetes, VMs, +and HPC will be specified later. ## 4. Logging Standard -ECMWF software should emit structured logs aligned with the OpenTelemetry log data model. +ECMWF software should emit structured logs aligned with the OpenTelemetry +log data model. Useful references: @@ -208,7 +217,8 @@ Good log line characteristics: - Include `trace_id` and `span_id` when context exists. - Include request/job identifiers when available. -Examples below use the same canonical structure as Section 4.1 (`resource` and `attributes`) for consistency. +Examples below use the same canonical structure as Section 4.1 (`resource` +and `attributes`) for consistency. Bad log line characteristics: @@ -265,9 +275,11 @@ Login failed for user alice password=PlainTextSecret token=eyJhbGci... - `ERROR`: failed operation requiring attention. - `FATAL`: unrecoverable condition before shutdown. -Use stable event names (`event.name`) where possible, and make messages explicit about outcome, target, and reason. +Use stable event names (`event.name`) where possible, and make messages +explicit about outcome, target, and reason. -For severity mapping guidance, follow OpenTelemetry severity concepts in the logs data model reference. +For severity mapping guidance, follow OpenTelemetry severity concepts in the +logs data model reference. ### 4.7 Exception and Error Logging @@ -279,7 +291,8 @@ For severity mapping guidance, follow OpenTelemetry severity concepts in the log ### 4.8 Safety and Compliance Rules -- MUST never log secrets, credentials, session tokens, private keys, or personal data. +- MUST never log secrets, credentials, session tokens, private keys, or + personal data. - MUST redact sensitive substrings before writing log output. - SHOULD avoid full object dumps unless explicitly sanitized. - SHOULD include stack traces for errors only when useful and sanitized. @@ -324,19 +337,24 @@ Ownership split for compliance: ## 5. Metrics Standard Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format. -ECMWF services MUST use Prometheus metric types and naming conventions, and MUST expose metrics in a Prometheus/OpenMetrics-compatible text format. -Metrics defined in this section are the source for alerting rules defined in the Alerting section. +ECMWF services MUST use Prometheus metric types and naming conventions, and +MUST expose metrics in a Prometheus/OpenMetrics-compatible text format. +Metrics defined in this section are the source for alerting rules defined in +the Alerting section. ### 5.1 Scope and Standard -- This section defines instrumentation expectations, metric schema, and quality requirements. -- Environment-specific scrape/discovery designs for Kubernetes, VMs, and HPC are specified separately. +- This section defines instrumentation expectations, metric schema, and + quality requirements. +- Environment-specific scrape/discovery designs for Kubernetes, VMs, and HPC + are specified separately. ### 5.2 References - Prometheus metric types: - Prometheus naming best practices: -- OpenMetrics specification: +- OpenMetrics specification: + ### 5.3 Metric Types and Usage @@ -361,7 +379,8 @@ Metrics defined in this section are the source for alerting rules defined in the - `_bytes` for size - `_total` for counters - Metric names SHOULD be stable over time. -- If a name must change, introduce the new metric and deprecate the old one before removal. +- If a name must change, introduce the new metric and deprecate the old one + before removal. Good naming examples: @@ -429,7 +448,8 @@ Batch/HPC job metrics (where applicable): Example bucket set for service latency metric: -- `0.005`, `0.01`, `0.025`, `0.05`, `0.1`, `0.25`, `0.5`, `1`, `2.5`, `5`, `10` seconds +- `0.005`, `0.01`, `0.025`, `0.05`, `0.1`, `0.25`, `0.5`, `1`, `2.5`, `5`, + `10` seconds ### 5.8 Good and Bad Metric Examples From d665791f03b3044b376aead625d4901d5c7e88da Mon Sep 17 00:00:00 2001 From: sametd Date: Mon, 16 Feb 2026 20:23:40 +0100 Subject: [PATCH 3/7] Clarify observability model across environments and trace correlation --- Observability/observability-guidelines.md | 50 ++++++++++++++++++++++- 1 file changed, 49 insertions(+), 1 deletion(-) diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md index 3ecaa61..fea8a43 100644 --- a/Observability/observability-guidelines.md +++ b/Observability/observability-guidelines.md @@ -10,12 +10,14 @@ - [2. Core Principles](#2-core-principles) - [2.1 Normative Language](#21-normative-language) - [3. Platform Context](#3-platform-context) + - [3.1 High-Level Collection Strategy](#31-high-level-collection-strategy) - [4. Logging Standard](#4-logging-standard) - [4.1 Log Event Model](#41-log-event-model) - [4.2 Required Fields (Minimum Contract)](#42-required-fields-minimum-contract) - [4.3 Event Naming and Attribute Cardinality](#43-event-naming-and-attribute-cardinality) - [4.4 Library vs Binary Application Logging](#44-library-vs-binary-application-logging) - [4.5 Good and Bad Log Lines](#45-good-and-bad-log-lines) + - [4.5.1 Trace Correlation Fields (`trace_id` and `span_id`)](#451-trace-correlation-fields-trace_id-and-span_id) - [4.6 Severity and Event Design](#46-severity-and-event-design) - [4.7 Exception and Error Logging](#47-exception-and-error-logging) - [4.8 Safety and Compliance Rules](#48-safety-and-compliance-rules) @@ -80,6 +82,37 @@ This document focuses on common logs and metrics structure plus application emission rules. Environment-specific collection design for Kubernetes, VMs, and HPC will be specified later. +### 3.1 High-Level Collection Strategy + +The collection pipeline is part of the deployment environment and MUST be +considered in service design. + +- Kubernetes workloads: + - Platform Engineering Team deploys and operates OpenTelemetry collectors. + - Workloads emit logs/metrics in the agreed formats. +- VM and HPC workloads: + - A collector/forwarder SHOULD run alongside the application or on the host. + - Workloads emit logs/metrics in the agreed formats. +- Central ingestion: + - Logs are forwarded to the central ECMWF logging backend. + - Metrics are collected into the central Prometheus-compatible metrics stack. + +```mermaid +flowchart TB + subgraph K["Kubernetes"] + A1["Workloads"] --> A2["Collector
(DaemonSet/Sidecar)"] + end + + subgraph V["VM / HPC"] + B1["Applications / Jobs"] --> B2["Collector Agent
(Host-local)"] + end + + A2 --> C["Central Ingestion"] + B2 --> C + C -->|logs| D["Logs Backend"] + C -->|metrics| E["Prometheus Metrics
Stack"] +``` + ## 4. Logging Standard ECMWF software should emit structured logs aligned with the OpenTelemetry @@ -220,6 +253,15 @@ Good log line characteristics: Examples below use the same canonical structure as Section 4.1 (`resource` and `attributes`) for consistency. +#### 4.5.1 Trace Correlation Fields (`trace_id` and `span_id`) + +- `trace_id` identifies the full end-to-end request/workflow across services. +- `span_id` identifies one operation within that trace in a single service. +- Multiple log records in one service operation typically share a `span_id`. +- A single `trace_id` usually contains multiple spans across components. +- When tracing context is unavailable (for example offline batch steps), + these fields MAY be absent. + Bad log line characteristics: - Free-form text without structure. @@ -279,7 +321,8 @@ Use stable event names (`event.name`) where possible, and make messages explicit about outcome, target, and reason. For severity mapping guidance, follow OpenTelemetry severity concepts in the -logs data model reference. +logs data model: + ### 4.7 Exception and Error Logging @@ -348,6 +391,11 @@ the Alerting section. quality requirements. - Environment-specific scrape/discovery designs for Kubernetes, VMs, and HPC are specified separately. +- Metrics exposure and collection at a high level: + - HTTP services SHOULD expose a `/metrics` endpoint owned by the service. + - Non-HTTP and batch/HPC workloads MUST still expose Prometheus-compatible + metrics, typically via a local collector/forwarder integration. + - Platform Engineering Team owns central scrape and ingestion configuration. ### 5.2 References From b0db1736ad461f864a015199ffe32bad0436e896 Mon Sep 17 00:00:00 2001 From: sametd Date: Mon, 16 Feb 2026 20:23:58 +0100 Subject: [PATCH 4/7] Refine ownership model with explicit team roles --- Observability/observability-guidelines.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md index fea8a43..9794461 100644 --- a/Observability/observability-guidelines.md +++ b/Observability/observability-guidelines.md @@ -367,7 +367,7 @@ Before release, teams should verify: Ownership split for compliance: -| Control | App Team | Platform Team | +| Control | Development Team | Platform Engineering Team | | --- | --- | --- | | Structured JSON emitted by app | MUST | N/A | | Required app fields (`service.name`, `service.version`, `deployment.environment`, `body`, severity) | MUST | Validate only | @@ -529,11 +529,11 @@ Before release, teams should verify: Ownership split for compliance: -| Control | App Team | Platform Team | -| --- | --- | --- | -| Instrument required baseline metrics | MUST | N/A | -| Naming and unit compliance | MUST | SHOULD validate | -| Label cardinality discipline | MUST | SHOULD enforce guardrails | -| Scrape/discovery pipeline configuration | N/A | MUST | -| Central metric relabeling and hygiene checks | N/A | SHOULD | -| Cost and cardinality monitoring at platform level | N/A | SHOULD | +| Control | Development Team | Platform Engineering Team | Production Team | +| --- | --- | --- | --- | +| Instrument required baseline metrics | MUST | N/A | SHOULD review service-level usefulness | +| Naming and unit compliance | MUST | SHOULD validate | SHOULD validate monitoring readiness | +| Label cardinality discipline | MUST | SHOULD enforce guardrails | SHOULD flag operational risks | +| Scrape/discovery pipeline configuration | N/A | MUST | SHOULD validate production coverage | +| Central metric relabeling and hygiene checks | N/A | SHOULD | N/A | +| Cost and cardinality monitoring at platform level | N/A | SHOULD | SHOULD provide operational feedback | From 4641fba68fa2b273eb5369fcd1546d3a5d81aefc Mon Sep 17 00:00:00 2001 From: sametd Date: Tue, 3 Mar 2026 01:54:07 +0100 Subject: [PATCH 5/7] docs(observability): clarify operational guidance and legacy migration --- Observability/observability-guidelines.md | 244 +++++++++++++++++++--- 1 file changed, 214 insertions(+), 30 deletions(-) diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md index 9794461..545759c 100644 --- a/Observability/observability-guidelines.md +++ b/Observability/observability-guidelines.md @@ -11,6 +11,9 @@ - [2.1 Normative Language](#21-normative-language) - [3. Platform Context](#3-platform-context) - [3.1 High-Level Collection Strategy](#31-high-level-collection-strategy) + - [3.2 Log Access and Ownership](#32-log-access-and-ownership) + - [3.3 Log Retention and Archival](#33-log-retention-and-archival) + - [3.4 Telemetry Outage and Recovery](#34-telemetry-outage-and-recovery) - [4. Logging Standard](#4-logging-standard) - [4.1 Log Event Model](#41-log-event-model) - [4.2 Required Fields (Minimum Contract)](#42-required-fields-minimum-contract) @@ -18,11 +21,13 @@ - [4.4 Library vs Binary Application Logging](#44-library-vs-binary-application-logging) - [4.5 Good and Bad Log Lines](#45-good-and-bad-log-lines) - [4.5.1 Trace Correlation Fields (`trace_id` and `span_id`)](#451-trace-correlation-fields-trace_id-and-span_id) + - [4.5.2 Correlation Identifiers (`trace_id`, `request.id`, `job.id`)](#452-correlation-identifiers-trace_id-requestid-jobid) - [4.6 Severity and Event Design](#46-severity-and-event-design) - [4.7 Exception and Error Logging](#47-exception-and-error-logging) - [4.8 Safety and Compliance Rules](#48-safety-and-compliance-rules) - [4.9 Common Anti-Patterns](#49-common-anti-patterns) - [4.10 Validation Checklist and Ownership](#410-validation-checklist-and-ownership) + - [4.11 Legacy Compatibility and Migration](#411-legacy-compatibility-and-migration) - [5. Metrics Standard](#5-metrics-standard) - [5.1 Scope and Standard](#51-scope-and-standard) - [5.2 References](#52-references) @@ -46,6 +51,8 @@ Current scope: - Kubernetes - Virtual machines (VMs) - HPC + - Bare metal servers + - Remote data-mover hosts Out of scope in this version: @@ -77,10 +84,12 @@ ECMWF software runs in multiple environments: - Kubernetes clusters - Virtual machines - HPC systems +- Bare metal servers +- Remote data-mover hosts This document focuses on common logs and metrics structure plus application emission rules. Environment-specific collection design for Kubernetes, VMs, -and HPC will be specified later. +HPC, bare metal, and remote data-mover hosts will be specified later. ### 3.1 High-Level Collection Strategy @@ -89,30 +98,116 @@ considered in service design. - Kubernetes workloads: - Platform Engineering Team deploys and operates OpenTelemetry collectors. - - Workloads emit logs/metrics in the agreed formats. -- VM and HPC workloads: - - A collector/forwarder SHOULD run alongside the application or on the host. - - Workloads emit logs/metrics in the agreed formats. -- Central ingestion: - - Logs are forwarded to the central ECMWF logging backend. - - Metrics are collected into the central Prometheus-compatible metrics stack. + - Application stdout/stderr is captured by the container runtime into node + log files (or equivalent runtime log sources), and collectors read/tail + those sources. + - Metrics/traces are collected via SDK/exporter endpoints or local agents, + depending on service design. +- VM, HPC, and bare-metal workloads (including remote data-mover hosts): + - Applications/jobs write logs to host-local logging sources (for example + files, journald, or syslog), and host-local or scheduler-integrated + collectors read from those sources. + - Metrics/traces are collected via local endpoints/agents where enabled. +- Central ingestion (common stage for all environments): + - Receives telemetry from Kubernetes and VM/HPC/bare-metal collection paths. + - Routes logs to the central ECMWF logging backend. + - Routes metrics to the central Prometheus-compatible metrics stack. ```mermaid flowchart TB subgraph K["Kubernetes"] - A1["Workloads"] --> A2["Collector
(DaemonSet/Sidecar)"] + K1["Workloads"] --> K2["Container Runtime Log Sources"] + K2 --> K3["Collector
(DaemonSet/Sidecar)"] end - subgraph V["VM / HPC"] - B1["Applications / Jobs"] --> B2["Collector Agent
(Host-local)"] + subgraph H["VM / HPC / Bare Metal"] + H1["Applications / Jobs"] --> H2["Host-local Log Sources
(files/journald/syslog)"] + H2 --> H3["Collector Agent
(Host-local / Scheduler-integrated)"] end - A2 --> C["Central Ingestion"] - B2 --> C + K3 --> C["Central Ingestion
(Common Stage)"] + H3 --> C C -->|logs| D["Logs Backend"] C -->|metrics| E["Prometheus Metrics
Stack"] ``` +### 3.2 Log Access and Ownership + +Access to production logs is a service onboarding requirement and MUST be +defined explicitly for each service and environment (Kubernetes, VM, HPC, +bare metal, and remote data-mover hosts). + +Minimum governance requirements: + +- Access path MUST be documented (for example central logging UI/API and, + where required for resilience, approved host-local access method). +- Access roles MUST be defined and approved by service owner and operations. +- Access MUST be granted through managed team groups. +- Access provisioning and access changes MUST follow the standard IAM + approval and logging process. +- Emergency access procedure MUST be documented, including allowed methods + during central logging outage (for Kubernetes, controlled use of + `kubectl logs`; for VM/HPC/bare metal, approved host-local log access). + Emergency access MUST be time-limited and linked to an active incident. + +Ownership model: + +| Control | Development Team | Platform Engineering Team | Production Team | +| --- | --- | --- | --- | +| Define service log access requirements | MUST | SHOULD review feasibility | MUST review operational fit | +| Implement central access controls (RBAC/SSO/groups) | N/A | MUST | SHOULD validate | +| Approve and periodically review access lists | MUST | SHOULD support automation | MUST | +| Maintain emergency access runbook | SHOULD contribute service context | SHOULD provide platform procedure | MUST own operational procedure | + +### 3.3 Log Retention and Archival + +Log retention requirements MUST be defined for each service and environment +at onboarding and reviewed during major service changes. + +Retention model: + +- A default retention period MUST be provided by the central logging service. +- Service-specific retention overrides MAY be requested with justification. +- Long-term archival requirements (beyond central retention) MUST be declared + by the service owner and approved by operations. + +Ownership model: + +| Control | Development Team | Platform Engineering Team | Production Team | +| --- | --- | --- | --- | +| Declare required retention and archival period | MUST | SHOULD review feasibility | MUST review operational fit | +| Implement retention in central logging platform | N/A | MUST | SHOULD validate production coverage | +| Implement and operate long-term archival pipeline | N/A | SHOULD support platform integration | MUST | +| Periodically review retention settings and costs | SHOULD | MUST provide platform metrics | MUST | + +### 3.4 Telemetry Outage and Recovery + +Observability design MUST include degraded-mode behavior for periods when +central telemetry ingestion is unavailable. + +Minimum requirements: + +- Applications MUST continue emitting telemetry signals locally during central + outage: + - logs to a local, accessible sink (stdout/stderr, file, or system logger); + - metrics via a local scrape/export endpoint or host-local collector path; + - traces to a local collector/agent where tracing is enabled. +- Collection/forwarding components SHOULD buffer locally and retry delivery + when connectivity or backend availability is restored. +- Backfill after recovery MUST be supported where buffering exists. +- Known telemetry coverage gaps MUST be detectable and reported to operations. +- Services and runbooks MUST define how urgent logs are retrieved during + outage using documented emergency access methods. + +Ownership model: + +| Control | Development Team | Platform Engineering Team | Production Team | +| --- | --- | --- | --- | +| Ensure service emits local telemetry in degraded mode | MUST | SHOULD provide guidance | SHOULD validate in production | +| Provide buffering/retry/backfill capability in pipeline | N/A | SHOULD | MUST validate operational readiness | +| Detect and report ingestion coverage gaps | SHOULD emit health signals | MUST provide platform-level detection | MUST monitor and escalate | +| Maintain outage and recovery runbook | SHOULD contribute service behavior | SHOULD contribute platform behavior | MUST own incident operation | + ## 4. Logging Standard ECMWF software should emit structured logs aligned with the OpenTelemetry @@ -125,7 +220,8 @@ Useful references: ### 4.1 Log Event Model -Each log record should contain: +Each log event MUST provide the following information, either directly in the +record or via stable resource/context enrichment in the pipeline: - A clear event message (`body` / message). - Severity (`severity_text`, `severity_number`). @@ -162,6 +258,10 @@ Canonical structure (OpenTelemetry-aligned): All production logs MUST include the following fields. +The minimum contract applies to the effective log event at query/analysis +time. Fields MAY be set directly by the application or added by approved +collector/pipeline enrichment, provided values are stable and correct. + Application-emitted fields: | Field | Requirement | Notes | @@ -173,7 +273,7 @@ Application-emitted fields: | `service.name` | MUST | Logical service/application name | | `service.version` | MUST | Deployed version/build identifier | | `deployment.environment` | MUST | e.g. `dev`, `test`, `staging`, `prod` | -| `trace_id` | MUST when available | Enables log-trace correlation | +| `trace_id` | SHOULD when available | Enables log-trace correlation; not required for startup, housekeeping, or other non-request events | | `span_id` | MUST when available | Enables log-trace correlation | Collector-enriched or infrastructure fields: @@ -205,27 +305,49 @@ Examples: - `operation.completed` - `operation.failed` -Attribute cardinality guidance: +When defining log attributes, teams MUST consider attribute cardinality. +Cardinality is the number of distinct values an attribute has across events. + +High cardinality reduces observability quality because each distinct value +creates its own group, which fragments aggregates and increases storage/query +cost. + +Attribute guidance: -- Low to medium cardinality fields are preferred for repeated events. -- Request/job identifiers are allowed for correlation. +- Prefer low to medium cardinality attributes for repeated events. +- Use request/job identifiers only for correlation and troubleshooting. - Do not create dynamic field names. - Do not move arbitrary payloads into attributes. -- Large free-text content SHOULD stay in `body` only when necessary. +- Keep large free-text content in `body` when necessary. ### 4.4 Library vs Binary Application Logging #### Libraries -- MUST not configure global logging policy. -- MUST use the application-provided logger interface. -- MUST emit structured fields, not only formatted strings. +- MUST not configure global logging policy (sinks, format, or global levels). +- MUST use logger/context provided by the application, or a documented + adapter/interface supplied by the application. +- MUST expose structured key/value fields in logging calls, not only + pre-formatted message strings. - MUST not log secrets or large payloads. - SHOULD avoid excessive `INFO`/`DEBUG` logs in hot code paths. - SHOULD include stable event names for reusable log points: - Example: `event.name="library.decode.failed"` - Avoid changing field keys between library versions without migration notes. +Library API expectation: + +- Library entry points SHOULD accept logging context from the caller + (logger handle/interface plus correlation fields when available). +- If a logger is not passed explicitly, the library SHOULD accept a context + object that carries logger and correlation metadata. +- Library code SHOULD propagate the received logger/context unchanged to lower + library layers. +- Libraries MUST NOT silently create independent global logger configuration as + a fallback. +- Libraries SHOULD document the expected logger/context contract in their public + API (what is required, optional, and how correlation fields are passed). + #### Binary Applications / Services - MUST own logger initialisation and runtime configuration. @@ -262,6 +384,29 @@ and `attributes`) for consistency. - When tracing context is unavailable (for example offline batch steps), these fields MAY be absent. +#### 4.5.2 Correlation Identifiers (`trace_id`, `request.id`, `job.id`) + +These identifiers represent different scopes of correlation and MAY appear +together in a single log event. + +- `trace_id`: identifies one end-to-end distributed trace across services. + It is created by tracing instrumentation and used to follow cross-service + call chains. +- `request.id`: identifies one application-level request or unit of API/user + work at the service boundary. It is generated by the application or + middleware handling that request and propagated through service logs. +- `job.id`: identifies one batch or workflow execution (for example scheduler + submission, worker run, or pipeline task instance). It is used to correlate + logs across the full lifecycle of that job. + +Guidance: + +- These identifiers are complementary and not interchangeable. +- Include all identifiers that exist in the current execution context. +- In request/response flows, `trace_id` and `request.id` often coexist. +- In batch/HPC flows, `job.id` is usually primary; `trace_id` MAY be absent + unless tracing is enabled for that workflow. + Bad log line characteristics: - Free-form text without structure. @@ -326,11 +471,23 @@ logs data model: ### 4.7 Exception and Error Logging -- Log an exception once at the handling boundary. -- Avoid duplicate logging of the same error in multiple layers. +- Emit one primary error log at the handling boundary that determines outcome + (for example request failure, job failure, retry exhaustion). +- Intermediate layers MAY log additional context, but SHOULD avoid duplicating + full stack traces/messages for the same failure path. +- Preserve failure context for cascaded errors by recording: + - the high-level operation that failed (for example request decoding, + workflow step execution, data transfer); + - the immediate reason at that layer; + - the underlying cause summary when the error was wrapped/propagated from a + lower layer. +- Use the language/runtime error-chain mechanism where available so operators + can reconstruct the sequence of failure causes from boundary logs. - Include stack traces when they materially improve diagnosis. - Sanitize stack traces and exception messages before emission. -- Include `error.type` and `error.message` for failed operations. + +Goal: preserve the failure chain (for example I/O error -> decode error -> +request failure) without logging the same full stack trace at every layer. ### 4.8 Safety and Compliance Rules @@ -351,7 +508,7 @@ logs data model: | Free-text logs only | Hard to parse, search, and alert | Structured JSON with stable keys | | Dynamic field names | Breaks queries and dashboards | Stable schema and key names | | Logging in tight loops at `INFO` | Noise and cost explosion | Reduce frequency and log only meaningful state changes | -| Duplicate exception logs across layers | Inflates incident noise | Log once at handling boundary | +| Duplicate exception logs across layers | Inflates incident noise | One primary error log at handling boundary; keep intermediate logs contextual and avoid duplicate full stacks | | Logging secrets/tokens | Security and compliance risk | Redaction and explicit deny-lists | ### 4.10 Validation Checklist and Ownership @@ -359,7 +516,8 @@ logs data model: Before release, teams should verify: - Required fields are present in production logs. -- Log output is valid structured JSON. +- Log output is valid structured JSON, or legacy format logs are mapped to + the common schema via approved pipeline parsing/enrichment. - Secrets and sensitive data are redacted. - Library and binary responsibilities are correctly separated. - Severity levels are used consistently. @@ -369,7 +527,7 @@ Ownership split for compliance: | Control | Development Team | Platform Engineering Team | | --- | --- | --- | -| Structured JSON emitted by app | MUST | N/A | +| Structured JSON emitted by app | MUST for new services; phased plan allowed for approved legacy services | N/A | | Required app fields (`service.name`, `service.version`, `deployment.environment`, `body`, severity) | MUST | Validate only | | Secret redaction in app logs | MUST | SHOULD add defensive redaction in pipeline | | `k8s.namespace.name`, `k8s.pod.name`, `host.name` enrichment | MAY | MUST where collector supports it | @@ -377,6 +535,28 @@ Ownership split for compliance: | Parsing/schema validation in collector | N/A | SHOULD | | Log noise and volume control | SHOULD at source | SHOULD as safety net | +### 4.11 Legacy Compatibility and Migration + +These guidelines define the target logging model, but do not require immediate +JSON-only migration for existing stable services. + +Compatibility requirements: + +- Existing log formats MAY continue where they are operationally established. +- Service teams MUST document the current format and downstream consumers + before changing log structure. +- New services MUST emit structured logs by default. +- For existing services, collector/pipeline parsing and enrichment MAY be used + to map legacy logs into the common schema. +- Migration SHOULD be incremental and service-specific, with no disruption to + existing operational workflows. + +Target-state requirement: + +- Services that can adopt structured JSON logging without operational risk + SHOULD do so; services with high migration cost MAY follow a phased plan + agreed with platform and operations teams. + ## 5. Metrics Standard Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format. @@ -389,8 +569,8 @@ the Alerting section. - This section defines instrumentation expectations, metric schema, and quality requirements. -- Environment-specific scrape/discovery designs for Kubernetes, VMs, and HPC - are specified separately. +- Environment-specific scrape/discovery designs for Kubernetes, VMs, HPC, + bare metal, and remote data-mover hosts are specified separately. - Metrics exposure and collection at a high level: - HTTP services SHOULD expose a `/metrics` endpoint owned by the service. - Non-HTTP and batch/HPC workloads MUST still expose Prometheus-compatible @@ -446,6 +626,10 @@ Bad naming examples: ### 5.5 Labels and Cardinality Labels add dimensionality to metrics but increase cardinality. +Cardinality is the number of distinct label combinations a metric produces. + +High cardinality reduces metric usefulness because it creates too many +series, increasing storage/query cost and weakening dashboard/alert signal. - Labels MUST use stable keys and bounded value sets. - Labels SHOULD describe dimensions such as: From bfab84f74561c7769ecd3c900aa9db11ce42d917 Mon Sep 17 00:00:00 2001 From: sametd Date: Fri, 6 Mar 2026 12:04:25 +0100 Subject: [PATCH 6/7] fix(observability): align logging guidelines with OTel log data model - Rename severity_text/severity_number to severityText/severityNumber - Rename trace_id/span_id to traceId/spanId and move to top-level LogRecord fields - Upgrade traceId requirement from SHOULD to MUST when available - Split required fields table into LogRecord fields and Resource attributes - Downgrade deployment.environment from MUST to SHOULD - Add TRACE severity level and severityText-to-severityNumber mapping table - Add OTel exception attributes: exception.type, exception.message, exception.stacktrace - Remove deprecated event.domain attribute - Replace error.message with exception.message per OTel semantic conventions - Fix MUST not -> MUST NOT in library logging rules - Fix lowercase normative keywords (should -> SHOULD/MUST NOT) - Update event.name examples to follow three-part domain.action.result format - Align 4.10 ownership table with deployment.environment SHOULD requirement --- Observability/observability-guidelines.md | 110 +++++++++++++--------- 1 file changed, 64 insertions(+), 46 deletions(-) diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md index 545759c..0eeb1e6 100644 --- a/Observability/observability-guidelines.md +++ b/Observability/observability-guidelines.md @@ -20,8 +20,8 @@ - [4.3 Event Naming and Attribute Cardinality](#43-event-naming-and-attribute-cardinality) - [4.4 Library vs Binary Application Logging](#44-library-vs-binary-application-logging) - [4.5 Good and Bad Log Lines](#45-good-and-bad-log-lines) - - [4.5.1 Trace Correlation Fields (`trace_id` and `span_id`)](#451-trace-correlation-fields-trace_id-and-span_id) - - [4.5.2 Correlation Identifiers (`trace_id`, `request.id`, `job.id`)](#452-correlation-identifiers-trace_id-requestid-jobid) + - [4.5.1 Trace Correlation Fields (`traceId` and `spanId`)](#451-trace-correlation-fields-traceid-and-spanid) + - [4.5.2 Correlation Identifiers (`traceId`, `request.id`, `job.id`)](#452-correlation-identifiers-traceid-requestid-jobid) - [4.6 Severity and Event Design](#46-severity-and-event-design) - [4.7 Exception and Error Logging](#47-exception-and-error-logging) - [4.8 Safety and Compliance Rules](#48-safety-and-compliance-rules) @@ -74,7 +74,7 @@ Out of scope in this version: The keywords `MUST`, `SHOULD`, and `MAY` are used as requirement levels: - `MUST`: mandatory requirement for compliance. -- `SHOULD`: recommended default; deviations should be justified. +- `SHOULD`: recommended default; deviations require justification. - `MAY`: optional behavior. ## 3. Platform Context @@ -210,7 +210,7 @@ Ownership model: ## 4. Logging Standard -ECMWF software should emit structured logs aligned with the OpenTelemetry +ECMWF software SHOULD emit structured logs aligned with the OpenTelemetry log data model. Useful references: @@ -224,7 +224,7 @@ Each log event MUST provide the following information, either directly in the record or via stable resource/context enrichment in the pipeline: - A clear event message (`body` / message). -- Severity (`severity_text`, `severity_number`). +- Severity (`severityText`, `severityNumber`). - Timestamp in UTC. - Stable resource attributes (service and environment metadata). - Context attributes for debugging and operations. @@ -234,8 +234,10 @@ Canonical structure (OpenTelemetry-aligned): ```json { "timestamp": "2026-02-11T12:20:43Z", - "severity_text": "INFO", - "severity_number": 9, + "traceId": "7f3fbbf5b8f24f32a59ec8ef9b264f93", + "spanId": "f9c3a29d03ef154f", + "severityText": "INFO", + "severityNumber": 9, "body": "Operation completed", "resource": { "service.name": "example-service", @@ -245,11 +247,9 @@ Canonical structure (OpenTelemetry-aligned): "k8s.pod.name": "example-service-7f8b66f9f7-rj8vd" }, "attributes": { - "event.name": "operation.completed", + "event.name": "data.transfer.completed", "request.id": "req-8f31c9", - "job.id": "job-42a7", - "trace_id": "7f3fbbf5b8f24f32a59ec8ef9b264f93", - "span_id": "f9c3a29d03ef154f" + "job.id": "job-42a7" } } ``` @@ -262,19 +262,24 @@ The minimum contract applies to the effective log event at query/analysis time. Fields MAY be set directly by the application or added by approved collector/pipeline enrichment, provided values are stable and correct. -Application-emitted fields: +LogRecord fields (top-level in the log record): | Field | Requirement | Notes | | --- | --- | --- | | `timestamp` | MUST | UTC, RFC 3339 / ISO-8601 format | -| `severity_text` | MUST | `DEBUG`, `INFO`, `WARN`, `ERROR`, `FATAL` | -| `severity_number` | MUST | Numeric OTel-compatible severity | +| `severityText` | MUST | `TRACE`, `DEBUG`, `INFO`, `WARN`, `ERROR`, `FATAL` | +| `severityNumber` | MUST | Numeric OTel-compatible severity | | `body` | MUST | Human-readable message describing one event | +| `traceId` | MUST when available | Enables log-trace correlation; not required for startup, housekeeping, or other non-request events | +| `spanId` | MUST when available | Enables log-trace correlation | + +Resource attributes (nested inside the `resource` block): + +| Field | Requirement | Notes | +| --- | --- | --- | | `service.name` | MUST | Logical service/application name | | `service.version` | MUST | Deployed version/build identifier | -| `deployment.environment` | MUST | e.g. `dev`, `test`, `staging`, `prod` | -| `trace_id` | SHOULD when available | Enables log-trace correlation; not required for startup, housekeeping, or other non-request events | -| `span_id` | MUST when available | Enables log-trace correlation | +| `deployment.environment` | SHOULD | e.g. `dev`, `test`, `staging`, `prod`; may not be known by the application at runtime | Collector-enriched or infrastructure fields: @@ -287,8 +292,7 @@ Collector-enriched or infrastructure fields: Recommended additional fields: - `event.name` (stable event type) -- `event.domain` (component/domain group) -- `error.type` and `error.message` for failures +- `error.type` for error classification; `exception.type`, `exception.message`, and `exception.stacktrace` for exception details - Request/work item identifiers (for example `request.id`, `job.id`) ### 4.3 Event Naming and Attribute Cardinality @@ -302,8 +306,8 @@ Event naming convention: Examples: -- `operation.completed` -- `operation.failed` +- `data.transfer.completed` +- `data.transfer.failed` When defining log attributes, teams MUST consider attribute cardinality. Cardinality is the number of distinct values an attribute has across events. @@ -324,12 +328,12 @@ Attribute guidance: #### Libraries -- MUST not configure global logging policy (sinks, format, or global levels). +- MUST NOT configure global logging policy (sinks, format, or global levels). - MUST use logger/context provided by the application, or a documented adapter/interface supplied by the application. - MUST expose structured key/value fields in logging calls, not only pre-formatted message strings. -- MUST not log secrets or large payloads. +- MUST NOT log secrets or large payloads. - SHOULD avoid excessive `INFO`/`DEBUG` logs in hot code paths. - SHOULD include stable event names for reusable log points: - Example: `event.name="library.decode.failed"` @@ -369,27 +373,27 @@ Good log line characteristics: - Includes identifiers and outcome. - Uses stable field names. - Supports correlation: - - Include `trace_id` and `span_id` when context exists. + - Include `traceId` and `spanId` when context exists. - Include request/job identifiers when available. Examples below use the same canonical structure as Section 4.1 (`resource` and `attributes`) for consistency. -#### 4.5.1 Trace Correlation Fields (`trace_id` and `span_id`) +#### 4.5.1 Trace Correlation Fields (`traceId` and `spanId`) -- `trace_id` identifies the full end-to-end request/workflow across services. -- `span_id` identifies one operation within that trace in a single service. -- Multiple log records in one service operation typically share a `span_id`. -- A single `trace_id` usually contains multiple spans across components. +- `traceId` identifies the full end-to-end request/workflow across services. +- `spanId` identifies one operation within that trace in a single service. +- Multiple log records in one service operation typically share a `spanId`. +- A single `traceId` usually contains multiple spans across components. - When tracing context is unavailable (for example offline batch steps), these fields MAY be absent. -#### 4.5.2 Correlation Identifiers (`trace_id`, `request.id`, `job.id`) +#### 4.5.2 Correlation Identifiers (`traceId`, `request.id`, `job.id`) These identifiers represent different scopes of correlation and MAY appear together in a single log event. -- `trace_id`: identifies one end-to-end distributed trace across services. +- `traceId`: identifies one end-to-end distributed trace across services. It is created by tracing instrumentation and used to follow cross-service call chains. - `request.id`: identifies one application-level request or unit of API/user @@ -403,8 +407,8 @@ Guidance: - These identifiers are complementary and not interchangeable. - Include all identifiers that exist in the current execution context. -- In request/response flows, `trace_id` and `request.id` often coexist. -- In batch/HPC flows, `job.id` is usually primary; `trace_id` MAY be absent +- In request/response flows, `traceId` and `request.id` often coexist. +- In batch/HPC flows, `job.id` is usually primary; `traceId` MAY be absent unless tracing is enabled for that workflow. Bad log line characteristics: @@ -422,8 +426,10 @@ Good example: ```json { "timestamp": "2026-02-11T12:20:43Z", - "severity_text": "INFO", - "severity_number": 9, + "traceId": "7f3fbbf5b8f24f32a59ec8ef9b264f93", + "spanId": "f9c3a29d03ef154f", + "severityText": "INFO", + "severityNumber": 9, "body": "Operation completed", "resource": { "service.name": "example-service", @@ -433,11 +439,9 @@ Good example: "k8s.pod.name": "example-service-7f8b66f9f7-rj8vd" }, "attributes": { - "event.name": "operation.completed", + "event.name": "data.transfer.completed", "request.id": "req-8f31c9", - "job.id": "job-42a7", - "trace_id": "7f3fbbf5b8f24f32a59ec8ef9b264f93", - "span_id": "f9c3a29d03ef154f" + "job.id": "job-42a7" } } ``` @@ -456,17 +460,29 @@ Login failed for user alice password=PlainTextSecret token=eyJhbGci... ### 4.6 Severity and Event Design +- `TRACE`: fine-grained diagnostics; more verbose than `DEBUG`. - `DEBUG`: development diagnostics and verbose internals. - `INFO`: normal lifecycle and business-relevant state changes. - `WARN`: unexpected but recoverable conditions. - `ERROR`: failed operation requiring attention. - `FATAL`: unrecoverable condition before shutdown. +`severityText` to `severityNumber` mapping (use the lowest value in the range +unless a finer distinction is needed): + +| `severityText` | `severityNumber` range | +| --- | --- | +| `TRACE` | 1–4 | +| `DEBUG` | 5–8 | +| `INFO` | 9–12 | +| `WARN` | 13–16 | +| `ERROR` | 17–20 | +| `FATAL` | 21–24 | + Use stable event names (`event.name`) where possible, and make messages explicit about outcome, target, and reason. -For severity mapping guidance, follow OpenTelemetry severity concepts in the -logs data model: +For the full severity number specification including sub-levels, see: ### 4.7 Exception and Error Logging @@ -483,6 +499,8 @@ logs data model: lower layer. - Use the language/runtime error-chain mechanism where available so operators can reconstruct the sequence of failure causes from boundary logs. +- When recording exception details, use the OTel semantic convention attributes: + `exception.type`, `exception.message`, and `exception.stacktrace`. - Include stack traces when they materially improve diagnosis. - Sanitize stack traces and exception messages before emission. @@ -491,7 +509,7 @@ request failure) without logging the same full stack trace at every layer. ### 4.8 Safety and Compliance Rules -- MUST never log secrets, credentials, session tokens, private keys, or +- MUST NOT log secrets, credentials, session tokens, private keys, or personal data. - MUST redact sensitive substrings before writing log output. - SHOULD avoid full object dumps unless explicitly sanitized. @@ -513,7 +531,7 @@ request failure) without logging the same full stack trace at every layer. ### 4.10 Validation Checklist and Ownership -Before release, teams should verify: +Before release, teams SHOULD verify: - Required fields are present in production logs. - Log output is valid structured JSON, or legacy format logs are mapped to @@ -521,14 +539,14 @@ Before release, teams should verify: - Secrets and sensitive data are redacted. - Library and binary responsibilities are correctly separated. - Severity levels are used consistently. -- Correlation fields (`trace_id`, `span_id`) are present when tracing context exists. +- Correlation fields (`traceId`, `spanId`) are present when tracing context exists. Ownership split for compliance: | Control | Development Team | Platform Engineering Team | | --- | --- | --- | | Structured JSON emitted by app | MUST for new services; phased plan allowed for approved legacy services | N/A | -| Required app fields (`service.name`, `service.version`, `deployment.environment`, `body`, severity) | MUST | Validate only | +| Required app fields (`service.name`, `service.version`, `body`, severity); `deployment.environment` where known | MUST; `deployment.environment` SHOULD | Validate only | | Secret redaction in app logs | MUST | SHOULD add defensive redaction in pipeline | | `k8s.namespace.name`, `k8s.pod.name`, `host.name` enrichment | MAY | MUST where collector supports it | | Log transport to backend (for example Splunk) | N/A | MUST | @@ -703,7 +721,7 @@ requestDurationMs{path="/api/v1/items/123456"} 187 ### 5.9 Validation Checklist and Ownership -Before release, teams should verify: +Before release, teams SHOULD verify: - Metric names, units, and suffixes are compliant. - Required baseline metrics are present. From 1bf0aa15c14cc6c46ad0bc991951596019b4570e Mon Sep 17 00:00:00 2001 From: sametd Date: Sat, 14 Mar 2026 00:36:05 +0100 Subject: [PATCH 7/7] docs(observability): move guidelines to Development Practices/Observability.md --- .../Observability.md | 0 README.md | 2 +- 2 files changed, 1 insertion(+), 1 deletion(-) rename Observability/observability-guidelines.md => Development Practices/Observability.md (100%) diff --git a/Observability/observability-guidelines.md b/Development Practices/Observability.md similarity index 100% rename from Observability/observability-guidelines.md rename to Development Practices/Observability.md diff --git a/README.md b/README.md index 03798a2..f644184 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ The Codex is a set of guidelines for development of software and services at ECM - [Project Maturity](./Project%20Maturity) - [Containerisation](./Containerisation) - [Testing](./Testing) -- [Observability](./Observability) +- [Observability](./Development%20Practices/Observability.md) - [ECMWF Software EnginE (ESEE)](./ESEE) - [Contributing to External Projects](./Contributing%20Externally/) - [Incoming External Contributions](./External%20Contributions/)