From 23c53dbbd7d3aa7b977e5dc5539884ff17746730 Mon Sep 17 00:00:00 2001
From: sametd <samet.demir@ecmwf.int>
Date: Wed, 11 Feb 2026 15:07:09 +0100
Subject: [PATCH 1/7] Add ECMWF observability guidelines for logging and
 metrics

---
 Observability/observability-guidelines.md | 471 ++++++++++++++++++++++
 README.md                                 |   1 +
 2 files changed, 472 insertions(+)
 create mode 100644 Observability/observability-guidelines.md

diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md
new file mode 100644
index 0000000..b1b581c
--- /dev/null
+++ b/Observability/observability-guidelines.md
@@ -0,0 +1,471 @@
+# ECMWF Observability Guidelines
+
+## Table of Contents
+
+- [1. Purpose and Scope](#1-purpose-and-scope)
+- [2. Core Principles](#2-core-principles)
+  - [2.1 Normative Language](#21-normative-language)
+- [3. Platform Context](#3-platform-context)
+- [4. Logging Standard](#4-logging-standard)
+  - [4.1 Log Event Model](#41-log-event-model)
+  - [4.2 Required Fields (Minimum Contract)](#42-required-fields-minimum-contract)
+  - [4.3 Event Naming and Attribute Cardinality](#43-event-naming-and-attribute-cardinality)
+  - [4.4 Library vs Binary Application Logging](#44-library-vs-binary-application-logging)
+  - [4.5 Good and Bad Log Lines](#45-good-and-bad-log-lines)
+  - [4.6 Severity and Event Design](#46-severity-and-event-design)
+  - [4.7 Exception and Error Logging](#47-exception-and-error-logging)
+  - [4.8 Safety and Compliance Rules](#48-safety-and-compliance-rules)
+  - [4.9 Common Anti-Patterns](#49-common-anti-patterns)
+  - [4.10 Validation Checklist and Ownership](#410-validation-checklist-and-ownership)
+- [5. Metrics Standard](#5-metrics-standard)
+  - [5.1 Scope and Standard](#51-scope-and-standard)
+  - [5.2 References](#52-references)
+  - [5.3 Metric Types and Usage](#53-metric-types-and-usage)
+  - [5.4 Naming Conventions](#54-naming-conventions)
+  - [5.5 Labels and Cardinality](#55-labels-and-cardinality)
+  - [5.6 Required Baseline Metrics](#56-required-baseline-metrics)
+  - [5.7 Histogram Guidance](#57-histogram-guidance)
+  - [5.8 Good and Bad Metric Examples](#58-good-and-bad-metric-examples)
+  - [5.9 Validation Checklist and Ownership](#59-validation-checklist-and-ownership)
+
+## 1. Purpose and Scope
+
+This document defines the ECMWF baseline for observability across software and services.
+
+Current scope:
+
+- Defines common expectations for observability signals.
+- Defines logging and metrics standardisation.
+- Covers all deployment contexts at a principle level:
+  - Kubernetes
+  - Virtual machines (VMs)
+  - HPC
+
+Out of scope in this version:
+
+- Detailed environment-specific collection pipelines and agent deployment patterns.
+- Full tracing specification (to be defined in a later revision).
+
+## 2. Core Principles
+
+- Use consistent observability conventions across all ECMWF software.
+- Prefer machine-parseable telemetry over free-form text.
+- Keep telemetry actionable and low-noise.
+- Correlate signals where possible (for example, include trace/span identifiers in logs when available).
+- Protect sensitive data by design (no credentials, tokens, or personal data in logs/metrics/traces).
+
+### 2.1 Normative Language
+
+The keywords `MUST`, `SHOULD`, and `MAY` are used as requirement levels:
+
+- `MUST`: mandatory requirement for compliance.
+- `SHOULD`: recommended default; deviations should be justified.
+- `MAY`: optional behavior.
+
+## 3. Platform Context
+
+ECMWF software runs in multiple environments:
+
+- Kubernetes clusters
+- Virtual machines
+- HPC systems
+
+This document focuses on common logs and metrics structure plus application emission rules. Environment-specific collection design for Kubernetes, VMs, and HPC will be specified later.
+
+## 4. Logging Standard
+
+ECMWF software should emit structured logs aligned with the OpenTelemetry log data model.
+
+Useful references:
+
+- OpenTelemetry logs data model: <https://opentelemetry.io/docs/specs/otel/logs/data-model/>
+- OpenTelemetry semantic conventions: <https://opentelemetry.io/docs/specs/semconv/>
+
+### 4.1 Log Event Model
+
+Each log record should contain:
+
+- A clear event message (`body` / message).
+- Severity (`severity_text`, `severity_number`).
+- Timestamp in UTC.
+- Stable resource attributes (service and environment metadata).
+- Context attributes for debugging and operations.
+
+Canonical structure (OpenTelemetry-aligned):
+
+```json
+{
+  "timestamp": "2026-02-11T12:20:43Z",
+  "severity_text": "INFO",
+  "severity_number": 9,
+  "body": "Operation completed",
+  "resource": {
+    "service.name": "example-service",
+    "service.version": "1.0.0",
+    "deployment.environment": "prod",
+    "k8s.namespace.name": "default",
+    "k8s.pod.name": "example-service-7f8b66f9f7-rj8vd"
+  },
+  "attributes": {
+    "event.name": "operation.completed",
+    "request.id": "req-8f31c9",
+    "job.id": "job-42a7",
+    "trace_id": "7f3fbbf5b8f24f32a59ec8ef9b264f93",
+    "span_id": "f9c3a29d03ef154f"
+  }
+}
+```
+
+### 4.2 Required Fields (Minimum Contract)
+
+All production logs MUST include the following fields.
+
+Application-emitted fields:
+
+| Field | Requirement | Notes |
+| --- | --- | --- |
+| `timestamp` | MUST | UTC, RFC 3339 / ISO-8601 format |
+| `severity_text` | MUST | `DEBUG`, `INFO`, `WARN`, `ERROR`, `FATAL` |
+| `severity_number` | MUST | Numeric OTel-compatible severity |
+| `body` | MUST | Human-readable message describing one event |
+| `service.name` | MUST | Logical service/application name |
+| `service.version` | MUST | Deployed version/build identifier |
+| `deployment.environment` | MUST | e.g. `dev`, `test`, `staging`, `prod` |
+| `trace_id` | MUST when available | Enables log-trace correlation |
+| `span_id` | MUST when available | Enables log-trace correlation |
+
+Collector-enriched or infrastructure fields:
+
+| Field | Requirement | Notes |
+| --- | --- | --- |
+| `host.name` | MUST (VM/HPC context) | May be emitted by app or added by collector/resource detection |
+| `k8s.namespace.name` | MUST (K8s context) | May be added at collection layer |
+| `k8s.pod.name` | MUST (K8s context) | May be added at collection layer |
+
+Recommended additional fields:
+
+- `event.name` (stable event type)
+- `event.domain` (component/domain group)
+- `error.type` and `error.message` for failures
+- Request/work item identifiers (for example `request.id`, `job.id`)
+
+### 4.3 Event Naming and Attribute Cardinality
+
+Event naming convention:
+
+- Use `event.name` in the form `domain.action.result`.
+- Use lowercase with `.` separators.
+- Keep names stable over time.
+- If an event meaning changes materially, create a new event name.
+
+Examples:
+
+- `operation.completed`
+- `operation.failed`
+
+Attribute cardinality guidance:
+
+- Low to medium cardinality fields are preferred for repeated events.
+- Request/job identifiers are allowed for correlation.
+- Do not create dynamic field names.
+- Do not move arbitrary payloads into attributes.
+- Large free-text content SHOULD stay in `body` only when necessary.
+
+### 4.4 Library vs Binary Application Logging
+
+#### Libraries
+
+- MUST not configure global logging policy.
+- MUST use the application-provided logger interface.
+- MUST emit structured fields, not only formatted strings.
+- MUST not log secrets or large payloads.
+- SHOULD avoid excessive `INFO`/`DEBUG` logs in hot code paths.
+- SHOULD include stable event names for reusable log points:
+  - Example: `event.name="library.decode.failed"`
+  - Avoid changing field keys between library versions without migration notes.
+
+#### Binary Applications / Services
+
+- MUST own logger initialisation and runtime configuration.
+- MUST enforce structured JSON output compatible with OTel pipelines.
+- MUST add resource context at startup (`service.*`, environment, runtime metadata).
+- MUST define log level policy by environment.
+- SHOULD control repetitive low-value log volume.
+- MUST implement redaction/masking filters before emission.
+- SHOULD ensure resource attributes are complete:
+  - `service.name`, `service.version`, `deployment.environment`
+  - Runtime and infrastructure attributes when available
+
+### 4.5 Good and Bad Log Lines
+
+Good log line characteristics:
+
+- Structured key/value format.
+- One clear event per line.
+- Includes identifiers and outcome.
+- Uses stable field names.
+- Supports correlation:
+  - Include `trace_id` and `span_id` when context exists.
+  - Include request/job identifiers when available.
+
+Examples below use the same canonical structure as Section 4.1 (`resource` and `attributes`) for consistency.
+
+Bad log line characteristics:
+
+- Free-form text without structure.
+- Missing context or identifiers.
+- Ambiguous message content.
+- Includes sensitive information.
+- Breaks schema consistency:
+  - Changes field names for the same event type.
+  - Encodes structured data only inside a message string.
+
+Good example:
+
+```json
+{
+  "timestamp": "2026-02-11T12:20:43Z",
+  "severity_text": "INFO",
+  "severity_number": 9,
+  "body": "Operation completed",
+  "resource": {
+    "service.name": "example-service",
+    "service.version": "1.0.0",
+    "deployment.environment": "prod",
+    "k8s.namespace.name": "default",
+    "k8s.pod.name": "example-service-7f8b66f9f7-rj8vd"
+  },
+  "attributes": {
+    "event.name": "operation.completed",
+    "request.id": "req-8f31c9",
+    "job.id": "job-42a7",
+    "trace_id": "7f3fbbf5b8f24f32a59ec8ef9b264f93",
+    "span_id": "f9c3a29d03ef154f"
+  }
+}
+```
+
+Bad example:
+
+```text
+done request ok
+```
+
+Bad example (sensitive data leak):
+
+```text
+Login failed for user alice password=PlainTextSecret token=eyJhbGci...
+```
+
+### 4.6 Severity and Event Design
+
+- `DEBUG`: development diagnostics and verbose internals.
+- `INFO`: normal lifecycle and business-relevant state changes.
+- `WARN`: unexpected but recoverable conditions.
+- `ERROR`: failed operation requiring attention.
+- `FATAL`: unrecoverable condition before shutdown.
+
+Use stable event names (`event.name`) where possible, and make messages explicit about outcome, target, and reason.
+
+For severity mapping guidance, follow OpenTelemetry severity concepts in the logs data model reference.
+
+### 4.7 Exception and Error Logging
+
+- Log an exception once at the handling boundary.
+- Avoid duplicate logging of the same error in multiple layers.
+- Include stack traces when they materially improve diagnosis.
+- Sanitize stack traces and exception messages before emission.
+- Include `error.type` and `error.message` for failed operations.
+
+### 4.8 Safety and Compliance Rules
+
+- MUST never log secrets, credentials, session tokens, private keys, or personal data.
+- MUST redact sensitive substrings before writing log output.
+- SHOULD avoid full object dumps unless explicitly sanitized.
+- SHOULD include stack traces for errors only when useful and sanitized.
+- SHOULD define deny-lists and redaction rules centrally:
+  - Authentication headers and bearer tokens
+  - Passwords, API keys, secrets
+  - User personal data fields
+
+### 4.9 Common Anti-Patterns
+
+| Anti-pattern | Why it is harmful | Preferred pattern |
+| --- | --- | --- |
+| Free-text logs only | Hard to parse, search, and alert | Structured JSON with stable keys |
+| Dynamic field names | Breaks queries and dashboards | Stable schema and key names |
+| Logging in tight loops at `INFO` | Noise and cost explosion | Reduce frequency and log only meaningful state changes |
+| Duplicate exception logs across layers | Inflates incident noise | Log once at handling boundary |
+| Logging secrets/tokens | Security and compliance risk | Redaction and explicit deny-lists |
+
+### 4.10 Validation Checklist and Ownership
+
+Before release, teams should verify:
+
+- Required fields are present in production logs.
+- Log output is valid structured JSON.
+- Secrets and sensitive data are redacted.
+- Library and binary responsibilities are correctly separated.
+- Severity levels are used consistently.
+- Correlation fields (`trace_id`, `span_id`) are present when tracing context exists.
+
+Ownership split for compliance:
+
+| Control | App Team | Platform Team |
+| --- | --- | --- |
+| Structured JSON emitted by app | MUST | N/A |
+| Required app fields (`service.name`, `service.version`, `deployment.environment`, `body`, severity) | MUST | Validate only |
+| Secret redaction in app logs | MUST | SHOULD add defensive redaction in pipeline |
+| `k8s.namespace.name`, `k8s.pod.name`, `host.name` enrichment | MAY | MUST where collector supports it |
+| Log transport to backend (for example Splunk) | N/A | MUST |
+| Parsing/schema validation in collector | N/A | SHOULD |
+| Log noise and volume control | SHOULD at source | SHOULD as safety net |
+
+## 5. Metrics Standard
+
+Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format.
+ECMWF services MUST use Prometheus metric types and naming conventions, and MUST expose metrics in a Prometheus/OpenMetrics-compatible text format.
+Metrics defined in this section are the source for alerting rules defined in the Alerting section.
+
+### 5.1 Scope and Standard
+
+- This section defines instrumentation expectations, metric schema, and quality requirements.
+- Environment-specific scrape/discovery designs for Kubernetes, VMs, and HPC are specified separately.
+
+### 5.2 References
+
+- Prometheus metric types: <https://prometheus.io/docs/concepts/metric_types/>
+- Prometheus naming best practices: <https://prometheus.io/docs/practices/naming/>
+- OpenMetrics specification: <https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md>
+
+### 5.3 Metric Types and Usage
+
+- `Counter`:
+  - MUST be monotonic.
+  - MUST use `_total` suffix.
+  - Use for counts of events and outcomes.
+- `Gauge`:
+  - Use for values that increase and decrease (for example in-flight operations).
+- `Histogram`:
+  - SHOULD be used for latency and size distributions.
+  - MUST have stable bucket boundaries for the same metric across instances.
+- `Summary`:
+  - SHOULD be avoided for cross-instance aggregation use cases.
+  - MAY be used only with clear justification.
+
+### 5.4 Naming Conventions
+
+- Metric names MUST be lowercase `snake_case`.
+- Metric names MUST include base units where applicable:
+  - `_seconds` for duration
+  - `_bytes` for size
+  - `_total` for counters
+- Metric names SHOULD be stable over time.
+- If a name must change, introduce the new metric and deprecate the old one before removal.
+
+Good naming examples:
+
+- `http_server_requests_total`
+- `http_server_request_duration_seconds`
+- `job_execution_duration_seconds`
+- `process_resident_memory_bytes`
+
+Bad naming examples:
+
+- `HttpRequests`
+- `requestDurationMs`
+- `errors`
+
+### 5.5 Labels and Cardinality
+
+Labels add dimensionality to metrics but increase cardinality.
+
+- Labels MUST use stable keys and bounded value sets.
+- Labels SHOULD describe dimensions such as:
+  - `service`
+  - `environment`
+  - `operation`
+  - `status`
+- Labels MUST NOT include unbounded identifiers such as:
+  - `request_id`
+  - `user_id`
+  - Raw URLs with path parameters
+  - UUIDs or timestamps
+- Label values SHOULD be normalized:
+  - Prefer route templates (for example `/api/v1/items/{id}`) over raw paths.
+  - Prefer status classes (`2xx`, `4xx`, `5xx`) when detail is not required.
+
+### 5.6 Required Baseline Metrics
+
+Application and service metrics:
+
+- Request/operation throughput counter
+  - Example: `service_requests_total`
+- Request/operation failure counter
+  - Example: `service_request_failures_total`
+- Request/operation duration histogram
+  - Example: `service_request_duration_seconds`
+- In-flight operation gauge (if applicable)
+  - Example: `service_requests_in_flight`
+
+Runtime/process metrics (where runtime supports them):
+
+- CPU usage
+- Memory usage
+- Uptime/start time
+- Runtime-specific health metrics (for example GC metrics)
+
+Batch/HPC job metrics (where applicable):
+
+- Job execution count by outcome
+- Job execution duration
+- Queue/wait duration
+
+### 5.7 Histogram Guidance
+
+- Histogram bucket boundaries SHOULD align with SLO/SLA objectives.
+- Bucket sets MUST remain consistent for the same metric across services and versions.
+- Bucket count SHOULD be limited to a practical set to control cost and query complexity.
+
+Example bucket set for service latency metric:
+
+- `0.005`, `0.01`, `0.025`, `0.05`, `0.1`, `0.25`, `0.5`, `1`, `2.5`, `5`, `10` seconds
+
+### 5.8 Good and Bad Metric Examples
+
+Good examples:
+
+```text
+service_requests_total{service="example-service",environment="prod",operation="create",status="2xx"} 12842
+service_request_duration_seconds_bucket{service="example-service",environment="prod",operation="create",le="0.5"} 12011
+service_request_duration_seconds_sum{service="example-service",environment="prod",operation="create"} 3184.22
+service_request_duration_seconds_count{service="example-service",environment="prod",operation="create"} 12842
+```
+
+Bad examples:
+
+```text
+requests{request_id="d9fd0f7a-3d8e-4c17-9d8b-9b57f43dc40e",user_id="483992"} 1
+requestDurationMs{path="/api/v1/items/123456"} 187
+```
+
+### 5.9 Validation Checklist and Ownership
+
+Before release, teams should verify:
+
+- Metric names, units, and suffixes are compliant.
+- Required baseline metrics are present.
+- Label keys and values are bounded and normalized.
+- No high-cardinality identifiers are emitted as labels.
+- Histogram buckets are defined and justified.
+
+Ownership split for compliance:
+
+| Control | App Team | Platform Team |
+| --- | --- | --- |
+| Instrument required baseline metrics | MUST | N/A |
+| Naming and unit compliance | MUST | SHOULD validate |
+| Label cardinality discipline | MUST | SHOULD enforce guardrails |
+| Scrape/discovery pipeline configuration | N/A | MUST |
+| Central metric relabeling and hygiene checks | N/A | SHOULD |
+| Cost and cardinality monitoring at platform level | N/A | SHOULD |
diff --git a/README.md b/README.md
index 5cb47c9..03798a2 100644
--- a/README.md
+++ b/README.md
@@ -15,6 +15,7 @@ The Codex is a set of guidelines for development of software and services at ECM
 - [Project Maturity](./Project%20Maturity)
 - [Containerisation](./Containerisation)
 - [Testing](./Testing)
+- [Observability](./Observability)
 - [ECMWF Software EnginE (ESEE)](./ESEE)
 - [Contributing to External Projects](./Contributing%20Externally/)
 - [Incoming External Contributions](./External%20Contributions/)

From 7ff4198d166e74d204b2e5ac612c17bd2c683399 Mon Sep 17 00:00:00 2001
From: sametd <samet.demir@ecmwf.int>
Date: Wed, 11 Feb 2026 15:33:24 +0100
Subject: [PATCH 2/7] linting and wrapping

---
 Observability/observability-guidelines.md | 50 ++++++++++++++++-------
 1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md
index b1b581c..3ecaa61 100644
--- a/Observability/observability-guidelines.md
+++ b/Observability/observability-guidelines.md
@@ -1,5 +1,9 @@
 # ECMWF Observability Guidelines
 
+<!-- markdownlint-configure-file
+{"MD013":{"code_blocks":false,"tables":false}}
+-->
+
 ## Table of Contents
 
 - [1. Purpose and Scope](#1-purpose-and-scope)
@@ -51,8 +55,10 @@ Out of scope in this version:
 - Use consistent observability conventions across all ECMWF software.
 - Prefer machine-parseable telemetry over free-form text.
 - Keep telemetry actionable and low-noise.
-- Correlate signals where possible (for example, include trace/span identifiers in logs when available).
-- Protect sensitive data by design (no credentials, tokens, or personal data in logs/metrics/traces).
+- Correlate signals where possible (for example, include trace/span
+  identifiers in logs when available).
+- Protect sensitive data by design (no credentials, tokens, or personal data
+  in logs/metrics/traces).
 
 ### 2.1 Normative Language
 
@@ -70,11 +76,14 @@ ECMWF software runs in multiple environments:
 - Virtual machines
 - HPC systems
 
-This document focuses on common logs and metrics structure plus application emission rules. Environment-specific collection design for Kubernetes, VMs, and HPC will be specified later.
+This document focuses on common logs and metrics structure plus application
+emission rules. Environment-specific collection design for Kubernetes, VMs,
+and HPC will be specified later.
 
 ## 4. Logging Standard
 
-ECMWF software should emit structured logs aligned with the OpenTelemetry log data model.
+ECMWF software should emit structured logs aligned with the OpenTelemetry
+log data model.
 
 Useful references:
 
@@ -208,7 +217,8 @@ Good log line characteristics:
   - Include `trace_id` and `span_id` when context exists.
   - Include request/job identifiers when available.
 
-Examples below use the same canonical structure as Section 4.1 (`resource` and `attributes`) for consistency.
+Examples below use the same canonical structure as Section 4.1 (`resource`
+and `attributes`) for consistency.
 
 Bad log line characteristics:
 
@@ -265,9 +275,11 @@ Login failed for user alice password=PlainTextSecret token=eyJhbGci...
 - `ERROR`: failed operation requiring attention.
 - `FATAL`: unrecoverable condition before shutdown.
 
-Use stable event names (`event.name`) where possible, and make messages explicit about outcome, target, and reason.
+Use stable event names (`event.name`) where possible, and make messages
+explicit about outcome, target, and reason.
 
-For severity mapping guidance, follow OpenTelemetry severity concepts in the logs data model reference.
+For severity mapping guidance, follow OpenTelemetry severity concepts in the
+logs data model reference.
 
 ### 4.7 Exception and Error Logging
 
@@ -279,7 +291,8 @@ For severity mapping guidance, follow OpenTelemetry severity concepts in the log
 
 ### 4.8 Safety and Compliance Rules
 
-- MUST never log secrets, credentials, session tokens, private keys, or personal data.
+- MUST never log secrets, credentials, session tokens, private keys, or
+  personal data.
 - MUST redact sensitive substrings before writing log output.
 - SHOULD avoid full object dumps unless explicitly sanitized.
 - SHOULD include stack traces for errors only when useful and sanitized.
@@ -324,19 +337,24 @@ Ownership split for compliance:
 ## 5. Metrics Standard
 
 Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format.
-ECMWF services MUST use Prometheus metric types and naming conventions, and MUST expose metrics in a Prometheus/OpenMetrics-compatible text format.
-Metrics defined in this section are the source for alerting rules defined in the Alerting section.
+ECMWF services MUST use Prometheus metric types and naming conventions, and
+MUST expose metrics in a Prometheus/OpenMetrics-compatible text format.
+Metrics defined in this section are the source for alerting rules defined in
+the Alerting section.
 
 ### 5.1 Scope and Standard
 
-- This section defines instrumentation expectations, metric schema, and quality requirements.
-- Environment-specific scrape/discovery designs for Kubernetes, VMs, and HPC are specified separately.
+- This section defines instrumentation expectations, metric schema, and
+  quality requirements.
+- Environment-specific scrape/discovery designs for Kubernetes, VMs, and HPC
+  are specified separately.
 
 ### 5.2 References
 
 - Prometheus metric types: <https://prometheus.io/docs/concepts/metric_types/>
 - Prometheus naming best practices: <https://prometheus.io/docs/practices/naming/>
-- OpenMetrics specification: <https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md>
+- OpenMetrics specification:
+  <https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md>
 
 ### 5.3 Metric Types and Usage
 
@@ -361,7 +379,8 @@ Metrics defined in this section are the source for alerting rules defined in the
   - `_bytes` for size
   - `_total` for counters
 - Metric names SHOULD be stable over time.
-- If a name must change, introduce the new metric and deprecate the old one before removal.
+- If a name must change, introduce the new metric and deprecate the old one
+  before removal.
 
 Good naming examples:
 
@@ -429,7 +448,8 @@ Batch/HPC job metrics (where applicable):
 
 Example bucket set for service latency metric:
 
-- `0.005`, `0.01`, `0.025`, `0.05`, `0.1`, `0.25`, `0.5`, `1`, `2.5`, `5`, `10` seconds
+- `0.005`, `0.01`, `0.025`, `0.05`, `0.1`, `0.25`, `0.5`, `1`, `2.5`, `5`,
+  `10` seconds
 
 ### 5.8 Good and Bad Metric Examples
 

From d665791f03b3044b376aead625d4901d5c7e88da Mon Sep 17 00:00:00 2001
From: sametd <samet.demir@ecmwf.int>
Date: Mon, 16 Feb 2026 20:23:40 +0100
Subject: [PATCH 3/7] Clarify observability model across environments and trace
 correlation

---
 Observability/observability-guidelines.md | 50 ++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md
index 3ecaa61..fea8a43 100644
--- a/Observability/observability-guidelines.md
+++ b/Observability/observability-guidelines.md
@@ -10,12 +10,14 @@
 - [2. Core Principles](#2-core-principles)
   - [2.1 Normative Language](#21-normative-language)
 - [3. Platform Context](#3-platform-context)
+  - [3.1 High-Level Collection Strategy](#31-high-level-collection-strategy)
 - [4. Logging Standard](#4-logging-standard)
   - [4.1 Log Event Model](#41-log-event-model)
   - [4.2 Required Fields (Minimum Contract)](#42-required-fields-minimum-contract)
   - [4.3 Event Naming and Attribute Cardinality](#43-event-naming-and-attribute-cardinality)
   - [4.4 Library vs Binary Application Logging](#44-library-vs-binary-application-logging)
   - [4.5 Good and Bad Log Lines](#45-good-and-bad-log-lines)
+    - [4.5.1 Trace Correlation Fields (`trace_id` and `span_id`)](#451-trace-correlation-fields-trace_id-and-span_id)
   - [4.6 Severity and Event Design](#46-severity-and-event-design)
   - [4.7 Exception and Error Logging](#47-exception-and-error-logging)
   - [4.8 Safety and Compliance Rules](#48-safety-and-compliance-rules)
@@ -80,6 +82,37 @@ This document focuses on common logs and metrics structure plus application
 emission rules. Environment-specific collection design for Kubernetes, VMs,
 and HPC will be specified later.
 
+### 3.1 High-Level Collection Strategy
+
+The collection pipeline is part of the deployment environment and MUST be
+considered in service design.
+
+- Kubernetes workloads:
+  - Platform Engineering Team deploys and operates OpenTelemetry collectors.
+  - Workloads emit logs/metrics in the agreed formats.
+- VM and HPC workloads:
+  - A collector/forwarder SHOULD run alongside the application or on the host.
+  - Workloads emit logs/metrics in the agreed formats.
+- Central ingestion:
+  - Logs are forwarded to the central ECMWF logging backend.
+  - Metrics are collected into the central Prometheus-compatible metrics stack.
+
+```mermaid
+flowchart TB
+  subgraph K["Kubernetes"]
+    A1["Workloads"] --> A2["Collector<br/>(DaemonSet/Sidecar)"]
+  end
+
+  subgraph V["VM / HPC"]
+    B1["Applications / Jobs"] --> B2["Collector Agent<br/>(Host-local)"]
+  end
+
+  A2 --> C["Central Ingestion"]
+  B2 --> C
+  C -->|logs| D["Logs Backend"]
+  C -->|metrics| E["Prometheus Metrics<br/>Stack"]
+```
+
 ## 4. Logging Standard
 
 ECMWF software should emit structured logs aligned with the OpenTelemetry
@@ -220,6 +253,15 @@ Good log line characteristics:
 Examples below use the same canonical structure as Section 4.1 (`resource`
 and `attributes`) for consistency.
 
+#### 4.5.1 Trace Correlation Fields (`trace_id` and `span_id`)
+
+- `trace_id` identifies the full end-to-end request/workflow across services.
+- `span_id` identifies one operation within that trace in a single service.
+- Multiple log records in one service operation typically share a `span_id`.
+- A single `trace_id` usually contains multiple spans across components.
+- When tracing context is unavailable (for example offline batch steps),
+  these fields MAY be absent.
+
 Bad log line characteristics:
 
 - Free-form text without structure.
@@ -279,7 +321,8 @@ Use stable event names (`event.name`) where possible, and make messages
 explicit about outcome, target, and reason.
 
 For severity mapping guidance, follow OpenTelemetry severity concepts in the
-logs data model reference.
+logs data model:
+<https://opentelemetry.io/docs/specs/otel/logs/data-model/#field-severitynumber>
 
 ### 4.7 Exception and Error Logging
 
@@ -348,6 +391,11 @@ the Alerting section.
   quality requirements.
 - Environment-specific scrape/discovery designs for Kubernetes, VMs, and HPC
   are specified separately.
+- Metrics exposure and collection at a high level:
+  - HTTP services SHOULD expose a `/metrics` endpoint owned by the service.
+  - Non-HTTP and batch/HPC workloads MUST still expose Prometheus-compatible
+    metrics, typically via a local collector/forwarder integration.
+  - Platform Engineering Team owns central scrape and ingestion configuration.
 
 ### 5.2 References
 

From b0db1736ad461f864a015199ffe32bad0436e896 Mon Sep 17 00:00:00 2001
From: sametd <samet.demir@ecmwf.int>
Date: Mon, 16 Feb 2026 20:23:58 +0100
Subject: [PATCH 4/7] Refine ownership model with explicit team roles

---
 Observability/observability-guidelines.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md
index fea8a43..9794461 100644
--- a/Observability/observability-guidelines.md
+++ b/Observability/observability-guidelines.md
@@ -367,7 +367,7 @@ Before release, teams should verify:
 
 Ownership split for compliance:
 
-| Control | App Team | Platform Team |
+| Control | Development Team | Platform Engineering Team |
 | --- | --- | --- |
 | Structured JSON emitted by app | MUST | N/A |
 | Required app fields (`service.name`, `service.version`, `deployment.environment`, `body`, severity) | MUST | Validate only |
@@ -529,11 +529,11 @@ Before release, teams should verify:
 
 Ownership split for compliance:
 
-| Control | App Team | Platform Team |
-| --- | --- | --- |
-| Instrument required baseline metrics | MUST | N/A |
-| Naming and unit compliance | MUST | SHOULD validate |
-| Label cardinality discipline | MUST | SHOULD enforce guardrails |
-| Scrape/discovery pipeline configuration | N/A | MUST |
-| Central metric relabeling and hygiene checks | N/A | SHOULD |
-| Cost and cardinality monitoring at platform level | N/A | SHOULD |
+| Control | Development Team | Platform Engineering Team | Production Team |
+| --- | --- | --- | --- |
+| Instrument required baseline metrics | MUST | N/A | SHOULD review service-level usefulness |
+| Naming and unit compliance | MUST | SHOULD validate | SHOULD validate monitoring readiness |
+| Label cardinality discipline | MUST | SHOULD enforce guardrails | SHOULD flag operational risks |
+| Scrape/discovery pipeline configuration | N/A | MUST | SHOULD validate production coverage |
+| Central metric relabeling and hygiene checks | N/A | SHOULD | N/A |
+| Cost and cardinality monitoring at platform level | N/A | SHOULD | SHOULD provide operational feedback |

From 4641fba68fa2b273eb5369fcd1546d3a5d81aefc Mon Sep 17 00:00:00 2001
From: sametd <sametdemir@gmail.com>
Date: Tue, 3 Mar 2026 01:54:07 +0100
Subject: [PATCH 5/7] docs(observability): clarify operational guidance and
 legacy migration

---
 Observability/observability-guidelines.md | 244 +++++++++++++++++++---
 1 file changed, 214 insertions(+), 30 deletions(-)

diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md
index 9794461..545759c 100644
--- a/Observability/observability-guidelines.md
+++ b/Observability/observability-guidelines.md
@@ -11,6 +11,9 @@
   - [2.1 Normative Language](#21-normative-language)
 - [3. Platform Context](#3-platform-context)
   - [3.1 High-Level Collection Strategy](#31-high-level-collection-strategy)
+  - [3.2 Log Access and Ownership](#32-log-access-and-ownership)
+  - [3.3 Log Retention and Archival](#33-log-retention-and-archival)
+  - [3.4 Telemetry Outage and Recovery](#34-telemetry-outage-and-recovery)
 - [4. Logging Standard](#4-logging-standard)
   - [4.1 Log Event Model](#41-log-event-model)
   - [4.2 Required Fields (Minimum Contract)](#42-required-fields-minimum-contract)
@@ -18,11 +21,13 @@
   - [4.4 Library vs Binary Application Logging](#44-library-vs-binary-application-logging)
   - [4.5 Good and Bad Log Lines](#45-good-and-bad-log-lines)
     - [4.5.1 Trace Correlation Fields (`trace_id` and `span_id`)](#451-trace-correlation-fields-trace_id-and-span_id)
+    - [4.5.2 Correlation Identifiers (`trace_id`, `request.id`, `job.id`)](#452-correlation-identifiers-trace_id-requestid-jobid)
   - [4.6 Severity and Event Design](#46-severity-and-event-design)
   - [4.7 Exception and Error Logging](#47-exception-and-error-logging)
   - [4.8 Safety and Compliance Rules](#48-safety-and-compliance-rules)
   - [4.9 Common Anti-Patterns](#49-common-anti-patterns)
   - [4.10 Validation Checklist and Ownership](#410-validation-checklist-and-ownership)
+  - [4.11 Legacy Compatibility and Migration](#411-legacy-compatibility-and-migration)
 - [5. Metrics Standard](#5-metrics-standard)
   - [5.1 Scope and Standard](#51-scope-and-standard)
   - [5.2 References](#52-references)
@@ -46,6 +51,8 @@ Current scope:
   - Kubernetes
   - Virtual machines (VMs)
   - HPC
+  - Bare metal servers
+  - Remote data-mover hosts
 
 Out of scope in this version:
 
@@ -77,10 +84,12 @@ ECMWF software runs in multiple environments:
 - Kubernetes clusters
 - Virtual machines
 - HPC systems
+- Bare metal servers
+- Remote data-mover hosts
 
 This document focuses on common logs and metrics structure plus application
 emission rules. Environment-specific collection design for Kubernetes, VMs,
-and HPC will be specified later.
+HPC, bare metal, and remote data-mover hosts will be specified later.
 
 ### 3.1 High-Level Collection Strategy
 
@@ -89,30 +98,116 @@ considered in service design.
 
 - Kubernetes workloads:
   - Platform Engineering Team deploys and operates OpenTelemetry collectors.
-  - Workloads emit logs/metrics in the agreed formats.
-- VM and HPC workloads:
-  - A collector/forwarder SHOULD run alongside the application or on the host.
-  - Workloads emit logs/metrics in the agreed formats.
-- Central ingestion:
-  - Logs are forwarded to the central ECMWF logging backend.
-  - Metrics are collected into the central Prometheus-compatible metrics stack.
+  - Application stdout/stderr is captured by the container runtime into node
+    log files (or equivalent runtime log sources), and collectors read/tail
+    those sources.
+  - Metrics/traces are collected via SDK/exporter endpoints or local agents,
+    depending on service design.
+- VM, HPC, and bare-metal workloads (including remote data-mover hosts):
+  - Applications/jobs write logs to host-local logging sources (for example
+    files, journald, or syslog), and host-local or scheduler-integrated
+    collectors read from those sources.
+  - Metrics/traces are collected via local endpoints/agents where enabled.
+- Central ingestion (common stage for all environments):
+  - Receives telemetry from Kubernetes and VM/HPC/bare-metal collection paths.
+  - Routes logs to the central ECMWF logging backend.
+  - Routes metrics to the central Prometheus-compatible metrics stack.
 
 ```mermaid
 flowchart TB
   subgraph K["Kubernetes"]
-    A1["Workloads"] --> A2["Collector<br/>(DaemonSet/Sidecar)"]
+    K1["Workloads"] --> K2["Container Runtime Log Sources"]
+    K2 --> K3["Collector<br/>(DaemonSet/Sidecar)"]
   end
 
-  subgraph V["VM / HPC"]
-    B1["Applications / Jobs"] --> B2["Collector Agent<br/>(Host-local)"]
+  subgraph H["VM / HPC / Bare Metal"]
+    H1["Applications / Jobs"] --> H2["Host-local Log Sources<br/>(files/journald/syslog)"]
+    H2 --> H3["Collector Agent<br/>(Host-local / Scheduler-integrated)"]
   end
 
-  A2 --> C["Central Ingestion"]
-  B2 --> C
+  K3 --> C["Central Ingestion<br/>(Common Stage)"]
+  H3 --> C
   C -->|logs| D["Logs Backend"]
   C -->|metrics| E["Prometheus Metrics<br/>Stack"]
 ```
 
+### 3.2 Log Access and Ownership
+
+Access to production logs is a service onboarding requirement and MUST be
+defined explicitly for each service and environment (Kubernetes, VM, HPC,
+bare metal, and remote data-mover hosts).
+
+Minimum governance requirements:
+
+- Access path MUST be documented (for example central logging UI/API and,
+  where required for resilience, approved host-local access method).
+- Access roles MUST be defined and approved by service owner and operations.
+- Access MUST be granted through managed team groups.
+- Access provisioning and access changes MUST follow the standard IAM
+  approval and logging process.
+- Emergency access procedure MUST be documented, including allowed methods
+  during central logging outage (for Kubernetes, controlled use of
+  `kubectl logs`; for VM/HPC/bare metal, approved host-local log access).
+  Emergency access MUST be time-limited and linked to an active incident.
+
+Ownership model:
+
+| Control | Development Team | Platform Engineering Team | Production Team |
+| --- | --- | --- | --- |
+| Define service log access requirements | MUST | SHOULD review feasibility | MUST review operational fit |
+| Implement central access controls (RBAC/SSO/groups) | N/A | MUST | SHOULD validate |
+| Approve and periodically review access lists | MUST | SHOULD support automation | MUST |
+| Maintain emergency access runbook | SHOULD contribute service context | SHOULD provide platform procedure | MUST own operational procedure |
+
+### 3.3 Log Retention and Archival
+
+Log retention requirements MUST be defined for each service and environment
+at onboarding and reviewed during major service changes.
+
+Retention model:
+
+- A default retention period MUST be provided by the central logging service.
+- Service-specific retention overrides MAY be requested with justification.
+- Long-term archival requirements (beyond central retention) MUST be declared
+  by the service owner and approved by operations.
+
+Ownership model:
+
+| Control | Development Team | Platform Engineering Team | Production Team |
+| --- | --- | --- | --- |
+| Declare required retention and archival period | MUST | SHOULD review feasibility | MUST review operational fit |
+| Implement retention in central logging platform | N/A | MUST | SHOULD validate production coverage |
+| Implement and operate long-term archival pipeline | N/A | SHOULD support platform integration | MUST |
+| Periodically review retention settings and costs | SHOULD | MUST provide platform metrics | MUST |
+
+### 3.4 Telemetry Outage and Recovery
+
+Observability design MUST include degraded-mode behavior for periods when
+central telemetry ingestion is unavailable.
+
+Minimum requirements:
+
+- Applications MUST continue emitting telemetry signals locally during central
+  outage:
+  - logs to a local, accessible sink (stdout/stderr, file, or system logger);
+  - metrics via a local scrape/export endpoint or host-local collector path;
+  - traces to a local collector/agent where tracing is enabled.
+- Collection/forwarding components SHOULD buffer locally and retry delivery
+  when connectivity or backend availability is restored.
+- Backfill after recovery MUST be supported where buffering exists.
+- Known telemetry coverage gaps MUST be detectable and reported to operations.
+- Services and runbooks MUST define how urgent logs are retrieved during
+  outage using documented emergency access methods.
+
+Ownership model:
+
+| Control | Development Team | Platform Engineering Team | Production Team |
+| --- | --- | --- | --- |
+| Ensure service emits local telemetry in degraded mode | MUST | SHOULD provide guidance | SHOULD validate in production |
+| Provide buffering/retry/backfill capability in pipeline | N/A | SHOULD | MUST validate operational readiness |
+| Detect and report ingestion coverage gaps | SHOULD emit health signals | MUST provide platform-level detection | MUST monitor and escalate |
+| Maintain outage and recovery runbook | SHOULD contribute service behavior | SHOULD contribute platform behavior | MUST own incident operation |
+
 ## 4. Logging Standard
 
 ECMWF software should emit structured logs aligned with the OpenTelemetry
@@ -125,7 +220,8 @@ Useful references:
 
 ### 4.1 Log Event Model
 
-Each log record should contain:
+Each log event MUST provide the following information, either directly in the
+record or via stable resource/context enrichment in the pipeline:
 
 - A clear event message (`body` / message).
 - Severity (`severity_text`, `severity_number`).
@@ -162,6 +258,10 @@ Canonical structure (OpenTelemetry-aligned):
 
 All production logs MUST include the following fields.
 
+The minimum contract applies to the effective log event at query/analysis
+time. Fields MAY be set directly by the application or added by approved
+collector/pipeline enrichment, provided values are stable and correct.
+
 Application-emitted fields:
 
 | Field | Requirement | Notes |
@@ -173,7 +273,7 @@ Application-emitted fields:
 | `service.name` | MUST | Logical service/application name |
 | `service.version` | MUST | Deployed version/build identifier |
 | `deployment.environment` | MUST | e.g. `dev`, `test`, `staging`, `prod` |
-| `trace_id` | MUST when available | Enables log-trace correlation |
+| `trace_id` | SHOULD when available | Enables log-trace correlation; not required for startup, housekeeping, or other non-request events |
 | `span_id` | MUST when available | Enables log-trace correlation |
 
 Collector-enriched or infrastructure fields:
@@ -205,27 +305,49 @@ Examples:
 - `operation.completed`
 - `operation.failed`
 
-Attribute cardinality guidance:
+When defining log attributes, teams MUST consider attribute cardinality.
+Cardinality is the number of distinct values an attribute has across events.
+
+High cardinality reduces observability quality because each distinct value
+creates its own group, which fragments aggregates and increases storage/query
+cost.
+
+Attribute guidance:
 
-- Low to medium cardinality fields are preferred for repeated events.
-- Request/job identifiers are allowed for correlation.
+- Prefer low to medium cardinality attributes for repeated events.
+- Use request/job identifiers only for correlation and troubleshooting.
 - Do not create dynamic field names.
 - Do not move arbitrary payloads into attributes.
-- Large free-text content SHOULD stay in `body` only when necessary.
+- Keep large free-text content in `body` when necessary.
 
 ### 4.4 Library vs Binary Application Logging
 
 #### Libraries
 
-- MUST not configure global logging policy.
-- MUST use the application-provided logger interface.
-- MUST emit structured fields, not only formatted strings.
+- MUST not configure global logging policy (sinks, format, or global levels).
+- MUST use logger/context provided by the application, or a documented
+  adapter/interface supplied by the application.
+- MUST expose structured key/value fields in logging calls, not only
+  pre-formatted message strings.
 - MUST not log secrets or large payloads.
 - SHOULD avoid excessive `INFO`/`DEBUG` logs in hot code paths.
 - SHOULD include stable event names for reusable log points:
   - Example: `event.name="library.decode.failed"`
   - Avoid changing field keys between library versions without migration notes.
 
+Library API expectation:
+
+- Library entry points SHOULD accept logging context from the caller
+  (logger handle/interface plus correlation fields when available).
+- If a logger is not passed explicitly, the library SHOULD accept a context
+  object that carries logger and correlation metadata.
+- Library code SHOULD propagate the received logger/context unchanged to lower
+  library layers.
+- Libraries MUST NOT silently create independent global logger configuration as
+  a fallback.
+- Libraries SHOULD document the expected logger/context contract in their public
+  API (what is required, optional, and how correlation fields are passed).
+
 #### Binary Applications / Services
 
 - MUST own logger initialisation and runtime configuration.
@@ -262,6 +384,29 @@ and `attributes`) for consistency.
 - When tracing context is unavailable (for example offline batch steps),
   these fields MAY be absent.
 
+#### 4.5.2 Correlation Identifiers (`trace_id`, `request.id`, `job.id`)
+
+These identifiers represent different scopes of correlation and MAY appear
+together in a single log event.
+
+- `trace_id`: identifies one end-to-end distributed trace across services.
+  It is created by tracing instrumentation and used to follow cross-service
+  call chains.
+- `request.id`: identifies one application-level request or unit of API/user
+  work at the service boundary. It is generated by the application or
+  middleware handling that request and propagated through service logs.
+- `job.id`: identifies one batch or workflow execution (for example scheduler
+  submission, worker run, or pipeline task instance). It is used to correlate
+  logs across the full lifecycle of that job.
+
+Guidance:
+
+- These identifiers are complementary and not interchangeable.
+- Include all identifiers that exist in the current execution context.
+- In request/response flows, `trace_id` and `request.id` often coexist.
+- In batch/HPC flows, `job.id` is usually primary; `trace_id` MAY be absent
+  unless tracing is enabled for that workflow.
+
 Bad log line characteristics:
 
 - Free-form text without structure.
@@ -326,11 +471,23 @@ logs data model:
 
 ### 4.7 Exception and Error Logging
 
-- Log an exception once at the handling boundary.
-- Avoid duplicate logging of the same error in multiple layers.
+- Emit one primary error log at the handling boundary that determines outcome
+  (for example request failure, job failure, retry exhaustion).
+- Intermediate layers MAY log additional context, but SHOULD avoid duplicating
+  full stack traces/messages for the same failure path.
+- Preserve failure context for cascaded errors by recording:
+  - the high-level operation that failed (for example request decoding,
+    workflow step execution, data transfer);
+  - the immediate reason at that layer;
+  - the underlying cause summary when the error was wrapped/propagated from a
+    lower layer.
+- Use the language/runtime error-chain mechanism where available so operators
+  can reconstruct the sequence of failure causes from boundary logs.
 - Include stack traces when they materially improve diagnosis.
 - Sanitize stack traces and exception messages before emission.
-- Include `error.type` and `error.message` for failed operations.
+
+Goal: preserve the failure chain (for example I/O error -> decode error ->
+request failure) without logging the same full stack trace at every layer.
 
 ### 4.8 Safety and Compliance Rules
 
@@ -351,7 +508,7 @@ logs data model:
 | Free-text logs only | Hard to parse, search, and alert | Structured JSON with stable keys |
 | Dynamic field names | Breaks queries and dashboards | Stable schema and key names |
 | Logging in tight loops at `INFO` | Noise and cost explosion | Reduce frequency and log only meaningful state changes |
-| Duplicate exception logs across layers | Inflates incident noise | Log once at handling boundary |
+| Duplicate exception logs across layers | Inflates incident noise | One primary error log at handling boundary; keep intermediate logs contextual and avoid duplicate full stacks |
 | Logging secrets/tokens | Security and compliance risk | Redaction and explicit deny-lists |
 
 ### 4.10 Validation Checklist and Ownership
@@ -359,7 +516,8 @@ logs data model:
 Before release, teams should verify:
 
 - Required fields are present in production logs.
-- Log output is valid structured JSON.
+- Log output is valid structured JSON, or legacy format logs are mapped to
+  the common schema via approved pipeline parsing/enrichment.
 - Secrets and sensitive data are redacted.
 - Library and binary responsibilities are correctly separated.
 - Severity levels are used consistently.
@@ -369,7 +527,7 @@ Ownership split for compliance:
 
 | Control | Development Team | Platform Engineering Team |
 | --- | --- | --- |
-| Structured JSON emitted by app | MUST | N/A |
+| Structured JSON emitted by app | MUST for new services; phased plan allowed for approved legacy services | N/A |
 | Required app fields (`service.name`, `service.version`, `deployment.environment`, `body`, severity) | MUST | Validate only |
 | Secret redaction in app logs | MUST | SHOULD add defensive redaction in pipeline |
 | `k8s.namespace.name`, `k8s.pod.name`, `host.name` enrichment | MAY | MUST where collector supports it |
@@ -377,6 +535,28 @@ Ownership split for compliance:
 | Parsing/schema validation in collector | N/A | SHOULD |
 | Log noise and volume control | SHOULD at source | SHOULD as safety net |
 
+### 4.11 Legacy Compatibility and Migration
+
+These guidelines define the target logging model, but do not require immediate
+JSON-only migration for existing stable services.
+
+Compatibility requirements:
+
+- Existing log formats MAY continue where they are operationally established.
+- Service teams MUST document the current format and downstream consumers
+  before changing log structure.
+- New services MUST emit structured logs by default.
+- For existing services, collector/pipeline parsing and enrichment MAY be used
+  to map legacy logs into the common schema.
+- Migration SHOULD be incremental and service-specific, with no disruption to
+  existing operational workflows.
+
+Target-state requirement:
+
+- Services that can adopt structured JSON logging without operational risk
+  SHOULD do so; services with high migration cost MAY follow a phased plan
+  agreed with platform and operations teams.
+
 ## 5. Metrics Standard
 
 Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format.
@@ -389,8 +569,8 @@ the Alerting section.
 
 - This section defines instrumentation expectations, metric schema, and
   quality requirements.
-- Environment-specific scrape/discovery designs for Kubernetes, VMs, and HPC
-  are specified separately.
+- Environment-specific scrape/discovery designs for Kubernetes, VMs, HPC,
+  bare metal, and remote data-mover hosts are specified separately.
 - Metrics exposure and collection at a high level:
   - HTTP services SHOULD expose a `/metrics` endpoint owned by the service.
   - Non-HTTP and batch/HPC workloads MUST still expose Prometheus-compatible
@@ -446,6 +626,10 @@ Bad naming examples:
 ### 5.5 Labels and Cardinality
 
 Labels add dimensionality to metrics but increase cardinality.
+Cardinality is the number of distinct label combinations a metric produces.
+
+High cardinality reduces metric usefulness because it creates too many
+series, increasing storage/query cost and weakening dashboard/alert signal.
 
 - Labels MUST use stable keys and bounded value sets.
 - Labels SHOULD describe dimensions such as:

From bfab84f74561c7769ecd3c900aa9db11ce42d917 Mon Sep 17 00:00:00 2001
From: sametd <sametdemir@gmail.com>
Date: Fri, 6 Mar 2026 12:04:25 +0100
Subject: [PATCH 6/7] fix(observability): align logging guidelines with OTel
 log data model

- Rename severity_text/severity_number to severityText/severityNumber
- Rename trace_id/span_id to traceId/spanId and move to top-level LogRecord fields
- Upgrade traceId requirement from SHOULD to MUST when available
- Split required fields table into LogRecord fields and Resource attributes
- Downgrade deployment.environment from MUST to SHOULD
- Add TRACE severity level and severityText-to-severityNumber mapping table
- Add OTel exception attributes: exception.type, exception.message, exception.stacktrace
- Remove deprecated event.domain attribute
- Replace error.message with exception.message per OTel semantic conventions
- Fix MUST not -> MUST NOT in library logging rules
- Fix lowercase normative keywords (should -> SHOULD/MUST NOT)
- Update event.name examples to follow three-part domain.action.result format
- Align 4.10 ownership table with deployment.environment SHOULD requirement
---
 Observability/observability-guidelines.md | 110 +++++++++++++---------
 1 file changed, 64 insertions(+), 46 deletions(-)

diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md
index 545759c..0eeb1e6 100644
--- a/Observability/observability-guidelines.md
+++ b/Observability/observability-guidelines.md
@@ -20,8 +20,8 @@
   - [4.3 Event Naming and Attribute Cardinality](#43-event-naming-and-attribute-cardinality)
   - [4.4 Library vs Binary Application Logging](#44-library-vs-binary-application-logging)
   - [4.5 Good and Bad Log Lines](#45-good-and-bad-log-lines)
-    - [4.5.1 Trace Correlation Fields (`trace_id` and `span_id`)](#451-trace-correlation-fields-trace_id-and-span_id)
-    - [4.5.2 Correlation Identifiers (`trace_id`, `request.id`, `job.id`)](#452-correlation-identifiers-trace_id-requestid-jobid)
+    - [4.5.1 Trace Correlation Fields (`traceId` and `spanId`)](#451-trace-correlation-fields-traceid-and-spanid)
+    - [4.5.2 Correlation Identifiers (`traceId`, `request.id`, `job.id`)](#452-correlation-identifiers-traceid-requestid-jobid)
   - [4.6 Severity and Event Design](#46-severity-and-event-design)
   - [4.7 Exception and Error Logging](#47-exception-and-error-logging)
   - [4.8 Safety and Compliance Rules](#48-safety-and-compliance-rules)
@@ -74,7 +74,7 @@ Out of scope in this version:
 The keywords `MUST`, `SHOULD`, and `MAY` are used as requirement levels:
 
 - `MUST`: mandatory requirement for compliance.
-- `SHOULD`: recommended default; deviations should be justified.
+- `SHOULD`: recommended default; deviations require justification.
 - `MAY`: optional behavior.
 
 ## 3. Platform Context
@@ -210,7 +210,7 @@ Ownership model:
 
 ## 4. Logging Standard
 
-ECMWF software should emit structured logs aligned with the OpenTelemetry
+ECMWF software SHOULD emit structured logs aligned with the OpenTelemetry
 log data model.
 
 Useful references:
@@ -224,7 +224,7 @@ Each log event MUST provide the following information, either directly in the
 record or via stable resource/context enrichment in the pipeline:
 
 - A clear event message (`body` / message).
-- Severity (`severity_text`, `severity_number`).
+- Severity (`severityText`, `severityNumber`).
 - Timestamp in UTC.
 - Stable resource attributes (service and environment metadata).
 - Context attributes for debugging and operations.
@@ -234,8 +234,10 @@ Canonical structure (OpenTelemetry-aligned):
 ```json
 {
   "timestamp": "2026-02-11T12:20:43Z",
-  "severity_text": "INFO",
-  "severity_number": 9,
+  "traceId": "7f3fbbf5b8f24f32a59ec8ef9b264f93",
+  "spanId": "f9c3a29d03ef154f",
+  "severityText": "INFO",
+  "severityNumber": 9,
   "body": "Operation completed",
   "resource": {
     "service.name": "example-service",
@@ -245,11 +247,9 @@ Canonical structure (OpenTelemetry-aligned):
     "k8s.pod.name": "example-service-7f8b66f9f7-rj8vd"
   },
   "attributes": {
-    "event.name": "operation.completed",
+    "event.name": "data.transfer.completed",
     "request.id": "req-8f31c9",
-    "job.id": "job-42a7",
-    "trace_id": "7f3fbbf5b8f24f32a59ec8ef9b264f93",
-    "span_id": "f9c3a29d03ef154f"
+    "job.id": "job-42a7"
   }
 }
 ```
@@ -262,19 +262,24 @@ The minimum contract applies to the effective log event at query/analysis
 time. Fields MAY be set directly by the application or added by approved
 collector/pipeline enrichment, provided values are stable and correct.
 
-Application-emitted fields:
+LogRecord fields (top-level in the log record):
 
 | Field | Requirement | Notes |
 | --- | --- | --- |
 | `timestamp` | MUST | UTC, RFC 3339 / ISO-8601 format |
-| `severity_text` | MUST | `DEBUG`, `INFO`, `WARN`, `ERROR`, `FATAL` |
-| `severity_number` | MUST | Numeric OTel-compatible severity |
+| `severityText` | MUST | `TRACE`, `DEBUG`, `INFO`, `WARN`, `ERROR`, `FATAL` |
+| `severityNumber` | MUST | Numeric OTel-compatible severity |
 | `body` | MUST | Human-readable message describing one event |
+| `traceId` | MUST when available | Enables log-trace correlation; not required for startup, housekeeping, or other non-request events |
+| `spanId` | MUST when available | Enables log-trace correlation |
+
+Resource attributes (nested inside the `resource` block):
+
+| Field | Requirement | Notes |
+| --- | --- | --- |
 | `service.name` | MUST | Logical service/application name |
 | `service.version` | MUST | Deployed version/build identifier |
-| `deployment.environment` | MUST | e.g. `dev`, `test`, `staging`, `prod` |
-| `trace_id` | SHOULD when available | Enables log-trace correlation; not required for startup, housekeeping, or other non-request events |
-| `span_id` | MUST when available | Enables log-trace correlation |
+| `deployment.environment` | SHOULD | e.g. `dev`, `test`, `staging`, `prod`; may not be known by the application at runtime |
 
 Collector-enriched or infrastructure fields:
 
@@ -287,8 +292,7 @@ Collector-enriched or infrastructure fields:
 Recommended additional fields:
 
 - `event.name` (stable event type)
-- `event.domain` (component/domain group)
-- `error.type` and `error.message` for failures
+- `error.type` for error classification; `exception.type`, `exception.message`, and `exception.stacktrace` for exception details
 - Request/work item identifiers (for example `request.id`, `job.id`)
 
 ### 4.3 Event Naming and Attribute Cardinality
@@ -302,8 +306,8 @@ Event naming convention:
 
 Examples:
 
-- `operation.completed`
-- `operation.failed`
+- `data.transfer.completed`
+- `data.transfer.failed`
 
 When defining log attributes, teams MUST consider attribute cardinality.
 Cardinality is the number of distinct values an attribute has across events.
@@ -324,12 +328,12 @@ Attribute guidance:
 
 #### Libraries
 
-- MUST not configure global logging policy (sinks, format, or global levels).
+- MUST NOT configure global logging policy (sinks, format, or global levels).
 - MUST use logger/context provided by the application, or a documented
   adapter/interface supplied by the application.
 - MUST expose structured key/value fields in logging calls, not only
   pre-formatted message strings.
-- MUST not log secrets or large payloads.
+- MUST NOT log secrets or large payloads.
 - SHOULD avoid excessive `INFO`/`DEBUG` logs in hot code paths.
 - SHOULD include stable event names for reusable log points:
   - Example: `event.name="library.decode.failed"`
@@ -369,27 +373,27 @@ Good log line characteristics:
 - Includes identifiers and outcome.
 - Uses stable field names.
 - Supports correlation:
-  - Include `trace_id` and `span_id` when context exists.
+  - Include `traceId` and `spanId` when context exists.
   - Include request/job identifiers when available.
 
 Examples below use the same canonical structure as Section 4.1 (`resource`
 and `attributes`) for consistency.
 
-#### 4.5.1 Trace Correlation Fields (`trace_id` and `span_id`)
+#### 4.5.1 Trace Correlation Fields (`traceId` and `spanId`)
 
-- `trace_id` identifies the full end-to-end request/workflow across services.
-- `span_id` identifies one operation within that trace in a single service.
-- Multiple log records in one service operation typically share a `span_id`.
-- A single `trace_id` usually contains multiple spans across components.
+- `traceId` identifies the full end-to-end request/workflow across services.
+- `spanId` identifies one operation within that trace in a single service.
+- Multiple log records in one service operation typically share a `spanId`.
+- A single `traceId` usually contains multiple spans across components.
 - When tracing context is unavailable (for example offline batch steps),
   these fields MAY be absent.
 
-#### 4.5.2 Correlation Identifiers (`trace_id`, `request.id`, `job.id`)
+#### 4.5.2 Correlation Identifiers (`traceId`, `request.id`, `job.id`)
 
 These identifiers represent different scopes of correlation and MAY appear
 together in a single log event.
 
-- `trace_id`: identifies one end-to-end distributed trace across services.
+- `traceId`: identifies one end-to-end distributed trace across services.
   It is created by tracing instrumentation and used to follow cross-service
   call chains.
 - `request.id`: identifies one application-level request or unit of API/user
@@ -403,8 +407,8 @@ Guidance:
 
 - These identifiers are complementary and not interchangeable.
 - Include all identifiers that exist in the current execution context.
-- In request/response flows, `trace_id` and `request.id` often coexist.
-- In batch/HPC flows, `job.id` is usually primary; `trace_id` MAY be absent
+- In request/response flows, `traceId` and `request.id` often coexist.
+- In batch/HPC flows, `job.id` is usually primary; `traceId` MAY be absent
   unless tracing is enabled for that workflow.
 
 Bad log line characteristics:
@@ -422,8 +426,10 @@ Good example:
 ```json
 {
   "timestamp": "2026-02-11T12:20:43Z",
-  "severity_text": "INFO",
-  "severity_number": 9,
+  "traceId": "7f3fbbf5b8f24f32a59ec8ef9b264f93",
+  "spanId": "f9c3a29d03ef154f",
+  "severityText": "INFO",
+  "severityNumber": 9,
   "body": "Operation completed",
   "resource": {
     "service.name": "example-service",
@@ -433,11 +439,9 @@ Good example:
     "k8s.pod.name": "example-service-7f8b66f9f7-rj8vd"
   },
   "attributes": {
-    "event.name": "operation.completed",
+    "event.name": "data.transfer.completed",
     "request.id": "req-8f31c9",
-    "job.id": "job-42a7",
-    "trace_id": "7f3fbbf5b8f24f32a59ec8ef9b264f93",
-    "span_id": "f9c3a29d03ef154f"
+    "job.id": "job-42a7"
   }
 }
 ```
@@ -456,17 +460,29 @@ Login failed for user alice password=PlainTextSecret token=eyJhbGci...
 
 ### 4.6 Severity and Event Design
 
+- `TRACE`: fine-grained diagnostics; more verbose than `DEBUG`.
 - `DEBUG`: development diagnostics and verbose internals.
 - `INFO`: normal lifecycle and business-relevant state changes.
 - `WARN`: unexpected but recoverable conditions.
 - `ERROR`: failed operation requiring attention.
 - `FATAL`: unrecoverable condition before shutdown.
 
+`severityText` to `severityNumber` mapping (use the lowest value in the range
+unless a finer distinction is needed):
+
+| `severityText` | `severityNumber` range |
+| --- | --- |
+| `TRACE` | 1–4 |
+| `DEBUG` | 5–8 |
+| `INFO` | 9–12 |
+| `WARN` | 13–16 |
+| `ERROR` | 17–20 |
+| `FATAL` | 21–24 |
+
 Use stable event names (`event.name`) where possible, and make messages
 explicit about outcome, target, and reason.
 
-For severity mapping guidance, follow OpenTelemetry severity concepts in the
-logs data model:
+For the full severity number specification including sub-levels, see:
 <https://opentelemetry.io/docs/specs/otel/logs/data-model/#field-severitynumber>
 
 ### 4.7 Exception and Error Logging
@@ -483,6 +499,8 @@ logs data model:
     lower layer.
 - Use the language/runtime error-chain mechanism where available so operators
   can reconstruct the sequence of failure causes from boundary logs.
+- When recording exception details, use the OTel semantic convention attributes:
+  `exception.type`, `exception.message`, and `exception.stacktrace`.
 - Include stack traces when they materially improve diagnosis.
 - Sanitize stack traces and exception messages before emission.
 
@@ -491,7 +509,7 @@ request failure) without logging the same full stack trace at every layer.
 
 ### 4.8 Safety and Compliance Rules
 
-- MUST never log secrets, credentials, session tokens, private keys, or
+- MUST NOT log secrets, credentials, session tokens, private keys, or
   personal data.
 - MUST redact sensitive substrings before writing log output.
 - SHOULD avoid full object dumps unless explicitly sanitized.
@@ -513,7 +531,7 @@ request failure) without logging the same full stack trace at every layer.
 
 ### 4.10 Validation Checklist and Ownership
 
-Before release, teams should verify:
+Before release, teams SHOULD verify:
 
 - Required fields are present in production logs.
 - Log output is valid structured JSON, or legacy format logs are mapped to
@@ -521,14 +539,14 @@ Before release, teams should verify:
 - Secrets and sensitive data are redacted.
 - Library and binary responsibilities are correctly separated.
 - Severity levels are used consistently.
-- Correlation fields (`trace_id`, `span_id`) are present when tracing context exists.
+- Correlation fields (`traceId`, `spanId`) are present when tracing context exists.
 
 Ownership split for compliance:
 
 | Control | Development Team | Platform Engineering Team |
 | --- | --- | --- |
 | Structured JSON emitted by app | MUST for new services; phased plan allowed for approved legacy services | N/A |
-| Required app fields (`service.name`, `service.version`, `deployment.environment`, `body`, severity) | MUST | Validate only |
+| Required app fields (`service.name`, `service.version`, `body`, severity); `deployment.environment` where known | MUST; `deployment.environment` SHOULD | Validate only |
 | Secret redaction in app logs | MUST | SHOULD add defensive redaction in pipeline |
 | `k8s.namespace.name`, `k8s.pod.name`, `host.name` enrichment | MAY | MUST where collector supports it |
 | Log transport to backend (for example Splunk) | N/A | MUST |
@@ -703,7 +721,7 @@ requestDurationMs{path="/api/v1/items/123456"} 187
 
 ### 5.9 Validation Checklist and Ownership
 
-Before release, teams should verify:
+Before release, teams SHOULD verify:
 
 - Metric names, units, and suffixes are compliant.
 - Required baseline metrics are present.

From 1bf0aa15c14cc6c46ad0bc991951596019b4570e Mon Sep 17 00:00:00 2001
From: sametd <sametdemir@gmail.com>
Date: Sat, 14 Mar 2026 00:36:05 +0100
Subject: [PATCH 7/7] docs(observability): move guidelines to Development
 Practices/Observability.md

---
 .../Observability.md                                            | 0
 README.md                                                       | 2 +-
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename Observability/observability-guidelines.md => Development Practices/Observability.md (100%)

diff --git a/Observability/observability-guidelines.md b/Development Practices/Observability.md
similarity index 100%
rename from Observability/observability-guidelines.md
rename to Development Practices/Observability.md
diff --git a/README.md b/README.md
index 03798a2..f644184 100644
--- a/README.md
+++ b/README.md
@@ -15,7 +15,7 @@ The Codex is a set of guidelines for development of software and services at ECM
 - [Project Maturity](./Project%20Maturity)
 - [Containerisation](./Containerisation)
 - [Testing](./Testing)
-- [Observability](./Observability)
+- [Observability](./Development%20Practices/Observability.md)
 - [ECMWF Software EnginE (ESEE)](./ESEE)
 - [Contributing to External Projects](./Contributing%20Externally/)
 - [Incoming External Contributions](./External%20Contributions/)