diff --git a/Development Practices/Observability.md b/Development Practices/Observability.md
new file mode 100644
index 0000000..0eeb1e6
--- /dev/null
+++ b/Development Practices/Observability.md
@@ -0,0 +1,741 @@
+# ECMWF Observability Guidelines
+
+
+
+## Table of Contents
+
+- [1. Purpose and Scope](#1-purpose-and-scope)
+- [2. Core Principles](#2-core-principles)
+ - [2.1 Normative Language](#21-normative-language)
+- [3. Platform Context](#3-platform-context)
+ - [3.1 High-Level Collection Strategy](#31-high-level-collection-strategy)
+ - [3.2 Log Access and Ownership](#32-log-access-and-ownership)
+ - [3.3 Log Retention and Archival](#33-log-retention-and-archival)
+ - [3.4 Telemetry Outage and Recovery](#34-telemetry-outage-and-recovery)
+- [4. Logging Standard](#4-logging-standard)
+ - [4.1 Log Event Model](#41-log-event-model)
+ - [4.2 Required Fields (Minimum Contract)](#42-required-fields-minimum-contract)
+ - [4.3 Event Naming and Attribute Cardinality](#43-event-naming-and-attribute-cardinality)
+ - [4.4 Library vs Binary Application Logging](#44-library-vs-binary-application-logging)
+ - [4.5 Good and Bad Log Lines](#45-good-and-bad-log-lines)
+ - [4.5.1 Trace Correlation Fields (`traceId` and `spanId`)](#451-trace-correlation-fields-traceid-and-spanid)
+ - [4.5.2 Correlation Identifiers (`traceId`, `request.id`, `job.id`)](#452-correlation-identifiers-traceid-requestid-jobid)
+ - [4.6 Severity and Event Design](#46-severity-and-event-design)
+ - [4.7 Exception and Error Logging](#47-exception-and-error-logging)
+ - [4.8 Safety and Compliance Rules](#48-safety-and-compliance-rules)
+ - [4.9 Common Anti-Patterns](#49-common-anti-patterns)
+ - [4.10 Validation Checklist and Ownership](#410-validation-checklist-and-ownership)
+ - [4.11 Legacy Compatibility and Migration](#411-legacy-compatibility-and-migration)
+- [5. Metrics Standard](#5-metrics-standard)
+ - [5.1 Scope and Standard](#51-scope-and-standard)
+ - [5.2 References](#52-references)
+ - [5.3 Metric Types and Usage](#53-metric-types-and-usage)
+ - [5.4 Naming Conventions](#54-naming-conventions)
+ - [5.5 Labels and Cardinality](#55-labels-and-cardinality)
+ - [5.6 Required Baseline Metrics](#56-required-baseline-metrics)
+ - [5.7 Histogram Guidance](#57-histogram-guidance)
+ - [5.8 Good and Bad Metric Examples](#58-good-and-bad-metric-examples)
+ - [5.9 Validation Checklist and Ownership](#59-validation-checklist-and-ownership)
+
+## 1. Purpose and Scope
+
+This document defines the ECMWF baseline for observability across software and services.
+
+Current scope:
+
+- Defines common expectations for observability signals.
+- Defines logging and metrics standardisation.
+- Covers all deployment contexts at a principle level:
+ - Kubernetes
+ - Virtual machines (VMs)
+ - HPC
+ - Bare metal servers
+ - Remote data-mover hosts
+
+Out of scope in this version:
+
+- Detailed environment-specific collection pipelines and agent deployment patterns.
+- Full tracing specification (to be defined in a later revision).
+
+## 2. Core Principles
+
+- Use consistent observability conventions across all ECMWF software.
+- Prefer machine-parseable telemetry over free-form text.
+- Keep telemetry actionable and low-noise.
+- Correlate signals where possible (for example, include trace/span
+ identifiers in logs when available).
+- Protect sensitive data by design (no credentials, tokens, or personal data
+ in logs/metrics/traces).
+
+### 2.1 Normative Language
+
+The keywords `MUST`, `SHOULD`, and `MAY` are used as requirement levels:
+
+- `MUST`: mandatory requirement for compliance.
+- `SHOULD`: recommended default; deviations require justification.
+- `MAY`: optional behavior.
+
+## 3. Platform Context
+
+ECMWF software runs in multiple environments:
+
+- Kubernetes clusters
+- Virtual machines
+- HPC systems
+- Bare metal servers
+- Remote data-mover hosts
+
+This document focuses on common logs and metrics structure plus application
+emission rules. Environment-specific collection design for Kubernetes, VMs,
+HPC, bare metal, and remote data-mover hosts will be specified later.
+
+### 3.1 High-Level Collection Strategy
+
+The collection pipeline is part of the deployment environment and MUST be
+considered in service design.
+
+- Kubernetes workloads:
+ - Platform Engineering Team deploys and operates OpenTelemetry collectors.
+ - Application stdout/stderr is captured by the container runtime into node
+ log files (or equivalent runtime log sources), and collectors read/tail
+ those sources.
+ - Metrics/traces are collected via SDK/exporter endpoints or local agents,
+ depending on service design.
+- VM, HPC, and bare-metal workloads (including remote data-mover hosts):
+ - Applications/jobs write logs to host-local logging sources (for example
+ files, journald, or syslog), and host-local or scheduler-integrated
+ collectors read from those sources.
+ - Metrics/traces are collected via local endpoints/agents where enabled.
+- Central ingestion (common stage for all environments):
+ - Receives telemetry from Kubernetes and VM/HPC/bare-metal collection paths.
+ - Routes logs to the central ECMWF logging backend.
+ - Routes metrics to the central Prometheus-compatible metrics stack.
+
+```mermaid
+flowchart TB
+ subgraph K["Kubernetes"]
+ K1["Workloads"] --> K2["Container Runtime Log Sources"]
+ K2 --> K3["Collector
(DaemonSet/Sidecar)"]
+ end
+
+ subgraph H["VM / HPC / Bare Metal"]
+ H1["Applications / Jobs"] --> H2["Host-local Log Sources
(files/journald/syslog)"]
+ H2 --> H3["Collector Agent
(Host-local / Scheduler-integrated)"]
+ end
+
+ K3 --> C["Central Ingestion
(Common Stage)"]
+ H3 --> C
+ C -->|logs| D["Logs Backend"]
+ C -->|metrics| E["Prometheus Metrics
Stack"]
+```
+
+### 3.2 Log Access and Ownership
+
+Access to production logs is a service onboarding requirement and MUST be
+defined explicitly for each service and environment (Kubernetes, VM, HPC,
+bare metal, and remote data-mover hosts).
+
+Minimum governance requirements:
+
+- Access path MUST be documented (for example central logging UI/API and,
+ where required for resilience, approved host-local access method).
+- Access roles MUST be defined and approved by service owner and operations.
+- Access MUST be granted through managed team groups.
+- Access provisioning and access changes MUST follow the standard IAM
+ approval and logging process.
+- Emergency access procedure MUST be documented, including allowed methods
+ during central logging outage (for Kubernetes, controlled use of
+ `kubectl logs`; for VM/HPC/bare metal, approved host-local log access).
+ Emergency access MUST be time-limited and linked to an active incident.
+
+Ownership model:
+
+| Control | Development Team | Platform Engineering Team | Production Team |
+| --- | --- | --- | --- |
+| Define service log access requirements | MUST | SHOULD review feasibility | MUST review operational fit |
+| Implement central access controls (RBAC/SSO/groups) | N/A | MUST | SHOULD validate |
+| Approve and periodically review access lists | MUST | SHOULD support automation | MUST |
+| Maintain emergency access runbook | SHOULD contribute service context | SHOULD provide platform procedure | MUST own operational procedure |
+
+### 3.3 Log Retention and Archival
+
+Log retention requirements MUST be defined for each service and environment
+at onboarding and reviewed during major service changes.
+
+Retention model:
+
+- A default retention period MUST be provided by the central logging service.
+- Service-specific retention overrides MAY be requested with justification.
+- Long-term archival requirements (beyond central retention) MUST be declared
+ by the service owner and approved by operations.
+
+Ownership model:
+
+| Control | Development Team | Platform Engineering Team | Production Team |
+| --- | --- | --- | --- |
+| Declare required retention and archival period | MUST | SHOULD review feasibility | MUST review operational fit |
+| Implement retention in central logging platform | N/A | MUST | SHOULD validate production coverage |
+| Implement and operate long-term archival pipeline | N/A | SHOULD support platform integration | MUST |
+| Periodically review retention settings and costs | SHOULD | MUST provide platform metrics | MUST |
+
+### 3.4 Telemetry Outage and Recovery
+
+Observability design MUST include degraded-mode behavior for periods when
+central telemetry ingestion is unavailable.
+
+Minimum requirements:
+
+- Applications MUST continue emitting telemetry signals locally during central
+ outage:
+ - logs to a local, accessible sink (stdout/stderr, file, or system logger);
+ - metrics via a local scrape/export endpoint or host-local collector path;
+ - traces to a local collector/agent where tracing is enabled.
+- Collection/forwarding components SHOULD buffer locally and retry delivery
+ when connectivity or backend availability is restored.
+- Backfill after recovery MUST be supported where buffering exists.
+- Known telemetry coverage gaps MUST be detectable and reported to operations.
+- Services and runbooks MUST define how urgent logs are retrieved during
+ outage using documented emergency access methods.
+
+Ownership model:
+
+| Control | Development Team | Platform Engineering Team | Production Team |
+| --- | --- | --- | --- |
+| Ensure service emits local telemetry in degraded mode | MUST | SHOULD provide guidance | SHOULD validate in production |
+| Provide buffering/retry/backfill capability in pipeline | N/A | SHOULD | MUST validate operational readiness |
+| Detect and report ingestion coverage gaps | SHOULD emit health signals | MUST provide platform-level detection | MUST monitor and escalate |
+| Maintain outage and recovery runbook | SHOULD contribute service behavior | SHOULD contribute platform behavior | MUST own incident operation |
+
+## 4. Logging Standard
+
+ECMWF software SHOULD emit structured logs aligned with the OpenTelemetry
+log data model.
+
+Useful references:
+
+- OpenTelemetry logs data model:
+- OpenTelemetry semantic conventions:
+
+### 4.1 Log Event Model
+
+Each log event MUST provide the following information, either directly in the
+record or via stable resource/context enrichment in the pipeline:
+
+- A clear event message (`body` / message).
+- Severity (`severityText`, `severityNumber`).
+- Timestamp in UTC.
+- Stable resource attributes (service and environment metadata).
+- Context attributes for debugging and operations.
+
+Canonical structure (OpenTelemetry-aligned):
+
+```json
+{
+ "timestamp": "2026-02-11T12:20:43Z",
+ "traceId": "7f3fbbf5b8f24f32a59ec8ef9b264f93",
+ "spanId": "f9c3a29d03ef154f",
+ "severityText": "INFO",
+ "severityNumber": 9,
+ "body": "Operation completed",
+ "resource": {
+ "service.name": "example-service",
+ "service.version": "1.0.0",
+ "deployment.environment": "prod",
+ "k8s.namespace.name": "default",
+ "k8s.pod.name": "example-service-7f8b66f9f7-rj8vd"
+ },
+ "attributes": {
+ "event.name": "data.transfer.completed",
+ "request.id": "req-8f31c9",
+ "job.id": "job-42a7"
+ }
+}
+```
+
+### 4.2 Required Fields (Minimum Contract)
+
+All production logs MUST include the following fields.
+
+The minimum contract applies to the effective log event at query/analysis
+time. Fields MAY be set directly by the application or added by approved
+collector/pipeline enrichment, provided values are stable and correct.
+
+LogRecord fields (top-level in the log record):
+
+| Field | Requirement | Notes |
+| --- | --- | --- |
+| `timestamp` | MUST | UTC, RFC 3339 / ISO-8601 format |
+| `severityText` | MUST | `TRACE`, `DEBUG`, `INFO`, `WARN`, `ERROR`, `FATAL` |
+| `severityNumber` | MUST | Numeric OTel-compatible severity |
+| `body` | MUST | Human-readable message describing one event |
+| `traceId` | MUST when available | Enables log-trace correlation; not required for startup, housekeeping, or other non-request events |
+| `spanId` | MUST when available | Enables log-trace correlation |
+
+Resource attributes (nested inside the `resource` block):
+
+| Field | Requirement | Notes |
+| --- | --- | --- |
+| `service.name` | MUST | Logical service/application name |
+| `service.version` | MUST | Deployed version/build identifier |
+| `deployment.environment` | SHOULD | e.g. `dev`, `test`, `staging`, `prod`; may not be known by the application at runtime |
+
+Collector-enriched or infrastructure fields:
+
+| Field | Requirement | Notes |
+| --- | --- | --- |
+| `host.name` | MUST (VM/HPC context) | May be emitted by app or added by collector/resource detection |
+| `k8s.namespace.name` | MUST (K8s context) | May be added at collection layer |
+| `k8s.pod.name` | MUST (K8s context) | May be added at collection layer |
+
+Recommended additional fields:
+
+- `event.name` (stable event type)
+- `error.type` for error classification; `exception.type`, `exception.message`, and `exception.stacktrace` for exception details
+- Request/work item identifiers (for example `request.id`, `job.id`)
+
+### 4.3 Event Naming and Attribute Cardinality
+
+Event naming convention:
+
+- Use `event.name` in the form `domain.action.result`.
+- Use lowercase with `.` separators.
+- Keep names stable over time.
+- If an event meaning changes materially, create a new event name.
+
+Examples:
+
+- `data.transfer.completed`
+- `data.transfer.failed`
+
+When defining log attributes, teams MUST consider attribute cardinality.
+Cardinality is the number of distinct values an attribute has across events.
+
+High cardinality reduces observability quality because each distinct value
+creates its own group, which fragments aggregates and increases storage/query
+cost.
+
+Attribute guidance:
+
+- Prefer low to medium cardinality attributes for repeated events.
+- Use request/job identifiers only for correlation and troubleshooting.
+- Do not create dynamic field names.
+- Do not move arbitrary payloads into attributes.
+- Keep large free-text content in `body` when necessary.
+
+### 4.4 Library vs Binary Application Logging
+
+#### Libraries
+
+- MUST NOT configure global logging policy (sinks, format, or global levels).
+- MUST use logger/context provided by the application, or a documented
+ adapter/interface supplied by the application.
+- MUST expose structured key/value fields in logging calls, not only
+ pre-formatted message strings.
+- MUST NOT log secrets or large payloads.
+- SHOULD avoid excessive `INFO`/`DEBUG` logs in hot code paths.
+- SHOULD include stable event names for reusable log points:
+ - Example: `event.name="library.decode.failed"`
+ - Avoid changing field keys between library versions without migration notes.
+
+Library API expectation:
+
+- Library entry points SHOULD accept logging context from the caller
+ (logger handle/interface plus correlation fields when available).
+- If a logger is not passed explicitly, the library SHOULD accept a context
+ object that carries logger and correlation metadata.
+- Library code SHOULD propagate the received logger/context unchanged to lower
+ library layers.
+- Libraries MUST NOT silently create independent global logger configuration as
+ a fallback.
+- Libraries SHOULD document the expected logger/context contract in their public
+ API (what is required, optional, and how correlation fields are passed).
+
+#### Binary Applications / Services
+
+- MUST own logger initialisation and runtime configuration.
+- MUST enforce structured JSON output compatible with OTel pipelines.
+- MUST add resource context at startup (`service.*`, environment, runtime metadata).
+- MUST define log level policy by environment.
+- SHOULD control repetitive low-value log volume.
+- MUST implement redaction/masking filters before emission.
+- SHOULD ensure resource attributes are complete:
+ - `service.name`, `service.version`, `deployment.environment`
+ - Runtime and infrastructure attributes when available
+
+### 4.5 Good and Bad Log Lines
+
+Good log line characteristics:
+
+- Structured key/value format.
+- One clear event per line.
+- Includes identifiers and outcome.
+- Uses stable field names.
+- Supports correlation:
+ - Include `traceId` and `spanId` when context exists.
+ - Include request/job identifiers when available.
+
+Examples below use the same canonical structure as Section 4.1 (`resource`
+and `attributes`) for consistency.
+
+#### 4.5.1 Trace Correlation Fields (`traceId` and `spanId`)
+
+- `traceId` identifies the full end-to-end request/workflow across services.
+- `spanId` identifies one operation within that trace in a single service.
+- Multiple log records in one service operation typically share a `spanId`.
+- A single `traceId` usually contains multiple spans across components.
+- When tracing context is unavailable (for example offline batch steps),
+ these fields MAY be absent.
+
+#### 4.5.2 Correlation Identifiers (`traceId`, `request.id`, `job.id`)
+
+These identifiers represent different scopes of correlation and MAY appear
+together in a single log event.
+
+- `traceId`: identifies one end-to-end distributed trace across services.
+ It is created by tracing instrumentation and used to follow cross-service
+ call chains.
+- `request.id`: identifies one application-level request or unit of API/user
+ work at the service boundary. It is generated by the application or
+ middleware handling that request and propagated through service logs.
+- `job.id`: identifies one batch or workflow execution (for example scheduler
+ submission, worker run, or pipeline task instance). It is used to correlate
+ logs across the full lifecycle of that job.
+
+Guidance:
+
+- These identifiers are complementary and not interchangeable.
+- Include all identifiers that exist in the current execution context.
+- In request/response flows, `traceId` and `request.id` often coexist.
+- In batch/HPC flows, `job.id` is usually primary; `traceId` MAY be absent
+ unless tracing is enabled for that workflow.
+
+Bad log line characteristics:
+
+- Free-form text without structure.
+- Missing context or identifiers.
+- Ambiguous message content.
+- Includes sensitive information.
+- Breaks schema consistency:
+ - Changes field names for the same event type.
+ - Encodes structured data only inside a message string.
+
+Good example:
+
+```json
+{
+ "timestamp": "2026-02-11T12:20:43Z",
+ "traceId": "7f3fbbf5b8f24f32a59ec8ef9b264f93",
+ "spanId": "f9c3a29d03ef154f",
+ "severityText": "INFO",
+ "severityNumber": 9,
+ "body": "Operation completed",
+ "resource": {
+ "service.name": "example-service",
+ "service.version": "1.0.0",
+ "deployment.environment": "prod",
+ "k8s.namespace.name": "default",
+ "k8s.pod.name": "example-service-7f8b66f9f7-rj8vd"
+ },
+ "attributes": {
+ "event.name": "data.transfer.completed",
+ "request.id": "req-8f31c9",
+ "job.id": "job-42a7"
+ }
+}
+```
+
+Bad example:
+
+```text
+done request ok
+```
+
+Bad example (sensitive data leak):
+
+```text
+Login failed for user alice password=PlainTextSecret token=eyJhbGci...
+```
+
+### 4.6 Severity and Event Design
+
+- `TRACE`: fine-grained diagnostics; more verbose than `DEBUG`.
+- `DEBUG`: development diagnostics and verbose internals.
+- `INFO`: normal lifecycle and business-relevant state changes.
+- `WARN`: unexpected but recoverable conditions.
+- `ERROR`: failed operation requiring attention.
+- `FATAL`: unrecoverable condition before shutdown.
+
+`severityText` to `severityNumber` mapping (use the lowest value in the range
+unless a finer distinction is needed):
+
+| `severityText` | `severityNumber` range |
+| --- | --- |
+| `TRACE` | 1–4 |
+| `DEBUG` | 5–8 |
+| `INFO` | 9–12 |
+| `WARN` | 13–16 |
+| `ERROR` | 17–20 |
+| `FATAL` | 21–24 |
+
+Use stable event names (`event.name`) where possible, and make messages
+explicit about outcome, target, and reason.
+
+For the full severity number specification including sub-levels, see:
+
+
+### 4.7 Exception and Error Logging
+
+- Emit one primary error log at the handling boundary that determines outcome
+ (for example request failure, job failure, retry exhaustion).
+- Intermediate layers MAY log additional context, but SHOULD avoid duplicating
+ full stack traces/messages for the same failure path.
+- Preserve failure context for cascaded errors by recording:
+ - the high-level operation that failed (for example request decoding,
+ workflow step execution, data transfer);
+ - the immediate reason at that layer;
+ - the underlying cause summary when the error was wrapped/propagated from a
+ lower layer.
+- Use the language/runtime error-chain mechanism where available so operators
+ can reconstruct the sequence of failure causes from boundary logs.
+- When recording exception details, use the OTel semantic convention attributes:
+ `exception.type`, `exception.message`, and `exception.stacktrace`.
+- Include stack traces when they materially improve diagnosis.
+- Sanitize stack traces and exception messages before emission.
+
+Goal: preserve the failure chain (for example I/O error -> decode error ->
+request failure) without logging the same full stack trace at every layer.
+
+### 4.8 Safety and Compliance Rules
+
+- MUST NOT log secrets, credentials, session tokens, private keys, or
+ personal data.
+- MUST redact sensitive substrings before writing log output.
+- SHOULD avoid full object dumps unless explicitly sanitized.
+- SHOULD include stack traces for errors only when useful and sanitized.
+- SHOULD define deny-lists and redaction rules centrally:
+ - Authentication headers and bearer tokens
+ - Passwords, API keys, secrets
+ - User personal data fields
+
+### 4.9 Common Anti-Patterns
+
+| Anti-pattern | Why it is harmful | Preferred pattern |
+| --- | --- | --- |
+| Free-text logs only | Hard to parse, search, and alert | Structured JSON with stable keys |
+| Dynamic field names | Breaks queries and dashboards | Stable schema and key names |
+| Logging in tight loops at `INFO` | Noise and cost explosion | Reduce frequency and log only meaningful state changes |
+| Duplicate exception logs across layers | Inflates incident noise | One primary error log at handling boundary; keep intermediate logs contextual and avoid duplicate full stacks |
+| Logging secrets/tokens | Security and compliance risk | Redaction and explicit deny-lists |
+
+### 4.10 Validation Checklist and Ownership
+
+Before release, teams SHOULD verify:
+
+- Required fields are present in production logs.
+- Log output is valid structured JSON, or legacy format logs are mapped to
+ the common schema via approved pipeline parsing/enrichment.
+- Secrets and sensitive data are redacted.
+- Library and binary responsibilities are correctly separated.
+- Severity levels are used consistently.
+- Correlation fields (`traceId`, `spanId`) are present when tracing context exists.
+
+Ownership split for compliance:
+
+| Control | Development Team | Platform Engineering Team |
+| --- | --- | --- |
+| Structured JSON emitted by app | MUST for new services; phased plan allowed for approved legacy services | N/A |
+| Required app fields (`service.name`, `service.version`, `body`, severity); `deployment.environment` where known | MUST; `deployment.environment` SHOULD | Validate only |
+| Secret redaction in app logs | MUST | SHOULD add defensive redaction in pipeline |
+| `k8s.namespace.name`, `k8s.pod.name`, `host.name` enrichment | MAY | MUST where collector supports it |
+| Log transport to backend (for example Splunk) | N/A | MUST |
+| Parsing/schema validation in collector | N/A | SHOULD |
+| Log noise and volume control | SHOULD at source | SHOULD as safety net |
+
+### 4.11 Legacy Compatibility and Migration
+
+These guidelines define the target logging model, but do not require immediate
+JSON-only migration for existing stable services.
+
+Compatibility requirements:
+
+- Existing log formats MAY continue where they are operationally established.
+- Service teams MUST document the current format and downstream consumers
+ before changing log structure.
+- New services MUST emit structured logs by default.
+- For existing services, collector/pipeline parsing and enrichment MAY be used
+ to map legacy logs into the common schema.
+- Migration SHOULD be incremental and service-specific, with no disruption to
+ existing operational workflows.
+
+Target-state requirement:
+
+- Services that can adopt structured JSON logging without operational risk
+ SHOULD do so; services with high migration cost MAY follow a phased plan
+ agreed with platform and operations teams.
+
+## 5. Metrics Standard
+
+Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format.
+ECMWF services MUST use Prometheus metric types and naming conventions, and
+MUST expose metrics in a Prometheus/OpenMetrics-compatible text format.
+Metrics defined in this section are the source for alerting rules defined in
+the Alerting section.
+
+### 5.1 Scope and Standard
+
+- This section defines instrumentation expectations, metric schema, and
+ quality requirements.
+- Environment-specific scrape/discovery designs for Kubernetes, VMs, HPC,
+ bare metal, and remote data-mover hosts are specified separately.
+- Metrics exposure and collection at a high level:
+ - HTTP services SHOULD expose a `/metrics` endpoint owned by the service.
+ - Non-HTTP and batch/HPC workloads MUST still expose Prometheus-compatible
+ metrics, typically via a local collector/forwarder integration.
+ - Platform Engineering Team owns central scrape and ingestion configuration.
+
+### 5.2 References
+
+- Prometheus metric types:
+- Prometheus naming best practices:
+- OpenMetrics specification:
+
+
+### 5.3 Metric Types and Usage
+
+- `Counter`:
+ - MUST be monotonic.
+ - MUST use `_total` suffix.
+ - Use for counts of events and outcomes.
+- `Gauge`:
+ - Use for values that increase and decrease (for example in-flight operations).
+- `Histogram`:
+ - SHOULD be used for latency and size distributions.
+ - MUST have stable bucket boundaries for the same metric across instances.
+- `Summary`:
+ - SHOULD be avoided for cross-instance aggregation use cases.
+ - MAY be used only with clear justification.
+
+### 5.4 Naming Conventions
+
+- Metric names MUST be lowercase `snake_case`.
+- Metric names MUST include base units where applicable:
+ - `_seconds` for duration
+ - `_bytes` for size
+ - `_total` for counters
+- Metric names SHOULD be stable over time.
+- If a name must change, introduce the new metric and deprecate the old one
+ before removal.
+
+Good naming examples:
+
+- `http_server_requests_total`
+- `http_server_request_duration_seconds`
+- `job_execution_duration_seconds`
+- `process_resident_memory_bytes`
+
+Bad naming examples:
+
+- `HttpRequests`
+- `requestDurationMs`
+- `errors`
+
+### 5.5 Labels and Cardinality
+
+Labels add dimensionality to metrics but increase cardinality.
+Cardinality is the number of distinct label combinations a metric produces.
+
+High cardinality reduces metric usefulness because it creates too many
+series, increasing storage/query cost and weakening dashboard/alert signal.
+
+- Labels MUST use stable keys and bounded value sets.
+- Labels SHOULD describe dimensions such as:
+ - `service`
+ - `environment`
+ - `operation`
+ - `status`
+- Labels MUST NOT include unbounded identifiers such as:
+ - `request_id`
+ - `user_id`
+ - Raw URLs with path parameters
+ - UUIDs or timestamps
+- Label values SHOULD be normalized:
+ - Prefer route templates (for example `/api/v1/items/{id}`) over raw paths.
+ - Prefer status classes (`2xx`, `4xx`, `5xx`) when detail is not required.
+
+### 5.6 Required Baseline Metrics
+
+Application and service metrics:
+
+- Request/operation throughput counter
+ - Example: `service_requests_total`
+- Request/operation failure counter
+ - Example: `service_request_failures_total`
+- Request/operation duration histogram
+ - Example: `service_request_duration_seconds`
+- In-flight operation gauge (if applicable)
+ - Example: `service_requests_in_flight`
+
+Runtime/process metrics (where runtime supports them):
+
+- CPU usage
+- Memory usage
+- Uptime/start time
+- Runtime-specific health metrics (for example GC metrics)
+
+Batch/HPC job metrics (where applicable):
+
+- Job execution count by outcome
+- Job execution duration
+- Queue/wait duration
+
+### 5.7 Histogram Guidance
+
+- Histogram bucket boundaries SHOULD align with SLO/SLA objectives.
+- Bucket sets MUST remain consistent for the same metric across services and versions.
+- Bucket count SHOULD be limited to a practical set to control cost and query complexity.
+
+Example bucket set for service latency metric:
+
+- `0.005`, `0.01`, `0.025`, `0.05`, `0.1`, `0.25`, `0.5`, `1`, `2.5`, `5`,
+ `10` seconds
+
+### 5.8 Good and Bad Metric Examples
+
+Good examples:
+
+```text
+service_requests_total{service="example-service",environment="prod",operation="create",status="2xx"} 12842
+service_request_duration_seconds_bucket{service="example-service",environment="prod",operation="create",le="0.5"} 12011
+service_request_duration_seconds_sum{service="example-service",environment="prod",operation="create"} 3184.22
+service_request_duration_seconds_count{service="example-service",environment="prod",operation="create"} 12842
+```
+
+Bad examples:
+
+```text
+requests{request_id="d9fd0f7a-3d8e-4c17-9d8b-9b57f43dc40e",user_id="483992"} 1
+requestDurationMs{path="/api/v1/items/123456"} 187
+```
+
+### 5.9 Validation Checklist and Ownership
+
+Before release, teams SHOULD verify:
+
+- Metric names, units, and suffixes are compliant.
+- Required baseline metrics are present.
+- Label keys and values are bounded and normalized.
+- No high-cardinality identifiers are emitted as labels.
+- Histogram buckets are defined and justified.
+
+Ownership split for compliance:
+
+| Control | Development Team | Platform Engineering Team | Production Team |
+| --- | --- | --- | --- |
+| Instrument required baseline metrics | MUST | N/A | SHOULD review service-level usefulness |
+| Naming and unit compliance | MUST | SHOULD validate | SHOULD validate monitoring readiness |
+| Label cardinality discipline | MUST | SHOULD enforce guardrails | SHOULD flag operational risks |
+| Scrape/discovery pipeline configuration | N/A | MUST | SHOULD validate production coverage |
+| Central metric relabeling and hygiene checks | N/A | SHOULD | N/A |
+| Cost and cardinality monitoring at platform level | N/A | SHOULD | SHOULD provide operational feedback |
diff --git a/README.md b/README.md
index 5cb47c9..f644184 100644
--- a/README.md
+++ b/README.md
@@ -15,6 +15,7 @@ The Codex is a set of guidelines for development of software and services at ECM
- [Project Maturity](./Project%20Maturity)
- [Containerisation](./Containerisation)
- [Testing](./Testing)
+- [Observability](./Development%20Practices/Observability.md)
- [ECMWF Software EnginE (ESEE)](./ESEE)
- [Contributing to External Projects](./Contributing%20Externally/)
- [Incoming External Contributions](./External%20Contributions/)