Add ECMWF observability guidelines for logging and metrics by sametd · Pull Request #39 · ecmwf/codex

sametd · 2026-02-11T14:19:51Z

Context

This is a draft proposal to align observability guidelines across ECMWF software and services.
The objective of this PR is to collect broad feedback early, before we finalize requirements and structure.

Review approach

Please focus on whether the proposed direction is workable for your teams and platforms.
You are welcome to add colleagues as reviewers where you think it is useful.

Discussion scope for this PR

Please avoid deep implementation debates in this PR thread.
If needed, we can open follow-up issues/PRs for detailed technical discussions.

Next steps

After this review round:

Incorporate feedback into a revised version.
Add alerting guidance.
Add environment-specific collection guidance.
Add tracing guidance.

Observability/observability-guidelines.md

Development Practices/Observability.md

Observability/observability-guidelines.md

Development Practices/Observability.md

peshence

Looks great!

Observability/observability-guidelines.md

Development Practices/Observability.md

cfkanesan

Hi @sametd , this is a very nicely written document in my opinion. I added some comments and I think there could be some additional guidelines in the case where you are pushing logs from the application to the logger (as opposed to collecting them from stdout). For instance, that logging should be delegated to a sidecar thread or process to avoid blocking the main thread or process to avoid increasing the latency. Making sure that failure to deliver logs, filling buffers, temporary downtime of the log collection infrastructure will not impact the uptime or performance of the service.

Development Practices/Observability.md

Observability/observability-guidelines.md

Development Practices/Observability.md

Observability/observability-guidelines.md

carletes · 2026-02-20T11:16:32Z

Development Practices/Observability.md

+  "severity_text": "INFO",
+  "severity_number": 9,
+  "body": "Operation completed",
+  "resource": {


We should also agree with CD on a set of fields for resource here. I'm already seeing different approaches in the test Opensearch infrastructure they're setting up.

Observability/observability-guidelines.md

Development Practices/Observability.md

Observability/observability-guidelines.md

Development Practices/Observability.md

simondsmart

There are a few things which I find it surprising not to be discussed at all, in the context of observability guidelines:

Access to logs (both how, and who should have access). This all seems to be delegated to something downstream. But it is really rather important - and not something which is the same for every service/universal.
Log lifetime. We have decades of contiguous log coverage for MARS/ECFS/others. We may not have the same requirements for other services. Currently we maintain and build the archive of logs ourselves - is this intended to be the responsibility of the service developer, or to move this to the logging infrastructure. If the latter, who is in charge of the lifetimes? And long term archival?
How do we handle outage of the telemetry system? In times of outage (power, network, ...) we have seen that lots of this infrastructure is not top priority when bringing things back up relative to operations. That means it is likely to lag our services. How do we access logs in that timeframe (which may be very urgent), and how do we ensure there are not gaps in our log coverage.
How do we interact with existing code bases and services. We have large, stable systems whose logs are already being processed and used for various purposes. It is/would be a huge undertaking to migrate MARS for instance (especially the server side) to only output JSON-based logs.
How do we handle logs for non-request-based tooling (startup/shutdown, housekeeping operations)?

Development Practices/Observability.md

Observability/observability-guidelines.md

Development Practices/Observability.md

Observability/observability-guidelines.md

simondsmart · 2026-03-02T17:02:01Z

Observability/observability-guidelines.md

+### 4.7 Exception and Error Logging
+
+- Log an exception once at the handling boundary.
+- Avoid duplicate logging of the same error in multiple layers.


Hmmz. Typically there is a cascade (e.g. an I/O error causes a decode error causes a user-request failure). Logging that cascade has serious value, no?

Again, tricky. Section 4.7 was revised to preserve causal-chain context while avoiding duplicate full stack logs at every layer.

Development Practices/Observability.md

sametd · 2026-03-03T01:10:49Z

@simondsmart Appreciate the review, very helpful. I’d kept some topics out initially to keep things short, and I’ve now expanded the document and addressed the majority of your feedback.

Access to logs (both how, and who should have access). This all seems to be delegated to something downstream. But it is really rather important - and not something which is the same for every service/universal.

I added Section 3.2 “Log Access and Ownership” with access-path requirements, role approval, IAM process alignment, emergency access expectations, and ownership split across Development / Platform / Production. Please check if this matches the operational model you had in mind.

Log lifetime. We have decades of contiguous log coverage for MARS/ECFS/others. We may not have the same requirements for other services. Currently we maintain and build the archive of logs ourselves - is this intended to be the responsibility of the service developer, or to move this to the logging infrastructure. If the latter, who is in charge of the lifetimes? And long term archival?

I added Section 3.3 “Log Retention and Archival” with retention declaration at onboarding, default retention from central logging, override process, long-term archival declaration, and ownership responsibilities for implementation/review.

How do we handle outage of the telemetry system? In times of outage (power, network, ...) we have seen that lots of this infrastructure is not top priority when bringing things back up relative to operations. That means it is likely to lag our services. How do we access logs in that timeframe (which may be very urgent), and how do we ensure there are not gaps in our log coverage.

This was really something I was planning to add in the next iteration. Now, I added Section 3.4 “Telemetry Outage and Recovery” covering degraded-mode behavior for logs/metrics/traces, buffering/retry/backfill expectations, gap detection/reporting, and runbook requirements for urgent access during outages.

How do we interact with existing code bases and services. We have large, stable systems whose logs are already being processed and used for various purposes. It is/would be a huge undertaking to migrate MARS for instance (especially the server side) to only output JSON-based logs.

Tricky one. I added Section 4.11 “Legacy Compatibility and Migration” to state target model vs transition model explicitly: no immediate JSON-only requirement for stable legacy services; collector/pipeline mapping is acceptable during phased migration.

How do we handle logs for non-request-based tooling (startup/shutdown, housekeeping operations)?

I think guideline does not limit logging to request-bound events: startup/shutdown/housekeeping and other background tooling logs are first-class logs, with the same structure/severity/redaction expectations. Correlation fields such as trace_id are optional when no tracing/request context exists.

cauzm

I don’t have any further comments from my side. This is not my main area of expertise, and Simon has already provided a number of relevant comments

- Rename severity_text/severity_number to severityText/severityNumber - Rename trace_id/span_id to traceId/spanId and move to top-level LogRecord fields - Upgrade traceId requirement from SHOULD to MUST when available - Split required fields table into LogRecord fields and Resource attributes - Downgrade deployment.environment from MUST to SHOULD - Add TRACE severity level and severityText-to-severityNumber mapping table - Add OTel exception attributes: exception.type, exception.message, exception.stacktrace - Remove deprecated event.domain attribute - Replace error.message with exception.message per OTel semantic conventions - Fix MUST not -> MUST NOT in library logging rules - Fix lowercase normative keywords (should -> SHOULD/MUST NOT) - Update event.name examples to follow three-part domain.action.result format - Align 4.10 ownership table with deployment.environment SHOULD requirement

tlmquintino

The previous reviews have been excellent and very complete.
I have nothing to add.
But please reorganise the file into: Development Practices/Observability.md

…bility.md

tlmquintino

Thanks! all good

Add ECMWF observability guidelines for logging and metrics

23c53db

sametd requested review from EddyCMWF, Ozaq, carletes, jameshawkes, peshence and tbkr February 11, 2026 14:19

linting and wrapping

7ff4198

jameshawkes marked this pull request as ready for review February 16, 2026 16:35

jameshawkes requested changes Feb 16, 2026

View reviewed changes

peshence approved these changes Feb 16, 2026

View reviewed changes

Observability/observability-guidelines.md Outdated Show resolved Hide resolved

Development Practices/Observability.md Show resolved Hide resolved

sametd added 2 commits February 16, 2026 20:23

Clarify observability model across environments and trace correlation

d665791

Refine ownership model with explicit team roles

b0db173

sametd requested a review from jameshawkes February 16, 2026 19:54

cfkanesan reviewed Feb 20, 2026

View reviewed changes

Development Practices/Observability.md Show resolved Hide resolved

Observability/observability-guidelines.md Outdated Show resolved Hide resolved

carletes reviewed Feb 20, 2026

View reviewed changes

jameshawkes approved these changes Mar 2, 2026

View reviewed changes

jameshawkes requested review from cauzm, dsarmany, simondsmart and tlmquintino and removed request for EddyCMWF, Ozaq, dsarmany and tbkr March 2, 2026 11:15

simondsmart requested changes Mar 2, 2026

View reviewed changes

docs(observability): clarify operational guidance and legacy migration

4641fba

sametd requested a review from simondsmart March 3, 2026 01:36

cauzm approved these changes Mar 3, 2026

View reviewed changes

tlmquintino requested changes Mar 13, 2026

View reviewed changes

docs(observability): move guidelines to Development Practices/Observa…

1bf0aa1

…bility.md

tlmquintino approved these changes Mar 13, 2026

View reviewed changes

tlmquintino merged commit db3c5b7 into main Mar 13, 2026
1 check passed

tlmquintino deleted the codex/observability-guidelines branch March 13, 2026 23:38

Conversation

sametd commented Feb 11, 2026

Context

Review approach

Discussion scope for this PR

Next steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

peshence left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cfkanesan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

carletes Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simondsmart left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simondsmart Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

sametd Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sametd commented Mar 3, 2026

Uh oh!

cauzm left a comment

Choose a reason for hiding this comment

Uh oh!

tlmquintino left a comment

Choose a reason for hiding this comment

Uh oh!

tlmquintino left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

simondsmart left a comment •

edited

Loading