Skip to content

Add ECMWF observability guidelines for logging and metrics#39

Merged
tlmquintino merged 7 commits intomainfrom
codex/observability-guidelines
Mar 13, 2026
Merged

Add ECMWF observability guidelines for logging and metrics#39
tlmquintino merged 7 commits intomainfrom
codex/observability-guidelines

Conversation

@sametd
Copy link
Member

@sametd sametd commented Feb 11, 2026

Context

This is a draft proposal to align observability guidelines across ECMWF software and services.
The objective of this PR is to collect broad feedback early, before we finalize requirements and structure.

Review approach

Please focus on whether the proposed direction is workable for your teams and platforms.
You are welcome to add colleagues as reviewers where you think it is useful.

Discussion scope for this PR

Please avoid deep implementation debates in this PR thread.
If needed, we can open follow-up issues/PRs for detailed technical discussions.

Next steps

After this review round:

  1. Incorporate feedback into a revised version.
  2. Add alerting guidance.
  3. Add environment-specific collection guidance.
  4. Add tracing guidance.

@jameshawkes jameshawkes marked this pull request as ready for review February 16, 2026 16:35
Copy link
Contributor

@peshence peshence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@sametd sametd requested a review from jameshawkes February 16, 2026 19:54
Copy link

@cfkanesan cfkanesan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sametd , this is a very nicely written document in my opinion. I added some comments and I think there could be some additional guidelines in the case where you are pushing logs from the application to the logger (as opposed to collecting them from stdout). For instance, that logging should be delegated to a sidecar thread or process to avoid blocking the main thread or process to avoid increasing the latency. Making sure that failure to deliver logs, filling buffers, temporary downtime of the log collection infrastructure will not impact the uptime or performance of the service.

"severity_text": "INFO",
"severity_number": 9,
"body": "Operation completed",
"resource": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also agree with CD on a set of fields for resource here. I'm already seeing different approaches in the test Opensearch infrastructure they're setting up.

@jameshawkes jameshawkes requested review from cauzm, dsarmany, simondsmart and tlmquintino and removed request for EddyCMWF, Ozaq, dsarmany and tbkr March 2, 2026 11:15
Copy link
Contributor

@simondsmart simondsmart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things which I find it surprising not to be discussed at all, in the context of observability guidelines:

  • Access to logs (both how, and who should have access). This all seems to be delegated to something downstream. But it is really rather important - and not something which is the same for every service/universal.
  • Log lifetime. We have decades of contiguous log coverage for MARS/ECFS/others. We may not have the same requirements for other services. Currently we maintain and build the archive of logs ourselves - is this intended to be the responsibility of the service developer, or to move this to the logging infrastructure. If the latter, who is in charge of the lifetimes? And long term archival?
  • How do we handle outage of the telemetry system? In times of outage (power, network, ...) we have seen that lots of this infrastructure is not top priority when bringing things back up relative to operations. That means it is likely to lag our services. How do we access logs in that timeframe (which may be very urgent), and how do we ensure there are not gaps in our log coverage.
  • How do we interact with existing code bases and services. We have large, stable systems whose logs are already being processed and used for various purposes. It is/would be a huge undertaking to migrate MARS for instance (especially the server side) to only output JSON-based logs.
  • How do we handle logs for non-request-based tooling (startup/shutdown, housekeeping operations)?

### 4.7 Exception and Error Logging

- Log an exception once at the handling boundary.
- Avoid duplicate logging of the same error in multiple layers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmz. Typically there is a cascade (e.g. an I/O error causes a decode error causes a user-request failure). Logging that cascade has serious value, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, tricky. Section 4.7 was revised to preserve causal-chain context while avoiding duplicate full stack logs at every layer.

@sametd
Copy link
Member Author

sametd commented Mar 3, 2026

@simondsmart Appreciate the review, very helpful. I’d kept some topics out initially to keep things short, and I’ve now expanded the document and addressed the majority of your feedback.

  • Access to logs (both how, and who should have access). This all seems to be delegated to something downstream. But it is really rather important - and not something which is the same for every service/universal.

I added Section 3.2 “Log Access and Ownership” with access-path requirements, role approval, IAM process alignment, emergency access expectations, and ownership split across Development / Platform / Production. Please check if this matches the operational model you had in mind.

  • Log lifetime. We have decades of contiguous log coverage for MARS/ECFS/others. We may not have the same requirements for other services. Currently we maintain and build the archive of logs ourselves - is this intended to be the responsibility of the service developer, or to move this to the logging infrastructure. If the latter, who is in charge of the lifetimes? And long term archival?

I added Section 3.3 “Log Retention and Archival” with retention declaration at onboarding, default retention from central logging, override process, long-term archival declaration, and ownership responsibilities for implementation/review.

  • How do we handle outage of the telemetry system? In times of outage (power, network, ...) we have seen that lots of this infrastructure is not top priority when bringing things back up relative to operations. That means it is likely to lag our services. How do we access logs in that timeframe (which may be very urgent), and how do we ensure there are not gaps in our log coverage.

This was really something I was planning to add in the next iteration. Now, I added Section 3.4 “Telemetry Outage and Recovery” covering degraded-mode behavior for logs/metrics/traces, buffering/retry/backfill expectations, gap detection/reporting, and runbook requirements for urgent access during outages.

  • How do we interact with existing code bases and services. We have large, stable systems whose logs are already being processed and used for various purposes. It is/would be a huge undertaking to migrate MARS for instance (especially the server side) to only output JSON-based logs.

Tricky one. I added Section 4.11 “Legacy Compatibility and Migration” to state target model vs transition model explicitly: no immediate JSON-only requirement for stable legacy services; collector/pipeline mapping is acceptable during phased migration.

  • How do we handle logs for non-request-based tooling (startup/shutdown, housekeeping operations)?

I think guideline does not limit logging to request-bound events: startup/shutdown/housekeeping and other background tooling logs are first-class logs, with the same structure/severity/redaction expectations. Correlation fields such as trace_id are optional when no tracing/request context exists.

@sametd sametd requested a review from simondsmart March 3, 2026 01:36
Copy link

@cauzm cauzm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t have any further comments from my side. This is not my main area of expertise, and Simon has already provided a number of relevant comments

- Rename severity_text/severity_number to severityText/severityNumber
- Rename trace_id/span_id to traceId/spanId and move to top-level LogRecord fields
- Upgrade traceId requirement from SHOULD to MUST when available
- Split required fields table into LogRecord fields and Resource attributes
- Downgrade deployment.environment from MUST to SHOULD
- Add TRACE severity level and severityText-to-severityNumber mapping table
- Add OTel exception attributes: exception.type, exception.message, exception.stacktrace
- Remove deprecated event.domain attribute
- Replace error.message with exception.message per OTel semantic conventions
- Fix MUST not -> MUST NOT in library logging rules
- Fix lowercase normative keywords (should -> SHOULD/MUST NOT)
- Update event.name examples to follow three-part domain.action.result format
- Align 4.10 ownership table with deployment.environment SHOULD requirement
Copy link
Member

@tlmquintino tlmquintino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous reviews have been excellent and very complete.
I have nothing to add.
But please reorganise the file into: Development Practices/Observability.md

Copy link
Member

@tlmquintino tlmquintino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! all good

@tlmquintino tlmquintino merged commit db3c5b7 into main Mar 13, 2026
1 check passed
@tlmquintino tlmquintino deleted the codex/observability-guidelines branch March 13, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants