WIP: OTEL context conformance with tlsdesc_v1_dev spec#347
WIP: OTEL context conformance with tlsdesc_v1_dev spec#347
Conversation
Implement feature-flagged context storage with two modes: - profiler (default): existing TLS-based storage with checksum - otel: ring buffer storage discoverable via /proc/<pid>/maps Key components: - ContextApi: unified abstraction layer for both modes - OtelContexts: mmap-based ring buffer with in_use flag protocol - ctxstorage option: select mode at profiler startup 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Make libclang-rt-dev package conditional - only available on x64. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update implementation to match ctx-sharing-demo reference. Note on naming: "V2" refers to the TLS record format version (struct layout with flexible array), while "tlsdesc_v1_dev" is the schema/ protocol version string. This matches the reference implementation which uses customlabels_v2.h but schema_version="tlsdesc_v1_dev". TLS Record (V2 format): - Fix struct layout: 28-byte header (removed root_span_id) - Use flexible array for attrs_data - Correct field ordering per tlsdesc_v1_dev schema Process Context: - schema_version: string "tlsdesc_v1_dev" (was int) - attribute_key_map: Array encoding (was KvList) - Mapping: writable (rw-p/rw-s) per PR #34, 1 page - Remove mprotect to read-only Fixes: - clear() properly invalidates V2 record - Reader accepts both r-- and rw- permissions - Tests updated for writable mappings Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
ivoanjo
left a comment
There was a problem hiding this comment.
Gave it a pass! There were a few of "uuuuh what? 👀 moments". I guess AI?
| CASE("ctxstorage") | ||
| if (value != NULL) { | ||
| if (strcmp(value, "otel") == 0) { | ||
| _context_storage = CTX_STORAGE_OTEL; | ||
| } else { | ||
| _context_storage = CTX_STORAGE_PROFILER; | ||
| } | ||
| } |
There was a problem hiding this comment.
I'm curious -- why keep both around? E.g. is it more of a "while we're testing things out and we can eventually clean it up", or do we suspect the previous approach will stick around for the long run?
| void ContextApi::set(u64 span_id, u64 root_span_id) { | ||
| // Map Datadog format to storage | ||
| // In OTEL mode: trace_id = (0, root_span_id), span_id = span_id | ||
| setOtel(0, root_span_id, span_id); | ||
| } | ||
|
|
||
| void ContextApi::setOtel(u64 trace_id_high, u64 trace_id_low, u64 span_id) { | ||
| // Use atomic load for mode check - may be called from signal handlers | ||
| ContextStorageMode mode = __atomic_load_n(&_mode, __ATOMIC_ACQUIRE); | ||
|
|
||
| TEST_LOG("ContextApi::setOtel: tid=%d mode=%s trace_high=0x%llx trace_low=0x%llx span=0x%llx", | ||
| OS::threadId(), mode == CTX_STORAGE_OTEL ? "OTEL" : "PROFILER", | ||
| (unsigned long long)trace_id_high, (unsigned long long)trace_id_low, | ||
| (unsigned long long)span_id); | ||
|
|
||
| if (mode == CTX_STORAGE_OTEL) { | ||
| OtelContexts::set(trace_id_high, trace_id_low, span_id); |
There was a problem hiding this comment.
This looks weird -- root_span_id becomes trace_id_low... 👀
We can also keep root_span_id for otel, by putting it as an extra attribute (goes in the attrs-data section); this way we don't lose any functionality between the previous format and otel context.
| bool ContextApi::getByTid(int tid, u64& span_id, u64& root_span_id) { | ||
| // Use atomic load for mode check - may be called from signal handlers | ||
| ContextStorageMode mode = __atomic_load_n(&_mode, __ATOMIC_ACQUIRE); | ||
|
|
||
| if (mode == CTX_STORAGE_OTEL) { | ||
| u64 trace_high, trace_low; | ||
| if (OtelContexts::getByTid(tid, trace_high, trace_low, span_id)) { | ||
| root_span_id = trace_low; | ||
| return true; | ||
| } | ||
| return false; | ||
| } else { | ||
| // Profiler mode: cannot read other thread's TLS | ||
| // This is a limitation - JVMTI wall-clock needs OTEL mode for remote reads | ||
| // Fall back to returning false (no context available) | ||
| span_id = 0; | ||
| root_span_id = 0; | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
In the otel code path it "looks like" we could make the read, but then it's hardcoded to always fail. So getByTid never succeeds -- maybe we should completely remove it? I'm not sure when a getByTid that never succeeds would be useful...
| // Tags still come from TLS Context (even in OTEL mode, for compatibility) | ||
| Context &context = Contexts::get(); | ||
| for (size_t i = 0; i < Profiler::instance()->numContextAttributes(); i++) { | ||
| Tag tag = context.get_tag(i); | ||
| buf->putVar32(tag.value); | ||
| } |
There was a problem hiding this comment.
We could put them in attrs-data for otel. In particular trace endpoint would be cool to keep, but in general we could fully support custom attributes for otel.
| @@ -469,12 +486,82 @@ Java_com_datadoghq_profiler_OTelContext_setProcessCtx0(JNIEnv *env, | |||
| .telemetry_sdk_language = const_cast<char*>("java"), | |||
| .telemetry_sdk_version = const_cast<char*>(tracer_version_str.c_str()), | |||
| .telemetry_sdk_name = const_cast<char*>("dd-trace-java"), | |||
| .resources = NULL // TODO: Arbitrary tags not supported yet for Java | |||
| .resources = NULL, // TODO: Arbitrary tags not supported yet for Java | |||
| .tls_config = &default_tls_config | |||
There was a problem hiding this comment.
Minor: max_record_size can be removed now
| /** | ||
| * Represents TLS context sharing configuration. | ||
| * | ||
| * <p>This configuration is used to expose thread-local storage context information | ||
| * to external profilers. The key map maps indices to attribute names, allowing | ||
| * external readers to decode compact TLS records. | ||
| */ | ||
| public static final class TlsConfig { | ||
| /** Default schema version for TLS context sharing (tlsdesc_v1_dev) */ | ||
| public static final String DEFAULT_SCHEMA_VERSION = "tlsdesc_v1_dev"; | ||
|
|
||
| /** TLS schema version string (e.g. "tlsdesc_v1_dev") */ | ||
| public final String schemaVersion; | ||
| /** Maximum bytes per TLS record */ | ||
| public final int maxRecordSize; | ||
| /** Key names in index order (position = key index, e.g. ["method", "route"]) */ | ||
| public final String[] attributeKeyMap; |
There was a problem hiding this comment.
It's kinda weird to see these details come up to the Java level (other than the attribute map) -- should they maybe be hidden in the cpp bits?
| // Verify mmap region naming in /proc/self/maps (informational) | ||
| // Note: PR_SET_VMA_ANON_NAME requires kernel 5.17+ and may not work in all environments | ||
| // The OTEL buffer still works for discovery via magic number scanning if naming fails | ||
| boolean hasNamedRegion = checkMapsContains("DD_OTEL_CTX"); | ||
| if (!hasNamedRegion) { | ||
| System.out.println("INFO: DD_OTEL_CTX mmap naming not available " + | ||
| "(requires kernel 5.17+ with PR_SET_VMA_ANON_NAME support)"); | ||
| } | ||
| } |
| /** | ||
| * Finds the OTEL_CTX mapping in /proc/self/maps. | ||
| * Supports both memfd mappings (/memfd:OTEL_CTX) and named anonymous mappings ([anon:OTEL_CTX]). | ||
| */ | ||
| private OtelMappingInfo findOtelMapping() throws IOException { | ||
| Path mapsFile = Paths.get("/proc/self/maps"); | ||
| if (!Files.exists(mapsFile)) { | ||
| return null; | ||
| } | ||
|
|
||
| Pattern otelPattern = Pattern.compile("^([0-9a-f]+)-([0-9a-f]+)\\s+(\\S+)\\s+\\S+\\s+\\S+\\s+\\S+\\s*\\[anon:OTEL_CTX\\].*$"); | ||
|
|
||
|
|
||
| // Pattern for named anonymous mapping: [anon:OTEL_CTX] | ||
| Pattern anonPattern = Pattern.compile("^([0-9a-f]+)-([0-9a-f]+)\\s+(\\S+)\\s+\\S+\\s+\\S+\\s+\\S+\\s*\\[anon:OTEL_CTX\\].*$"); | ||
| // Pattern for memfd mapping: /memfd:OTEL_CTX (deleted) | ||
| Pattern memfdPattern = Pattern.compile("^([0-9a-f]+)-([0-9a-f]+)\\s+(\\S+)\\s+.*?/memfd:OTEL_CTX.*$"); | ||
|
|
||
| try (BufferedReader reader = Files.newBufferedReader(mapsFile)) { | ||
| String line; | ||
| while ((line = reader.readLine()) != null) { | ||
| Matcher matcher = otelPattern.matcher(line); | ||
| if (matcher.matches()) { | ||
| Matcher anonMatcher = anonPattern.matcher(line); | ||
| if (anonMatcher.matches()) { | ||
| return new OtelMappingInfo( | ||
| anonMatcher.group(1), | ||
| anonMatcher.group(2), | ||
| anonMatcher.group(3) | ||
| ); | ||
| } | ||
| Matcher memfdMatcher = memfdPattern.matcher(line); | ||
| if (memfdMatcher.matches()) { | ||
| return new OtelMappingInfo( | ||
| matcher.group(1), | ||
| matcher.group(2), | ||
| matcher.group(3) | ||
| memfdMatcher.group(1), | ||
| memfdMatcher.group(2), | ||
| memfdMatcher.group(3) |
There was a problem hiding this comment.
I suggest matching only on OTEL_CTX -- may be easier?
|
|
||
| The OTEL Context Storage system provides two distinct context sharing mechanisms: | ||
|
|
||
| 1. **Thread-Level Context**: Ring buffer storage for per-thread trace/span context (existing implementation) |
There was a problem hiding this comment.
This looks to be outdated? In particular, it's not a ring?
There was a problem hiding this comment.
It's very probably referred as a ring somewhere in the docs/reference impl - claude was very persistent on referring to that as a ring buffer :)
| ### Process Context | ||
|
|
||
| 1. **Mapping Permissions**: Use `rw-p` (anonymous) or `rw-s` (memfd) | ||
| - Do NOT use `mprotect()` to make read-only | ||
| - Writable mappings allow in-place updates (PR #34) | ||
|
|
There was a problem hiding this comment.
I guess this got carried over from the time where the pages were stated to be read-only; it's kinda weird to mention it now since the spec says nothing about it anymore. Maybe worth giving a cleaning pass in general?
Yes. Ah, you shouldn't have spent a lot of time on this ... this is just a dirty AI generated scaffolding, hence still draft .. |
What does this PR do?:
Updates the OTEL context implementation to conform with the tlsdesc_v1_dev specification from the ctx-sharing-demo reference implementation. Fixes both TLS record format and process context encoding to match the expected format for external profiler discovery.
Motivation:
The current implementation had diverged from the reference specification in ctx-sharing-demo, causing compatibility issues with external profilers trying to read context. Key issues included:
Additional Notes:
Note on naming convention: "V2" refers to the TLS record format version (struct layout with flexible array member), while "tlsdesc_v1_dev" is the schema/protocol version string. This matches the reference implementation which uses customlabels_v2.h but schema_version="tlsdesc_v1_dev".
Changes include:
How to test the change?:
./gradlew buildDebug./gradlew :ddprof-test:test --tests "*OtelContext*"VALIDATE OK: [v2] thread=..., labels=[...]All existing tests pass, including ProcessContextTest and OtelContextStorageModeTest.
For Datadog employees:
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
Unsure? Have a question? Request a review!