Fix DD_APM_TRACING_ENABLED to work with LLMObs#10989
Conversation
|
Hi @smola , just a gentle ping. |
|
Sorry for the late review :( I think the PR fixes the immediate issue correctly, but it introduces a structural technical debt. This is something we had in mind when we started with the standalone implementations, but we didn’t move forward at the time since no other products needed it. The current approach introduces a new sampler class per product combination:
The next standalone product will require yet another class and another branch. Suggested alternative (just an example) Each product has three properties, its enum StandaloneProduct {
ASM (ProductTraceSource.ASM, SamplingMechanism.APPSEC, true),
LLMOBS(ProductTraceSource.LLMOBS, SamplingMechanism.DEFAULT, false);
}A single StandaloneSampler receives the list of active products, iterates them to detect product-marked traces, and falls back to rate-limiting (or drop) depending on whether any active product requires the billing trace. forConfig becomes flat and open for extension: if (!config.isApmTracingEnabled()) {
List<StandaloneProduct> active = new ArrayList<>();
if (isAsmEnabled(config)) active.add(StandaloneProduct.ASM);
if (config.isLlmObsEnabled()) active.add(StandaloneProduct.LLMOBS);
// next product: one line here + one entry in the enum
return active.isEmpty()
? new ForcePrioritySampler(SAMPLER_DROP, DEFAULT)
: new StandaloneSampler(active, Clock.systemUTC());
}The same generalisation applies to the manual && per product in Maybe I’m missing some context since this is an older topic, but what do you think about the suggestion? Do you see it as viable? cc:@smola |
|
Thanks for the detailed feedback, @jandro996 — this makes a lot of sense. I initially kept the change minimal to address the immediate issue, but I agree that the current approach doesn’t scale well as more standalone products are added, and your suggestion of using a single Also, good point about applying the same generalization to I’m happy to refactor the implementation along these lines. |
|
Hi @jandro996, I've refactored the implementation along your suggestion:
cc: @smola |
Thanks a lot for your changes! It looks good to me, just added a few comments related with testing to improve the coverage |
Thanks @jandro996! Addressed all three points below — let me know if any of them missed the mark. |
cf6696e to
df1ae9f
Compare
|
Hi @jandro996, thanks for the LGTM! The branch is now up to date with the base branch and should be ready to merge. When you have a moment, could you approve the CI workflows? Thanks! cc: @smola |
df1ae9f to
7f5c7ef
Compare
Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
Fixes DataDog#10051 When DD_APM_TRACING_ENABLED=false is set, APM tracing should be disabled while allowing other products like LLM Observability to function. Previously, setting DD_APM_TRACING_ENABLED=false would inadvertently disable LLMObs or allow APM traces to leak through when no other products were enabled. Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
When APM tracing is disabled but both LLMObs and ASM are enabled, the previous code returned LlmObsStandaloneSampler which never sent the 1 APM trace/minute required by ASM for billing and service catalog. Introduce LlmObsAndAsmStandaloneSampler that keeps all LLMObs and ASM traces while rate-limiting plain APM traces to 1 per minute. Also clarify the log message for the ASM-only standalone case. Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
Replace the three separate sampler classes (AsmStandaloneSampler, LlmObsStandaloneSampler, LlmObsAndAsmStandaloneSampler) with a single StandaloneSampler that iterates over a list of active StandaloneProduct entries, making it trivially extensible to future products. Also add ProductTraceSource.isAnyStandaloneProductMarked() to simplify the TraceCollector force-keep bypass check. Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
- Replace individual SamplerTest cases with @unroll matrix that covers all product-flag combinations and asserts activeProducts contents directly; add package-private getActiveProducts() getter to StandaloneSampler to enable this - Add StandaloneSamplerTest case for spans with both LLMOBS and ASM bits set simultaneously, verifying LLMOBS wins via list ordering - Add TraceCollectorTest exercising the full span-finish → CoreTracer write path to verify that setSamplingPriorityIfNecessary skips the sampler when APM is disabled, a standalone product flag is set, and priority is already non-UNSET Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
7f5c7ef to
e9010e7
Compare
|
@matsumo-and — found an edge case in the LLMObs trace-source propagation, repro in #11249.
TraceSegment segment = AgentTracer.get().getTraceSegment();
if (segment != null) {
segment.setTagTop(Tags.PROPAGATED_TRACE_SOURCE, ProductTraceSource.LLMOBS);
}
#11249 adds a Fix is probably just to use the LLMObs span's own context — it's already a if (span.context() instanceof TraceSegment) {
((TraceSegment) span.context()).setTagTop(Tags.PROPAGATED_TRACE_SOURCE, ProductTraceSource.LLMOBS);
}
|
|
Good catch, thank you @FouadWahabi! I’ll address the root cause by updating the implementation to use the span context directly. |
Thanks @matsumo-and! Could you please add the additional test coverage to this PR? I'll go ahead and close the other one since it was only created to help clarify the issue. |
When an LLMObs span is created without an active surrounding APM AgentScope, DDLLMObsSpan calls AgentTracer.get().getTraceSegment() which reads activeSpan() — but LLMObsContext.attach does not activate an AgentScope, so the segment is null and the LLMOBS trace-source bit is never set on the local root. StandaloneSampler then drops the trace because it only checks the root's traceSource bitfield. Add a controller endpoint that creates the LLMObs span on a fresh thread (no inherited APM scope) and a smoke test that asserts the standalone LLMObs trace is kept and carries the LLMOBS bit (0x20) in _dd.p.ts. The test fails on top of the current PR and will pass once the bit is propagated regardless of active scope. Signed-off-by: matsumo-and <yh134.toisanda@gmail.com> Co-authored-by: FouadWahabi <FouadWahabi@users.noreply.github.com>
DDLLMObsSpan set PROPAGATED_TRACE_SOURCE via AgentTracer.get().getTraceSegment(),which reads activeSpan(). When LLMObs is used without a surrounding APM scope (CLI, batch, fresh thread), activeSpan() is null, so the LLMOBS bit never lands on the local root. StandaloneSampler then drops the trace. Fix by using span.context() directly — DDSpanContext implements TraceSegment and its setTagTop() routes through getRootSpanContextOrThis(), so the bit is set on the local root regardless of which scope is active. Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
Thank you, @FouadWahabi! Pulled the smoke test from #11249 into this PR and applied the fix. Root cause: Fix: Use The standalone LLMObs span (no surrounding APM scope) should be kept and carry the LLMOBS trace-source bit test now passes. |
|
Thanks for the review! Looks like the PR has approvals from both ASM and LLMObs sides now, but CI hasn't been triggered yet. Could someone help update the branch / trigger workflows when convenient? |
|
Merged master to resolve conflicts. While investigating the smoke test failure on this branch, I found that a crash tracking change merged from master (43fc0e8 Enable sending crashtracking reports to errors intake by default) introduced a NullPointerException in JVMFlagAccess.setValue when getStringFlag returns null on platforms like Apple Silicon. This caused the Opened a separate PR to fix it: #11354 |
What Does This Do
This PR fixes two issues when using LLM Observability without APM tracing:
DD_APM_TRACING_ENABLED=falseproperly drop APM traces while keeping LLMObs tracesNullPointerExceptionwhenDD_TRACE_ENABLED=falseis set with LLMObs enabledMotivation
Solves #10051
Currently, when
DD_APM_TRACING_ENABLED=falseis set withDD_LLMOBS_ENABLED=true, APM traces are still sent to the agent because the sampler selection logic falls through to the default sampler, which keeps all traces by default.Additionally, setting
DD_TRACE_ENABLED=falsecauses NPE in LLMObs methods because LLMObs tries to initialize even when all tracing is disabled.Additional Notes
Implementation approach:
TraceSegment.setTagTop()to propagate the LLMOBS flag from child spans to the root span, following the exact same pattern used by ASM (AppSecEventTracker, GatewayBridge, IAST Reporter)LLMObsSystem.start()when!config.isTraceEnabled()Testing:
dd-smoke-tests/apm-tracing-disabled/Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issueNote: Once your PR is ready to merge, add it to the merge queue by commenting
/merge./merge -ccancels the queue request./merge -f --reason "reason"skips all merge queue checks; please use this judiciously, as some checks do not run at the PR-level. For more information, see this doc.