Skip to content

Fix DD_APM_TRACING_ENABLED to work with LLMObs#10989

Open
matsumo-and wants to merge 11 commits into
DataDog:masterfrom
matsumo-and:fix/LLMOBS-10051
Open

Fix DD_APM_TRACING_ENABLED to work with LLMObs#10989
matsumo-and wants to merge 11 commits into
DataDog:masterfrom
matsumo-and:fix/LLMOBS-10051

Conversation

@matsumo-and
Copy link
Copy Markdown

@matsumo-and matsumo-and commented Mar 28, 2026

What Does This Do

This PR fixes two issues when using LLM Observability without APM tracing:

  1. Makes DD_APM_TRACING_ENABLED=false properly drop APM traces while keeping LLMObs traces
  2. Prevents NullPointerException when DD_TRACE_ENABLED=false is set with LLMObs enabled

Motivation

Solves #10051

Currently, when DD_APM_TRACING_ENABLED=false is set with DD_LLMOBS_ENABLED=true, APM traces are still sent to the agent because the sampler selection logic falls through to the default sampler, which keeps all traces by default.

Additionally, setting DD_TRACE_ENABLED=false causes NPE in LLMObs methods because LLMObs tries to initialize even when all tracing is disabled.

Additional Notes

Implementation approach:

  1. ProductTraceSource.LLMOBS flag (0x20): Added following the existing pattern for ASM/DSM/DBM products
  2. StandaloneProduct enum: encapsulates per-product behavior (trace source bit, sampling mechanism, whether a billing trace is needed). Currently LLMOBS and ASM.
  3. StandaloneSampler: unified sampler for when APM tracing is disabled but one or more standalone products are active.
  4. DDLLMObsSpan marking: Uses TraceSegment.setTagTop() to propagate the LLMOBS flag from child spans to the root span, following the exact same pattern used by ASM (AppSecEventTracker, GatewayBridge, IAST Reporter)
  5. NPE prevention: Added early return in LLMObsSystem.start() when !config.isTraceEnabled()

Testing:

  • Added smoke tests in dd-smoke-tests/apm-tracing-disabled/
  • Verified on master: LLMObs works but APM traces not dropped
  • Verified on fix branch: LLMObs works and APM traces properly dropped

Contributor Checklist

Note: Once your PR is ready to merge, add it to the merge queue by commenting /merge. /merge -c cancels the queue request. /merge -f --reason "reason" skips all merge queue checks; please use this judiciously, as some checks do not run at the PR-level. For more information, see this doc.

@matsumo-and matsumo-and marked this pull request as ready for review March 28, 2026 10:13
@matsumo-and matsumo-and requested review from a team as code owners March 28, 2026 10:13
@bric3 bric3 requested a review from smola April 2, 2026 08:36
@matsumo-and
Copy link
Copy Markdown
Author

Hi @smola , just a gentle ping.
I'd appreciate your review when you have time.
Thanks!

Comment thread dd-trace-core/src/main/java/datadog/trace/common/sampling/Sampler.java Outdated
Comment thread dd-trace-core/src/main/java/datadog/trace/common/sampling/Sampler.java Outdated
@matsumo-and matsumo-and requested a review from smola April 8, 2026 16:24
@jandro996
Copy link
Copy Markdown
Member

jandro996 commented Apr 16, 2026

Sorry for the late review :(

I think the PR fixes the immediate issue correctly, but it introduces a structural technical debt. This is something we had in mind when we started with the standalone implementations, but we didn’t move forward at the time since no other products needed it.

The current approach introduces a new sampler class per product combination:

  • APM disabled + ASM → AsmStandaloneSampler
  • APM disabled + LLMObs → LlmObsStandaloneSampler
  • APM disabled + ASM + LLMObs → LlmObsAndAsmStandaloneSampler

The next standalone product will require yet another class and another branch.

Suggested alternative (just an example)

Each product has three properties, its ProductTraceSource bit, the SamplingMechanism to use when keeping, and whether it needs the 1-per-minute billing trace:

enum StandaloneProduct {
     ASM   (ProductTraceSource.ASM,    SamplingMechanism.APPSEC,  true),
     LLMOBS(ProductTraceSource.LLMOBS, SamplingMechanism.DEFAULT, false);
 }

A single StandaloneSampler receives the list of active products, iterates them to detect product-marked traces, and falls back to rate-limiting (or drop) depending on whether any active product requires the billing trace. forConfig becomes flat and open for extension:

if (!config.isApmTracingEnabled()) {
      List<StandaloneProduct> active = new ArrayList<>();
      if (isAsmEnabled(config))     active.add(StandaloneProduct.ASM);
      if (config.isLlmObsEnabled()) active.add(StandaloneProduct.LLMOBS);
      // next product: one line here + one entry in the enum

      return active.isEmpty()
          ? new ForcePrioritySampler(SAMPLER_DROP, DEFAULT)
          : new StandaloneSampler(active, Clock.systemUTC());
  }

The same generalisation applies to the manual && per product in TraceCollector.setSamplingPriorityIfNecessary, a helper like ProductTraceSource.isAnyStandaloneProductMarked(traceSource) would consolidate that.

Maybe I’m missing some context since this is an older topic, but what do you think about the suggestion? Do you see it as viable?

cc:@smola

@matsumo-and
Copy link
Copy Markdown
Author

Thanks for the detailed feedback, @jandro996 — this makes a lot of sense.

I initially kept the change minimal to address the immediate issue, but I agree that the current approach doesn’t scale well as more standalone products are added, and your suggestion of using a single StandaloneSampler with a product-based configuration looks much cleaner and easier to extend.

Also, good point about applying the same generalization to TraceCollector. Consolidating the per-product checks into a helper like ProductTraceSource.isAnyStandaloneProductMarked would definitely improve maintainability and avoid duplicating logic.

I’m happy to refactor the implementation along these lines.

@matsumo-and
Copy link
Copy Markdown
Author

matsumo-and commented Apr 16, 2026

Hi @jandro996, I've refactored the implementation along your suggestion:

  • Replaced the three separate sampler classes with a single StandaloneSampler that takes a List
  • Added a StandaloneProduct enum holding each product's trace source bit, sampling mechanism, and billing trace flag
  • Added ProductTraceSource.isAnyStandaloneProductMarked() to simplify the check in TraceCollector
  • Sampler.Builder.forConfig() is now flat and open for extension — adding a new product is just one line in the enum and one line in the builder

6071b27

cc: @smola

Comment thread dd-trace-core/src/main/java/datadog/trace/core/TraceCollector.java
@jandro996
Copy link
Copy Markdown
Member

Hi @jandro996, I've refactored the implementation along your suggestion:

  • Replaced the three separate sampler classes with a single StandaloneSampler that takes a List
  • Added a StandaloneProduct enum holding each product's trace source bit, sampling mechanism, and billing trace flag
  • Added ProductTraceSource.isAnyStandaloneProductMarked() to simplify the check in TraceCollector
  • Sampler.Builder.forConfig() is now flat and open for extension — adding a new product is just one line in the enum and one line in the builder

6071b27

cc: @smola

Thanks a lot for your changes! It looks good to me, just added a few comments related with testing to improve the coverage

@matsumo-and
Copy link
Copy Markdown
Author

matsumo-and commented Apr 21, 2026

Thanks a lot for your changes! It looks good to me, just added a few comments related with testing to improve the coverage

Thanks @jandro996! Addressed all three points below — let me know if any of them missed the mark.

cf6696e

@jandro996 jandro996 added comp: asm waf Application Security Management (WAF) comp: asm iast Application Security Management (IAST) type: enhancement Enhancements and improvements labels Apr 24, 2026
Copy link
Copy Markdown
Member

@jandro996 jandro996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@matsumo-and
Copy link
Copy Markdown
Author

matsumo-and commented Apr 26, 2026

Hi @jandro996, thanks for the LGTM!
I’ve rebased this PR onto the latest master and amended all commits to include Signed-off-by.

The branch is now up to date with the base branch and should be ready to merge. When you have a moment, could you approve the CI workflows?

Thanks!

cc: @smola

Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
Fixes DataDog#10051

When DD_APM_TRACING_ENABLED=false is set, APM tracing should be disabled
while allowing other products like LLM Observability to function. Previously,
setting DD_APM_TRACING_ENABLED=false would inadvertently disable LLMObs or
allow APM traces to leak through when no other products were enabled.

Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
When APM tracing is disabled but both LLMObs and ASM are enabled,
the previous code returned LlmObsStandaloneSampler which never sent
the 1 APM trace/minute required by ASM for billing and service catalog.

Introduce LlmObsAndAsmStandaloneSampler that keeps all LLMObs and
ASM traces while rate-limiting plain APM traces to 1 per minute.

Also clarify the log message for the ASM-only standalone case.

Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
Replace the three separate sampler classes (AsmStandaloneSampler,
LlmObsStandaloneSampler, LlmObsAndAsmStandaloneSampler) with a single
StandaloneSampler that iterates over a list of active StandaloneProduct
entries, making it trivially extensible to future products.

Also add ProductTraceSource.isAnyStandaloneProductMarked() to simplify
the TraceCollector force-keep bypass check.

Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
- Replace individual SamplerTest cases with @unroll matrix that covers all
product-flag combinations and asserts activeProducts contents directly;
add package-private getActiveProducts() getter to StandaloneSampler to
enable this
- Add StandaloneSamplerTest case for spans with both LLMOBS and ASM bits set
simultaneously, verifying LLMOBS wins via list ordering
- Add TraceCollectorTest exercising the full span-finish → CoreTracer write
path to verify that setSamplingPriorityIfNecessary skips the sampler when
APM is disabled, a standalone product flag is set, and priority is already
non-UNSET

Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
@FouadWahabi
Copy link
Copy Markdown

@matsumo-and — found an edge case in the LLMObs trace-source propagation, repro in #11249.

DDLLMObsSpan marks the local root via:

TraceSegment segment = AgentTracer.get().getTraceSegment();
if (segment != null) {
  segment.setTagTop(Tags.PROPAGATED_TRACE_SOURCE, ProductTraceSource.LLMOBS);
}

getTraceSegment() resolves through activeSpan() (CoreTracer.java:1505-1515), but LLMObsContext.attach(span.context()) doesn't activate an AgentScope — it just writes to its own ContextKey. So as soon as LLMObs is used without a surrounding APM scope (CLI, batch worker, fresh thread, anything before an HTTP/servlet handler runs), activeSpan() is null, segment is null, and the LLMOBS bit never lands on the local root. StandaloneSampler then sees an empty traceSource bitfield and drops the trace.

LlmObsApmDisabledSmokeTest doesn't catch it because every call there runs inside a Spring servlet handler, so the request span is both the active scope and the local root.

#11249 adds a /rest-api/llmobs/standalone endpoint that runs LLMObs.startLLMSpan(...) on a fresh thread and asserts the trace exists, has _dd.p.ts=20, and is SAMPLER_KEEP. On top of this branch it fails — SAMPLER_DROP, no _dd.p.ts.

Fix is probably just to use the LLMObs span's own context — it's already a TraceSegment via DDSpanContext:

if (span.context() instanceof TraceSegment) {
  ((TraceSegment) span.context()).setTagTop(Tags.PROPAGATED_TRACE_SOURCE, ProductTraceSource.LLMOBS);
}

DDSpanContext.setTagTop already routes through getRootSpanContextOrThis() (DDSpanContext.java:1363) so it marks the local root regardless of active scope.

@matsumo-and
Copy link
Copy Markdown
Author

Good catch, thank you @FouadWahabi!

I’ll address the root cause by updating the implementation to use the span context directly.
Happy to leave the additional test coverage in your PR.

@FouadWahabi
Copy link
Copy Markdown

Good catch, thank you @FouadWahabi!

I’ll address the root cause by updating the implementation to use the span context directly. Happy to leave the additional test coverage in your PR.

Thanks @matsumo-and! Could you please add the additional test coverage to this PR? I'll go ahead and close the other one since it was only created to help clarify the issue.

matsumo-and and others added 2 commits May 2, 2026 02:58
When an LLMObs span is created without an active surrounding APM
AgentScope, DDLLMObsSpan calls AgentTracer.get().getTraceSegment()
which reads activeSpan() — but LLMObsContext.attach does not activate
an AgentScope, so the segment is null and the LLMOBS trace-source bit
is never set on the local root. StandaloneSampler then drops the trace
because it only checks the root's traceSource bitfield.

Add a controller endpoint that creates the LLMObs span on a fresh
thread (no inherited APM scope) and a smoke test that asserts the
standalone LLMObs trace is kept and carries the LLMOBS bit (0x20) in
_dd.p.ts. The test fails on top of the current PR and will pass once
the bit is propagated regardless of active scope.

Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
Co-authored-by: FouadWahabi <FouadWahabi@users.noreply.github.com>
DDLLMObsSpan set PROPAGATED_TRACE_SOURCE via AgentTracer.get().getTraceSegment(),which reads activeSpan().
When LLMObs is used without a surrounding APM scope (CLI, batch, fresh thread), activeSpan() is null,
so the LLMOBS bit never lands on the local root. StandaloneSampler then drops the trace.

Fix by using span.context() directly — DDSpanContext implements TraceSegment and its setTagTop() routes through getRootSpanContextOrThis(),
so the bit is set on the local root regardless of which scope is active.

Signed-off-by: matsumo-and <yh134.toisanda@gmail.com>
@matsumo-and
Copy link
Copy Markdown
Author

matsumo-and commented May 1, 2026

Thanks @matsumo-and! Could you please add the additional test coverage to this PR? I'll go ahead and close the other one since it was only created to help clarify the issue.

Thank you, @FouadWahabi! Pulled the smoke test from #11249 into this PR and applied the fix.

Root cause: DDLLMObsSpan was setting PROPAGATED_TRACE_SOURCE via AgentTracer.get().getTraceSegment(), which resolves through activeSpan(). With no surrounding APM scope, activeSpan() is null, so the LLMOBS bit never reached the local root and StandaloneSampler dropped the trace.

Fix: Use span.context() directly — DDSpanContext implements TraceSegment and its setTagTop() routes through getRootSpanContextOrThis(), so the bit lands on the local root regardless of active scope.

The standalone LLMObs span (no surrounding APM scope) should be kept and carry the LLMOBS trace-source bit test now passes.

Copy link
Copy Markdown

@FouadWahabi FouadWahabi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from ml-observability side

@matsumo-and
Copy link
Copy Markdown
Author

Thanks for the review!

Looks like the PR has approvals from both ASM and LLMObs sides now, but CI hasn't been triggered yet.

Could someone help update the branch / trigger workflows when convenient?

@matsumo-and
Copy link
Copy Markdown
Author

matsumo-and commented May 12, 2026

Merged master to resolve conflicts.

While investigating the smoke test failure on this branch, I found that a crash tracking change merged from master (43fc0e8 Enable sending crashtracking reports to errors intake by default) introduced a NullPointerException in JVMFlagAccess.setValue when getStringFlag returns null on platforms like Apple Silicon.

This caused the !isLogPresent { it.contains("NullPointerException") } assertion to fail in LlmObsApmDisabledSmokeTest.groovy.

Opened a separate PR to fix it: #11354

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: asm iast Application Security Management (IAST) comp: asm waf Application Security Management (WAF) type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants