ref(llm-detection): Refactor Seer integration to fetch traces via RPC #104485

nora-shap · 2025-12-05T22:37:24Z

Problem

The LLM issue detection task was fetching full span data for every trace in Sentry, then sending bits of that telemetry to Seer in individual requests. We want to use EAPTrace instead which would include much more data in a format better optimized for llm analysis. This requires a significant restructuring of the request/response formats between this task and its seer endpoint.

There was also a lil bug in how we were selecting traces for each transaction - cleared that up and introduced a tiny bit of variation to trace selection logic.

Solution

Changed the request/response flow so Sentry sends only trace IDs to Seer in a single bundled request. Now, Seer fetches the full EAPTrace data itself via Sentry's existing get_trace_waterfall RPC endpoint and uses that as the input for llm detection.

Changes to Sentry → Seer Request

Before:

Sentry sent truncated trace telemetry
Multiple fields: trace_id, project_id, transaction_name, total_spans, spans: list[Span]
Sent one trace at a time

After:

Sentry sends only trace metadata: trace_id and normalized transaction_name
Sends up to 50 traces in a single request
Seer fetches full EAPTrace data via RPC

Changes to Seer → Sentry Response

Updated DetectedIssue model to include context fields:

Added trace_id: str - which trace the issue was found in
Added transaction_name: str - normalized transaction name
These are pass-through fields Seer must return from the request

Trace Selection Logic

Query top transactions by sum(span.duration) over 30-minute window
Deduplicate by normalized transaction name
For each unique transaction, select one representative trace using a randomized time sub-window (1-8 minute offset)

Breaking Changes

This is a breaking change to the Seer integration. Deployment requires:

Stop the task (issue-detection.llm-detection.enabled = false)
Deploy Seer changes to handle new request format and fetch traces via RPC
Deploy this Sentry change
Re-enable the task
This will not impact any customers.

linear · 2025-12-05T22:37:27Z

ID-1121 Move over to EAPTrace

src/sentry/seer/sentry_data_models.py

nora-shap · 2025-12-05T22:38:14Z

src/sentry/tasks/llm_issue_detection/detection.py

-
-NUM_TRANSACTIONS_TO_PROCESS = 20
-LOWER_SPAN_LIMIT = 20
-UPPER_SPAN_LIMIT = 500


these will be handled on the seer side

src/sentry/tasks/llm_issue_detection/trace_data.py

src/sentry/seer/sentry_data_models.py

roggenkemper · 2025-12-05T22:48:59Z

src/sentry/seer/sentry_data_models.py

+class EvidenceTraceData(BaseModel):  # hate this name
    trace_id: str
-    project_id: int
    transaction_name: str


do we need the transaction name in addition to the trace_id when fetching the EAPTrace? or is this just so we still have access to the transaction name for our own things

great q - transaction name is now just context data that we pass to seer, and seer passes back in the detected issue, because we need it to create the issue.
the EAPTrace only needs trace_id + org_id

src/sentry/tasks/llm_issue_detection/detection.py

src/sentry/tasks/llm_issue_detection/trace_data.py

cursor · 2025-12-05T23:19:41Z

src/sentry/tasks/llm_issue_detection/detection.py

+            organization_id=organization_id,
+            response_data=response.data.decode("utf-8"),
+            error_message=str(e),
+        )


Bug: Missing pydantic.ValidationError in exception handler

The exception handler catches (ValueError, TypeError) but IssueDetectionResponse.parse_obj() raises pydantic.ValidationError when the Seer response doesn't match the expected schema. Since the DetectedIssue model now requires trace_id and transaction_name fields that Seer must pass back, if Seer fails to return these fields or returns them with incorrect types, the pydantic.ValidationError will propagate uncaught instead of being wrapped in LLMIssueDetectionError. The codebase correctly catches pydantic.ValidationError elsewhere when using parse_obj.

src/sentry/seer/sentry_data_models.py

src/sentry/tasks/llm_issue_detection/detection.py

    if not has_access:
        return


roggenkemper · 2025-12-05T23:22:48Z

src/sentry/tasks/llm_issue_detection/detection.py

-        if processed_count >= NUM_TRANSACTIONS_TO_PROCESS:
-            break
+    seer_request = {
+        "telemetry": [{**trace.dict(), "kind": "trace"} for trace in evidence_traces],


feels like we could use better variable names here since it's just the id/name instead of an actual trace now

agree - cleaned it up on the seer side, updating this pr to match

codecov · 2025-12-05T23:38:03Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
30285	1	30284	240

View the top 1 failed test(s) by shortest run time

tests.sentry.tasks.test_llm_issue_detection.LLMIssueDetectionTest::test_detect_llm_issues_full_flow

Stack Traces | 2.25s run time

#x1B[1m#x1B[.../sentry/tasks/test_llm_issue_detection.py#x1B[0m:234: in test_detect_llm_issues_full_flow
    detect_llm_issues_for_project(self.project.id)
#x1B[1m#x1B[.../sentry/taskworker/task.py#x1B[0m:89: in __call__
    return self._func(*args, **kwargs)
#x1B[1m#x1B[.../tasks/llm_issue_detection/detection.py#x1B[0m:252: in detect_llm_issues_for_project
    body=json.dumps(seer_request).encode("utf-8"),
#x1B[1m#x1B[.../sentry/utils/json.py#x1B[0m:112: in dumps
    return _default_encoder.encode(value)
#x1B[1m#x1B[31m.venv/lib/python3.13....../site-packages/simplejson/encoder.py#x1B[0m:296: in encode
    chunks = self.iterencode(o, _one_shot=True)
#x1B[1m#x1B[31m.venv/lib/python3.13....../site-packages/simplejson/encoder.py#x1B[0m:378: in iterencode
    return _iterencode(o, 0)
#x1B[1m#x1B[.../sentry/utils/json.py#x1B[0m:62: in better_default_encoder
    raise TypeError(repr(o) + " is not JSON serializable")
#x1B[1m#x1B[31mE   TypeError: IssueDetectionRequest(traces=[TraceMetadata(trace_id='trace_id_1', transaction_name='POST /some/thing'), TraceMetadata(trace_id='trace_id_2', transaction_name='GET /another/')], organization_id=4557250140504064, project_id=4557250140504064) is not JSON serializable#x1B[0m

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

sentry · 2025-12-09T00:22:03Z

src/sentry/tasks/llm_issue_detection/detection.py

+    except (ValueError, TypeError) as e:
+        raise LLMIssueDetectionError(
+            message="Seer response parsing error",


Bug: Batch-level LLMIssueDetectionError is uncaught, leading to task crashes and loss of entire batches.
_{Severity: CRITICAL | Confidence: High}

🔍 Detailed Analysis

The code will crash unexpectedly with an uncaught LLMIssueDetectionError if Seer returns a non-2xx HTTP response (e.g., network failure, server error, rate limiting) or if Seer's response cannot be parsed as a valid JSON/Pydantic model. This causes the entire task to fail, marked as failed in the task worker, without retries, leading to the loss of the entire batch of potential issues. The previous code explicitly handled these errors, but the new batch processing logic removes this critical error handling.

💡 Suggested Fix

Add a try/except block around the batch Seer request and response parsing to catch LLMIssueDetectionError and handle it gracefully, similar to the old code's per-trace error handling.

🤖 Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/sentry/tasks/llm_issue_detection/detection.py#L267-L269 Potential issue: The code will crash unexpectedly with an uncaught `LLMIssueDetectionError` if Seer returns a non-2xx HTTP response (e.g., network failure, server error, rate limiting) or if Seer's response cannot be parsed as a valid JSON/Pydantic model. This causes the entire task to fail, marked as failed in the task worker, without retries, leading to the loss of the entire batch of potential issues. The previous code explicitly handled these errors, but the new batch processing logic removes this critical error handling.

_{Did we get this right? 👍 / 👎 to inform future reviews.}
_{Reference ID: 6314793}

cursor · 2025-12-09T00:26:16Z

src/sentry/tasks/llm_issue_detection/detection.py

+    response = make_signed_seer_api_request(
+        connection_pool=seer_issue_detection_connection_pool,
+        path=SEER_ANALYZE_ISSUE_ENDPOINT_PATH,
+        body=json.dumps(seer_request).encode("utf-8"),


Bug: Pydantic model not serializable with json.dumps

json.dumps(seer_request) will raise a TypeError because seer_request is a Pydantic BaseModel (IssueDetectionRequest), which cannot be directly serialized by simplejson. The codebase uses Pydantic v1 which requires calling .dict() on models before JSON serialization, as demonstrated elsewhere in the codebase (e.g., seer/explorer/index_data.py line 537). The call needs to use json.dumps(seer_request.dict()) instead.

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Dec 5, 2025

nora-shap commented Dec 5, 2025

View reviewed changes

src/sentry/seer/sentry_data_models.py Outdated Show resolved Hide resolved

nora-shap commented Dec 5, 2025

View reviewed changes

nora-shap force-pushed the nora/ID-1121 branch from c36bdc3 to f36abaf Compare December 5, 2025 22:41

nora-shap requested a review from roggenkemper December 5, 2025 22:42

vercel bot deployed to Preview December 5, 2025 22:44 View deployment

cursor bot reviewed Dec 5, 2025

View reviewed changes

src/sentry/tasks/llm_issue_detection/trace_data.py Outdated Show resolved Hide resolved

src/sentry/seer/sentry_data_models.py Outdated Show resolved Hide resolved

roggenkemper reviewed Dec 5, 2025

View reviewed changes

src/sentry/tasks/llm_issue_detection/detection.py Outdated Show resolved Hide resolved

roggenkemper reviewed Dec 5, 2025

View reviewed changes

src/sentry/tasks/llm_issue_detection/trace_data.py Outdated Show resolved Hide resolved

nora-shap force-pushed the nora/ID-1121 branch 2 times, most recently from 52a714f to d0ece3c Compare December 5, 2025 23:09

vercel bot deployed to Preview December 5, 2025 23:11 View deployment

nora-shap marked this pull request as ready for review December 5, 2025 23:18

nora-shap requested review from a team as code owners December 5, 2025 23:18

cursor bot reviewed Dec 5, 2025

View reviewed changes

roggenkemper reviewed Dec 5, 2025

View reviewed changes

src/sentry/seer/sentry_data_models.py Outdated Show resolved Hide resolved

sentry bot reviewed Dec 5, 2025

View reviewed changes

src/sentry/tasks/llm_issue_detection/detection.py

Comment on lines 224 to 232

if not has_access:

return

This comment was marked as outdated.

Sign in to view

roggenkemper reviewed Dec 5, 2025

View reviewed changes

nora-shap added 3 commits December 8, 2025 16:19

refactor llm issue detection task for EAPTrace

e436825

increase limit, introduce randomness

7c86949

update naming

ae4fc8a

nora-shap force-pushed the nora/ID-1121 branch from d0ece3c to ae4fc8a Compare December 9, 2025 00:19

sentry bot reviewed Dec 9, 2025

View reviewed changes

vercel bot deployed to Preview December 9, 2025 00:22 View deployment

cursor bot reviewed Dec 9, 2025

View reviewed changes

Uh oh!

ref(llm-detection): Refactor Seer integration to fetch traces via RPC #104485

Are you sure you want to change the base?

ref(llm-detection): Refactor Seer integration to fetch traces via RPC #104485

Conversation

nora-shap commented Dec 5, 2025

Problem

Solution

Changes to Sentry → Seer Request

Changes to Seer → Sentry Response

Trace Selection Logic

Breaking Changes

Uh oh!

linear bot commented Dec 5, 2025

Uh oh!

Uh oh!

nora-shap Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

roggenkemper Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nora-shap Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot Dec 5, 2025

Choose a reason for hiding this comment

Bug: Missing pydantic.ValidationError in exception handler

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

roggenkemper Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

nora-shap Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 1 Tests Failed:

Uh oh!

sentry bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot Dec 9, 2025

Choose a reason for hiding this comment

Bug: Pydantic model not serializable with json.dumps

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

roggenkemper Dec 5, 2025 •

edited

Loading

codecov bot commented Dec 5, 2025 •

edited

Loading