-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
ref(llm-detection): Refactor Seer integration to fetch traces via RPC #104485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
|
||
| NUM_TRANSACTIONS_TO_PROCESS = 20 | ||
| LOWER_SPAN_LIMIT = 20 | ||
| UPPER_SPAN_LIMIT = 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these will be handled on the seer side
c36bdc3 to
f36abaf
Compare
| class EvidenceTraceData(BaseModel): # hate this name | ||
| trace_id: str | ||
| project_id: int | ||
| transaction_name: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the transaction name in addition to the trace_id when fetching the EAPTrace? or is this just so we still have access to the transaction name for our own things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great q - transaction name is now just context data that we pass to seer, and seer passes back in the detected issue, because we need it to create the issue.
the EAPTrace only needs trace_id + org_id
52a714f to
d0ece3c
Compare
| organization_id=organization_id, | ||
| response_data=response.data.decode("utf-8"), | ||
| error_message=str(e), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Missing pydantic.ValidationError in exception handler
The exception handler catches (ValueError, TypeError) but IssueDetectionResponse.parse_obj() raises pydantic.ValidationError when the Seer response doesn't match the expected schema. Since the DetectedIssue model now requires trace_id and transaction_name fields that Seer must pass back, if Seer fails to return these fields or returns them with incorrect types, the pydantic.ValidationError will propagate uncaught instead of being wrapped in LLMIssueDetectionError. The codebase correctly catches pydantic.ValidationError elsewhere when using parse_obj.
| if processed_count >= NUM_TRANSACTIONS_TO_PROCESS: | ||
| break | ||
| seer_request = { | ||
| "telemetry": [{**trace.dict(), "kind": "trace"} for trace in evidence_traces], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like we could use better variable names here since it's just the id/name instead of an actual trace now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree - cleaned it up on the seer side, updating this pr to match
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
d0ece3c to
ae4fc8a
Compare
| except (ValueError, TypeError) as e: | ||
| raise LLMIssueDetectionError( | ||
| message="Seer response parsing error", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Batch-level LLMIssueDetectionError is uncaught, leading to task crashes and loss of entire batches.
Severity: CRITICAL | Confidence: High
🔍 Detailed Analysis
The code will crash unexpectedly with an uncaught LLMIssueDetectionError if Seer returns a non-2xx HTTP response (e.g., network failure, server error, rate limiting) or if Seer's response cannot be parsed as a valid JSON/Pydantic model. This causes the entire task to fail, marked as failed in the task worker, without retries, leading to the loss of the entire batch of potential issues. The previous code explicitly handled these errors, but the new batch processing logic removes this critical error handling.
💡 Suggested Fix
Add a try/except block around the batch Seer request and response parsing to catch LLMIssueDetectionError and handle it gracefully, similar to the old code's per-trace error handling.
🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/sentry/tasks/llm_issue_detection/detection.py#L267-L269
Potential issue: The code will crash unexpectedly with an uncaught
`LLMIssueDetectionError` if Seer returns a non-2xx HTTP response (e.g., network failure,
server error, rate limiting) or if Seer's response cannot be parsed as a valid
JSON/Pydantic model. This causes the entire task to fail, marked as failed in the task
worker, without retries, leading to the loss of the entire batch of potential issues.
The previous code explicitly handled these errors, but the new batch processing logic
removes this critical error handling.
Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 6314793
| response = make_signed_seer_api_request( | ||
| connection_pool=seer_issue_detection_connection_pool, | ||
| path=SEER_ANALYZE_ISSUE_ENDPOINT_PATH, | ||
| body=json.dumps(seer_request).encode("utf-8"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Pydantic model not serializable with json.dumps
json.dumps(seer_request) will raise a TypeError because seer_request is a Pydantic BaseModel (IssueDetectionRequest), which cannot be directly serialized by simplejson. The codebase uses Pydantic v1 which requires calling .dict() on models before JSON serialization, as demonstrated elsewhere in the codebase (e.g., seer/explorer/index_data.py line 537). The call needs to use json.dumps(seer_request.dict()) instead.
Problem
The LLM issue detection task was fetching full span data for every trace in Sentry, then sending bits of that telemetry to Seer in individual requests. We want to use EAPTrace instead which would include much more data in a format better optimized for llm analysis. This requires a significant restructuring of the request/response formats between this task and its seer endpoint.
There was also a lil bug in how we were selecting traces for each transaction - cleared that up and introduced a tiny bit of variation to trace selection logic.
Solution
Changed the request/response flow so Sentry sends only trace IDs to Seer in a single bundled request. Now, Seer fetches the full
EAPTracedata itself via Sentry's existingget_trace_waterfallRPC endpoint and uses that as the input for llm detection.Changes to Sentry → Seer Request
Before:
trace_id,project_id,transaction_name,total_spans,spans: list[Span]After:
trace_idand normalizedtransaction_nameEAPTracedata via RPCChanges to Seer → Sentry Response
Updated
DetectedIssuemodel to include context fields:trace_id: str- which trace the issue was found intransaction_name: str- normalized transaction nameTrace Selection Logic
sum(span.duration)over 30-minute windowBreaking Changes
This is a breaking change to the Seer integration. Deployment requires:
issue-detection.llm-detection.enabled = false)This will not impact any customers.