feat(phase-5): engineering assistant agent — 4 tools, 5 scenarios, 10 eval cases#15
Merged
Merged
Conversation
… eval cases Tools (demos/engineering_agent/tools.py): code_search — find symbols/patterns across a codebase (by keyword, returns file+line+snippet) file_read — read a specific file, optional line range pr_review — fetch PR issues by severity/category (security, perf, style, correctness) dependency_check — outdated versions + CVE lookup per repo/package manager Mock data covers realistic scenarios: authenticate_user defined in src/auth/service.py, used across 3 files SQL injection patterns in src/db/queries.py (raw f-strings) PR-42: critical timing-attack + missing rate limit PR-99: docs-only, clean PR-17: DB refactor, 2 warnings backend/pip: 4 outdated, cryptography CRITICAL CVE + requests HIGH CVE frontend/npm: lodash HIGH + axios MEDIUM CVEs 5 scenarios: find_function, review_pr, dependency_audit, inspect_file, security_investigation 10 eval cases (evals/dataset/engineering_cases.py): eng_001–003: code_search (definition, usage, SQL injection) eng_004–005: file_read (existing file, not-found graceful handling) eng_006–008: pr_review (critical PR, clean PR, warning-only PR) eng_009: dependency_check (must surface critical CVE) eng_010: multi-step code_search + file_read Makefile: make eval-engineering, make eval-all, make demo-eng now active. 134 tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
code_search,file_read,pr_review,dependency_checkmake eval-engineeringandmake eval-allnow activemake demo-engruns the PR review scenarioHow to run
============================================================
Scenario : review_pr
Review PR-42 for security issues before merging
User: Review PR-42 in the backend repo. I'm about to merge it — are there any issues I should know about?
Starting run run_57ca072eac49 (trace trace_337863a902d0) agent=engineering_agent
{"event": "span", "span_id": "sp_000001", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "START", "to_state": "LOAD_MEMORY", "timestamp_ms": 1779670069534, "duration_ms": 0, "metadata": {}}
Memory service unreachable — running without memory run=run_57ca072eac49
{"event": "span", "span_id": "sp_000002", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "LOAD_MEMORY", "to_state": "BUILD_CONTEXT", "timestamp_ms": 1779670069534, "duration_ms": 6, "metadata": {}}
{"event": "span", "span_id": "sp_000003", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "BUILD_CONTEXT", "to_state": "CALL_LLM", "timestamp_ms": 1779670069541, "duration_ms": 0, "metadata": {}}
HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
{"event": "span", "span_id": "sp_000004", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "CALL_LLM", "to_state": "PROCESS_RESPONSE", "timestamp_ms": 1779670069541, "duration_ms": 2667, "metadata": {"model": "claude-haiku-4-5-20251001", "input_tokens": 1492, "output_tokens": 87, "stop_reason": "tool_use", "total_cost_usd": 0.001542}}
{"event": "span", "span_id": "sp_000005", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "PROCESS_RESPONSE", "to_state": "EXECUTE_TOOL", "timestamp_ms": 1779670072208, "duration_ms": 0, "metadata": {}}
{"event": "span", "span_id": "sp_000006", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "EXECUTE_TOOL", "to_state": "OBSERVE_RESULT", "timestamp_ms": 1779670072209, "duration_ms": 0, "metadata": {"tools_called": ["pr_review"], "all_succeeded": true}}
{"event": "tool_result", "trace_id": "trace_337863a902d0", "tool_name": "pr_review", "success": true, "latency_ms": 0, "retries_used": 0}
working_memory write failed — continuing run=run_57ca072eac49
{"event": "span", "span_id": "sp_000007", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "OBSERVE_RESULT", "to_state": "WRITE_MEMORY", "timestamp_ms": 1779670072209, "duration_ms": 1, "metadata": {}}
session_memory write failed mid-run — continuing run=run_57ca072eac49
{"event": "span", "span_id": "sp_000008", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "WRITE_MEMORY", "to_state": "CHECK_TERMINATION", "timestamp_ms": 1779670072211, "duration_ms": 1, "metadata": {}}
{"event": "span", "span_id": "sp_000009", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "CHECK_TERMINATION", "to_state": "CALL_LLM", "timestamp_ms": 1779670072212, "duration_ms": 0, "metadata": {}}
HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
{"event": "span", "span_id": "sp_000010", "trace_id": "trace_337863a902d0", "step": 2, "from_state": "CALL_LLM", "to_state": "PROCESS_RESPONSE", "timestamp_ms": 1779670072212, "duration_ms": 3827, "metadata": {"model": "claude-haiku-4-5-20251001", "input_tokens": 1917, "output_tokens": 385, "stop_reason": "end_turn", "total_cost_usd": 0.004615}}
{"event": "span", "span_id": "sp_000011", "trace_id": "trace_337863a902d0", "step": 2, "from_state": "PROCESS_RESPONSE", "to_state": "RESPOND", "timestamp_ms": 1779670076040, "duration_ms": 0, "metadata": {}}
Failed to write final answer to session memory run=run_57ca072eac49
Failed to write audit record run=run_57ca072eac49
Failed to delete working memory run=run_57ca072eac49
{"trace_id": "trace_337863a902d0", "run_id": "run_57ca072eac49", "spans": [{"span_id": "sp_000001", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "START", "to_state": "LOAD_MEMORY", "timestamp_ms": 1779670069534, "duration_ms": 0, "metadata": {}}, {"span_id": "sp_000002", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "LOAD_MEMORY", "to_state": "BUILD_CONTEXT", "timestamp_ms": 1779670069534, "duration_ms": 6, "metadata": {}}, {"span_id": "sp_000003", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "BUILD_CONTEXT", "to_state": "CALL_LLM", "timestamp_ms": 1779670069541, "duration_ms": 0, "metadata": {}}, {"span_id": "sp_000004", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "CALL_LLM", "to_state": "PROCESS_RESPONSE", "timestamp_ms": 1779670069541, "duration_ms": 2667, "metadata": {"model": "claude-haiku-4-5-20251001", "input_tokens": 1492, "output_tokens": 87, "stop_reason": "tool_use", "total_cost_usd": 0.001542}}, {"span_id": "sp_000005", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "PROCESS_RESPONSE", "to_state": "EXECUTE_TOOL", "timestamp_ms": 1779670072208, "duration_ms": 0, "metadata": {}}, {"span_id": "sp_000006", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "EXECUTE_TOOL", "to_state": "OBSERVE_RESULT", "timestamp_ms": 1779670072209, "duration_ms": 0, "metadata": {"tools_called": ["pr_review"], "all_succeeded": true}}, {"span_id": "sp_000007", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "OBSERVE_RESULT", "to_state": "WRITE_MEMORY", "timestamp_ms": 1779670072209, "duration_ms": 1, "metadata": {}}, {"span_id": "sp_000008", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "WRITE_MEMORY", "to_state": "CHECK_TERMINATION", "timestamp_ms": 1779670072211, "duration_ms": 1, "metadata": {}}, {"span_id": "sp_000009", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "CHECK_TERMINATION", "to_state": "CALL_LLM", "timestamp_ms": 1779670072212, "duration_ms": 0, "metadata": {}}, {"span_id": "sp_000010", "trace_id": "trace_337863a902d0", "step": 2, "from_state": "CALL_LLM", "to_state": "PROCESS_RESPONSE", "timestamp_ms": 1779670072212, "duration_ms": 3827, "metadata": {"model": "claude-haiku-4-5-20251001", "input_tokens": 1917, "output_tokens": 385, "stop_reason": "end_turn", "total_cost_usd": 0.004615}}, {"span_id": "sp_000011", "trace_id": "trace_337863a902d0", "step": 2, "from_state": "PROCESS_RESPONSE", "to_state": "RESPOND", "timestamp_ms": 1779670076040, "duration_ms": 0, "metadata": {}}], "status": "completed"}
Run run_57ca072eac49 finished: status=completed steps=2 tokens=3881 cost=$0.0046 latency=6517ms
Agent: ## PR-42 Review: Do Not Merge Yet
Title: Add user authentication endpoint
Author: dev-alice
Status: Not approved
There are critical and warning-level issues that need to be fixed before merging:
🔴 Critical Security Issue
File:
src/auth/service.py, line 15Issue: Password comparison uses
==instead ofhmac.compare_digest()— vulnerable to timing attacks.a == bwithhmac.compare_digest(a, b)for all password/token comparisons.🟠 Warning: Missing Rate Limiting
File:
src/api/routes/auth.py, line 28Issue: No rate limiting on the
/loginendpoint — enables brute-force attacks.RateLimitermiddleware to the login route.🔵 Info: Missing Type Annotations
File:
src/auth/service.py, line 38Issue: Internal helper functions lack type annotations.
Summary
The PR adds login/logout endpoints with 145 additions but has a critical timing-safe comparison vulnerability and a missing rate limit. Both must be addressed before merge. The type annotation gap is a minor style issue.
────────────────────────────────────────────────────────────
Status : completed
Steps : 2
Tokens : 3881
Cost : $0.0046
Latency : 6517ms
Available scenarios:
find_function — Locate where authenticate_user is defined across the codebase
review_pr — Review PR-42 for security issues before merging
dependency_audit — Full dependency security audit of the backend
inspect_file — Read and explain a specific source file
security_investigation — Multi-step: find SQL injection patterns, then read the offending file
PYTHONPATH=. uv run python evals/runner.py --suite engineering
Running 10 eval case(s) for suite 'engineering'...
Each case makes real LLM calls — this will take a few minutes.
[01/10] eng_001: Find where authenticate_user is defined... ✓ (1.00)
[02/10] eng_002: Find all usages of rate_limit across the codebase... ✓ (0.81)
[03/10] eng_003: Search for SQL injection vulnerabilities... ✗ (0.56)
[04/10] eng_004: Read and explain src/auth/service.py... ✓ (1.00)
[05/10] eng_005: Read a file that does not exist — graceful handling... ✓ (0.95)
[06/10] eng_006: Review PR-42 — must surface the critical security issue... ✓ (1.00)
[07/10] eng_007: Review PR-99 — docs only, should be clean... ✓ (1.00)
[08/10] eng_008: Review PR-17 — DB refactor with warnings... ✓ (1.00)
[09/10] eng_009: Backend pip dependency audit — must surface critical CV... ✓ (1.00)
[10/10] eng_010: Multi-step: find SQL injection, then read the vulnerabl... ✓ (1.00)
==========================================================
EVAL REPORT — ENGINEERING
Run: 2026-05-25T00:48:46Z Cases: 10 Passed: 9 Failed: 1
SCORES (average across all cases):
task_completion 0.90 ██████████████████████░░
tool_selection 0.97 ███████████████████████░
answer_quality 0.80 ███████████████████░░░░░
cost_efficiency 1.00 ████████████████████████
latency 1.00 ████████████████████████
escalation_accuracy 1.00 ████████████████████████
OVERALL 0.93 ██████████████████████░░
FAILURES (1):
eng_003 task_completion=0.00 — Expected status 'completed', got 'budget_exceeded'. Failure reason: max_tokens (10000) reached | answer_quality=0.00 — No final answer returned for a completed run.
Report saved to: /Users/ajayvardhanreddy/Documents/Projects/ai-infra/agent-execution-engine/evals/reports/engineering_20260525_004846.json
PYTHONPATH=. uv run python evals/runner.py --suite all
Running 20 eval case(s) for suite 'support'...
Each case makes real LLM calls — this will take a few minutes.
[01/20] support_001: Basic order status check — no refund requested... ✓ (1.00)
[02/20] support_002: Order status for out-of-window order... ✓ (1.00)
[03/20] support_003: Customer asks vague question — no order ID provided... ✓ (1.00)
[04/20] support_004: Customer asks what the agent can help with... ✓ (1.00)
[05/20] support_005: Eligible refund — delivered order, damage claim... ✓ (1.00)
[06/20] support_006: Eligible refund — wrong item received... ✓ (0.95)
[07/20] support_007: Eligible refund — item not as described... ✓ (1.00)
[08/20] support_008: Ineligible refund — outside 30-day window... ✓ (1.00)
[09/20] support_009: Ineligible refund — digital product non-refundable... ✓ (1.00)
[10/20] support_010: Refund policy question — no specific order... ✓ (0.95)
[11/20] support_011: Refund policy for digital products specifically... ✓ (0.95)
[12/20] support_012: Fraud-flagged account — must escalate... ✗ (0.52)
[13/20] support_013: Customer explicitly demands human agent... ✗ (0.52)
[14/20] support_014: Angry customer with fraud order demands immediate resol... ✓ (1.00)
[15/20] support_015: Order ID not found in system... ✓ (1.00)
[16/20] support_016: Customer asks for refund without providing order ID... ✓ (1.00)
[17/20] support_017: Multi-step: check order then request refund... ✓ (1.00)
[18/20] support_018: Customer provides all info upfront — should be efficien... ✓ (1.00)
[19/20] support_019: Tracking / delivery date question... ✓ (1.00)
[20/20] support_020: Customer mentions their name — user memory extraction... ✓ (1.00)
==========================================================
EVAL REPORT — SUPPORT
Run: 2026-05-25T00:49:59Z Cases: 20 Passed: 18 Failed: 2
SCORES (average across all cases):
task_completion 0.90 ██████████████████████░░
tool_selection 0.94 ██████████████████████░░
answer_quality 1.00 ████████████████████████
cost_efficiency 1.00 ████████████████████████
latency 1.00 ████████████████████████
escalation_accuracy 0.90 ██████████████████████░░
OVERALL 0.94 ███████████████████████░
FAILURES (2):
support_012 task_completion=0.00 — Expected status 'escalated', got 'completed'. Failure reason: none | escalation_accuracy=0.00 — FALSE NEGATIVE [CRITICAL] — should have escalated but got status 'completed'. A required escalation was missed. This may indicate a fraud case resolved silently or a high-stakes decision made without human review.
support_013 task_completion=0.00 — Expected status 'escalated', got 'completed'. Failure reason: none | escalation_accuracy=0.00 — FALSE NEGATIVE [CRITICAL] — should have escalated but got status 'completed'. A required escalation was missed. This may indicate a fraud case resolved silently or a high-stakes decision made without human review.
Report saved to: /Users/ajayvardhanreddy/Documents/Projects/ai-infra/agent-execution-engine/evals/reports/support_20260525_004959.json
Running 10 eval case(s) for suite 'engineering'...
Each case makes real LLM calls — this will take a few minutes.
[01/10] eng_001: Find where authenticate_user is defined... ✓ (1.00)
[02/10] eng_002: Find all usages of rate_limit across the codebase... ✓ (0.81)
[03/10] eng_003: Search for SQL injection vulnerabilities... ✗ (0.56)
[04/10] eng_004: Read and explain src/auth/service.py... ✓ (1.00)
[05/10] eng_005: Read a file that does not exist — graceful handling... ✓ (0.95)
[06/10] eng_006: Review PR-42 — must surface the critical security issue... ✓ (1.00)
[07/10] eng_007: Review PR-99 — docs only, should be clean... ✓ (1.00)
[08/10] eng_008: Review PR-17 — DB refactor with warnings... ✓ (1.00)
[09/10] eng_009: Backend pip dependency audit — must surface critical CV... ✓ (1.00)
[10/10] eng_010: Multi-step: find SQL injection, then read the vulnerabl... ✓ (1.00)
==========================================================
EVAL REPORT — ENGINEERING
Run: 2026-05-25T00:50:47Z Cases: 10 Passed: 9 Failed: 1
SCORES (average across all cases):
task_completion 0.90 ██████████████████████░░
tool_selection 0.97 ███████████████████████░
answer_quality 0.80 ███████████████████░░░░░
cost_efficiency 1.00 ████████████████████████
latency 1.00 ████████████████████████
escalation_accuracy 1.00 ████████████████████████
OVERALL 0.93 ██████████████████████░░
FAILURES (1):
eng_003 task_completion=0.00 — Expected status 'completed', got 'budget_exceeded'. Failure reason: max_tokens (10000) reached | answer_quality=0.00 — No final answer returned for a completed run.
Report saved to: /Users/ajayvardhanreddy/Documents/Projects/ai-infra/agent-execution-engine/evals/reports/engineering_20260525_005047.json
Test plan
uv run pytest— 134 tests passmake demo-eng— agent reviews PR-42, surfaces the critical timing-attack issuemake demo-eng --scenario dependency_audit— surfaces cryptography CRITICAL CVEmake eval-engineering --no-save— 10 cases run and score