feat(phase-5): engineering assistant agent — 4 tools, 5 scenarios, 10 eval cases by Ajayvardhanreddy · Pull Request #15 · Ajayvardhanreddy/agent-execution-engine

Ajayvardhanreddy · 2026-05-25T00:50:48Z

Summary

Engineering assistant agent demo with 4 tools: code_search, file_read, pr_review, dependency_check
All tools use deterministic mock data — no real codebase or GitHub access needed
5 runnable scenarios: find function, review PR, dependency audit, read file, multi-step security investigation
10 eval cases completing the dataset (eng_001–010)
make eval-engineering and make eval-all now active
make demo-eng runs the PR review scenario

How to run

============================================================
Scenario : review_pr
Review PR-42 for security issues before merging

User: Review PR-42 in the backend repo. I'm about to merge it — are there any issues I should know about?

Starting run run_57ca072eac49 (trace trace_337863a902d0) agent=engineering_agent
{"event": "span", "span_id": "sp_000001", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "START", "to_state": "LOAD_MEMORY", "timestamp_ms": 1779670069534, "duration_ms": 0, "metadata": {}}
Memory service unreachable — running without memory run=run_57ca072eac49
{"event": "span", "span_id": "sp_000002", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "LOAD_MEMORY", "to_state": "BUILD_CONTEXT", "timestamp_ms": 1779670069534, "duration_ms": 6, "metadata": {}}
{"event": "span", "span_id": "sp_000003", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "BUILD_CONTEXT", "to_state": "CALL_LLM", "timestamp_ms": 1779670069541, "duration_ms": 0, "metadata": {}}
HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
{"event": "span", "span_id": "sp_000004", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "CALL_LLM", "to_state": "PROCESS_RESPONSE", "timestamp_ms": 1779670069541, "duration_ms": 2667, "metadata": {"model": "claude-haiku-4-5-20251001", "input_tokens": 1492, "output_tokens": 87, "stop_reason": "tool_use", "total_cost_usd": 0.001542}}
{"event": "span", "span_id": "sp_000005", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "PROCESS_RESPONSE", "to_state": "EXECUTE_TOOL", "timestamp_ms": 1779670072208, "duration_ms": 0, "metadata": {}}
{"event": "span", "span_id": "sp_000006", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "EXECUTE_TOOL", "to_state": "OBSERVE_RESULT", "timestamp_ms": 1779670072209, "duration_ms": 0, "metadata": {"tools_called": ["pr_review"], "all_succeeded": true}}
{"event": "tool_result", "trace_id": "trace_337863a902d0", "tool_name": "pr_review", "success": true, "latency_ms": 0, "retries_used": 0}
working_memory write failed — continuing run=run_57ca072eac49
{"event": "span", "span_id": "sp_000007", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "OBSERVE_RESULT", "to_state": "WRITE_MEMORY", "timestamp_ms": 1779670072209, "duration_ms": 1, "metadata": {}}
session_memory write failed mid-run — continuing run=run_57ca072eac49
{"event": "span", "span_id": "sp_000008", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "WRITE_MEMORY", "to_state": "CHECK_TERMINATION", "timestamp_ms": 1779670072211, "duration_ms": 1, "metadata": {}}
{"event": "span", "span_id": "sp_000009", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "CHECK_TERMINATION", "to_state": "CALL_LLM", "timestamp_ms": 1779670072212, "duration_ms": 0, "metadata": {}}
HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
{"event": "span", "span_id": "sp_000010", "trace_id": "trace_337863a902d0", "step": 2, "from_state": "CALL_LLM", "to_state": "PROCESS_RESPONSE", "timestamp_ms": 1779670072212, "duration_ms": 3827, "metadata": {"model": "claude-haiku-4-5-20251001", "input_tokens": 1917, "output_tokens": 385, "stop_reason": "end_turn", "total_cost_usd": 0.004615}}
{"event": "span", "span_id": "sp_000011", "trace_id": "trace_337863a902d0", "step": 2, "from_state": "PROCESS_RESPONSE", "to_state": "RESPOND", "timestamp_ms": 1779670076040, "duration_ms": 0, "metadata": {}}
Failed to write final answer to session memory run=run_57ca072eac49
Failed to write audit record run=run_57ca072eac49
Failed to delete working memory run=run_57ca072eac49
{"trace_id": "trace_337863a902d0", "run_id": "run_57ca072eac49", "spans": [{"span_id": "sp_000001", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "START", "to_state": "LOAD_MEMORY", "timestamp_ms": 1779670069534, "duration_ms": 0, "metadata": {}}, {"span_id": "sp_000002", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "LOAD_MEMORY", "to_state": "BUILD_CONTEXT", "timestamp_ms": 1779670069534, "duration_ms": 6, "metadata": {}}, {"span_id": "sp_000003", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "BUILD_CONTEXT", "to_state": "CALL_LLM", "timestamp_ms": 1779670069541, "duration_ms": 0, "metadata": {}}, {"span_id": "sp_000004", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "CALL_LLM", "to_state": "PROCESS_RESPONSE", "timestamp_ms": 1779670069541, "duration_ms": 2667, "metadata": {"model": "claude-haiku-4-5-20251001", "input_tokens": 1492, "output_tokens": 87, "stop_reason": "tool_use", "total_cost_usd": 0.001542}}, {"span_id": "sp_000005", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "PROCESS_RESPONSE", "to_state": "EXECUTE_TOOL", "timestamp_ms": 1779670072208, "duration_ms": 0, "metadata": {}}, {"span_id": "sp_000006", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "EXECUTE_TOOL", "to_state": "OBSERVE_RESULT", "timestamp_ms": 1779670072209, "duration_ms": 0, "metadata": {"tools_called": ["pr_review"], "all_succeeded": true}}, {"span_id": "sp_000007", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "OBSERVE_RESULT", "to_state": "WRITE_MEMORY", "timestamp_ms": 1779670072209, "duration_ms": 1, "metadata": {}}, {"span_id": "sp_000008", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "WRITE_MEMORY", "to_state": "CHECK_TERMINATION", "timestamp_ms": 1779670072211, "duration_ms": 1, "metadata": {}}, {"span_id": "sp_000009", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "CHECK_TERMINATION", "to_state": "CALL_LLM", "timestamp_ms": 1779670072212, "duration_ms": 0, "metadata": {}}, {"span_id": "sp_000010", "trace_id": "trace_337863a902d0", "step": 2, "from_state": "CALL_LLM", "to_state": "PROCESS_RESPONSE", "timestamp_ms": 1779670072212, "duration_ms": 3827, "metadata": {"model": "claude-haiku-4-5-20251001", "input_tokens": 1917, "output_tokens": 385, "stop_reason": "end_turn", "total_cost_usd": 0.004615}}, {"span_id": "sp_000011", "trace_id": "trace_337863a902d0", "step": 2, "from_state": "PROCESS_RESPONSE", "to_state": "RESPOND", "timestamp_ms": 1779670076040, "duration_ms": 0, "metadata": {}}], "status": "completed"}
Run run_57ca072eac49 finished: status=completed steps=2 tokens=3881 cost=$0.0046 latency=6517ms
Agent: ## PR-42 Review: Do Not Merge Yet

Title: Add user authentication endpoint
Author: dev-alice
Status: Not approved

There are critical and warning-level issues that need to be fixed before merging:

🔴 Critical Security Issue

File: src/auth/service.py, line 15
Issue: Password comparison uses == instead of hmac.compare_digest() — vulnerable to timing attacks.

Fix: Replace a == b with hmac.compare_digest(a, b) for all password/token comparisons.
Impact: High — attackers can use response time differences to leak information about password hashes.

🟠 Warning: Missing Rate Limiting

File: src/api/routes/auth.py, line 28
Issue: No rate limiting on the /login endpoint — enables brute-force attacks.

Fix: Apply the existing RateLimiter middleware to the login route.
Impact: Medium — without rate limiting, accounts are vulnerable to password brute-forcing.

🔵 Info: Missing Type Annotations

File: src/auth/service.py, line 38
Issue: Internal helper functions lack type annotations.

Fix: Add type hints for function parameters and return values.
Impact: Low — improves code maintainability but not blocking.

Summary

The PR adds login/logout endpoints with 145 additions but has a critical timing-safe comparison vulnerability and a missing rate limit. Both must be addressed before merge. The type annotation gap is a minor style issue.

────────────────────────────────────────────────────────────
Status : completed
Steps : 2
Tokens : 3881
Cost : $0.0046
Latency : 6517ms

Available scenarios:
find_function — Locate where authenticate_user is defined across the codebase
review_pr — Review PR-42 for security issues before merging
dependency_audit — Full dependency security audit of the backend
inspect_file — Read and explain a specific source file
security_investigation — Multi-step: find SQL injection patterns, then read the offending file
PYTHONPATH=. uv run python evals/runner.py --suite engineering

Running 10 eval case(s) for suite 'engineering'...
Each case makes real LLM calls — this will take a few minutes.

[01/10] eng_001: Find where authenticate_user is defined... ✓ (1.00)
[02/10] eng_002: Find all usages of rate_limit across the codebase... ✓ (0.81)
[03/10] eng_003: Search for SQL injection vulnerabilities... ✗ (0.56)
[04/10] eng_004: Read and explain src/auth/service.py... ✓ (1.00)
[05/10] eng_005: Read a file that does not exist — graceful handling... ✓ (0.95)
[06/10] eng_006: Review PR-42 — must surface the critical security issue... ✓ (1.00)
[07/10] eng_007: Review PR-99 — docs only, should be clean... ✓ (1.00)
[08/10] eng_008: Review PR-17 — DB refactor with warnings... ✓ (1.00)
[09/10] eng_009: Backend pip dependency audit — must surface critical CV... ✓ (1.00)
[10/10] eng_010: Multi-step: find SQL injection, then read the vulnerabl... ✓ (1.00)

==========================================================
EVAL REPORT — ENGINEERING
Run: 2026-05-25T00:48:46Z Cases: 10 Passed: 9 Failed: 1

SCORES (average across all cases):
task_completion 0.90 ██████████████████████░░
tool_selection 0.97 ███████████████████████░
answer_quality 0.80 ███████████████████░░░░░
cost_efficiency 1.00 ████████████████████████
latency 1.00 ████████████████████████
escalation_accuracy 1.00 ████████████████████████

OVERALL 0.93 ██████████████████████░░

FAILURES (1):
eng_003 task_completion=0.00 — Expected status 'completed', got 'budget_exceeded'. Failure reason: max_tokens (10000) reached | answer_quality=0.00 — No final answer returned for a completed run.

Report saved to: /Users/ajayvardhanreddy/Documents/Projects/ai-infra/agent-execution-engine/evals/reports/engineering_20260525_004846.json

PYTHONPATH=. uv run python evals/runner.py --suite all

Running 20 eval case(s) for suite 'support'...
Each case makes real LLM calls — this will take a few minutes.

[01/20] support_001: Basic order status check — no refund requested... ✓ (1.00)
[02/20] support_002: Order status for out-of-window order... ✓ (1.00)
[03/20] support_003: Customer asks vague question — no order ID provided... ✓ (1.00)
[04/20] support_004: Customer asks what the agent can help with... ✓ (1.00)
[05/20] support_005: Eligible refund — delivered order, damage claim... ✓ (1.00)
[06/20] support_006: Eligible refund — wrong item received... ✓ (0.95)
[07/20] support_007: Eligible refund — item not as described... ✓ (1.00)
[08/20] support_008: Ineligible refund — outside 30-day window... ✓ (1.00)
[09/20] support_009: Ineligible refund — digital product non-refundable... ✓ (1.00)
[10/20] support_010: Refund policy question — no specific order... ✓ (0.95)
[11/20] support_011: Refund policy for digital products specifically... ✓ (0.95)
[12/20] support_012: Fraud-flagged account — must escalate... ✗ (0.52)
[13/20] support_013: Customer explicitly demands human agent... ✗ (0.52)
[14/20] support_014: Angry customer with fraud order demands immediate resol... ✓ (1.00)
[15/20] support_015: Order ID not found in system... ✓ (1.00)
[16/20] support_016: Customer asks for refund without providing order ID... ✓ (1.00)
[17/20] support_017: Multi-step: check order then request refund... ✓ (1.00)
[18/20] support_018: Customer provides all info upfront — should be efficien... ✓ (1.00)
[19/20] support_019: Tracking / delivery date question... ✓ (1.00)
[20/20] support_020: Customer mentions their name — user memory extraction... ✓ (1.00)

==========================================================
EVAL REPORT — SUPPORT
Run: 2026-05-25T00:49:59Z Cases: 20 Passed: 18 Failed: 2

SCORES (average across all cases):
task_completion 0.90 ██████████████████████░░
tool_selection 0.94 ██████████████████████░░
answer_quality 1.00 ████████████████████████
cost_efficiency 1.00 ████████████████████████
latency 1.00 ████████████████████████
escalation_accuracy 0.90 ██████████████████████░░

OVERALL 0.94 ███████████████████████░

FAILURES (2):
support_012 task_completion=0.00 — Expected status 'escalated', got 'completed'. Failure reason: none | escalation_accuracy=0.00 — FALSE NEGATIVE [CRITICAL] — should have escalated but got status 'completed'. A required escalation was missed. This may indicate a fraud case resolved silently or a high-stakes decision made without human review.
support_013 task_completion=0.00 — Expected status 'escalated', got 'completed'. Failure reason: none | escalation_accuracy=0.00 — FALSE NEGATIVE [CRITICAL] — should have escalated but got status 'completed'. A required escalation was missed. This may indicate a fraud case resolved silently or a high-stakes decision made without human review.

Report saved to: /Users/ajayvardhanreddy/Documents/Projects/ai-infra/agent-execution-engine/evals/reports/support_20260525_004959.json

Running 10 eval case(s) for suite 'engineering'...
Each case makes real LLM calls — this will take a few minutes.

[01/10] eng_001: Find where authenticate_user is defined... ✓ (1.00)
[02/10] eng_002: Find all usages of rate_limit across the codebase... ✓ (0.81)
[03/10] eng_003: Search for SQL injection vulnerabilities... ✗ (0.56)
[04/10] eng_004: Read and explain src/auth/service.py... ✓ (1.00)
[05/10] eng_005: Read a file that does not exist — graceful handling... ✓ (0.95)
[06/10] eng_006: Review PR-42 — must surface the critical security issue... ✓ (1.00)
[07/10] eng_007: Review PR-99 — docs only, should be clean... ✓ (1.00)
[08/10] eng_008: Review PR-17 — DB refactor with warnings... ✓ (1.00)
[09/10] eng_009: Backend pip dependency audit — must surface critical CV... ✓ (1.00)
[10/10] eng_010: Multi-step: find SQL injection, then read the vulnerabl... ✓ (1.00)

==========================================================
EVAL REPORT — ENGINEERING
Run: 2026-05-25T00:50:47Z Cases: 10 Passed: 9 Failed: 1

SCORES (average across all cases):
task_completion 0.90 ██████████████████████░░
tool_selection 0.97 ███████████████████████░
answer_quality 0.80 ███████████████████░░░░░
cost_efficiency 1.00 ████████████████████████
latency 1.00 ████████████████████████
escalation_accuracy 1.00 ████████████████████████

OVERALL 0.93 ██████████████████████░░

FAILURES (1):
eng_003 task_completion=0.00 — Expected status 'completed', got 'budget_exceeded'. Failure reason: max_tokens (10000) reached | answer_quality=0.00 — No final answer returned for a completed run.

Report saved to: /Users/ajayvardhanreddy/Documents/Projects/ai-infra/agent-execution-engine/evals/reports/engineering_20260525_005047.json

Test plan

uv run pytest — 134 tests pass
make demo-eng — agent reviews PR-42, surfaces the critical timing-attack issue
make demo-eng --scenario dependency_audit — surfaces cryptography CRITICAL CVE
make eval-engineering --no-save — 10 cases run and score

… eval cases Tools (demos/engineering_agent/tools.py): code_search — find symbols/patterns across a codebase (by keyword, returns file+line+snippet) file_read — read a specific file, optional line range pr_review — fetch PR issues by severity/category (security, perf, style, correctness) dependency_check — outdated versions + CVE lookup per repo/package manager Mock data covers realistic scenarios: authenticate_user defined in src/auth/service.py, used across 3 files SQL injection patterns in src/db/queries.py (raw f-strings) PR-42: critical timing-attack + missing rate limit PR-99: docs-only, clean PR-17: DB refactor, 2 warnings backend/pip: 4 outdated, cryptography CRITICAL CVE + requests HIGH CVE frontend/npm: lodash HIGH + axios MEDIUM CVEs 5 scenarios: find_function, review_pr, dependency_audit, inspect_file, security_investigation 10 eval cases (evals/dataset/engineering_cases.py): eng_001–003: code_search (definition, usage, SQL injection) eng_004–005: file_read (existing file, not-found graceful handling) eng_006–008: pr_review (critical PR, clean PR, warning-only PR) eng_009: dependency_check (must surface critical CVE) eng_010: multi-step code_search + file_read Makefile: make eval-engineering, make eval-all, make demo-eng now active. 134 tests pass.

Ajayvardhanreddy merged commit 2a0ff75 into dev May 25, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(phase-5): engineering assistant agent — 4 tools, 5 scenarios, 10 eval cases#15

feat(phase-5): engineering assistant agent — 4 tools, 5 scenarios, 10 eval cases#15
Ajayvardhanreddy merged 1 commit into
devfrom
feat/phase-5-engineering-agent

Ajayvardhanreddy commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ajayvardhanreddy commented May 25, 2026

Summary

How to run

============================================================ Scenario : review_pr Review PR-42 for security issues before merging

🔴 Critical Security Issue

🟠 Warning: Missing Rate Limiting

🔵 Info: Missing Type Annotations

Summary

──────────────────────────────────────────────────────────── Status : completed Steps : 2 Tokens : 3881 Cost : $0.0046 Latency : 6517ms

========================================================== EVAL REPORT — ENGINEERING Run: 2026-05-25T00:48:46Z Cases: 10 Passed: 9 Failed: 1

FAILURES (1): eng_003 task_completion=0.00 — Expected status 'completed', got 'budget_exceeded'. Failure reason: max_tokens (10000) reached | answer_quality=0.00 — No final answer returned for a completed run.

========================================================== EVAL REPORT — SUPPORT Run: 2026-05-25T00:49:59Z Cases: 20 Passed: 18 Failed: 2

========================================================== EVAL REPORT — ENGINEERING Run: 2026-05-25T00:50:47Z Cases: 10 Passed: 9 Failed: 1

FAILURES (1): eng_003 task_completion=0.00 — Expected status 'completed', got 'budget_exceeded'. Failure reason: max_tokens (10000) reached | answer_quality=0.00 — No final answer returned for a completed run.

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

============================================================
Scenario : review_pr
Review PR-42 for security issues before merging

────────────────────────────────────────────────────────────
Status : completed
Steps : 2
Tokens : 3881
Cost : $0.0046
Latency : 6517ms

==========================================================
EVAL REPORT — ENGINEERING
Run: 2026-05-25T00:48:46Z Cases: 10 Passed: 9 Failed: 1

FAILURES (1):
eng_003 task_completion=0.00 — Expected status 'completed', got 'budget_exceeded'. Failure reason: max_tokens (10000) reached | answer_quality=0.00 — No final answer returned for a completed run.

==========================================================
EVAL REPORT — SUPPORT
Run: 2026-05-25T00:49:59Z Cases: 20 Passed: 18 Failed: 2

==========================================================
EVAL REPORT — ENGINEERING
Run: 2026-05-25T00:50:47Z Cases: 10 Passed: 9 Failed: 1

FAILURES (1):
eng_003 task_completion=0.00 — Expected status 'completed', got 'budget_exceeded'. Failure reason: max_tokens (10000) reached | answer_quality=0.00 — No final answer returned for a completed run.