Skip to content

feat(phase-5): engineering assistant agent — 4 tools, 5 scenarios, 10 eval cases#15

Merged
Ajayvardhanreddy merged 1 commit into
devfrom
feat/phase-5-engineering-agent
May 25, 2026
Merged

feat(phase-5): engineering assistant agent — 4 tools, 5 scenarios, 10 eval cases#15
Ajayvardhanreddy merged 1 commit into
devfrom
feat/phase-5-engineering-agent

Conversation

@Ajayvardhanreddy
Copy link
Copy Markdown
Owner

Summary

  • Engineering assistant agent demo with 4 tools: code_search, file_read, pr_review, dependency_check
  • All tools use deterministic mock data — no real codebase or GitHub access needed
  • 5 runnable scenarios: find function, review PR, dependency audit, read file, multi-step security investigation
  • 10 eval cases completing the dataset (eng_001–010)
  • make eval-engineering and make eval-all now active
  • make demo-eng runs the PR review scenario

How to run

============================================================
Scenario : review_pr
Review PR-42 for security issues before merging

User: Review PR-42 in the backend repo. I'm about to merge it — are there any issues I should know about?

Starting run run_57ca072eac49 (trace trace_337863a902d0) agent=engineering_agent
{"event": "span", "span_id": "sp_000001", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "START", "to_state": "LOAD_MEMORY", "timestamp_ms": 1779670069534, "duration_ms": 0, "metadata": {}}
Memory service unreachable — running without memory run=run_57ca072eac49
{"event": "span", "span_id": "sp_000002", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "LOAD_MEMORY", "to_state": "BUILD_CONTEXT", "timestamp_ms": 1779670069534, "duration_ms": 6, "metadata": {}}
{"event": "span", "span_id": "sp_000003", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "BUILD_CONTEXT", "to_state": "CALL_LLM", "timestamp_ms": 1779670069541, "duration_ms": 0, "metadata": {}}
HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
{"event": "span", "span_id": "sp_000004", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "CALL_LLM", "to_state": "PROCESS_RESPONSE", "timestamp_ms": 1779670069541, "duration_ms": 2667, "metadata": {"model": "claude-haiku-4-5-20251001", "input_tokens": 1492, "output_tokens": 87, "stop_reason": "tool_use", "total_cost_usd": 0.001542}}
{"event": "span", "span_id": "sp_000005", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "PROCESS_RESPONSE", "to_state": "EXECUTE_TOOL", "timestamp_ms": 1779670072208, "duration_ms": 0, "metadata": {}}
{"event": "span", "span_id": "sp_000006", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "EXECUTE_TOOL", "to_state": "OBSERVE_RESULT", "timestamp_ms": 1779670072209, "duration_ms": 0, "metadata": {"tools_called": ["pr_review"], "all_succeeded": true}}
{"event": "tool_result", "trace_id": "trace_337863a902d0", "tool_name": "pr_review", "success": true, "latency_ms": 0, "retries_used": 0}
working_memory write failed — continuing run=run_57ca072eac49
{"event": "span", "span_id": "sp_000007", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "OBSERVE_RESULT", "to_state": "WRITE_MEMORY", "timestamp_ms": 1779670072209, "duration_ms": 1, "metadata": {}}
session_memory write failed mid-run — continuing run=run_57ca072eac49
{"event": "span", "span_id": "sp_000008", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "WRITE_MEMORY", "to_state": "CHECK_TERMINATION", "timestamp_ms": 1779670072211, "duration_ms": 1, "metadata": {}}
{"event": "span", "span_id": "sp_000009", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "CHECK_TERMINATION", "to_state": "CALL_LLM", "timestamp_ms": 1779670072212, "duration_ms": 0, "metadata": {}}
HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
{"event": "span", "span_id": "sp_000010", "trace_id": "trace_337863a902d0", "step": 2, "from_state": "CALL_LLM", "to_state": "PROCESS_RESPONSE", "timestamp_ms": 1779670072212, "duration_ms": 3827, "metadata": {"model": "claude-haiku-4-5-20251001", "input_tokens": 1917, "output_tokens": 385, "stop_reason": "end_turn", "total_cost_usd": 0.004615}}
{"event": "span", "span_id": "sp_000011", "trace_id": "trace_337863a902d0", "step": 2, "from_state": "PROCESS_RESPONSE", "to_state": "RESPOND", "timestamp_ms": 1779670076040, "duration_ms": 0, "metadata": {}}
Failed to write final answer to session memory run=run_57ca072eac49
Failed to write audit record run=run_57ca072eac49
Failed to delete working memory run=run_57ca072eac49
{"trace_id": "trace_337863a902d0", "run_id": "run_57ca072eac49", "spans": [{"span_id": "sp_000001", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "START", "to_state": "LOAD_MEMORY", "timestamp_ms": 1779670069534, "duration_ms": 0, "metadata": {}}, {"span_id": "sp_000002", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "LOAD_MEMORY", "to_state": "BUILD_CONTEXT", "timestamp_ms": 1779670069534, "duration_ms": 6, "metadata": {}}, {"span_id": "sp_000003", "trace_id": "trace_337863a902d0", "step": 0, "from_state": "BUILD_CONTEXT", "to_state": "CALL_LLM", "timestamp_ms": 1779670069541, "duration_ms": 0, "metadata": {}}, {"span_id": "sp_000004", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "CALL_LLM", "to_state": "PROCESS_RESPONSE", "timestamp_ms": 1779670069541, "duration_ms": 2667, "metadata": {"model": "claude-haiku-4-5-20251001", "input_tokens": 1492, "output_tokens": 87, "stop_reason": "tool_use", "total_cost_usd": 0.001542}}, {"span_id": "sp_000005", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "PROCESS_RESPONSE", "to_state": "EXECUTE_TOOL", "timestamp_ms": 1779670072208, "duration_ms": 0, "metadata": {}}, {"span_id": "sp_000006", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "EXECUTE_TOOL", "to_state": "OBSERVE_RESULT", "timestamp_ms": 1779670072209, "duration_ms": 0, "metadata": {"tools_called": ["pr_review"], "all_succeeded": true}}, {"span_id": "sp_000007", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "OBSERVE_RESULT", "to_state": "WRITE_MEMORY", "timestamp_ms": 1779670072209, "duration_ms": 1, "metadata": {}}, {"span_id": "sp_000008", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "WRITE_MEMORY", "to_state": "CHECK_TERMINATION", "timestamp_ms": 1779670072211, "duration_ms": 1, "metadata": {}}, {"span_id": "sp_000009", "trace_id": "trace_337863a902d0", "step": 1, "from_state": "CHECK_TERMINATION", "to_state": "CALL_LLM", "timestamp_ms": 1779670072212, "duration_ms": 0, "metadata": {}}, {"span_id": "sp_000010", "trace_id": "trace_337863a902d0", "step": 2, "from_state": "CALL_LLM", "to_state": "PROCESS_RESPONSE", "timestamp_ms": 1779670072212, "duration_ms": 3827, "metadata": {"model": "claude-haiku-4-5-20251001", "input_tokens": 1917, "output_tokens": 385, "stop_reason": "end_turn", "total_cost_usd": 0.004615}}, {"span_id": "sp_000011", "trace_id": "trace_337863a902d0", "step": 2, "from_state": "PROCESS_RESPONSE", "to_state": "RESPOND", "timestamp_ms": 1779670076040, "duration_ms": 0, "metadata": {}}], "status": "completed"}
Run run_57ca072eac49 finished: status=completed steps=2 tokens=3881 cost=$0.0046 latency=6517ms
Agent: ## PR-42 Review: Do Not Merge Yet

Title: Add user authentication endpoint
Author: dev-alice
Status: Not approved

There are critical and warning-level issues that need to be fixed before merging:

🔴 Critical Security Issue

File: src/auth/service.py, line 15
Issue: Password comparison uses == instead of hmac.compare_digest() — vulnerable to timing attacks.

  • Fix: Replace a == b with hmac.compare_digest(a, b) for all password/token comparisons.
  • Impact: High — attackers can use response time differences to leak information about password hashes.

🟠 Warning: Missing Rate Limiting

File: src/api/routes/auth.py, line 28
Issue: No rate limiting on the /login endpoint — enables brute-force attacks.

  • Fix: Apply the existing RateLimiter middleware to the login route.
  • Impact: Medium — without rate limiting, accounts are vulnerable to password brute-forcing.

🔵 Info: Missing Type Annotations

File: src/auth/service.py, line 38
Issue: Internal helper functions lack type annotations.

  • Fix: Add type hints for function parameters and return values.
  • Impact: Low — improves code maintainability but not blocking.

Summary

The PR adds login/logout endpoints with 145 additions but has a critical timing-safe comparison vulnerability and a missing rate limit. Both must be addressed before merge. The type annotation gap is a minor style issue.

────────────────────────────────────────────────────────────
Status : completed
Steps : 2
Tokens : 3881
Cost : $0.0046
Latency : 6517ms

Available scenarios:
find_function — Locate where authenticate_user is defined across the codebase
review_pr — Review PR-42 for security issues before merging
dependency_audit — Full dependency security audit of the backend
inspect_file — Read and explain a specific source file
security_investigation — Multi-step: find SQL injection patterns, then read the offending file
PYTHONPATH=. uv run python evals/runner.py --suite engineering

Running 10 eval case(s) for suite 'engineering'...
Each case makes real LLM calls — this will take a few minutes.

[01/10] eng_001: Find where authenticate_user is defined... ✓ (1.00)
[02/10] eng_002: Find all usages of rate_limit across the codebase... ✓ (0.81)
[03/10] eng_003: Search for SQL injection vulnerabilities... ✗ (0.56)
[04/10] eng_004: Read and explain src/auth/service.py... ✓ (1.00)
[05/10] eng_005: Read a file that does not exist — graceful handling... ✓ (0.95)
[06/10] eng_006: Review PR-42 — must surface the critical security issue... ✓ (1.00)
[07/10] eng_007: Review PR-99 — docs only, should be clean... ✓ (1.00)
[08/10] eng_008: Review PR-17 — DB refactor with warnings... ✓ (1.00)
[09/10] eng_009: Backend pip dependency audit — must surface critical CV... ✓ (1.00)
[10/10] eng_010: Multi-step: find SQL injection, then read the vulnerabl... ✓ (1.00)

==========================================================
EVAL REPORT — ENGINEERING
Run: 2026-05-25T00:48:46Z Cases: 10 Passed: 9 Failed: 1

SCORES (average across all cases):
task_completion 0.90 ██████████████████████░░
tool_selection 0.97 ███████████████████████░
answer_quality 0.80 ███████████████████░░░░░
cost_efficiency 1.00 ████████████████████████
latency 1.00 ████████████████████████
escalation_accuracy 1.00 ████████████████████████

OVERALL 0.93 ██████████████████████░░

FAILURES (1):
eng_003 task_completion=0.00 — Expected status 'completed', got 'budget_exceeded'. Failure reason: max_tokens (10000) reached | answer_quality=0.00 — No final answer returned for a completed run.

Report saved to: /Users/ajayvardhanreddy/Documents/Projects/ai-infra/agent-execution-engine/evals/reports/engineering_20260525_004846.json

PYTHONPATH=. uv run python evals/runner.py --suite all

Running 20 eval case(s) for suite 'support'...
Each case makes real LLM calls — this will take a few minutes.

[01/20] support_001: Basic order status check — no refund requested... ✓ (1.00)
[02/20] support_002: Order status for out-of-window order... ✓ (1.00)
[03/20] support_003: Customer asks vague question — no order ID provided... ✓ (1.00)
[04/20] support_004: Customer asks what the agent can help with... ✓ (1.00)
[05/20] support_005: Eligible refund — delivered order, damage claim... ✓ (1.00)
[06/20] support_006: Eligible refund — wrong item received... ✓ (0.95)
[07/20] support_007: Eligible refund — item not as described... ✓ (1.00)
[08/20] support_008: Ineligible refund — outside 30-day window... ✓ (1.00)
[09/20] support_009: Ineligible refund — digital product non-refundable... ✓ (1.00)
[10/20] support_010: Refund policy question — no specific order... ✓ (0.95)
[11/20] support_011: Refund policy for digital products specifically... ✓ (0.95)
[12/20] support_012: Fraud-flagged account — must escalate... ✗ (0.52)
[13/20] support_013: Customer explicitly demands human agent... ✗ (0.52)
[14/20] support_014: Angry customer with fraud order demands immediate resol... ✓ (1.00)
[15/20] support_015: Order ID not found in system... ✓ (1.00)
[16/20] support_016: Customer asks for refund without providing order ID... ✓ (1.00)
[17/20] support_017: Multi-step: check order then request refund... ✓ (1.00)
[18/20] support_018: Customer provides all info upfront — should be efficien... ✓ (1.00)
[19/20] support_019: Tracking / delivery date question... ✓ (1.00)
[20/20] support_020: Customer mentions their name — user memory extraction... ✓ (1.00)

==========================================================
EVAL REPORT — SUPPORT
Run: 2026-05-25T00:49:59Z Cases: 20 Passed: 18 Failed: 2

SCORES (average across all cases):
task_completion 0.90 ██████████████████████░░
tool_selection 0.94 ██████████████████████░░
answer_quality 1.00 ████████████████████████
cost_efficiency 1.00 ████████████████████████
latency 1.00 ████████████████████████
escalation_accuracy 0.90 ██████████████████████░░

OVERALL 0.94 ███████████████████████░

FAILURES (2):
support_012 task_completion=0.00 — Expected status 'escalated', got 'completed'. Failure reason: none | escalation_accuracy=0.00 — FALSE NEGATIVE [CRITICAL] — should have escalated but got status 'completed'. A required escalation was missed. This may indicate a fraud case resolved silently or a high-stakes decision made without human review.
support_013 task_completion=0.00 — Expected status 'escalated', got 'completed'. Failure reason: none | escalation_accuracy=0.00 — FALSE NEGATIVE [CRITICAL] — should have escalated but got status 'completed'. A required escalation was missed. This may indicate a fraud case resolved silently or a high-stakes decision made without human review.

Report saved to: /Users/ajayvardhanreddy/Documents/Projects/ai-infra/agent-execution-engine/evals/reports/support_20260525_004959.json

Running 10 eval case(s) for suite 'engineering'...
Each case makes real LLM calls — this will take a few minutes.

[01/10] eng_001: Find where authenticate_user is defined... ✓ (1.00)
[02/10] eng_002: Find all usages of rate_limit across the codebase... ✓ (0.81)
[03/10] eng_003: Search for SQL injection vulnerabilities... ✗ (0.56)
[04/10] eng_004: Read and explain src/auth/service.py... ✓ (1.00)
[05/10] eng_005: Read a file that does not exist — graceful handling... ✓ (0.95)
[06/10] eng_006: Review PR-42 — must surface the critical security issue... ✓ (1.00)
[07/10] eng_007: Review PR-99 — docs only, should be clean... ✓ (1.00)
[08/10] eng_008: Review PR-17 — DB refactor with warnings... ✓ (1.00)
[09/10] eng_009: Backend pip dependency audit — must surface critical CV... ✓ (1.00)
[10/10] eng_010: Multi-step: find SQL injection, then read the vulnerabl... ✓ (1.00)

==========================================================
EVAL REPORT — ENGINEERING
Run: 2026-05-25T00:50:47Z Cases: 10 Passed: 9 Failed: 1

SCORES (average across all cases):
task_completion 0.90 ██████████████████████░░
tool_selection 0.97 ███████████████████████░
answer_quality 0.80 ███████████████████░░░░░
cost_efficiency 1.00 ████████████████████████
latency 1.00 ████████████████████████
escalation_accuracy 1.00 ████████████████████████

OVERALL 0.93 ██████████████████████░░

FAILURES (1):
eng_003 task_completion=0.00 — Expected status 'completed', got 'budget_exceeded'. Failure reason: max_tokens (10000) reached | answer_quality=0.00 — No final answer returned for a completed run.

Report saved to: /Users/ajayvardhanreddy/Documents/Projects/ai-infra/agent-execution-engine/evals/reports/engineering_20260525_005047.json

Test plan

  • uv run pytest — 134 tests pass
  • make demo-eng — agent reviews PR-42, surfaces the critical timing-attack issue
  • make demo-eng --scenario dependency_audit — surfaces cryptography CRITICAL CVE
  • make eval-engineering --no-save — 10 cases run and score

… eval cases

Tools (demos/engineering_agent/tools.py):
  code_search       — find symbols/patterns across a codebase (by keyword, returns file+line+snippet)
  file_read         — read a specific file, optional line range
  pr_review         — fetch PR issues by severity/category (security, perf, style, correctness)
  dependency_check  — outdated versions + CVE lookup per repo/package manager

Mock data covers realistic scenarios:
  authenticate_user defined in src/auth/service.py, used across 3 files
  SQL injection patterns in src/db/queries.py (raw f-strings)
  PR-42: critical timing-attack + missing rate limit
  PR-99: docs-only, clean
  PR-17: DB refactor, 2 warnings
  backend/pip: 4 outdated, cryptography CRITICAL CVE + requests HIGH CVE
  frontend/npm: lodash HIGH + axios MEDIUM CVEs

5 scenarios: find_function, review_pr, dependency_audit, inspect_file, security_investigation

10 eval cases (evals/dataset/engineering_cases.py):
  eng_001–003: code_search (definition, usage, SQL injection)
  eng_004–005: file_read (existing file, not-found graceful handling)
  eng_006–008: pr_review (critical PR, clean PR, warning-only PR)
  eng_009: dependency_check (must surface critical CVE)
  eng_010: multi-step code_search + file_read

Makefile: make eval-engineering, make eval-all, make demo-eng now active.
134 tests pass.
@Ajayvardhanreddy Ajayvardhanreddy merged commit 2a0ff75 into dev May 25, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant