fix(control-plane): authorize execution note writes(#420)#575
fix(control-plane): authorize execution note writes(#420)#575Luffy2208 wants to merge 2 commits into
Conversation
📊 Coverage gateThresholds from
✅ Gate passedNo surface regressed past the allowed threshold and the aggregate stayed above the floor. |
📐 Patch coverage gateThreshold: 80% on lines this PR touches vs
✅ Patch gate passedEvery surface whose lines were touched by this PR has patch coverage at or above the threshold. |
b9fb5f5 to
72f59ee
Compare
|
I fixed the patch coverage failure by adding targeted tests in What was added:
|
santoshkumarradha
left a comment
There was a problem hiding this comment.
🔴 PR-AF Review — Needs Major Rework
Automated multi-agent code review · PR-AF built with AgentField
10 findings · 🔴 3 critical · 🟠 4 important · 🔵 0 suggestions · ⚪ 0 nitpicks
PR Overview
Summary
Fixes an IDOR in the execution notes write endpoint by enforcing execution ownership before appending a note.
File-specific changes:
-
control-plane/internal/handlers/execution_notes.go- Resolves the caller agent identity before writing a note.
- Compares the caller agent ID with the execution owner,
execution.AgentNodeID. - Returns
403 execution_ownership_mismatchwhen the caller does not own the execution. - Supports DID-authenticated callers by resolving the verified caller DID to an agent ID.
-
control-plane/internal/server/middleware/auth.go- Stores API-key caller identity in Gin context using
CallerAgentIDKey. - Uses
X-Caller-Agent-IDfirst, withX-Agent-Node-IDas fallback.
- Stores API-key caller identity in Gin context using
-
control-plane/internal/handlers/execution_notes_test.go- Adds coverage for owner write success.
- Adds coverage for non-owner API-key write returning
403. - Adds coverage for DID-authenticated owner and non-owner behavior.
-
control-plane/internal/server/middleware/auth_test.go- Adds coverage that API-key auth populates caller identity in Gin context.
-
control-plane/internal/handlers/coverage_handlers_90_test.go- Updates the existing successful note-write coverage test to include matching caller identity.
Testing
-
./scripts/test-all.sh - Additional verification:
cd control-plane && go test ./internal/handlers ./internal/server/middlewarecd control-plane && go test ./internal/handlers ./internal/server/middleware -coverprofile=/tmp/issue-420.coverprofilecd control-plane && golangci-lint run --new-from-rev=upstream/main ./internal/handlers ./internal/server/middleware- Manual E2E curl verification:
- Agent A adding a note to Agent A’s execution returns
200 OK - Agent A adding a note to Agent B’s execution returns
403 Forbidden - Agent B’s fresh execution remains with
notes: []after the blocked write
- Agent A adding a note to Agent A’s execution returns
Note: full repo lint currently reports pre-existing unrelated Go lint issues outside this PR’s changed files. Changed-line lint for the touched packages reports 0 issues.
Checklist
- I updated documentation where applicable.
- I added or updated tests (or none were needed).
- I updated
CHANGELOG.md(or this change does not warrant a changelog entry).
Screenshots (if UI-related)
Not UI-related.
Related issues
Fixes #420
Key Findings
7 issue(s) should be addressed before merge:
- 🔴 Complete ownership enforcement bypass when APIKey is empty and DID auth is off — the default configuration (
control-plane/internal/server/middleware/auth.go:24) — WhenAuthConfig.APIKeyis empty (the default in ALL deployment configurations:config/agentfield.yaml, Docker Compose atdeployments/docker/docker-compose.yml, and Helm at `deployments/helm/agen… - 🔴 Three-tier identity fallback in execution notes handler has no fail-closed mechanism: raw-header tier silently becomes primary identity source when all upstream auth middleware is configuration-disabled (
control-plane/internal/handlers/execution_notes.go:184) — The execution notes handler's executionNoteCallerAgentID implements a three-tier identity resolution cascade: (1) verified DID from DIDAuthMiddleware, (2) CallerAgentIDKey context value from APIKeyAut… - 🔴 Fixing the default authentication bypass by enabling DID auth silently activates diverging DID resolution paths — a 'fix F4, expose F1' trap (
control-plane/internal/handlers/execution_notes.go:184) — F4 establishes that under the default configuration (APIKey empty AND did_auth_enabled false), the execution notes handler accepts unvalidated raw headers as caller identity — a complete ownership enf… - 🟠 GetExecutionNotesHandler leaks execution notes to any authenticated caller with no ownership enforcement (
control-plane/internal/handlers/execution_notes.go:235) — The PR fixes an IDOR on the write path (AddExecutionNoteHandler) by enforcing execution ownership, but the read path (GetExecutionNotesHandler) remains completely open: any API-key-authenticated calle… - 🟠 DID auth middleware provides no defense against no-auth bypass — attacker can simply omit DID headers (
control-plane/internal/server/middleware/did_auth.go:177) — The assignment question asks: "Does the DIDAuthMiddleware at routes_middleware.go:77-88 provide any defense when API key auth is off?" No, not in any meaningful way. If `DID.Enabled && DIDAuthEna… - 🟠 Two-tier DID resolution reads semantically-different field names (AgentID vs AgentNodeID) from independent tables with no structural equivalence guarantee (
control-plane/internal/handlers/execution_notes.go:206) — The functionresolveExecutionNoteAgentIDByDIDat line 206 resolves a caller DID to an agent identifier through two independent code paths that read *differently-named columns from different tables… - 🟠 Type-unsafe CallerAgentIDKey enables silent reversion to attacker-controlled raw-header identity when any middleware writes a non-string value — turning a compile-time type error into a runtime authentication bypass (
control-plane/internal/handlers/execution_notes.go:189) — The combination of F2 (CallerAgentIDKey accepts any value type via Gin'sc.Set(key string, value any)) and F4 (executionNoteCallerAgentID falls through to raw X-Caller-Agent-ID / X-Agent-Node-ID hea…
Files with findings: control-plane/internal/handlers/execution_notes.go, control-plane/internal/server/middleware/auth.go, control-plane/internal/server/middleware/did_auth.go
All Findings by Severity
🔴 Critical (3)
- Complete ownership enforcement bypass when APIKey is empty and DID auth is off — the default configuration
control-plane/internal/server/middleware/auth.go:24 - Three-tier identity fallback in execution notes handler has no fail-closed mechanism: raw-header tier silently becomes primary identity source when all upstream auth middleware is configuration-disabled
control-plane/internal/handlers/execution_notes.go:184 - Fixing the default authentication bypass by enabling DID auth silently activates diverging DID resolution paths — a 'fix F4, expose F1' trap
control-plane/internal/handlers/execution_notes.go:184
🟠 Important (4)
- GetExecutionNotesHandler leaks execution notes to any authenticated caller with no ownership enforcement
control-plane/internal/handlers/execution_notes.go:235 - DID auth middleware provides no defense against no-auth bypass — attacker can simply omit DID headers
control-plane/internal/server/middleware/did_auth.go:177 - Two-tier DID resolution reads semantically-different field names (AgentID vs AgentNodeID) from independent tables with no structural equivalence guarantee
control-plane/internal/handlers/execution_notes.go:206 - Type-unsafe CallerAgentIDKey enables silent reversion to attacker-controlled raw-header identity when any middleware writes a non-string value — turning a compile-time type error into a runtime authentication bypass
control-plane/internal/handlers/execution_notes.go:189
Review Process Details
Dimensions Analyzed (15):
- No-Auth Mode Identity Spoofing Bypass — 4 file(s)
- CallerAgentIDKey Context Semantics Collision — 3 file(s)
- GET/POST Execution Notes Authorization Asymmetry — 2 file(s)
- DID Resolution Silent Degradation — Error vs Not-Found Conflation — 3 file(s)
- Context.Background() in GET Handler Bypasses Request Timeout — 2 file(s)
- DID resolution field-name divergence: AgentID vs AgentNodeID — 3 file(s)
- CallerAgentIDKey context type contract: non-string write → silent fallback — 3 file(s)
- No-auth middleware bypass: mechanical trace from empty APIKey to unauthenticated header reads — 4 file(s)
- Storage error propagation contract: errors.As dependency on UpdateExecutionRecord fidelity — 2 file(s)
- Context.Background() drift in GET handler: deadline propagation gap vs POST handler — 2 file(s)
- No-auth bypass of execution ownership enforcement via raw header fallback — 4 file(s)
- Triplicated caller resolution logic with diverged priority chains sharing one context namespace — 3 file(s)
- Auth error classification breaks if production storage provider wraps or replaces closure errors — 2 file(s)
- APIKeyAuth global broadcast of CallerAgentIDKey changes semantics for ALL authenticated routes — 2 file(s)
- DID-to-agent ID resolution produces different identifiers depending on resolution path (AgentID vs AgentNodeID field mismatch) — 3 file(s)
Meta-Dimension Lenses (3):
- Semantic — 5 dimension(s), 92% coverage confidence
- Mechanical — 5 dimension(s), 92% coverage confidence
- Systemic — 5 dimension(s), 92% coverage confidence
Cross-Reference & Adversary Analysis:
- 6 compound finding(s) synthesized
- 12 finding(s) adversarially tested: 12 confirmed, 0 challenged
Pipeline Stats
| Metric | Value |
|---|---|
| Duration | 4643.7s |
| Agent invocations | 65 |
| Coverage iterations | 0 |
| Estimated cost | N/A (provider does not report cost) |
| Budget exhausted | No |
| PR type | bugfix |
| Complexity | standard |
Review ID: rev_b6e41625c18a
| return nil | ||
| } | ||
|
|
||
| func executionNoteCallerAgentID(ctx context.Context, c *gin.Context, storageProvider ExecutionNoteStorage) (string, error) { |
There was a problem hiding this comment.
🔴 [CRITICAL] Raw-header fallback becomes sole identity source under default config
executionNoteCallerAgentID has 3 tiers: (1) verified DID, (2) CallerAgentIDKey context value, (3) raw X-Caller-Agent-ID/X-Agent-Node-ID headers. Tiers 1 & 2 are config-gated. Under defaults (APIKey="", did_auth_enabled=false), both are skipped — tier 3 accepts attacker-controlled headers with zero validation, flowing directly to ensureExecutionNoteOwnership.
Evidence:
routes_middleware.go:77— DID middleware not installed when disabledauth.go:26-28— APIKeyAuth no-ops whenAPIKey=="", never sets context keyexecution_notes.go:196-201— raw header read with no validation- When APIKeyAuth does run, it reads the same headers (
auth.go:118-124), so the fallback is either dead code or active bypass — never legitimate
Fix: Delete the raw-header fallback. Add a startup assertion in routes_middleware.go that refuses to register write routes when both auth methods are disabled.
Compound Analysis · confidence 95%
🤖 Reviewed by AgentField PR-AF
|
|
||
| // GetExecutionNotesHandler handles GET /api/v1/executions/:execution_id/notes | ||
| // Retrieves notes for a specific execution with optional tag filtering | ||
| func GetExecutionNotesHandler(storageProvider ExecutionNoteStorage) gin.HandlerFunc { |
There was a problem hiding this comment.
🟠 [IMPORTANT] Read path leaks execution notes — no ownership check
PR fixes IDOR on write path but GetExecutionNotesHandler (line 235) remains open: any API-key-authenticated caller can read any execution's notes by ID. Notes carry workflow state (phase progress, intermediate results, confidence reasoning — see examples/python_agent_nodes/agentic_rag/main.py:912-1097).
storageProvider.GetExecutionRecord() is called with no caller identity resolution or comparison against execution.AgentNodeID — same IDOR pattern just fixed on write.
PR mentions this is "deliberately NOT modified" but provides no rationale, no tests, no code comment. Likely oversight.
Fix: Mirror write path — resolve caller via executionNoteCallerAgentID, then ensureExecutionNoteOwnership, return 403 on mismatch. If intentional, document with code comment + test confirming open-read is the contract.
Authorization Asymmetry · confidence 95%
🤖 Reviewed by AgentField PR-AF
| return nil | ||
| } | ||
|
|
||
| func executionNoteCallerAgentID(ctx context.Context, c *gin.Context, storageProvider ExecutionNoteStorage) (string, error) { |
There was a problem hiding this comment.
🔴 [CRITICAL] Fix-F4-expose-F1 trap: enabling DID auth activates divergent resolution paths
Operator fixing F4 (default-config bypass) by setting did_auth_enabled=true unknowingly activates two DID→AgentNodeID resolution chains that read different stores:
resolveExecutionNoteAgentIDByDID(execution_notes.go:206-230) → SQL viaGetDIDDocument+ListAgentDIDsfromagent_didstableDIDService.ResolveAgentIDByDID(did_service.go:500-518) → in-memoryregistry.AgentNodes(used byMemoryPermissionMiddleware, memory_permission.go:140)
No transactional sync between SQL and in-memory registry. During register/unregister race windows, the two stores can return different AgentNodeID values for the same DID — causing false 403s or cross-agent writes.
Fix: (1) Unify DID resolution into a single source of truth. (2) Add startup assertion when write routes registered but no auth enabled. (3) Annotate resolved identity with provenance for audit logging.
Compound Analysis · confidence 82%
🤖 Reviewed by AgentField PR-AF
| return "", nil | ||
| } | ||
|
|
||
| func resolveExecutionNoteAgentIDByDID(ctx context.Context, storageProvider ExecutionNoteStorage, callerDID string) (string, error) { |
There was a problem hiding this comment.
🟠 [IMPORTANT] DID resolution reads differently-named columns from independent tables
resolveExecutionNoteAgentIDByDID (line 206) has two paths returning what should be the same identifier:
- Path 1 (line 209):
record.AgentIDfromdid_documents.agent_id(migration 019) - Path 2 (line 226):
info.AgentNodeIDfromagent_dids.agent_node_id(migration 002)
Both compared against execution.AgentNodeID. They converge today only because nodes_register.go:1027,1044 populates both from newNode.ID — pure code coincidence. No FK, unique constraint, or comment ties them. A future migration touching one but not the other silently breaks resolution, causing false 403s.
The naming divergence (AgentID vs AgentNodeID) obscures the invariant.
Fix: (1) Document the equivalence on DIDDocumentRecord.AgentID. (2) Rename to AgentNodeID for consistency. (3) Add invariant test that creates entries in both tables and verifies both paths return the same result.
Field-name divergence · confidence 85%
🤖 Reviewed by AgentField PR-AF
| return resolveExecutionNoteAgentIDByDID(ctx, storageProvider, callerDID) | ||
| } | ||
|
|
||
| if callerID, exists := c.Get(string(middleware.CallerAgentIDKey)); exists { |
There was a problem hiding this comment.
🟠 [IMPORTANT] Type-unsafe CallerAgentIDKey enables silent fallback to attacker headers
c.Set(string, any) accepts any value type — only convention enforces string. The handler at line 190 does id, ok := callerID.(string); on a non-string value, ok=false, id="", the inner block is skipped, and control falls through to raw X-Caller-Agent-ID / X-Agent-Node-ID headers (lines 196-201) with no log or error.
Attack path: an authenticated request hits a future middleware bug that writes a non-string to the key → identity silently discarded → handler reads attacker-controlled headers → ownership check passes if attacker knows the execution ID.
Proof: context_helpers_additional_test.go:19-20 shows c.Set(string(CallerAgentIDKey), 42) compiles cleanly and GetCallerAgentID returns "" — the language permits the bug.
Fix: (1) Typed setter SetCallerAgentID(c, agentID string) + forbidigo lint rule banning raw c.Set on this key. (2) When exists==true && ok==false, log WARN and return error — never fall to raw headers. (3) Remove the raw-header fallback entirely (see related comment).
Compound Analysis · confidence 70%
🤖 Reviewed by AgentField PR-AF
|
|
||
| func resolveExecutionNoteAgentIDByDID(ctx context.Context, storageProvider ExecutionNoteStorage, callerDID string) (string, error) { | ||
| if lookup, ok := storageProvider.(executionNoteDIDDocumentLookup); ok { | ||
| if record, err := lookup.GetDIDDocument(ctx, callerDID); err == nil && record != nil { |
There was a problem hiding this comment.
[HIGH] Revoked DIDs still pass ownership check
At line 208, GetDIDDocument result is accepted on err == nil && record != nil with no record.IsRevoked() check. LocalStorage.GetDIDDocument (local.go:8310-8328) returns revoked records with RevokedAt populated but err==nil. Compare GetDIDDocumentByAgentID (local.go:8345) which adds AND revoked_at IS NULL.
A caller holding a revoked DID can still resolve to their old agent ID and pass ownership checks on former executions.
Fix: After successful retrieval, if record.IsRevoked() { /* fall through or error */ }. Or push the filter into the SQL query as GetDIDDocumentByAgentID does.
Security · confidence 95%
🤖 Reviewed by AgentField PR-AF
| } | ||
|
|
||
| func resolveExecutionNoteAgentIDByDID(ctx context.Context, storageProvider ExecutionNoteStorage, callerDID string) (string, error) { | ||
| if lookup, ok := storageProvider.(executionNoteDIDDocumentLookup); ok { |
There was a problem hiding this comment.
[MEDIUM] GetDIDDocument errors swallowed; transient failures masked as not-found
resolveExecutionNoteAgentIDByDID (lines 207-211) checks err == nil && record != nil. GetDIDDocument (local.go:8323-8328) returns the same error type for both not-found and DB failures with no sentinel. Transient DB failure → falls through to ListAgentDIDs → if that also returns empty, client gets 403 "caller agent identity is required" while the real cause is infrastructure. Error info is lost.
Fix: (a) sentinel ErrDIDNotFound so only genuine not-found triggers fallback; or (b) log the GetDIDDocument error before falling through so operators can detect degradation.
Reliability · confidence 85%
🤖 Reviewed by AgentField PR-AF
|
@Luffy2208, just a heads-up that we're currently evaluating the review quality of https://github.com/Agent-Field/pr-af. If you notice any of these automated findings are off-base, noisy, or unhelpful, please let us know! Your feedback would be super helpful. |
Summary
Fixes an IDOR in the execution notes write endpoint by enforcing execution ownership before appending a note.
File-specific changes:
control-plane/internal/handlers/execution_notes.goexecution.AgentNodeID.403 execution_ownership_mismatchwhen the caller does not own the execution.control-plane/internal/server/middleware/auth.goCallerAgentIDKey.X-Caller-Agent-IDfirst, withX-Agent-Node-IDas fallback.control-plane/internal/handlers/execution_notes_test.go403.control-plane/internal/server/middleware/auth_test.gocontrol-plane/internal/handlers/coverage_handlers_90_test.goTesting
./scripts/test-all.shcd control-plane && go test ./internal/handlers ./internal/server/middlewarecd control-plane && go test ./internal/handlers ./internal/server/middleware -coverprofile=/tmp/issue-420.coverprofilecd control-plane && golangci-lint run --new-from-rev=upstream/main ./internal/handlers ./internal/server/middleware200 OK403 Forbiddennotes: []after the blocked writeNote: full repo lint currently reports pre-existing unrelated Go lint issues outside this PR’s changed files. Changed-line lint for the touched packages reports
0 issues.Checklist
CHANGELOG.md(or this change does not warrant a changelog entry).Screenshots (if UI-related)
Not UI-related.
Related issues
Fixes #420