-
Notifications
You must be signed in to change notification settings - Fork 33
docs: add ADR-0006 for agent injection architecture #364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
|
@cgwalters hey, here is what I was talking about bring-your-own: https://github.com/ambient-code/workflows/tree/main/workflows/template-workflow#directory-structure All that stuff gets cloned into your session PVC to be made available to claude and agents. I am using Workflows as the top level thing that teams use to "onboard" to the system. Workflows don't include a way to customize the runner image, either by having a custom image or customizing the system's base image. So... @MichaelClifford any concerns with this? |
d5715d0 to
97110ef
Compare
|
FTR I'm keeping this draft for now, I'm still refining this and playing with the experimental implementation. |
Claude Code ReviewSummaryThis PR introduces ADR-0006 which documents a workspace container architecture that enables users to provide custom container images while running Claude Code agent sessions. The implementation adds support for a two-container pattern with the agent running in one container and executing commands in a separate user-provided workspace container via kubectl exec. Overall Assessment: Strong architectural documentation with a working implementation. However, there are several critical security vulnerabilities and inconsistencies between the ADR documentation and actual implementation that must be addressed before merge. Issues by SeverityBlocker Issues
Critical Issues
Major Issues
Minor Issues
Positive Highlights
RecommendationsImmediate Actions (Required for Merge)
Summary: Strong architectural work with good intent, but must address critical security issues before merge. The command injection and RBAC cleanup issues are blockers. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Should we try to separate a pull request adding an ADR from one adding an implementation? (Kind of like RFC vs code?) |
|
OK I completely reworked this as I decided it's just way simpler to skip things like custom MCP servers and basically give the agent sugar on top of I also dropped out the PoC implementation work I had; feels cleaner to keep that as a separate PR. Going to have the agent try updating the PoC to match this latest implementation proposal to see how it works/feels. |
Claude Code ReviewSummaryThis ADR proposes a well-thought-out architecture for injecting agent orchestration capabilities into user-provided container images. The document is comprehensive, detailed, and demonstrates strong technical reasoning. The proposed solution (agent-orchestrated pod spawning via However, there are several security, documentation, and architectural concerns that should be addressed before merging, particularly around session isolation, RBAC scope clarity, and alignment with the ADR template format. Issues by Severity🔴 Critical Issues1. Session Isolation Security Model is Underspecified Location: Lines 412-471 (Session Isolation section) Issue: The ADR acknowledges that the current "soft isolation" model allows sessions within the same project namespace to interfere with each other (line 433-438), but treats this as acceptable without sufficient justification. Security implications:
The ADR states (line 471): "this is not a security boundary - an agent can bypass Problem: If this isn't a security boundary, what is? The platform allows arbitrary code execution (Claude Code agents), so isolation between sessions is critical. A malicious or compromised agent in one session could disrupt or exfiltrate data from other sessions in the same project. Recommendation:
2. RBAC Permissions Scope Ambiguity Location: Lines 320-376 (Agent Container Requirements) Issue: The RBAC section (line 322-368) shows namespace-scoped permissions but doesn't clearly state which namespace the ServiceAccount operates in. Confusion points:
Question: Is "session namespace" the same as "project namespace"? The ADR uses both terms inconsistently. From Session Isolation section (line 430): "All AgenticSessions within a project currently run in the same Kubernetes namespace" This confirms they're the same, but the term "session namespace" is misleading. Recommendation:
3. No Validation Section Location: Missing (should follow Implementation Notes per template) Issue: ADR-0001 and ADR-0002 both include "Validation" sections with success metrics and lessons learned. This ADR has no validation section, even though it's marked as Status: Proposed (line 3). Problem:
Recommendation: 🟡 Major Issues4. Missing Deciders and Technical Story Metadata Location: Lines 3-5 Issue: The ADR metadata is incomplete compared to the template (template.md:2-6). Missing Other ADRs include this:
Recommendation: Add missing metadata fields per ADR template. 5. Security Context Recommendations are Weak Location: Lines 385-410 (Security Considerations - Pod Escape Mitigation) Issue: Security settings are marked as "SHOULD" (line 386) rather than "MUST", and the example is buried in a reference (line 398). Current wording (line 386-391):
Problem:
Recommendation: Make security contexts MANDATORY by default, enforced by 6. ConfigMap Mounting Details Missing Location: Lines 64-88, 166-171 (ConfigMap creation and mounting) Issue: The ADR shows ConfigMap creation but doesn't address:
Recommendation: Add ConfigMap lifecycle clarification (immutability, size limits, update behavior). 7. Error Handling and Failure Modes Not Documented Location: Throughout implementation sections Issue: The ADR doesn't address failure scenarios:
Recommendation: Add "Error Handling" subsection documenting failure modes and graceful degradation. 🔵 Minor Issues8. Inconsistent YAML Formatting Location: Lines 65-88, 132-184, 187-215, etc. Issue: Some YAML blocks use comments inconsistently, and indentation varies. Recommend consistent comment style. 9. Alternatives Section Could Be More Concise Location: Lines 668-931 (Alternatives Considered) Issue: The "Purely Ephemeral Pods" alternative (lines 762-931) is 169 lines of detailed Python code. Consider collapsing detailed implementation into expandable section. 10. yq Dependency Not Mentioned in Requirements Location: Lines 98-106 (Agent uses template) Issue: The example shows using 11. Potential Name Collision for Default Workspace Pod Location: Line 252 (Default workspace pod naming) Issue: Default workspace pod is Recommendation: 12. No Discussion of Resource Limits on Workspace Pods Location: User Configuration Examples (lines 617-666) Issue: Examples show resource limits but don't discuss default limits or why they're important. Add resource management guidance with recommended defaults. Positive Highlights
RecommendationsBefore Merging (Blockers)
Strongly Recommended (Pre-Acceptance)
Nice-to-Have (Post-Acceptance)
Final AssessmentThis is a high-quality ADR with strong technical reasoning and comprehensive coverage. The proposed architecture is sound and well-aligned with Kubernetes patterns. However, the session isolation security model needs to be resolved before this can be accepted. The current "soft isolation" approach creates a significant security gap that must either be:
Once the critical security concerns are addressed and the Validation section is added, this ADR will be ready for acceptance. Recommendation: Request Changes (address Critical Issues #1-3 and Major Issue #4, then re-review) |
|
I'm continuing to iterate on this and experiment with a PoC implementation, I'll post an update soon |
Increase flexibility by allowing users to configure the container used by the agent's tools (separate from the container running the agent itself). Assisted-by: Claude Code (Opus 4.5)
Claude Code ReviewSummaryThis PR implements ADR-0006: Workspace Container Architecture, introducing a significant architectural enhancement that separates agent execution from the workspace environment. The implementation enables users to provide custom container images while maintaining security through a session-proxy sidecar pattern. Overall Assessment: Strong implementation with excellent security considerations. The code follows most established patterns, but there are several areas requiring attention before merge. Issues by Severity🚫 Blocker Issues1. Backend Handler Missing User Token Authentication (workspace_container.go) Location: Issue: Both // CURRENT CODE (lines 39-43)
_, reqDyn := GetK8sClientsForRequest(c)
if reqDyn == nil {
c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing authentication token"})
return
}Why it's a blocker: According to security standards (security-standards.md:13-18), we must check both clients. The current code only checks Fix: Check both return values or assign reqK8s, reqDyn := GetK8sClientsForRequest(c)
if reqK8s == nil || reqDyn == nil {
c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing authentication token"})
return
}Reference: 2. Unsafe Type Assertions in Backend Handler (workspace_container.go:62-91) Location: Issue: Direct type assertions without checking, violating Critical Rule #4: // Line 62 - UNSAFE
spec, _, _ := unstructured.NestedMap(obj.Object, "spec")
// Line 72-90 - Direct type assertion without checking 'ok'
if image, ok := wcMap["image"].(string); ok {
settings.Image = image
}Why it's a blocker: CLAUDE.md:358-362 explicitly forbids this pattern. If Fix: Check the spec, found, err := unstructured.NestedMap(obj.Object, "spec")
if !found || err != nil {
c.JSON(http.StatusOK, WorkspaceContainerSettings{})
return
}Reference: 🔴 Critical Issues3. Session Proxy Missing Rate Limiting Location: Issue: The Recommendation:
Security Impact: Medium - Could lead to resource exhaustion attacks. 4. Frontend Component Violates Type Standards Location: Issue: Using const FIXED_KEYS = useMemo(() => ["ANTHROPIC_API_KEY", ...] as const, []);Why it's critical: Frontend standards require Fix: Define as a module-level constant type: type FixedKey =
| "ANTHROPIC_API_KEY"
| "GIT_USER_NAME"
| "GIT_USER_EMAIL"
| "GITHUB_TOKEN"
| "JIRA_URL"
| "JIRA_PROJECT"
| "JIRA_EMAIL"
| "JIRA_API_TOKEN"
| "GITLAB_TOKEN"
| "GITLAB_INSTANCE_URL";
const FIXED_KEYS: readonly FixedKey[] = [
"ANTHROPIC_API_KEY",
// ... rest
];Reference: 5. Operator Missing Error Context in Status Updates Location: Issue: When Vertex secret copy fails, the error message doesn't include the actual error: if err := copySecretToNamespace(copyCtx, ambientVertexSecret, sessionNamespace, currentObj); err != nil {
return fmt.Errorf("failed to copy %s secret from %s to %s (CLAUDE_CODE_USE_VERTEX=1): %w",
types.AmbientVertexSecretName, operatorNamespace, sessionNamespace, err)
}Good: Using But: The session status should be updated before returning the error (Pattern 2 in error-handling.md:146-160). Fix: Update session status to reflect error state: if err := copySecretToNamespace(...); err != nil {
updateAgenticSessionStatus(sessionNamespace, name, map[string]interface{}{
"phase": "Error",
"message": fmt.Sprintf("Failed to copy Vertex secret: %v", err),
})
return fmt.Errorf("failed to copy %s secret: %w", types.AmbientVertexSecretName, err)
}Reference: 🟡 Major Issues6. Operator Session Handler Extremely Long (sessions.go) Location: Issue: The Impact: Difficult to review, test, and maintain. Makes it hard to isolate bugs. Recommendation: Refactor into smaller functions:
Reference: Go best practices recommend functions under 50-100 lines. 7. Frontend Component Exceeds Line Limit Location: Issue: Component is 695 lines, far exceeding the 200-line guideline. Fix: Extract sub-components:
Reference: 8. Missing Security Context Validation in Session Proxy Location: Issue: The proxy executes commands in the workspace pod but doesn't verify the pod's SecurityContext before executing. A misconfigured workspace pod could run as root. Recommendation: Add validation in // Check pod security context
if pod.Spec.SecurityContext == nil ||
pod.Spec.SecurityContext.RunAsNonRoot == nil ||
!*pod.Spec.SecurityContext.RunAsNonRoot {
return "", fmt.Errorf("workspace pod %s does not have required SecurityContext", pod.Name)
}Security Impact: Medium - Defense in depth against privilege escalation. 🔵 Minor Issues9. Inconsistent Comment Style in Session Proxy Location: Issue: Package comment uses single-line style, but multi-line would be more appropriate for the detailed description. Fix: Use standard Go doc comment format: // Package proxy implements the streaming exec API for workspace containers.
//
// The proxy runs as a sidecar in the runner pod and handles kubectl exec
// operations to workspace pods, providing streaming command execution over HTTP.
package proxy10. Magic Numbers in Workspace MCP Server Location: Issue: Hardcoded timeout constants without explanation: DEFAULT_TIMEOUT = 300
MAX_TIMEOUT = 1800 # 30 minutes maxRecommendation: Make configurable via environment variables with fallbacks: DEFAULT_TIMEOUT = int(os.environ.get("WORKSPACE_DEFAULT_TIMEOUT", "300"))
MAX_TIMEOUT = int(os.environ.get("WORKSPACE_MAX_TIMEOUT", "1800"))11. Frontend React Query Hook Missing Error Handling Location: Issue: The mutation doesn't have Recommendation: Add error logging: onError: (error) => {
console.error('Failed to update workspace container settings:', error);
},Positive Highlights✅ Excellent ADR Documentation: ADR-0006 is comprehensive, well-structured, and provides clear rationale for architectural decisions. The "Alternatives Considered" section is particularly valuable. ✅ Security-First Design: The session-proxy sidecar pattern with privilege separation (token isolation, localhost-only API) demonstrates strong security awareness. ✅ Proper Owner References: Operator correctly sets OwnerReferences on workspace pods and PVCs for automatic cleanup (sessions.go:232-242). ✅ Frontend Follows React Query Patterns: New hooks ( ✅ Type-Safe Unstructured Access in Most Places: Operator code generally uses ✅ Streaming Implementation: Session proxy's streaming exec with chunked transfer encoding provides excellent real-time feedback (handler.go:125-150). ✅ Comprehensive CRD Schema: ProjectSettings CRD extension includes proper validation and optional fields for workspace container customization (projectsettings-crd.yaml). ✅ Backward Compatibility: Implementation maintains compatibility with existing sessions while enabling new workspace container mode. RecommendationsHigh Priority (Before Merge)
Medium Priority (Can Address in Follow-up)
Nice to Have
Final VerdictOverall: 🟡 Needs Work Before Merge This is a well-architected feature with strong security foundations, but the blocker issues must be addressed to comply with established backend development standards. The implementation demonstrates good understanding of K8s patterns and streaming APIs. Estimated effort to address blockers: 2-3 hours Recommendation: Request changes for blockers #1-2, strongly recommend addressing critical issues #3-5 before merge. Review conducted by Claude Code following standards in CLAUDE.md, .claude/context/, and .claude/patterns/ 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
| @@ -0,0 +1,7 @@ | |||
| apiVersion: v1 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry, some minikube-related stuff leaked into this PR too beacuse I'd been developing on top of #367 first. Will fix.
| @@ -0,0 +1,149 @@ | |||
| # Claude Agent SDK Patches | |||
|
|
|||
| This directory contains patches applied to the `claude-agent-sdk` package to ensure compliance with ADR-0006 (Agent Injection Architecture). | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was good for PoC but now we need to actually try upstreaming these changes without patching like this.
| @@ -0,0 +1,233 @@ | |||
| #!/bin/bash | |||
| # Minikube deployment script for Ambient Code Platform | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also remove this
Claude Code Review - PR #364SummaryADR-0006: Workspace Container Architecture implements separation of agent execution from tool execution environment (~3,000 lines across backend, frontend, operator, session-proxy). Assessment: Well-architected with strong security isolation, BUT has critical security gaps requiring fixes before merge. Blocker Issues (Must Fix)
Positive Highlights✅ Excellent two-pod architecture with localhost proxy RecommendationREQUEST CHANGES - Do not merge until blockers #1-8 resolved. Architecture is sound but implementation needs hardening. See full review with code examples and patterns: [detailed analysis above] Review by Claude Code (Sonnet 4.5) using memory system context 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
Implement workspace container architecture per ADR-0006: - Add session-proxy sidecar for streaming command execution - Add workspace MCP server for agent command interface - Add workspace container settings to ProjectSettings CRD - Add backend handlers for workspace container configuration - Add frontend UI for workspace container settings - Update operator to create workspace pods with user-configurable images Security improvements from review: - Add rate limiting to session proxy (max concurrent execs) - Add exec timeout (configurable via environment) - Fix authentication checks in backend handlers - Add proper error handling for unstructured K8s access - Update session status on Vertex secret copy failure Assisted-by: Claude Code (Opus 4.5)
Claude Code ReviewSummaryThis PR implements ADR-0006, introducing workspace container architecture with agent isolation. The implementation spans backend API endpoints, frontend UI, operator logic, and a new session-proxy sidecar. The code quality is generally strong with good adherence to established patterns, but there are several critical security and architectural concerns that should be addressed. Overall Assessment: The PR demonstrates solid architectural thinking and mostly follows the project's established patterns. However, there are blocker-level security issues around RBAC validation and service account usage that must be fixed before merge. 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
Blocker Issues (Must Fix Before Merge)1. Missing RBAC Validation in Backend HandlersLocation: Both Required: Add SelfSubjectAccessReview checks before Get/Update operations per backend-development.md patterns. 2. User-Scoped Client Used for CR Write OperationsLocation: The handler uses user-scoped client ( Required: Switch to 3. Missing OwnerReferences on Workspace PodLocation: Operator session handler Per CLAUDE.md Critical Rule #5, all child resources must have OwnerReferences set for automatic cleanup. Without OwnerReferences, workspace pods won't be cleaned up when the AgenticSession is deleted, causing resource leaks. Required: Verify OwnerReferences are set on workspace pods in operator. |
|
OK, there's now conflicts with #367 on git main (which I was using to develop this). I may try testing it this out instead on a full OCP (or maybe https://github.com/minc-org/minc instead of minikube), but anyways this is probably far enough along that folks can comment on the overall architecture idea. Does anyone think it might make sense to just always have a complete "sub-namespace" per workspace where the agent can run arbitrary pods by default say? Or maybe we go all the way to requiring https://github.com/projectcapsule/capsule (feels a bit like overkill by default, but we should certainly make it easy to integrate with that, as well as being easily able to provision completely distinct cloud resources that are lifecycle-bound; lots of precedent in that space). |
Critical Issues (Should Fix Before Merge)4. Type Assertions Without Safety ChecksLocation: Direct type assertions like Fix: Use 5. Frontend:
|
Major Issues (Address in Follow-Up)7. Frontend: Manual State SynchronizationLocation: Manual 8. Hardcoded String LiteralsLocation: The singleton ProjectSettings CR name 9. Missing Loading State for Workspace SettingsLocation: The workspace container section doesn't show a loading state while fetching settings. Per DESIGN_GUIDELINES.md, all data operations should handle loading states. Minor Issues
|
Positive Highlights
|
RecommendationsMust Fix Before Merge (Blockers)
Should Fix Before Merge (Critical)
Testing Recommendations
ConclusionThis is a significant architectural enhancement with strong design thinking. The ADR is excellent and the implementation follows most project patterns. However, the critical security issues around RBAC validation and service account usage must be fixed before merge to maintain the security guarantees established in the codebase. Recommendation: Request changes for blockers #1, #2, and verification of #3. The rest can be addressed in follow-up PRs if time-sensitive. |
Add architectural decision record documenting how to support user-provided container images with agent code injection. This addresses the requirement to let users run agentic sessions in their own container images with their own tools without forcing them to extend our base image.