Skip to content

Conversation

@github-actions
Copy link
Contributor

@github-actions github-actions bot commented Dec 5, 2025

Automated Fix by Amber Agent

This PR addresses issue #445 using the Amber background agent.

Changes Summary

  • Action Type: auto-fix
  • Commit: 6fd6440
  • Triggered by: Issue label/command

Pre-merge Checklist

  • All linters pass
  • All tests pass
  • Changes follow project conventions (CLAUDE.md)
  • No scope creep beyond issue description

Reviewer Notes

This PR was automatically generated. Please review:

  1. Code quality and adherence to standards
  2. Test coverage for changes
  3. No unintended side effects

🤖 Generated with Amber Background Agent

Closes #445

Long-running sessions were failing due to ServiceAccount token expiration
because tokens were injected as environment variables at pod startup and
never refreshed, even though the operator was refreshing the Secret.

Changes:
- Operator: Mount runner token Secret as volume instead of env var
- Operator: Inject BOT_TOKEN_PATH env var pointing to mounted token file
- Runner: Read token from BOT_TOKEN_PATH file on each connection/reconnection
- Runner: Fall back to BOT_TOKEN env var for backward compatibility
- Runner: Improved error messages for token authentication issues

The operator already refreshes tokens every 45 minutes via ensureFreshRunnerToken().
Now the runner can read the refreshed token from the mounted Secret volume without
requiring pod restart.

Fixes #445

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Comment on lines -1158 to -1169
// Prefer annotated secret name; fallback to deterministic name
secretName := ""
if meta, ok := currentObj.Object["metadata"].(map[string]interface{}); ok {
if anns, ok := meta["annotations"].(map[string]interface{}); ok {
if v, ok := anns["ambient-code.io/runner-token-secret"].(string); ok && strings.TrimSpace(v) != "" {
secretName = strings.TrimSpace(v)
}
}
}
if secretName == "" {
secretName = fmt.Sprintf("ambient-runner-token-%s", name)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude any risks to deleting this code?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses the issue of ServiceAccount tokens expiring in long-running sessions by implementing dynamic token refresh. Instead of reading tokens once from environment variables at startup, the runner now reads tokens from a file path that Kubernetes updates automatically when tokens are refreshed.

  • Changes token handling from static BOT_TOKEN environment variable to file-based BOT_TOKEN_PATH
  • Implements volume mount for runner token secret to enable dynamic token refresh without pod restart
  • Maintains backward compatibility by falling back to BOT_TOKEN if BOT_TOKEN_PATH is not set

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
components/runners/runner-shell/runner_shell/core/transport_ws.py Adds logic to read authentication token from file path (BOT_TOKEN_PATH) with fallback to environment variable (BOT_TOKEN), and improves error messages for authentication failures
components/operator/internal/handlers/sessions.go Mounts runner token secret as volume at /app/runner-token and changes environment variable from BOT_TOKEN to BOT_TOKEN_PATH pointing to the mounted token file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions

This comment has been minimized.

bobbravo2 and others added 2 commits December 5, 2025 19:00
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link
Contributor Author

github-actions bot commented Dec 6, 2025

Claude Code Review

Summary

This PR addresses service account token expiration in long-running sessions by implementing a volume-based token refresh mechanism. The solution transitions from static environment variable injection (BOT_TOKEN) to file-based token mounting (BOT_TOKEN_PATH) with automatic refresh capabilities.

Overall Assessment:Approve with minor suggestions

The implementation follows established patterns and security standards. The approach is architecturally sound and maintains backward compatibility.


Issues by Severity

🚫 Blocker Issues

None - No blocking issues found.

🔴 Critical Issues

None - No critical issues found.

🟡 Major Issues

1. Missing Test Coverage for Token Refresh Logic

Location: components/operator/internal/handlers/sessions.go:934-943, components/runners/runner-shell/runner_shell/core/transport_ws.py:37-45

Issue: No tests verify the new token volume mount and file reading logic.

Impact: Token refresh mechanism is untested. Regressions could cause authentication failures in long-running sessions.

Recommendation:

  • Add operator test verifying volume mount configuration in Job spec
  • Add Python unit test for WebSocketTransport.connect() with mocked file reads
  • Add integration test simulating token expiration scenario

Example test structure:

// components/operator/internal/handlers/sessions_test.go
func TestHandleAgenticSessionEvent_RunnerTokenVolumeMount(t *testing.T) {
    // Test that Job includes runner-token volume and volumeMount
    // Test annotation-based secret name override
    // Test fallback to default secret name
}

2. Error Handling for Token File Read Should Be More Robust

Location: components/runners/runner-shell/runner_shell/core/transport_ws.py:38-43

Issue: File read errors are logged as warnings but don't prevent connection attempts with empty token.

Current behavior:

try:
    with open(token_path, "r", encoding="utf-8") as f:
        token = f.read().strip()
except Exception as e:
    logger.warning(f"Failed to read token from {token_path}: {e}")
# Continues with empty token, will get 401 later

Recommendation: Consider failing fast if BOT_TOKEN_PATH is set but file is unreadable:

if token_path:
    try:
        with open(token_path, "r", encoding="utf-8") as f:
            token = f.read().strip()
        logger.info(f"Read token from {token_path} (len={len(token)})")
    except FileNotFoundError:
        logger.error(f"BOT_TOKEN_PATH set to {token_path} but file not found")
        raise
    except Exception as e:
        logger.error(f"Failed to read token from {token_path}: {e}")
        raise

Rationale: If BOT_TOKEN_PATH is explicitly configured, inability to read it indicates misconfiguration that should halt startup rather than failing later with confusing 401 errors.

🔵 Minor Issues

3. Token Redaction Missing in New Log Statement

Location: components/runners/runner-shell/runner_shell/core/transport_ws.py:42

Issue: New log statement logs token path but doesn't redact token length.

Current:

logger.info(f"Read token from {token_path}")

Recommendation (following security-standards.md patterns):

logger.info(f"Read token from {token_path} (len={len(token)})")

4. Inconsistent Comment Style in Go Code

Location: components/operator/internal/handlers/sessions.go:1177-1179

Issue: Comment exceeds recommended line length and could be more concise.

Current:

// Inject BOT_TOKEN_PATH pointing to mounted secret volume
// Token is mounted from runnerTokenSecretName at /app/runner-token
// This allows the runner to read refreshed tokens without pod restart

Recommendation:

// BOT_TOKEN_PATH points to mounted secret volume for dynamic token refresh
// Mounted from runnerTokenSecretName at /app/runner-token

5. Python Docstring Could Be More Descriptive

Location: components/runners/runner-shell/runner_shell/core/transport_ws.py:29

Issue: Docstring doesn't mention new token refresh capability.

Recommendation:

async def connect(self):
    """
    Connect to WebSocket endpoint with authentication.
    
    Supports dynamic token refresh via BOT_TOKEN_PATH (file-based)
    or static BOT_TOKEN (env var, backward compatibility).
    """

Positive Highlights

Excellent backward compatibility - Graceful fallback from BOT_TOKEN_PATH to BOT_TOKEN env var ensures existing deployments continue working.

Security best practices followed - Token mounted as volume (read-only), follows Kubernetes secret rotation patterns.

Clear error messages - Helpful 401 error hints distinguish between missing token vs. unreadable file.

Follows established patterns:

  • OwnerReferences NOT used on volume (correct per CLAUDE.md:367-371)
  • SecurityContext maintained on container (CLAUDE.md:579-590)
  • Structured logging with context (CLAUDE.md:779-786)

Type-safe unstructured access - Uses GetAnnotations() method instead of direct type assertion.

Proper error handling - Non-fatal errors logged with context, follows error-handling.md patterns.


Recommendations

Immediate Action Items (Before Merge)

  1. Add test coverage for token volume mount configuration
  2. ⚠️ Consider fail-fast behavior if BOT_TOKEN_PATH is set but unreadable
  3. Add token length to log statement for consistency with security standards

Follow-up Items (Post-Merge)

  1. 📝 Document token refresh mechanism in operator README or architecture docs
  2. 📊 Add observability - Consider metrics for token refresh frequency/failures
  3. 🔍 Integration test for actual token expiration scenario (may require test cluster)

Security Review

✅ Compliant with Security Standards

Token Handling (security-standards.md):

  • ✅ Token not logged (only path and length logged)
  • ✅ Token stored in Kubernetes Secret
  • ✅ Volume mounted read-only
  • ✅ Backward compatibility with existing BOT_TOKEN env var

Container Security:

  • ✅ SecurityContext maintained on runner container
  • ✅ Volume mount uses ReadOnly: true
  • ✅ No privilege escalation

RBAC:

  • ✅ No changes to RBAC policies
  • ✅ Operator service account already authorized to read secrets

Architecture Alignment

Follows ADR-0002 (User Token Authentication):

  • ✅ Runner uses service account token (not user token) - correct for backend communication
  • ✅ Token refresh enables long-running sessions without pod restart

Follows operator patterns (CLAUDE.md:636-771):

  • ✅ Job spec generation in handleAgenticSessionEvent
  • ✅ OwnerReferences set correctly
  • ✅ Volume mount added to pod spec

Follows error handling patterns (error-handling.md):

  • ✅ Non-fatal errors logged as warnings (file read failures)
  • ✅ Fatal errors raise exceptions (connection failures)
  • ✅ Structured error messages with context

Code Quality Checks

Backend/Operator (Go)

  • ✅ No panic() in production code
  • ✅ Type-safe annotation access (GetAnnotations())
  • ✅ Proper error logging with context
  • ⚠️ Missing: gofmt verification (run locally: gofmt -w components/operator)
  • ⚠️ Missing: Test coverage for new code paths

Runner (Python)

  • ✅ Follows existing code style
  • ✅ Exception handling present
  • ✅ Logging with appropriate levels
  • ⚠️ Missing: Unit tests for new token file reading logic
  • ⚠️ Missing: Type hints for new variables (token_path, has_token_path)

Final Verdict

Recommendation:Approve after addressing test coverage

This PR successfully solves the token expiration problem with a clean, Kubernetes-native approach. The implementation quality is high and follows project standards.

Required before merge:

  1. Add operator test for volume mount configuration
  2. Add Python test for file-based token reading
  3. Run gofmt -w components/operator to ensure formatting

Optional improvements:

  • Consider fail-fast error handling for BOT_TOKEN_PATH
  • Add token length to log statement
  • Update docstrings

🤖 Review completed by Claude Code (Sonnet 4.5)
📋 Based on project standards from CLAUDE.md and memory context files
🔍 Reviewed 2 files, 47 additions, 22 deletions


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@github-actions
Copy link
Contributor Author

github-actions bot commented Dec 6, 2025

Claude Code Review

Summary

This PR fixes long-running session token expiration by migrating from environment variable injection to volume-mounted secrets. The change enables dynamic token refresh without pod restarts, addressing a critical operational issue for sessions exceeding ServiceAccount token TTL.

Overall Assessment:APPROVE with minor recommendations

The implementation follows established patterns, maintains backward compatibility, and aligns with CLAUDE.md security standards. The approach is architecturally sound and consistent with Kubernetes best practices.


Issues by Severity

🚫 Blocker Issues

None - No blocking issues found.

🔴 Critical Issues

None - All critical patterns followed correctly.

🟡 Major Issues

1. Missing Error Handling for Volume Mount Failure

File: components/operator/internal/handlers/sessions.go:994-1000

The volume mount configuration doesn't handle the case where the secret might not exist yet or is inaccessible.

Current:

{
    Name: "runner-token",
    VolumeSource: corev1.VolumeSource{
        Secret: &corev1.SecretVolumeSource{
            SecretName: runnerTokenSecretName,
        },
    },
},

Recommendation:

{
    Name: "runner-token",
    VolumeSource: corev1.VolumeSource{
        Secret: &corev1.SecretVolumeSource{
            SecretName: runnerTokenSecretName,
            Optional:   boolPtr(false),  // Fail fast if secret missing
        },
    },
},

This makes the failure explicit rather than waiting for pod startup failure.

2. Token File Read Error Handling Could Be More Robust

File: components/runners/runner-shell/runner_shell/core/transport_ws.py:38-44

The token file read logs a warning but continues silently. For long-running sessions, this could lead to subtle auth failures.

Current behavior: Logs warning, falls back to empty token
Recommended: Consider failing fast or retrying file read

if token_path:
    max_retries = 3
    for attempt in range(max_retries):
        try:
            with open(token_path, "r", encoding="utf-8") as f:
                token = f.read().strip()
            logger.info(f"Read token from {token_path}")
            break
        except FileNotFoundError:
            if attempt == max_retries - 1:
                logger.error(f"Token file not found after {max_retries} attempts: {token_path}")
                raise
            await asyncio.sleep(1)  # Brief wait for file to appear
        except Exception as e:
            logger.error(f"Failed to read token from {token_path}: {e}")
            raise

🔵 Minor Issues

1. Potential Race Condition in Annotation Reading

File: components/operator/internal/handlers/sessions.go:935-943

Uses direct map access instead of type-safe unstructured helpers, violating CLAUDE.md rule #4.

Current:

if annotations := currentObj.GetAnnotations(); annotations != nil {
    if v, ok := annotations["ambient-code.io/runner-token-secret"]; ok && strings.TrimSpace(v) != "" {
        runnerTokenSecretName = strings.TrimSpace(v)
    }
}

Per CLAUDE.md (line 361-365):

"FORBIDDEN: Direct type assertions without checking"
"REQUIRED: Use unstructured.Nested* helpers"

Recommended:

// Use type-safe access
annotations := currentObj.GetAnnotations()
if v, ok := annotations["ambient-code.io/runner-token-secret"]; ok && strings.TrimSpace(v) != "" {
    runnerTokenSecretName = strings.TrimSpace(v)
}

Note: Current implementation is actually acceptable since GetAnnotations() already returns map[string]string. This is a borderline case.

2. Log Message Formatting Inconsistency

File: components/runners/runner-shell/runner_shell/core/transport_ws.py:42

Uses f-string for path logging, but consider redacting sensitive paths.

Current:

logger.info(f"Read token from {token_path}")

Recommended:

# Redact sensitive path info if needed
logger.info(f"Read token from volume mount (path length: {len(token_path)})")

3. Missing Volume Mount Documentation

The PR changes the fundamental token delivery mechanism but doesn't update inline documentation about the new pattern.

Recommendation: Add a comment block above the volume mount explaining the token refresh mechanism:

// Mount runner token secret as volume for dynamic token refresh
// The backend operator periodically refreshes tokens in this secret,
// allowing long-running sessions to continue without pod restarts.
// Token file: /app/runner-token/k8s-token
{
    Name: "runner-token",
    VolumeSource: corev1.VolumeSource{
        Secret: &corev1.SecretVolumeSource{
            SecretName: runnerTokenSecretName,
        },
    },
},

Positive Highlights

Excellent Security Practices

  • ✅ No token logging (only logs path, not content) - transport_ws.py:42
  • ✅ Maintains backward compatibility with BOT_TOKEN env var
  • ✅ Volume mounted read-only (ReadOnly: true at sessions.go:1060)
  • ✅ Clear error messages for auth failures (transport_ws.py:87-94)

Proper Resource Lifecycle Management

  • ✅ Secret has OwnerReferences set (as seen in grep output at sessions.go:2428)
  • ✅ Follows established naming convention (ambient-runner-token-{sessionName})
  • ✅ Deterministic fallback logic for secret name

Code Quality

  • ✅ Clear variable naming (runnerTokenSecretName, BOT_TOKEN_PATH)
  • ✅ Comprehensive error messages with context
  • ✅ Type-safe string handling with strings.TrimSpace()
  • ✅ Maintains separation of concerns (operator handles mounting, runner handles reading)

Backward Compatibility

  • ✅ Falls back to BOT_TOKEN env var if BOT_TOKEN_PATH not set
  • ✅ Doesn't break existing sessions
  • ✅ Clear migration path documented in comments

Testing Considerations

The change is well-isolated and testable:

  • Secret mounting can be verified via pod inspection
  • Token refresh can be tested by updating secret contents
  • Backward compatibility can be verified with old-style sessions

Recommendations

Priority 1: Address Before Merge

  1. Add Optional: boolPtr(false) to secret volume source (Major Issue Outcome: Reduce Refinement Time with agent System #1)
    • Ensures explicit failure if secret missing
    • Easier debugging than waiting for pod startup failure

Priority 2: Consider for Follow-up PR

  1. Add retry logic for token file reads (Major Issue Epic: RAT Architecture & Design #2)

    • Handle transient file system issues
    • More robust for long-running sessions
  2. Add integration test

    • Test token refresh by updating secret while pod is running
    • Verify reconnection with new token works
  3. Update documentation

    • Add ADR explaining token refresh architecture
    • Update operator README with new volume mount pattern

Priority 3: Nice-to-Have

  1. Add metric/log for token refresh events
    • Track how often tokens are refreshed
    • Monitor for refresh failures
    • Consider adding Prometheus metrics

Pre-Commit Checklist Verification

Authentication: N/A (operator code, not user-facing API)
Authorization: N/A (pod configuration, not RBAC check)
Error Handling: Mostly good, recommendations above
Token Security: ✅ No tokens in logs
Type Safety: ✅ (minor note on annotations)
Resource Cleanup: ✅ OwnerReferences set
Status Updates: N/A (no CR status changes)
Logging: ✅ Good context, clear messages
Code Quality: ✅ Should pass linting


Testing Recommendations

# Manual testing checklist
1. Deploy with new image
2. Create long-running session (>1 hour)
3. Watch secret for token refresh
4. Verify session continues after token refresh
5. Test backward compatibility with BOT_TOKEN env var
6. Verify proper error messages when token file missing
7. Check operator logs for volume mount errors

Final Verdict

APPROVED

This is a well-implemented fix that follows project standards and solves a real operational problem. The recommendations above are improvements, not blockers. The code is production-ready with the Priority 1 change applied.

Estimated Risk: LOW
Merge Confidence: HIGH


🤖 Generated with Claude Code
Review based on: CLAUDE.md, backend-development.md, security-standards.md, k8s-client-usage.md, error-handling.md


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amber-generated PR created by Amber background agent auto-fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Amber] Long running sessions have ServiceAccount tokens expire

2 participants