[Amber] Fix: [Amber] Long running sessions have ServiceAccount tokens expire #446

github-actions · 2025-12-05T22:56:19Z

Automated Fix by Amber Agent

This PR addresses issue #445 using the Amber background agent.

Changes Summary

Action Type: auto-fix
Commit: 6fd6440
Triggered by: Issue label/command

Pre-merge Checklist

All linters pass
All tests pass
Changes follow project conventions (CLAUDE.md)
No scope creep beyond issue description

Reviewer Notes

This PR was automatically generated. Please review:

Code quality and adherence to standards
Test coverage for changes
No unintended side effects

🤖 Generated with Amber Background Agent

Closes #445

Long-running sessions were failing due to ServiceAccount token expiration because tokens were injected as environment variables at pod startup and never refreshed, even though the operator was refreshing the Secret. Changes: - Operator: Mount runner token Secret as volume instead of env var - Operator: Inject BOT_TOKEN_PATH env var pointing to mounted token file - Runner: Read token from BOT_TOKEN_PATH file on each connection/reconnection - Runner: Fall back to BOT_TOKEN env var for backward compatibility - Runner: Improved error messages for token authentication issues The operator already refreshes tokens every 45 minutes via ensureFreshRunnerToken(). Now the runner can read the refreshed token from the mounted Secret volume without requiring pod restart. Fixes #445 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

bobbravo2 · 2025-12-05T23:44:10Z

components/operator/internal/handlers/sessions.go

-								// Prefer annotated secret name; fallback to deterministic name
-								secretName := ""
-								if meta, ok := currentObj.Object["metadata"].(map[string]interface{}); ok {
-									if anns, ok := meta["annotations"].(map[string]interface{}); ok {
-										if v, ok := anns["ambient-code.io/runner-token-secret"].(string); ok && strings.TrimSpace(v) != "" {
-											secretName = strings.TrimSpace(v)
-										}
-									}
-								}
-								if secretName == "" {
-									secretName = fmt.Sprintf("ambient-runner-token-%s", name)
-								}


@claude any risks to deleting this code?

Copilot

Pull request overview

This PR addresses the issue of ServiceAccount tokens expiring in long-running sessions by implementing dynamic token refresh. Instead of reading tokens once from environment variables at startup, the runner now reads tokens from a file path that Kubernetes updates automatically when tokens are refreshed.

Changes token handling from static BOT_TOKEN environment variable to file-based BOT_TOKEN_PATH
Implements volume mount for runner token secret to enable dynamic token refresh without pod restart
Maintains backward compatibility by falling back to BOT_TOKEN if BOT_TOKEN_PATH is not set

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
components/runners/runner-shell/runner_shell/core/transport_ws.py	Adds logic to read authentication token from file path (`BOT_TOKEN_PATH`) with fallback to environment variable (`BOT_TOKEN`), and improves error messages for authentication failures
components/operator/internal/handlers/sessions.go	Mounts runner token secret as volume at `/app/runner-token` and changes environment variable from `BOT_TOKEN` to `BOT_TOKEN_PATH` pointing to the mounted token file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

components/runners/runner-shell/runner_shell/core/transport_ws.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

github-actions · 2025-12-06T00:02:18Z

Claude Code Review

Summary

This PR addresses service account token expiration in long-running sessions by implementing a volume-based token refresh mechanism. The solution transitions from static environment variable injection (BOT_TOKEN) to file-based token mounting (BOT_TOKEN_PATH) with automatic refresh capabilities.

Overall Assessment: ✅ Approve with minor suggestions

The implementation follows established patterns and security standards. The approach is architecturally sound and maintains backward compatibility.

Issues by Severity

🚫 Blocker Issues

None - No blocking issues found.

🔴 Critical Issues

None - No critical issues found.

🟡 Major Issues

1. Missing Test Coverage for Token Refresh Logic

Location: components/operator/internal/handlers/sessions.go:934-943, components/runners/runner-shell/runner_shell/core/transport_ws.py:37-45

Issue: No tests verify the new token volume mount and file reading logic.

Impact: Token refresh mechanism is untested. Regressions could cause authentication failures in long-running sessions.

Recommendation:

Add operator test verifying volume mount configuration in Job spec
Add Python unit test for WebSocketTransport.connect() with mocked file reads
Add integration test simulating token expiration scenario

Example test structure:

// components/operator/internal/handlers/sessions_test.go
func TestHandleAgenticSessionEvent_RunnerTokenVolumeMount(t *testing.T) {
    // Test that Job includes runner-token volume and volumeMount
    // Test annotation-based secret name override
    // Test fallback to default secret name
}

2. Error Handling for Token File Read Should Be More Robust

Location: components/runners/runner-shell/runner_shell/core/transport_ws.py:38-43

Issue: File read errors are logged as warnings but don't prevent connection attempts with empty token.

Current behavior:

try:
    with open(token_path, "r", encoding="utf-8") as f:
        token = f.read().strip()
except Exception as e:
    logger.warning(f"Failed to read token from {token_path}: {e}")
# Continues with empty token, will get 401 later

Recommendation: Consider failing fast if BOT_TOKEN_PATH is set but file is unreadable:

if token_path:
    try:
        with open(token_path, "r", encoding="utf-8") as f:
            token = f.read().strip()
        logger.info(f"Read token from {token_path} (len={len(token)})")
    except FileNotFoundError:
        logger.error(f"BOT_TOKEN_PATH set to {token_path} but file not found")
        raise
    except Exception as e:
        logger.error(f"Failed to read token from {token_path}: {e}")
        raise

Rationale: If BOT_TOKEN_PATH is explicitly configured, inability to read it indicates misconfiguration that should halt startup rather than failing later with confusing 401 errors.

🔵 Minor Issues

3. Token Redaction Missing in New Log Statement

Location: components/runners/runner-shell/runner_shell/core/transport_ws.py:42

Issue: New log statement logs token path but doesn't redact token length.

Current:

logger.info(f"Read token from {token_path}")

Recommendation (following security-standards.md patterns):

logger.info(f"Read token from {token_path} (len={len(token)})")

4. Inconsistent Comment Style in Go Code

Location: components/operator/internal/handlers/sessions.go:1177-1179

Issue: Comment exceeds recommended line length and could be more concise.

Current:

// Inject BOT_TOKEN_PATH pointing to mounted secret volume
// Token is mounted from runnerTokenSecretName at /app/runner-token
// This allows the runner to read refreshed tokens without pod restart

Recommendation:

// BOT_TOKEN_PATH points to mounted secret volume for dynamic token refresh
// Mounted from runnerTokenSecretName at /app/runner-token

5. Python Docstring Could Be More Descriptive

Location: components/runners/runner-shell/runner_shell/core/transport_ws.py:29

Issue: Docstring doesn't mention new token refresh capability.

Recommendation:

async def connect(self):
    """
    Connect to WebSocket endpoint with authentication.
    
    Supports dynamic token refresh via BOT_TOKEN_PATH (file-based)
    or static BOT_TOKEN (env var, backward compatibility).
    """

Positive Highlights

✅ Excellent backward compatibility - Graceful fallback from BOT_TOKEN_PATH to BOT_TOKEN env var ensures existing deployments continue working.

✅ Security best practices followed - Token mounted as volume (read-only), follows Kubernetes secret rotation patterns.

✅ Clear error messages - Helpful 401 error hints distinguish between missing token vs. unreadable file.

✅ Follows established patterns:

OwnerReferences NOT used on volume (correct per CLAUDE.md:367-371)
SecurityContext maintained on container (CLAUDE.md:579-590)
Structured logging with context (CLAUDE.md:779-786)

✅ Type-safe unstructured access - Uses GetAnnotations() method instead of direct type assertion.

✅ Proper error handling - Non-fatal errors logged with context, follows error-handling.md patterns.

Recommendations

Immediate Action Items (Before Merge)

✅ Add test coverage for token volume mount configuration
⚠️ Consider fail-fast behavior if BOT_TOKEN_PATH is set but unreadable
✅ Add token length to log statement for consistency with security standards

Follow-up Items (Post-Merge)

📝 Document token refresh mechanism in operator README or architecture docs
📊 Add observability - Consider metrics for token refresh frequency/failures
🔍 Integration test for actual token expiration scenario (may require test cluster)

Security Review

✅ Compliant with Security Standards

Token Handling (security-standards.md):

✅ Token not logged (only path and length logged)
✅ Token stored in Kubernetes Secret
✅ Volume mounted read-only
✅ Backward compatibility with existing BOT_TOKEN env var

Container Security:

✅ SecurityContext maintained on runner container
✅ Volume mount uses ReadOnly: true
✅ No privilege escalation

RBAC:

✅ No changes to RBAC policies
✅ Operator service account already authorized to read secrets

Architecture Alignment

Follows ADR-0002 (User Token Authentication):

✅ Runner uses service account token (not user token) - correct for backend communication
✅ Token refresh enables long-running sessions without pod restart

Follows operator patterns (CLAUDE.md:636-771):

✅ Job spec generation in handleAgenticSessionEvent
✅ OwnerReferences set correctly
✅ Volume mount added to pod spec

Follows error handling patterns (error-handling.md):

✅ Non-fatal errors logged as warnings (file read failures)
✅ Fatal errors raise exceptions (connection failures)
✅ Structured error messages with context

Code Quality Checks

Backend/Operator (Go)

✅ No panic() in production code
✅ Type-safe annotation access (GetAnnotations())
✅ Proper error logging with context
⚠️ Missing: gofmt verification (run locally: gofmt -w components/operator)
⚠️ Missing: Test coverage for new code paths

Runner (Python)

✅ Follows existing code style
✅ Exception handling present
✅ Logging with appropriate levels
⚠️ Missing: Unit tests for new token file reading logic
⚠️ Missing: Type hints for new variables (token_path, has_token_path)

Final Verdict

Recommendation: ✅ Approve after addressing test coverage

This PR successfully solves the token expiration problem with a clean, Kubernetes-native approach. The implementation quality is high and follows project standards.

Required before merge:

Add operator test for volume mount configuration
Add Python test for file-based token reading
Run gofmt -w components/operator to ensure formatting

Optional improvements:

Consider fail-fast error handling for BOT_TOKEN_PATH
Add token length to log statement
Update docstrings

🤖 Review completed by Claude Code (Sonnet 4.5)
📋 Based on project standards from CLAUDE.md and memory context files
🔍 Reviewed 2 files, 47 additions, 22 deletions

🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

CLAUDE.md - Master project instructions, development standards
backend-development.md - Go backend, K8s integration patterns
frontend-development.md - NextJS, Shadcn UI, React Query patterns
security-standards.md - Auth, RBAC, token handling
k8s-client-usage.md - User token vs service account patterns
error-handling.md - Consistent error patterns
react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

github-actions · 2025-12-06T00:02:57Z

Claude Code Review

Summary

This PR fixes long-running session token expiration by migrating from environment variable injection to volume-mounted secrets. The change enables dynamic token refresh without pod restarts, addressing a critical operational issue for sessions exceeding ServiceAccount token TTL.

Overall Assessment: ✅ APPROVE with minor recommendations

The implementation follows established patterns, maintains backward compatibility, and aligns with CLAUDE.md security standards. The approach is architecturally sound and consistent with Kubernetes best practices.

Issues by Severity

🚫 Blocker Issues

None - No blocking issues found.

🔴 Critical Issues

None - All critical patterns followed correctly.

🟡 Major Issues

1. Missing Error Handling for Volume Mount Failure

File: components/operator/internal/handlers/sessions.go:994-1000

The volume mount configuration doesn't handle the case where the secret might not exist yet or is inaccessible.

Current:

{
    Name: "runner-token",
    VolumeSource: corev1.VolumeSource{
        Secret: &corev1.SecretVolumeSource{
            SecretName: runnerTokenSecretName,
        },
    },
},

Recommendation:

{
    Name: "runner-token",
    VolumeSource: corev1.VolumeSource{
        Secret: &corev1.SecretVolumeSource{
            SecretName: runnerTokenSecretName,
            Optional:   boolPtr(false),  // Fail fast if secret missing
        },
    },
},

This makes the failure explicit rather than waiting for pod startup failure.

2. Token File Read Error Handling Could Be More Robust

File: components/runners/runner-shell/runner_shell/core/transport_ws.py:38-44

The token file read logs a warning but continues silently. For long-running sessions, this could lead to subtle auth failures.

Current behavior: Logs warning, falls back to empty token
Recommended: Consider failing fast or retrying file read

if token_path:
    max_retries = 3
    for attempt in range(max_retries):
        try:
            with open(token_path, "r", encoding="utf-8") as f:
                token = f.read().strip()
            logger.info(f"Read token from {token_path}")
            break
        except FileNotFoundError:
            if attempt == max_retries - 1:
                logger.error(f"Token file not found after {max_retries} attempts: {token_path}")
                raise
            await asyncio.sleep(1)  # Brief wait for file to appear
        except Exception as e:
            logger.error(f"Failed to read token from {token_path}: {e}")
            raise

🔵 Minor Issues

1. Potential Race Condition in Annotation Reading

File: components/operator/internal/handlers/sessions.go:935-943

Uses direct map access instead of type-safe unstructured helpers, violating CLAUDE.md rule #4.

Current:

if annotations := currentObj.GetAnnotations(); annotations != nil {
    if v, ok := annotations["ambient-code.io/runner-token-secret"]; ok && strings.TrimSpace(v) != "" {
        runnerTokenSecretName = strings.TrimSpace(v)
    }
}

Per CLAUDE.md (line 361-365):

"FORBIDDEN: Direct type assertions without checking"
"REQUIRED: Use unstructured.Nested* helpers"

Recommended:

// Use type-safe access
annotations := currentObj.GetAnnotations()
if v, ok := annotations["ambient-code.io/runner-token-secret"]; ok && strings.TrimSpace(v) != "" {
    runnerTokenSecretName = strings.TrimSpace(v)
}

Note: Current implementation is actually acceptable since GetAnnotations() already returns map[string]string. This is a borderline case.

2. Log Message Formatting Inconsistency

File: components/runners/runner-shell/runner_shell/core/transport_ws.py:42

Uses f-string for path logging, but consider redacting sensitive paths.

Current:

logger.info(f"Read token from {token_path}")

Recommended:

# Redact sensitive path info if needed
logger.info(f"Read token from volume mount (path length: {len(token_path)})")

3. Missing Volume Mount Documentation

The PR changes the fundamental token delivery mechanism but doesn't update inline documentation about the new pattern.

Recommendation: Add a comment block above the volume mount explaining the token refresh mechanism:

// Mount runner token secret as volume for dynamic token refresh
// The backend operator periodically refreshes tokens in this secret,
// allowing long-running sessions to continue without pod restarts.
// Token file: /app/runner-token/k8s-token
{
    Name: "runner-token",
    VolumeSource: corev1.VolumeSource{
        Secret: &corev1.SecretVolumeSource{
            SecretName: runnerTokenSecretName,
        },
    },
},

Positive Highlights

✅ Excellent Security Practices

✅ No token logging (only logs path, not content) - transport_ws.py:42
✅ Maintains backward compatibility with BOT_TOKEN env var
✅ Volume mounted read-only (ReadOnly: true at sessions.go:1060)
✅ Clear error messages for auth failures (transport_ws.py:87-94)

✅ Proper Resource Lifecycle Management

✅ Secret has OwnerReferences set (as seen in grep output at sessions.go:2428)
✅ Follows established naming convention (ambient-runner-token-{sessionName})
✅ Deterministic fallback logic for secret name

✅ Code Quality

✅ Clear variable naming (runnerTokenSecretName, BOT_TOKEN_PATH)
✅ Comprehensive error messages with context
✅ Type-safe string handling with strings.TrimSpace()
✅ Maintains separation of concerns (operator handles mounting, runner handles reading)

✅ Backward Compatibility

✅ Falls back to BOT_TOKEN env var if BOT_TOKEN_PATH not set
✅ Doesn't break existing sessions
✅ Clear migration path documented in comments

✅ Testing Considerations

The change is well-isolated and testable:

Secret mounting can be verified via pod inspection
Token refresh can be tested by updating secret contents
Backward compatibility can be verified with old-style sessions

Recommendations

Priority 1: Address Before Merge

Add Optional: boolPtr(false) to secret volume source (Major Issue Outcome: Reduce Refinement Time with agent System #1)
- Ensures explicit failure if secret missing
- Easier debugging than waiting for pod startup failure

Priority 2: Consider for Follow-up PR

Add retry logic for token file reads (Major Issue Epic: RAT Architecture & Design #2)
- Handle transient file system issues
- More robust for long-running sessions
Add integration test
- Test token refresh by updating secret while pod is running
- Verify reconnection with new token works
Update documentation
- Add ADR explaining token refresh architecture
- Update operator README with new volume mount pattern

Priority 3: Nice-to-Have

Add metric/log for token refresh events
- Track how often tokens are refreshed
- Monitor for refresh failures
- Consider adding Prometheus metrics

Pre-Commit Checklist Verification

✅ Authentication: N/A (operator code, not user-facing API)
✅ Authorization: N/A (pod configuration, not RBAC check)
✅ Error Handling: Mostly good, recommendations above
✅ Token Security: ✅ No tokens in logs
✅ Type Safety: ✅ (minor note on annotations)
✅ Resource Cleanup: ✅ OwnerReferences set
✅ Status Updates: N/A (no CR status changes)
✅ Logging: ✅ Good context, clear messages
✅ Code Quality: ✅ Should pass linting

Testing Recommendations

# Manual testing checklist
1. Deploy with new image
2. Create long-running session (>1 hour)
3. Watch secret for token refresh
4. Verify session continues after token refresh
5. Test backward compatibility with BOT_TOKEN env var
6. Verify proper error messages when token file missing
7. Check operator logs for volume mount errors

Final Verdict

APPROVED ✅

This is a well-implemented fix that follows project standards and solves a real operational problem. The recommendations above are improvements, not blockers. The code is production-ready with the Priority 1 change applied.

Estimated Risk: LOW
Merge Confidence: HIGH

🤖 Generated with Claude Code
Review based on: CLAUDE.md, backend-development.md, security-standards.md, k8s-client-usage.md, error-handling.md

🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

CLAUDE.md - Master project instructions, development standards
backend-development.md - Go backend, K8s integration patterns
frontend-development.md - NextJS, Shadcn UI, React Query patterns
security-standards.md - Auth, RBAC, token handling
k8s-client-usage.md - User token vs service account patterns
error-handling.md - Consistent error patterns
react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

github-actions bot added amber-generated PR created by Amber background agent auto-fix labels Dec 5, 2025

github-actions bot mentioned this pull request Dec 5, 2025

[Amber] Long running sessions have ServiceAccount tokens expire #445

Open

3 tasks

bobbravo2 requested a review from Copilot December 5, 2025 23:42

Copilot started reviewing on behalf of bobbravo2 December 5, 2025 23:43 View session

bobbravo2 reviewed Dec 5, 2025

View reviewed changes

Copilot AI reviewed Dec 5, 2025

View reviewed changes

components/runners/runner-shell/runner_shell/core/transport_ws.py Show resolved Hide resolved

components/runners/runner-shell/runner_shell/core/transport_ws.py Outdated Show resolved Hide resolved

components/runners/runner-shell/runner_shell/core/transport_ws.py Show resolved Hide resolved

This comment has been minimized.

Sign in to view

bobbravo2 and others added 2 commits December 5, 2025 19:00

Update components/runners/runner-shell/runner_shell/core/transport_ws.py

4f940d2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update components/runners/runner-shell/runner_shell/core/transport_ws.py

cb20cd1

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Amber] Fix: [Amber] Long running sessions have ServiceAccount tokens expire #446

[Amber] Fix: [Amber] Long running sessions have ServiceAccount tokens expire #446

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

bobbravo2 Dec 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

github-actions bot commented Dec 6, 2025 •

edited

Loading

What Amber Loaded for Code Review

Uh oh!

github-actions bot commented Dec 6, 2025 •

edited

Loading

What Amber Loaded for Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Amber] Fix: [Amber] Long running sessions have ServiceAccount tokens expire #446

Are you sure you want to change the base?

[Amber] Fix: [Amber] Long running sessions have ServiceAccount tokens expire #446

Uh oh!

Conversation

github-actions bot commented Dec 5, 2025

Automated Fix by Amber Agent

Changes Summary

Pre-merge Checklist

Reviewer Notes

Uh oh!

bobbravo2 Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

github-actions bot commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Summary

Issues by Severity

🚫 Blocker Issues

🔴 Critical Issues

🟡 Major Issues

1. Missing Test Coverage for Token Refresh Logic

2. Error Handling for Token File Read Should Be More Robust

🔵 Minor Issues

3. Token Redaction Missing in New Log Statement

4. Inconsistent Comment Style in Go Code

5. Python Docstring Could Be More Descriptive

Positive Highlights

Recommendations

Immediate Action Items (Before Merge)

Follow-up Items (Post-Merge)

Security Review

✅ Compliant with Security Standards

Architecture Alignment

Code Quality Checks

Backend/Operator (Go)

Runner (Python)

Final Verdict

What Amber Loaded for Code Review

Uh oh!

github-actions bot commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Summary

Issues by Severity

🚫 Blocker Issues

🔴 Critical Issues

🟡 Major Issues

1. Missing Error Handling for Volume Mount Failure

2. Token File Read Error Handling Could Be More Robust

🔵 Minor Issues

1. Potential Race Condition in Annotation Reading

2. Log Message Formatting Inconsistency

3. Missing Volume Mount Documentation

Positive Highlights

✅ Excellent Security Practices

✅ Proper Resource Lifecycle Management

✅ Code Quality

✅ Backward Compatibility

✅ Testing Considerations

Recommendations

Priority 1: Address Before Merge

Priority 2: Consider for Follow-up PR

Priority 3: Nice-to-Have

Pre-Commit Checklist Verification

Testing Recommendations

Final Verdict

What Amber Loaded for Code Review

Uh oh!

Reviewers

github-actions bot commented Dec 6, 2025 •

edited

Loading

github-actions bot commented Dec 6, 2025 •

edited

Loading