Skip to content

feat: add Keycard provider for sandbox identity credential management #748

@kamil-keycard

Description

@kamil-keycard

Problem Statement

Sandboxes need per-instance Keycard identities for service-to-service authentication. Today, providers are passive credential stores — no provider performs API calls during the sandbox lifecycle. We need a new Keycard provider that creates an APPLICATION with a unique SPIFFE ID before sandbox provisioning, generates ephemeral password credentials injected as KEYCARD_CLIENT_ID and KEYCARD_CLIENT_SECRET env vars, and cleans up the APPLICATION when the sandbox is decommissioned. Credentials must never be stored long-term — they are read once from the Keycard API and scoped to the sandbox lifetime.

Technical Context

The current provider system is designed around passive credential storage: a provider record holds static credentials and config in key-value maps, and the sandbox supervisor fetches them at boot via GetSandboxProviderEnvironment. No existing provider makes external API calls or participates in sandbox lifecycle events. The Keycard integration introduces a fundamentally new pattern — an "active" or "lifecycle-aware" provider that must:

  1. Call the Keycard API to create an APPLICATION before the sandbox starts
  2. Generate per-sandbox credentials dynamically
  3. Clean up the APPLICATION when the sandbox is deleted

This is architecturally novel for the provider system and requires new hook points in the server-side sandbox lifecycle.

Affected Components

Component Key Files Role
Provider system crates/openshell-providers/src/lib.rs, providers/mod.rs Provider plugin trait, registry, discovery
Gateway server (sandbox lifecycle) crates/openshell-server/src/grpc.rs create_sandbox(), delete_sandbox(), resolve_provider_environment()
Sandbox supervisor crates/openshell-sandbox/src/lib.rs, grpc_client.rs, secrets.rs, process.rs Fetches provider env, injects into child processes
Proto definitions proto/datamodel.proto, proto/openshell.proto Provider message, sandbox spec, gRPC services
Architecture docs architecture/sandbox-providers.md Provider architecture documentation

Technical Investigation

Architecture Overview

Current provider flow:

  1. Providers are created via CLI/gRPC with a type, credentials map, and config map — persisted in the server's object store.
  2. At sandbox creation (create_sandbox() in grpc.rs:178-315), SandboxSpec.providers lists provider names. The server validates existence (fail-fast) but does NOT inject credentials into the pod spec.
  3. At sandbox boot, the supervisor calls GetSandboxProviderEnvironment (grpc.rs:914-945), which resolves provider names → credential env vars via resolve_provider_environment() (grpc.rs:3641-3672).
  4. The supervisor creates placeholder values (openshell:resolve:env:KEY) and holds real secrets in memory for proxy-time resolution via SecretResolver.
  5. At sandbox deletion (delete_sandbox() in grpc.rs:601-701), the server deletes K8s resources, SSH sessions, and settings. There is no provider cleanup hook.

Key observation: The ProviderPlugin trait has an apply_to_sandbox() method that exists as a default no-op. It takes &Provider and returns Result<(), ProviderError>. It has no access to sandbox ID, no async support, and is called from nowhere in the current codebase. This cannot be used as-is for lifecycle hooks.

Code References

Location Description
crates/openshell-providers/src/lib.rs ProviderPlugin trait — discover(), apply_to_sandbox() (no-op), environment_variables()
crates/openshell-providers/src/providers/mod.rs Provider module registry — where keycard module would be added
crates/openshell-server/src/grpc.rs:178-315 create_sandbox() — sandbox ID generated at line 229, K8s creation at ~line 276. Pre-provision hook window is between these
crates/openshell-server/src/grpc.rs:601-701 delete_sandbox() — sandbox cleanup. Keycard APPLICATION deletion would go here
crates/openshell-server/src/grpc.rs:3641-3672 resolve_provider_environment() — iterates providers, builds env map. Needs Keycard-specific credential resolution logic
crates/openshell-server/src/grpc.rs:914-945 get_sandbox_provider_environment() gRPC handler
crates/openshell-server/src/grpc.rs:4116-4268 create_provider_record() — validation and persistence for new providers
crates/openshell-sandbox/src/lib.rs:187-205 Supervisor fetches provider env at startup via gRPC
crates/openshell-sandbox/src/secrets.rs SecretResolver::from_provider_env() builds placeholder/resolver pair
crates/openshell-sandbox/src/process.rs:27-28 inject_provider_env() into child processes
proto/datamodel.proto:79-88 Provider message — id, name, type, credentials, config maps
proto/datamodel.proto:26-36 SandboxSpec with providers field
architecture/sandbox-providers.md Provider architecture documentation

Current Behavior

When create_sandbox() runs:

  1. Request is validated, sandbox ID is generated (uuid::Uuid::new_v4())
  2. Listed providers are checked for existence in the store (fail-fast)
  3. Sandbox is persisted to object store, then created as a K8s resource
  4. No provider-specific actions are triggered during creation

When delete_sandbox() runs:

  1. Sandbox is fetched from store, SSH sessions and settings cleaned up
  2. K8s resource is deleted
  3. No provider-specific cleanup occurs

When resolve_provider_environment() runs:

  1. Iterates provider names from the sandbox spec
  2. Fetches each provider record from the store
  3. Concatenates all credentials map entries into a flat env map
  4. All credentials are blindly injected — no per-provider filtering

What Would Need to Change

New provider plugin (keycard.rs):

  • Implements ProviderPlugin for discovery and type registration
  • Defines expected config keys: base_url, zone_id, client_id, client_secret
  • Defines output env vars: KEYCARD_CLIENT_ID, KEYCARD_CLIENT_SECRET

New Keycard HTTP client module (server-side):

  • POST /zones/{zoneId}/applications — create APPLICATION with SPIFFE ID as identifier
  • POST /zones/{zoneId}/application-credentials — create password credential, read identifier (client ID) and password (client secret)
  • DELETE /zones/{zoneId}/applications/{id} — delete APPLICATION
  • Basic auth using admin client_id/client_secret from provider config
  • Error handling with retries for transient failures

Modified create_sandbox() in grpc.rs:

  • After sandbox ID generation (line 229), before K8s creation (~line 276):
    • Check if any listed provider is type keycard
    • Call Keycard API to create APPLICATION with SPIFFE ID spiffe://openshell/sandbox/{sandbox_id}
    • Call Keycard API to create password credential
    • Store the credential response (identifier + password) for injection
  • Handle partial failures: if Keycard call fails, abort sandbox creation; if K8s creation fails after Keycard succeeds, clean up the Keycard APPLICATION

Modified delete_sandbox() in grpc.rs:

  • Before or after K8s resource deletion:
    • Check if sandbox has a Keycard provider
    • Retrieve the Keycard APPLICATION ID (stored in sandbox metadata or a mapping)
    • Call Keycard API to delete the APPLICATION

Modified resolve_provider_environment() in grpc.rs:

  • For Keycard providers, distinguish between admin credentials (used by server, NOT injected) and sandbox credentials (injected as KEYCARD_CLIENT_ID, KEYCARD_CLIENT_SECRET)
  • Admin credentials in config map should not leak into sandbox env

Provider registration:

  • Add keycard module to providers/mod.rs
  • Register in ProviderRegistry::new() in lib.rs
  • Add "keycard" to normalize_provider_type()

Alternative Approaches Considered

1. Where does Keycard lifecycle logic live?

  • Option A: Inline in grpc.rs — Add Keycard API calls directly in create_sandbox() and delete_sandbox(). Simplest, but couples server to a specific provider.
  • Option B: Provider lifecycle trait — Extend ProviderPlugin or create LifecycleProvider trait with on_sandbox_create()/on_sandbox_delete() async methods. Cleaner abstraction but more engineering upfront.
  • Option C: Generic provider hook system — Event-based dispatch in the server for all provider lifecycle events. Most flexible, most complex, likely premature.
  • Trade-off: Option A is pragmatic for a first implementation. Option B should be considered if a second lifecycle provider emerges. This is a decision for human review.

2. Per-sandbox credential storage strategy:

  • Option A: Store in provider credentials map — Use the existing credentials map on a per-sandbox Keycard provider record. Flows through existing injection path seamlessly. Contradicts strict "never stores" but credentials are ephemeral (deleted with sandbox).
  • Option B: Ephemeral credential store — New data structure scoped to sandbox lifetime, not persisted to provider store. Architecturally novel, more complex.
  • Option C: SandboxSpec environment — Inject directly into pod env. Simpler but bypasses the secret placeholder/proxy-resolution system.
  • Trade-off: Option A is pragmatic — the credentials ARE stored temporarily but are scoped to the sandbox lifetime and cleaned up on decommission. The "never stores" requirement should be interpreted as "never stores long-lived credentials."

3. SPIFFE ID format:

  • No SPIFFE framework exists in the codebase. The SPIFFE ID would be a string convention, not a full SPIFFE runtime.
  • Proposed format: spiffe://openshell/sandbox/{sandbox_id} — needs human input on trust domain and hierarchy.

Patterns to Follow

  1. Provider plugin pattern: Follow github.rs or claude.rs for ProviderPlugin implementation — same registration, discovery, and environment variable patterns.
  2. HTTP client: Use reqwest with rustls-tls (already in workspace Cargo.toml at version 0.12).
  3. Error handling: Use tonic::Status for gRPC errors in server code.
  4. Testing: Use wiremock (already a dev-dependency of openshell-server) for mocking Keycard HTTP API. Use Store::connect("sqlite::memory:") for persistence tests.
  5. Provider docs: Update architecture/sandbox-providers.md following its existing structure (has sections per provider type).

Proposed Approach

Introduce a new keycard provider type that holds admin API credentials (base URL, zone ID, client ID, client secret) in its config map. During sandbox creation, the server detects the Keycard provider, calls the Keycard API to create an APPLICATION with a SPIFFE-formatted identifier and a password credential, then makes the generated identifier/password available as KEYCARD_CLIENT_ID/KEYCARD_CLIENT_SECRET for sandbox env injection. During sandbox deletion, the server calls the Keycard API to delete the APPLICATION. The initial implementation places the lifecycle logic inline in create_sandbox()/delete_sandbox() with a clear path toward a lifecycle trait abstraction if needed later.

Scope Assessment

  • Complexity: Medium-High
  • Confidence: Medium — clear path for core functionality, several design decisions need human input (credential storage strategy, SPIFFE ID format, lifecycle logic location)
  • Estimated files to change: ~8-10
  • Issue type: feat

Risks & Open Questions

  • Failure modes during provisioning: If Keycard API is unreachable, should sandbox creation be blocked entirely or proceed without credentials? Partial failures (APPLICATION created but credential creation fails) need rollback logic.
  • Orphaned Keycard APPLICATIONs: If sandbox K8s creation fails after the Keycard APPLICATION was created, or if the server crashes between the two operations, Keycard APPLICATIONs could be orphaned. A reconciliation/cleanup mechanism may be needed.
  • Admin vs sandbox credential separation: The current resolve_provider_environment() injects all credentials from a provider — the Keycard admin credentials must be explicitly excluded from sandbox env injection.
  • SPIFFE ID format and trust domain: What should the trust domain be? Should it include the gateway namespace or cluster identity? (Proposed: spiffe://openshell/sandbox/{sandbox_id})
  • Credential lifecycle mapping: Where is the Keycard APPLICATION ID stored so it can be retrieved during sandbox deletion? Options: sandbox metadata, a dedicated mapping table, or derived from the SPIFFE ID.
  • Per-sandbox provider record vs dynamic resolution: Should a per-sandbox Keycard provider record be created, or should a single shared provider record hold admin creds with per-sandbox credentials resolved dynamically?

Test Considerations

  • Unit tests for Keycard HTTP client: Use wiremock to mock all three Keycard API endpoints (create application, create credential, delete application). Test success paths, error responses, and timeout handling.
  • Unit tests for modified resolve_provider_environment(): Extend the existing 7 tests to cover Keycard-specific credential resolution and admin credential filtering.
  • Integration tests for sandbox lifecycle: Test that create_sandbox() calls Keycard API and injects credentials, and delete_sandbox() calls Keycard API to clean up.
  • Failure/rollback tests: Test partial failure scenarios — Keycard success then K8s failure, Keycard credential creation failure after APPLICATION creation.
  • Provider validation tests: Test that Keycard provider creation validates required config keys (zone_id, client_id, client_secret).
  • E2e tests may be needed if a test Keycard environment is available, but unit tests with wiremock should provide sufficient coverage for the initial implementation.

Created by spike investigation. Use build-from-issue to plan and implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions