-
Notifications
You must be signed in to change notification settings - Fork 471
feat: add Keycard provider for sandbox identity credential management #748
Description
Problem Statement
Sandboxes need per-instance Keycard identities for service-to-service authentication. Today, providers are passive credential stores — no provider performs API calls during the sandbox lifecycle. We need a new Keycard provider that creates an APPLICATION with a unique SPIFFE ID before sandbox provisioning, generates ephemeral password credentials injected as KEYCARD_CLIENT_ID and KEYCARD_CLIENT_SECRET env vars, and cleans up the APPLICATION when the sandbox is decommissioned. Credentials must never be stored long-term — they are read once from the Keycard API and scoped to the sandbox lifetime.
Technical Context
The current provider system is designed around passive credential storage: a provider record holds static credentials and config in key-value maps, and the sandbox supervisor fetches them at boot via GetSandboxProviderEnvironment. No existing provider makes external API calls or participates in sandbox lifecycle events. The Keycard integration introduces a fundamentally new pattern — an "active" or "lifecycle-aware" provider that must:
- Call the Keycard API to create an APPLICATION before the sandbox starts
- Generate per-sandbox credentials dynamically
- Clean up the APPLICATION when the sandbox is deleted
This is architecturally novel for the provider system and requires new hook points in the server-side sandbox lifecycle.
Affected Components
| Component | Key Files | Role |
|---|---|---|
| Provider system | crates/openshell-providers/src/lib.rs, providers/mod.rs |
Provider plugin trait, registry, discovery |
| Gateway server (sandbox lifecycle) | crates/openshell-server/src/grpc.rs |
create_sandbox(), delete_sandbox(), resolve_provider_environment() |
| Sandbox supervisor | crates/openshell-sandbox/src/lib.rs, grpc_client.rs, secrets.rs, process.rs |
Fetches provider env, injects into child processes |
| Proto definitions | proto/datamodel.proto, proto/openshell.proto |
Provider message, sandbox spec, gRPC services |
| Architecture docs | architecture/sandbox-providers.md |
Provider architecture documentation |
Technical Investigation
Architecture Overview
Current provider flow:
- Providers are created via CLI/gRPC with a
type,credentialsmap, andconfigmap — persisted in the server's object store. - At sandbox creation (
create_sandbox()ingrpc.rs:178-315),SandboxSpec.providerslists provider names. The server validates existence (fail-fast) but does NOT inject credentials into the pod spec. - At sandbox boot, the supervisor calls
GetSandboxProviderEnvironment(grpc.rs:914-945), which resolves provider names → credential env vars viaresolve_provider_environment()(grpc.rs:3641-3672). - The supervisor creates placeholder values (
openshell:resolve:env:KEY) and holds real secrets in memory for proxy-time resolution viaSecretResolver. - At sandbox deletion (
delete_sandbox()ingrpc.rs:601-701), the server deletes K8s resources, SSH sessions, and settings. There is no provider cleanup hook.
Key observation: The ProviderPlugin trait has an apply_to_sandbox() method that exists as a default no-op. It takes &Provider and returns Result<(), ProviderError>. It has no access to sandbox ID, no async support, and is called from nowhere in the current codebase. This cannot be used as-is for lifecycle hooks.
Code References
| Location | Description |
|---|---|
crates/openshell-providers/src/lib.rs |
ProviderPlugin trait — discover(), apply_to_sandbox() (no-op), environment_variables() |
crates/openshell-providers/src/providers/mod.rs |
Provider module registry — where keycard module would be added |
crates/openshell-server/src/grpc.rs:178-315 |
create_sandbox() — sandbox ID generated at line 229, K8s creation at ~line 276. Pre-provision hook window is between these |
crates/openshell-server/src/grpc.rs:601-701 |
delete_sandbox() — sandbox cleanup. Keycard APPLICATION deletion would go here |
crates/openshell-server/src/grpc.rs:3641-3672 |
resolve_provider_environment() — iterates providers, builds env map. Needs Keycard-specific credential resolution logic |
crates/openshell-server/src/grpc.rs:914-945 |
get_sandbox_provider_environment() gRPC handler |
crates/openshell-server/src/grpc.rs:4116-4268 |
create_provider_record() — validation and persistence for new providers |
crates/openshell-sandbox/src/lib.rs:187-205 |
Supervisor fetches provider env at startup via gRPC |
crates/openshell-sandbox/src/secrets.rs |
SecretResolver::from_provider_env() builds placeholder/resolver pair |
crates/openshell-sandbox/src/process.rs:27-28 |
inject_provider_env() into child processes |
proto/datamodel.proto:79-88 |
Provider message — id, name, type, credentials, config maps |
proto/datamodel.proto:26-36 |
SandboxSpec with providers field |
architecture/sandbox-providers.md |
Provider architecture documentation |
Current Behavior
When create_sandbox() runs:
- Request is validated, sandbox ID is generated (
uuid::Uuid::new_v4()) - Listed providers are checked for existence in the store (fail-fast)
- Sandbox is persisted to object store, then created as a K8s resource
- No provider-specific actions are triggered during creation
When delete_sandbox() runs:
- Sandbox is fetched from store, SSH sessions and settings cleaned up
- K8s resource is deleted
- No provider-specific cleanup occurs
When resolve_provider_environment() runs:
- Iterates provider names from the sandbox spec
- Fetches each provider record from the store
- Concatenates all
credentialsmap entries into a flat env map - All credentials are blindly injected — no per-provider filtering
What Would Need to Change
New provider plugin (keycard.rs):
- Implements
ProviderPluginfor discovery and type registration - Defines expected config keys:
base_url,zone_id,client_id,client_secret - Defines output env vars:
KEYCARD_CLIENT_ID,KEYCARD_CLIENT_SECRET
New Keycard HTTP client module (server-side):
POST /zones/{zoneId}/applications— create APPLICATION with SPIFFE ID as identifierPOST /zones/{zoneId}/application-credentials— create password credential, readidentifier(client ID) andpassword(client secret)DELETE /zones/{zoneId}/applications/{id}— delete APPLICATION- Basic auth using admin
client_id/client_secretfrom provider config - Error handling with retries for transient failures
Modified create_sandbox() in grpc.rs:
- After sandbox ID generation (line 229), before K8s creation (~line 276):
- Check if any listed provider is type
keycard - Call Keycard API to create APPLICATION with SPIFFE ID
spiffe://openshell/sandbox/{sandbox_id} - Call Keycard API to create password credential
- Store the credential response (
identifier+password) for injection
- Check if any listed provider is type
- Handle partial failures: if Keycard call fails, abort sandbox creation; if K8s creation fails after Keycard succeeds, clean up the Keycard APPLICATION
Modified delete_sandbox() in grpc.rs:
- Before or after K8s resource deletion:
- Check if sandbox has a Keycard provider
- Retrieve the Keycard APPLICATION ID (stored in sandbox metadata or a mapping)
- Call Keycard API to delete the APPLICATION
Modified resolve_provider_environment() in grpc.rs:
- For Keycard providers, distinguish between admin credentials (used by server, NOT injected) and sandbox credentials (injected as
KEYCARD_CLIENT_ID,KEYCARD_CLIENT_SECRET) - Admin credentials in
configmap should not leak into sandbox env
Provider registration:
- Add
keycardmodule toproviders/mod.rs - Register in
ProviderRegistry::new()inlib.rs - Add
"keycard"tonormalize_provider_type()
Alternative Approaches Considered
1. Where does Keycard lifecycle logic live?
- Option A: Inline in
grpc.rs— Add Keycard API calls directly increate_sandbox()anddelete_sandbox(). Simplest, but couples server to a specific provider. - Option B: Provider lifecycle trait — Extend
ProviderPluginor createLifecycleProvidertrait withon_sandbox_create()/on_sandbox_delete()async methods. Cleaner abstraction but more engineering upfront. - Option C: Generic provider hook system — Event-based dispatch in the server for all provider lifecycle events. Most flexible, most complex, likely premature.
- Trade-off: Option A is pragmatic for a first implementation. Option B should be considered if a second lifecycle provider emerges. This is a decision for human review.
2. Per-sandbox credential storage strategy:
- Option A: Store in provider credentials map — Use the existing
credentialsmap on a per-sandbox Keycard provider record. Flows through existing injection path seamlessly. Contradicts strict "never stores" but credentials are ephemeral (deleted with sandbox). - Option B: Ephemeral credential store — New data structure scoped to sandbox lifetime, not persisted to provider store. Architecturally novel, more complex.
- Option C: SandboxSpec environment — Inject directly into pod env. Simpler but bypasses the secret placeholder/proxy-resolution system.
- Trade-off: Option A is pragmatic — the credentials ARE stored temporarily but are scoped to the sandbox lifetime and cleaned up on decommission. The "never stores" requirement should be interpreted as "never stores long-lived credentials."
3. SPIFFE ID format:
- No SPIFFE framework exists in the codebase. The SPIFFE ID would be a string convention, not a full SPIFFE runtime.
- Proposed format:
spiffe://openshell/sandbox/{sandbox_id}— needs human input on trust domain and hierarchy.
Patterns to Follow
- Provider plugin pattern: Follow
github.rsorclaude.rsforProviderPluginimplementation — same registration, discovery, and environment variable patterns. - HTTP client: Use
reqwestwithrustls-tls(already in workspaceCargo.tomlat version 0.12). - Error handling: Use
tonic::Statusfor gRPC errors in server code. - Testing: Use
wiremock(already a dev-dependency ofopenshell-server) for mocking Keycard HTTP API. UseStore::connect("sqlite::memory:")for persistence tests. - Provider docs: Update
architecture/sandbox-providers.mdfollowing its existing structure (has sections per provider type).
Proposed Approach
Introduce a new keycard provider type that holds admin API credentials (base URL, zone ID, client ID, client secret) in its config map. During sandbox creation, the server detects the Keycard provider, calls the Keycard API to create an APPLICATION with a SPIFFE-formatted identifier and a password credential, then makes the generated identifier/password available as KEYCARD_CLIENT_ID/KEYCARD_CLIENT_SECRET for sandbox env injection. During sandbox deletion, the server calls the Keycard API to delete the APPLICATION. The initial implementation places the lifecycle logic inline in create_sandbox()/delete_sandbox() with a clear path toward a lifecycle trait abstraction if needed later.
Scope Assessment
- Complexity: Medium-High
- Confidence: Medium — clear path for core functionality, several design decisions need human input (credential storage strategy, SPIFFE ID format, lifecycle logic location)
- Estimated files to change: ~8-10
- Issue type:
feat
Risks & Open Questions
- Failure modes during provisioning: If Keycard API is unreachable, should sandbox creation be blocked entirely or proceed without credentials? Partial failures (APPLICATION created but credential creation fails) need rollback logic.
- Orphaned Keycard APPLICATIONs: If sandbox K8s creation fails after the Keycard APPLICATION was created, or if the server crashes between the two operations, Keycard APPLICATIONs could be orphaned. A reconciliation/cleanup mechanism may be needed.
- Admin vs sandbox credential separation: The current
resolve_provider_environment()injects all credentials from a provider — the Keycard admin credentials must be explicitly excluded from sandbox env injection. - SPIFFE ID format and trust domain: What should the trust domain be? Should it include the gateway namespace or cluster identity? (Proposed:
spiffe://openshell/sandbox/{sandbox_id}) - Credential lifecycle mapping: Where is the Keycard APPLICATION ID stored so it can be retrieved during sandbox deletion? Options: sandbox metadata, a dedicated mapping table, or derived from the SPIFFE ID.
- Per-sandbox provider record vs dynamic resolution: Should a per-sandbox Keycard provider record be created, or should a single shared provider record hold admin creds with per-sandbox credentials resolved dynamically?
Test Considerations
- Unit tests for Keycard HTTP client: Use
wiremockto mock all three Keycard API endpoints (create application, create credential, delete application). Test success paths, error responses, and timeout handling. - Unit tests for modified
resolve_provider_environment(): Extend the existing 7 tests to cover Keycard-specific credential resolution and admin credential filtering. - Integration tests for sandbox lifecycle: Test that
create_sandbox()calls Keycard API and injects credentials, anddelete_sandbox()calls Keycard API to clean up. - Failure/rollback tests: Test partial failure scenarios — Keycard success then K8s failure, Keycard credential creation failure after APPLICATION creation.
- Provider validation tests: Test that Keycard provider creation validates required config keys (zone_id, client_id, client_secret).
- E2e tests may be needed if a test Keycard environment is available, but unit tests with
wiremockshould provide sufficient coverage for the initial implementation.
Created by spike investigation. Use build-from-issue to plan and implement.