Skip to content

PoC: OpenShell integration#393

Draft
josh-pritchard wants to merge 4 commits intoagentregistry-dev:mainfrom
josh-pritchard:openshell-poc
Draft

PoC: OpenShell integration#393
josh-pritchard wants to merge 4 commits intoagentregistry-dev:mainfrom
josh-pritchard:openshell-poc

Conversation

@josh-pritchard
Copy link

@josh-pritchard josh-pritchard commented Mar 20, 2026

Summary

Proof-of-concept integration of OpenShell as a third deployment platform alongside local (Docker Compose) and kubernetes. OpenShell provides secure, sandboxed agent execution with defense-in-depth isolation (Landlock LSM, seccomp-bpf, network namespaces, inference routing).

This PoC validates the end-to-end flow: register agent → build image → deploy to OpenShell sandbox → sandbox reaches READY. It is not merge-ready — see "Not working / Known gaps" below.

What's working

  • gRPC client (client.go): Connects to an OpenShell gateway with mTLS authentication. Supports endpoint discovery via env vars (OPENSHELL_GATEWAY_ENDPOINT) or filesystem config (~/.config/openshell/gateways/). Lazy client initialization — AR server starts fine without OpenShell installed.
  • Deployment adapter (deployment_adapter.go): Full DeploymentPlatformAdapter implementation — Deploy, Undeploy, GetLogs, Cancel. Polls GetSandbox until the sandbox reaches READY phase (120s timeout).
  • Provider registration: openshell platform registered in provider adapters and deployment platform map. DB migration seeds an openshell-default provider.
  • Proto vendoring: Makefile target (make sync-openshell-proto) fetches protos from NVIDIA/OpenShell at a pinned version and generates Go stubs. Both protos and generated code are checked in so builds don't require protoc.
  • UI deploy target selector: Deploy dialog fetches available providers and shows a dropdown instead of hardcoding providerId: "local". Platform display names map openshell → "OpenShell".
  • Docker Compose config: AR server container configured with OpenShell gateway endpoint and mTLS cert mount for local dev.
  • Unit tests: Client and deployment adapter have full mock-based test coverage.

Not working / Known gaps

  • E2E tests: Test scaffolding is in place (skip when OpenShell unavailable, image loading into K3s, sandbox verification/cleanup helpers) but tests have not been run end-to-end in CI. They require an OpenShell gateway running locally.
  • Image compatibility: OpenShell's supervisor requires images to include iproute2 (for ip binary) and a sandbox user/group. Standard agent images will fail without these. See "Image requirements" below.
  • CLI timeout: arctl deploy create blocks synchronously for up to 120s while the sandbox provisions. The arctl HTTP client can timeout before the server-side deploy completes. Needs async deploy pattern (return immediately, poll status).
  • Agent invocation: No way to send prompts to deployed agents. arctl agent run only works with the local Docker Compose platform. There is no deploy invoke or deploy chat command.
  • SPA routing fix: Includes an unrelated fix to server.go for Next.js static export routing (.html suffix resolution, SPA fallback). Should be split into a separate PR before merge.

Image requirements for OpenShell

OpenShell's supervisor binary is sideloaded into user containers at runtime. It requires:

  1. iproute2 package — supervisor shells out to ip for network namespace creation (veth pairs, netns). Without it, sandbox creation fails with ENOENT.
  2. iptables package (optional) — enables network bypass detection.
  3. sandbox user and group — supervisor drops privileges to this user after setup.

For Alpine/Wolfi-based images (like kagent-adk):

USER root
RUN apk add --no-cache iproute2 iptables && \
    addgroup -S sandbox && adduser -S -G sandbox -s /bin/sh sandbox

These requirements should eventually be handled automatically by arctl agent build when targeting OpenShell, rather than requiring manual Dockerfile changes.

Why OpenShell is a good fit for AR

  • Secure local alternative: The local platform (Docker Compose) has zero isolation. OpenShell provides Landlock, seccomp, network namespaces, and inference routing out of the box — same UX, defense-in-depth security.
  • Purpose-built API: gRPC gateway with mTLS is designed for programmatic sandbox management. Cleaner integration surface than shelling out to Docker/kubectl.
  • Runtime, not infrastructure: K3s is an implementation detail hidden from users. They interact with sandboxes, not pods. AR manages the registry; OpenShell manages execution.
  • Multi-gateway architecture: One AR instance can target multiple OpenShell gateways (dev laptop, staging server, production). Each provider maps to a gateway endpoint. This naturally could extend AR to multi-environment deployment.
  • Inference routing opportunity: OpenShell can enforce which LLM providers an agent can reach. AR could integrate this for provider-level access control — a capability neither local nor kubernetes offers.

Local testing steps

# 1. Start OpenShell gateway
openshell gateway start --name ar-dev

# 2. Start AgentRegistry
make docker-compose-up

# 3. Build an agent with OpenShell requirements
arctl agent init adk python my-agent
# Edit Dockerfile to add iproute2 + sandbox user (see above)
arctl agent build my-agent

# 4. Load image into OpenShell's K3s
docker save my-agent:latest | \
  docker exec -i $(docker ps --filter name=openshell-cluster- --format '{{.Names}}') \
  ctr -n k8s.io images import --all-platforms -

# 5. Set image pull policy (needed for :latest tags)
kubectl set env statefulset/openshell -n openshell \
  OPENSHELL_SANDBOX_IMAGE_PULL_POLICY=IfNotPresent

# 6. Deploy
arctl deploy create my-agent --type agent --provider-id openshell-default

# 7. Verify
openshell sandbox list
arctl deploy show <deployment-id>

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Proof-of-concept integration of OpenShell as a third deployment platform (alongside local and kubernetes), including vendored OpenShell gRPC protos/clients, adapter wiring, provider seeding, UI provider selection, and E2E scaffolding. The PR also includes an unrelated UI static-export routing change in the API server.

Changes:

  • Add openshell deployment adapter + gRPC client (mTLS + gateway discovery) with unit tests.
  • Vendor OpenShell protos + generated Go stubs, plus a Makefile sync target and a DB seed migration for openshell-default.
  • Update UI deploy dialog to select a provider dynamically; extend E2E deploy targets and local docker-compose dev config.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/registry/platforms/openshell/client.go OpenShell gRPC client (endpoint discovery + mTLS) and basic sandbox operations
internal/registry/platforms/openshell/client_test.go Unit tests for gateway metadata / TLS loading and small helpers
internal/registry/platforms/openshell/deployment_adapter.go OpenShell DeploymentPlatformAdapter implementation (deploy, undeploy, logs, cancel)
internal/registry/platforms/openshell/deployment_adapter_test.go Mock-based unit tests for adapter behavior (deploy polling, etc.)
internal/registry/platforms/openshell/provider_config.go Provider-level config type for the OpenShell platform
internal/registry/platforms/openshell/proto/OPENSHELL_PROTO_VERSION Pinned upstream proto version marker
internal/registry/platforms/openshell/proto/openshell.proto Vendored OpenShell service proto
internal/registry/platforms/openshell/proto/datamodel.proto Vendored OpenShell datamodel proto
internal/registry/platforms/openshell/proto/sandbox.proto Vendored sandbox policy proto
internal/registry/platforms/openshell/proto/inference.proto Vendored inference proto
internal/registry/platforms/openshell/proto/gen/openshell.pb.go Generated Go stubs for openshell.proto
internal/registry/platforms/openshell/proto/gen/openshell_grpc.pb.go Generated Go gRPC stubs for OpenShell service
internal/registry/platforms/openshell/proto/gen/datamodel.pb.go Generated Go stubs for datamodel.proto
internal/registry/platforms/openshell/proto/gen/sandbox.pb.go Generated Go stubs for sandbox.proto
internal/registry/platforms/openshell/proto/gen/inference.pb.go Generated Go stubs for inference.proto
internal/registry/platforms/openshell/proto/gen/inference_grpc.pb.go Generated Go gRPC stubs for Inference service
internal/registry/registry_app.go Wire openshell into the deployment platform adapter map
internal/registry/api/handlers/v0/provider_adapters.go Register openshell provider adapter
internal/registry/database/migrations/011_seed_openshell_provider.sql Seed openshell-default provider
internal/daemon/docker-compose.yml Local dev env vars + mTLS mount for OpenShell gateway access
Makefile Add sync-openshell-proto target to fetch protos and regenerate stubs
e2e/deploy_test.go Add OpenShell deploy targets + helpers for image loading / sandbox verification
ui/components/deploy-server-dialog.tsx UI deploy dialog: provider dropdown (instead of hardcoded local)
ui/lib/platform-display.ts Platform display mapping for UI labels/descriptions
internal/registry/api/server.go SPA/static-export routing logic for embedded UI assets

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +134 to +138
return nil, fmt.Errorf("deployment is required: %w", database.ErrInvalidInput)
}
sandboxName := sandboxNameForDeployment(deployment)
return client.GetSandboxLogs(ctx, sandboxName)
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetLogs() passes the sandbox name into Client.GetSandboxLogs(), but the gRPC request uses GetSandboxLogsRequest.sandbox_id (see grpcClient.GetSandboxLogs). Unless OpenShell treats name == id, this will fail to fetch logs. Suggested fix: persist the sandbox ID in DeploymentActionResult.ProviderMetadata during Deploy (from CreateSandbox response) and use that for log retrieval, or resolve name->id via GetSandbox before calling GetSandboxLogs.

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +47
// Try the exact path first (handles static assets like .js, .css, etc.)
if f, err := httpFS.Open(path); err == nil {
f.Close()
// Check if it's a file (not a directory) — serve it directly
if stat, err := f.(interface{ Stat() (fs.FileInfo, error) }).Stat(); err == nil && !stat.IsDir() {
fileServer.ServeHTTP(w, r)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UI handler’s existence check is broken: it calls httpFS.Open(r.URL.Path) with a leading "/" (http.FS expects paths without the leading slash), and it closes the file before calling Stat() on it. As written, this will fail to detect/serve existing files and can fall through to 404/SPAs incorrectly. Fix by normalizing the path (e.g., strings.TrimPrefix(path, "/")) for all Open() calls, and only closing after Stat() (use a defer close).

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +80
slog.Info("openshell: deploy started", "server", req.ServerName, "provider", req.ProviderID)
client, err := a.getClient()
if err != nil {
return nil, err
}
slog.Info("openshell: client ready")
if err := utils.ValidateDeploymentRequest(req, false); err != nil {
return nil, err
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deploy() logs req.ServerName/req.ProviderID before validating that req is non-nil. ValidateDeploymentRequest already checks for nil, but it’s called after this log line, so a nil request will panic. Move validation before any field access/logging of req.

Suggested change
slog.Info("openshell: deploy started", "server", req.ServerName, "provider", req.ProviderID)
client, err := a.getClient()
if err != nil {
return nil, err
}
slog.Info("openshell: client ready")
if err := utils.ValidateDeploymentRequest(req, false); err != nil {
return nil, err
}
if err := utils.ValidateDeploymentRequest(req, false); err != nil {
return nil, err
}
slog.Info("openshell: deploy started", "server", req.ServerName, "provider", req.ProviderID)
client, err := a.getClient()
if err != nil {
return nil, err
}
slog.Info("openshell: client ready")

Copilot uses AI. Check for mistakes.
Comment on lines +109 to +120
func (a *openshellDeploymentAdapter) Undeploy(_ context.Context, deployment *models.Deployment) error {
client, err := a.getClient()
if err != nil {
return err
}
if err := utils.ValidateDeploymentRequest(deployment, true); err != nil {
return err
}

ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undeploy() discards the caller’s context and always uses context.Background() with a new timeout. This prevents request cancellation/deadlines from propagating (and differs from the local/kubernetes adapters which use the passed ctx). Prefer deriving the timeout from the incoming ctx (context.WithTimeout(ctx, …)).

Copilot uses AI. Check for mistakes.
@peterj peterj self-assigned this Mar 20, 2026
peterj added 2 commits March 20, 2026 14:08
Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
@peterj
Copy link
Contributor

peterj commented Mar 20, 2026

I was able to partially get it to work:

  • made changes to the dockerfile to use the sandbox user and install iproute/iptables
  • automatically creating the provider based on the model set in the agent (note that for gemini, we have to use generic as openshell doesn't have a dedicated type for that yet)
  • creating the policy that allows egress to the model + allows executing the python binary

Deployment work. You can deploy an agent to openshell. However, for whatever reason I wasn't able to set the OPENSHELL_SANDBOX_COMMAND to the actual kagent-adk command we need to launch the agent. Openshell keeps changing that to sleep inifinity (might have to look at this with the fresh eyes).

Anyway, so you can still make it work manually after that:

# getting a shell inside the sandbox
RUST_LOG=debug openshell sandbox connect myagent

# in the shell
kagent-adk run --host 0.0.0.0 --port 9999 myagent --local

Then, start port forward to the sandbox:

RUST_LOG=debug openshell forward start 9999 myagent

And then you can send the A2A request to localhost:9999 and that will work.

Interestingly, if I provide the dockerfile directly to the sandbox create command, that works without issues:

RUST_LOG=debug openshell sandbox create \
    --from Dockerfile \
    --forward 9999 \
    -- kagent-adk run --host 0.0.0.0 --port 8080 myagent --local

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants