diff --git a/README.md b/README.md index 7ab75c2..bfebe3d 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,7 @@ # rhdh-fullsend -Custom fullsend sandbox images for the RHDH team's agent infrastructure. +Custom fullsend sandbox images and deployment documentation for the RHDH +team's agent infrastructure. ## Why this repo exists @@ -14,6 +15,16 @@ workaround (a `host_files`-mounted shell script) is fragile. This repo builds a single image that extends `fullsend-code:latest` with corepack and yarn pre-activated. +## Documentation + +| Doc | What it covers | +|-----|---------------| +| [Local Setup](docs/local-setup.md) | Podman VM, OpenShell gateway, GCP credentials, running agents locally | +| [Repo Onboarding](docs/repo-onboarding.md) | Installing fullsend on a new RHDH repo (standard and manual methods) | +| [GCP Infrastructure](docs/gcp-infrastructure.md) | GCP project, WIF providers, IAM, service accounts | +| [Sandbox Networking](docs/sandbox-networking.md) | DNS inside OpenShell sandboxes — why it fails, workarounds | +| [Known Issues](docs/known-issues.md) | Active friction points, workarounds, upstream tracking | + ## Image ``` @@ -52,8 +63,7 @@ This replaces the `sandbox-yarn-setup.sh` + `host_files` workaround. ## Local agent runs -See [docs/local-setup.md](docs/local-setup.md) for the full guide: Podman VM, -OpenShell gateway, GCP credentials, SSH tunnel, and running agents end-to-end. +See [Local Setup](docs/local-setup.md) for running agents on macOS. ## Local build diff --git a/docs/gcp-infrastructure.md b/docs/gcp-infrastructure.md new file mode 100644 index 0000000..0694760 --- /dev/null +++ b/docs/gcp-infrastructure.md @@ -0,0 +1,233 @@ +# GCP Infrastructure + +GCP project, Workload Identity Federation, IAM, and service account reference +for the RHDH fullsend setup. + +## Project context + +| Field | Value | +|-------|-------| +| GCP project ID | `rhdh-sidekick-167988` | +| GCP project number | `189673402608` | +| Vertex AI region | `us-east5` | +| WIF pool | `fullsend-pool` (ACTIVE) | +| IAM admin group | `rhdh-sidekick@redhat.com` | + +The project lives under `IT Public Cloud > Sandbox > Customers` in the GCP +org hierarchy. The admin group has `iam.workloadIdentityPoolAdmin`, +`iam.serviceAccountAdmin`, and `iam.serviceAccountKeyAdmin` — sufficient to +self-provision WIF providers and service accounts without fullsend team +involvement. + +**Conditional IAM restriction:** The `projectIamAdmin` role on this project +is restricted to granting only `roles/aiplatform.user`: + +``` +expression: api.getAttribute('iam.googleapis.com/modifiedGrantsByRole', + []).hasOnly(['roles/aiplatform.user']) +``` + +This means you cannot grant yourself additional roles or enable APIs. All +changes beyond `aiplatform.user` must go through IT (ServiceNow ticket). + +## WIF providers + +Each repo gets its own WIF provider, scoped via `attribute-condition` to +that specific repository. + +### Current providers + +| Provider | Repo scope | State | +|----------|-----------|-------| +| `gh-redhat-developer-rhdh-agentic` | `redhat-developer/rhdh-agentic` | ACTIVE | +| `gh-redhat-developer-rhdh-plugins` | `redhat-developer/rhdh-plugins` | ACTIVE | +| `gh-rhdeveloper-plugin-export` | `redhat-developer/rhdh-plugin-export-overlays` | ACTIVE | + +### Creating a new WIF provider + +```bash +PROVIDER_NAME="gh-redhat-developer-" # max 32 chars +PROVIDER_PATH="projects/189673402608/locations/global/workloadIdentityPools/fullsend-pool/providers/${PROVIDER_NAME}" + +gcloud iam workload-identity-pools providers create-oidc "$PROVIDER_NAME" \ + --location=global \ + --workload-identity-pool=fullsend-pool \ + --project=rhdh-sidekick-167988 \ + --issuer-uri=https://token.actions.githubusercontent.com \ + --allowed-audiences="fullsend-mint,https://iam.googleapis.com/${PROVIDER_PATH}" \ + --attribute-mapping="google.subject=assertion.sub,attribute.actor=assertion.actor,attribute.repository=assertion.repository,attribute.repository_owner=assertion.repository_owner" \ + --attribute-condition="assertion.repository == '/'" +``` + +### Dual-audience requirement + +Two audiences are required in `--allowed-audiences`: + +| Audience | Used by | Step | +|----------|---------|------| +| `fullsend-mint` | Mint token exchange | GitHub OIDC → fullsend session token | +| `https://iam.googleapis.com/projects/189673402608/.../providers/` | `google-github-actions/auth` | GCP credential setup for Vertex AI | + +Omitting the second audience causes an `audience mismatch` error at the +"Setup GCP" step in the workflow. The `fullsend admin install` CLI sets +both automatically; manual provider creation must include both. + +### IAM binding + +The existing `aiplatform.user` binding covers all `redhat-developer` repos +via the `attribute.repository_owner` principal set: + +``` +principalSet://iam.googleapis.com/projects/189673402608/locations/global/workloadIdentityPools/fullsend-pool/attribute.repository_owner/redhat-developer +``` + +No per-repo IAM binding is needed after the initial setup. + +## Service accounts + +For local agent runs (not CI). See also +[Local Setup — GCP Credentials](local-setup.md#step-3-gcp-credentials). + +### Creating a service account + +```bash +gcloud iam service-accounts create fullsend-local \ + --display-name="Fullsend local agent runner" \ + --project=rhdh-sidekick-167988 +``` + +There is a propagation delay of a few seconds before the SA can be used in +IAM bindings. + +### Granting Vertex AI access + +```bash +gcloud projects add-iam-policy-binding rhdh-sidekick-167988 \ + --member="serviceAccount:fullsend-local@rhdh-sidekick-167988.iam.gserviceaccount.com" \ + --role="roles/aiplatform.user" \ + --condition=None +``` + +`--condition=None` is required because the project has conditional IAM +bindings. Without it, `gcloud` prompts interactively. + +### Creating a JSON key + +```bash +gcloud iam service-accounts keys create \ + ~/.config/fullsend/fullsend-local-credentials.json \ + --iam-account=fullsend-local@rhdh-sidekick-167988.iam.gserviceaccount.com + +chmod 600 ~/.config/fullsend/fullsend-local-credentials.json +``` + +The key file contains a private key. Do not commit it to git or share via +Slack. If compromised: + +```bash +KEY_ID=$(python3 -c "import json,sys; print(json.load(sys.stdin)['private_key_id'])" \ + < ~/.config/fullsend/fullsend-local-credentials.json) +gcloud iam service-accounts keys delete "$KEY_ID" \ + --iam-account=fullsend-local@rhdh-sidekick-167988.iam.gserviceaccount.com +``` + +### Per-person service accounts + +For individual usage tracking, create per-person SAs: + +```bash +NAME="fullsend-local-" # kebab-case, max 30 chars + +gcloud iam service-accounts create "$NAME" \ + --display-name="Fullsend local – " \ + --project=rhdh-sidekick-167988 + +gcloud projects add-iam-policy-binding rhdh-sidekick-167988 \ + --member="serviceAccount:${NAME}@rhdh-sidekick-167988.iam.gserviceaccount.com" \ + --role="roles/aiplatform.user" \ + --condition=None + +gcloud iam service-accounts keys create "/tmp/${NAME}-credentials.json" \ + --iam-account="${NAME}@rhdh-sidekick-167988.iam.gserviceaccount.com" +``` + +Share the key file securely (Bitwarden, 1Password — never Slack or email) +and delete the local copy. + +### Key rotation + +Create a new key before deleting the old one to avoid downtime: + +```bash +gcloud iam service-accounts keys create \ + ~/.config/fullsend/fullsend-local-credentials-new.json \ + --iam-account=fullsend-local@rhdh-sidekick-167988.iam.gserviceaccount.com + +# Test with the new key, then: +OLD_KEY_ID=$(python3 -c "import json,sys; print(json.load(sys.stdin)['private_key_id'])" \ + < ~/.config/fullsend/fullsend-local-credentials.json) +gcloud iam service-accounts keys delete "$OLD_KEY_ID" \ + --iam-account=fullsend-local@rhdh-sidekick-167988.iam.gserviceaccount.com + +mv ~/.config/fullsend/fullsend-local-credentials-new.json \ + ~/.config/fullsend/fullsend-local-credentials.json +``` + +## IAM troubleshooting + +### "Permission 'aiplatform.endpoints.predict' denied" + +The WIF principal has no `roles/aiplatform.user` binding. Verify: + +```bash +gcloud projects get-iam-policy rhdh-sidekick-167988 \ + --flatten="bindings[].members" \ + --filter="bindings.members:principalSet" \ + --format="table(bindings.role, bindings.members)" +``` + +If the binding is missing, add it using the org-level principal set (covers +all repos under `redhat-developer`): + +```bash +gcloud projects add-iam-policy-binding rhdh-sidekick-167988 \ + --role="roles/aiplatform.user" \ + --member="principalSet://iam.googleapis.com/projects/189673402608/locations/global/workloadIdentityPools/fullsend-pool/attribute.repository_owner/redhat-developer" \ + --condition=None +``` + +### Installer claims success but binding is missing + +The `fullsend admin install` CLI may report "granted roles/aiplatform.user" +even when the conditional `projectIamAdmin` role silently blocks the grant. +Always verify with `gcloud projects get-iam-policy` after install. IAM +propagation can take up to 7 minutes. + +### "audience mismatch" at Setup GCP step + +The WIF provider was created with only one allowed audience. Update it to +include both: + +```bash +gcloud iam workload-identity-pools providers update-oidc \ + --location=global \ + --workload-identity-pool=fullsend-pool \ + --project=rhdh-sidekick-167988 \ + --allowed-audiences="fullsend-mint,https://iam.googleapis.com/" +``` + +### Monitoring Vertex AI usage + +Via GCP Console: Vertex AI → Model Garden → Usage page. Filter by service +account for per-SA token consumption. + +Via CLI: + +```bash +gcloud logging read \ + 'resource.type="aiplatform.googleapis.com/Endpoint" AND + protoPayload.authenticationInfo.principalEmail="fullsend-local@rhdh-sidekick-167988.iam.gserviceaccount.com"' \ + --project=rhdh-sidekick-167988 \ + --limit=10 \ + --format="table(timestamp, protoPayload.request.model, protoPayload.response.usageMetadata)" +``` diff --git a/docs/known-issues.md b/docs/known-issues.md new file mode 100644 index 0000000..ba194ec --- /dev/null +++ b/docs/known-issues.md @@ -0,0 +1,83 @@ +# Known Issues + +Active friction points, workarounds, and upstream tracking for the RHDH +fullsend setup. Last updated: 2026-06-09. + +## Sandbox and tooling + +| Issue | Impact | Workaround | Status | +|-------|--------|------------|--------| +| DNS broken inside sandboxes | `yarn install`, `pip install`, `git clone` fail with `getaddrinfo EAI_AGAIN` | Explicit `httpProxy`/`httpsProxy` in `.yarnrc.yml` pointing to the L7 proxy | By design — see [Sandbox Networking](sandbox-networking.md) | +| `yarn install` takes 10-15 min in sandbox | Monorepo overhead for large workspaces | Custom image with yarn pre-installed eliminates bootstrap; consider pre-installing deps | Open | +| Git hooks (husky) need yarn in PATH | Hooks run in subprocesses without the agent's PATH | Custom image with `/usr/local/bin/yarn` wrapper — see [rhdh-fullsend-code image](../README.md) | Solved | +| Sandbox creation timeout (60s) for large images | Code agent uses `fullsend-code:latest` (larger than triage sandbox) | Upstream fix exists (pre-pull + retry + 120s timeout) but not in `@v0` tag. Set `FULLSEND_SANDBOX_READY_TIMEOUT=180` as env var. | Fixed upstream, pending `@v0` release | +| `/etc/resolv.conf` points to unreachable nameserver | Tools timeout instead of failing fast | None — consider filing OpenShell issue | Open | + +## Agent behavior + +| Issue | Impact | Workaround | Status | +|-------|--------|------------|--------| +| Triage doesn't auto-trigger on `issues/opened` | Must use `/fs-triage` slash command | Post `/fs-triage` as issue comment | By design — dispatcher only handles `issues/labeled` | +| Coder doesn't auto-trigger from triage | Triage labels `triaged`, not `ready-to-code` | Post `/fs-code` manually after triage | By design | +| Fix only triggers from bot reviews | Human `changes_requested` reviews don't trigger fix agent | Post `/fs-fix` manually | By design | +| Retro dropped by concurrency group collision | Retro job gets cancelled by other dispatch jobs | Post `/fs-retro` manually in a quiet window | Open | +| Custom agent stages not supported in per-repo mode | Cannot register custom `/fs-*` slash commands | Extend existing agents with custom skills instead of building standalone agents | Architectural limitation | + +## Monorepo-specific + +| Issue | Impact | Workaround | Status | +|-------|--------|------------|--------| +| No workspace awareness | Agent sees full repo context, not just the workspace a PR touches | `paths` filter on `pull_request_target` for workspace-level triggering | Partial — shim-level only | +| Routing skill: label priority | Agent guesses workspace from title/body instead of trusting `workspace/*` label | Improve routing skill to prioritize existing labels | Open | +| Routing skill not in triage harness | Triage has no workspace awareness — can misroute issues | Add routing skill to triage harness | Open | +| `workspace/*` labels not automated | Must manually create labels when adding workspaces | Automate label creation when a new workspace is added | Open | + +## Observability + +| Issue | Impact | Workaround | Status | +|-------|--------|------------|--------| +| Agent transcript not visible inline in GHA logs | Must download artifact separately | `gh run download --name fullsend-code` | Open | +| No summary in GHA step output | Hard to see what the agent did at a glance | Consider post-script step extracting key actions from transcript | Open | + +## Upstream harness drift + +Customized harness and policy files are **copies** of upstream (baseline +2026-06-05). When upstream changes, our copies need manual sync: + +| File | Repo | +|------|------| +| `harness/code.yaml` | rhdh-plugins | +| `harness/fix.yaml` | rhdh-plugins | +| `policies/code.yaml` | rhdh-plugins | +| `policies/fix.yaml` | rhdh-plugins | +| `agents/code.md` | rhdh-plugins | + +## Upstream feature requests + +| Issue | Description | Status | +|-------|-------------|--------| +| [fullsend#1937](https://github.com/fullsend-ai/fullsend/issues/1937) | Native `working_dir` field in harness schema | Filed | +| `repo.yarnpkg.com` missing from upstream policies | Any JS monorepo using corepack + yarn hits this | Not yet filed | +| `sandbox_init_script` in harness schema | Pre-agent env setup without relying on `.env.d` or skills | Not yet filed | +| [OpenShell#1107](https://github.com/NVIDIA/OpenShell/issues/1107) | `/etc/hosts` injection for policy-allowed hostnames | Open, assigned | + +## `@v0` tag regression risk + +Commit `709d8af0` (2026-05-15) fixed per-repo retro/prioritize routing by +removing the `retro|prioritize → fullsend` stage-to-role mapping. However, +PR #1187 (`005ac0a1`, 2026-05-19) re-introduced the old mapping on `main`. +The `@v0` tag predates this regression, so per-repo mode is currently safe. + +**Risk:** If `@v0` advances past PR #1187, per-repo retro and prioritize +will silently break for all consumer orgs whose config lists +`retro`/`prioritize` instead of `fullsend`. + +## Public repo security + +Fullsend's `issue_comment` trigger routes to agents without checking +`author_association`. Any external user posting `/fs-review` on a public +repo's PR triggers Vertex AI inference on the repo owner's GCP project. + +**Mitigation:** Add an `author_association` check to the dispatch job in +the shim workflow. Applied in rhdh-plugins and rhdh-plugin-export-overlays. +See [Repo Onboarding — Method 2](repo-onboarding.md#method-2-manual-install-customized-shim). diff --git a/docs/local-setup.md b/docs/local-setup.md index 700f4a0..9f37066 100644 --- a/docs/local-setup.md +++ b/docs/local-setup.md @@ -100,23 +100,9 @@ service account key for the `rhdh-sidekick-167988` project with the **If your team lead provides the key file:** save it to `~/.config/fullsend/fullsend-local-credentials.json` and `chmod 600` it. -**If you need to create the SA yourself:** - -```bash -gcloud iam service-accounts create fullsend-local \ - --display-name="Fullsend local agent runner" \ - --project=rhdh-sidekick-167988 - -gcloud projects add-iam-policy-binding rhdh-sidekick-167988 \ - --member="serviceAccount:fullsend-local@rhdh-sidekick-167988.iam.gserviceaccount.com" \ - --role="roles/aiplatform.user" \ - --condition=None - -gcloud iam service-accounts keys create ~/.config/fullsend/fullsend-local-credentials.json \ - --iam-account=fullsend-local@rhdh-sidekick-167988.iam.gserviceaccount.com - -chmod 600 ~/.config/fullsend/fullsend-local-credentials.json -``` +**If you need to create the SA yourself:** see +[GCP Infrastructure — Service Accounts](gcp-infrastructure.md#service-accounts) +for the full `gcloud` commands (create SA, grant role, generate key, rotate). ## Step 4: Create env files diff --git a/docs/repo-onboarding.md b/docs/repo-onboarding.md new file mode 100644 index 0000000..076ee4e --- /dev/null +++ b/docs/repo-onboarding.md @@ -0,0 +1,184 @@ +# Repo Onboarding Guide + +How to install fullsend on a new repo in the `redhat-developer` GitHub org. + +## Architecture + +The RHDH team uses a **hybrid model**: + +- **Inference** — own GCP project `rhdh-sidekick-167988` (cost accounting + goes to the RHDH cost center) +- **Token mint** — shared service at `fullsend-mint-gljhbkcloq-uc.a.run.app` + (managed by the fullsend team, heading toward fully public managed service) + +``` +redhat-developer/ + └── .github/workflows/fullsend.yaml (shim) + └── calls fullsend-ai/fullsend reusable workflows @v0 + └── authenticates via WIF + ├── rhdh-sidekick-167988 (inference, cost accounting) + └── fullsend-mint (shared mint — GitHub App tokens) +``` + +Each repo gets its own WIF provider scoped to that single repo. The fullsend +GitHub Apps are the public `fullsend-ai-*` app set (shared across orgs). + +## Method 1: Standard install (`fullsend admin install`) + +Use this for repos that need no customization — the installer creates a +default shim workflow and scaffold. + +```bash +fullsend admin install / \ + --inference-project rhdh-sidekick-167988 \ + --mint-url https://fullsend-mint-gljhbkcloq-uc.a.run.app \ + --skip-mint-check +``` + +The installer auto-provisions: +- WIF pool + provider in `rhdh-sidekick-167988` +- IAM binding (`aiplatform.user`) for the repo +- `.fullsend/config.yaml` and scaffold directories +- `.github/workflows/fullsend.yaml` shim workflow +- GitHub repository variables and secrets + +### What `--skip-mint-check` does + +Skips GCP API calls to the mint project. Required because we don't have +access to the fullsend team's mint project (`it-gcp-konflux-dev-fullsend`). + +## Method 2: Manual install (customized shim) + +Use this for repos that need a **customized shim workflow** — auth gate for +public repos, `paths` filter for monorepo workspace scoping, or custom +agents/skills. The installer would overwrite these customizations. + +### Prerequisites + +- `gcloud` CLI authenticated with a user in the `rhdh-sidekick@redhat.com` + group (needs `iam.workloadIdentityPoolAdmin` on `rhdh-sidekick-167988`) +- `gh` CLI authenticated with `repo` + `workflow` scopes and admin access + to the target repo +- WIF pool `fullsend-pool` already exists on `rhdh-sidekick-167988` +- fullsend GitHub Apps already installed in `redhat-developer` org with + access to the target repo + +### Step 1: Create WIF provider + +See [GCP Infrastructure — WIF Providers](gcp-infrastructure.md#wif-providers) +for the full `gcloud` command and the dual-audience requirement. + +Provider name must be max 32 characters, lowercase alphanumeric + dashes. + +### Step 2: Set GitHub variables and secrets + +```bash +gh variable set FULLSEND_MINT_URL --repo / \ + --body "https://fullsend-mint-gljhbkcloq-uc.a.run.app" +gh variable set FULLSEND_GCP_REGION --repo / \ + --body "global" +gh secret set FULLSEND_GCP_WIF_PROVIDER --repo / \ + --body "" +gh secret set FULLSEND_GCP_PROJECT_ID --repo / \ + --body "rhdh-sidekick-167988" +``` + +### Step 3: Commit fullsend files via PR + +Create a PR with: + +| File | Purpose | +|------|---------| +| `.fullsend/config.yaml` | Enabled roles (triage, coder, review, fix) | +| `.github/workflows/fullsend.yaml` | Customized shim (auth gate, paths filter) | +| `.fullsend/customized/` | Scaffold dirs + custom agents/skills/harness overrides | +| `.github/CODEOWNERS` | Protect `.fullsend/` and `.github/workflows/fullsend.yaml` | + +**For public repos**, add an `author_association` auth gate to the dispatch +job's `if` condition. Without it, any GitHub user can post `/fs-review` and +burn Vertex AI tokens on your GCP project. Gate to +`OWNER/MEMBER/COLLABORATOR`: + +```yaml +# In the dispatch job +if: >- + github.event.comment && + contains(fromJSON('["OWNER","MEMBER","COLLABORATOR"]'), + github.event.comment.author_association) +``` + +**For monorepos**, add a `paths` filter on `pull_request_target` to scope +auto-review to a specific workspace: + +```yaml +on: + pull_request_target: + paths: + - 'workspaces//**' +``` + +### Step 4: Grant fullsend GitHub Apps access + +Ensure the target repo is added to the repository access list for each +fullsend-ai GitHub App. Manage via org settings → Installed GitHub Apps. + +The 5 per-repo apps: + +| App | Roles | +|-----|-------| +| `fullsend-ai-triage` | triage | +| `fullsend-ai-coder` | coder, fix | +| `fullsend-ai-review` | review | +| `fullsend-ai-retro` | retro | +| `fullsend-ai-prioritize` | prioritize | + +### Step 5: Verify + +After merge, post `/fs-review` on a PR to trigger the review agent: + +```bash +gh run list --workflow=fullsend.yaml --repo / --limit 3 +gh run view --repo / --log +``` + +## Current repo status + +| Repo | Status | Install method | Notes | +|------|--------|----------------|-------| +| `rhdh-agentic` | Live (2026-05-20) | `fullsend admin install` | Custom review agent with OpenSpec skill | +| `rhdh-plugins` | Live (2026-06-02) | Manual | Review scoped to `workspaces/scorecard/`, auth-gated slash commands | +| `rhdh-plugin-export-overlays` | WIF configured, PR pending | Manual | Custom workspace-review skill, scoped to `backstage-plugins-for-aws` | + +## Key learnings + +1. **Use per-repo mode** — simpler than org mode, works for private repos, + no org-wide config repo needed. + +2. **Use the shared mint** — don't self-host unless you have a strong + reason. The fullsend team manages it and is heading toward a public + managed service. + +3. **Get your own GCP project for inference** — even if you start on the + shared project, switch early for cost tracking. + +4. **Pre-install GitHub Apps before running the CLI** — smoother than the + interactive flow that opens browser windows. + +5. **Add CODEOWNERS immediately** — protect `.fullsend/` and + `.github/workflows/` from agent modifications. + +6. **Expect slash commands, not automation** — most agents need manual + triggering via `/fs-*`. Only Review auto-triggers reliably on PR + open/update. + +7. **Budget 3-5 days for GCP/IT coordination** — the biggest time sink is + permissions, not the install itself. IT sandbox projects use conditional + IAM that prevents self-service. + +8. **Add an auth gate on slash commands for public repos** — fullsend + doesn't check `author_association` on `issue_comment` events. + +9. **WIF providers need two allowed audiences** — `fullsend-mint` for the + mint token step, and the full provider URL for `google-github-actions/auth`. + Omitting either causes auth failures. See + [GCP Infrastructure](gcp-infrastructure.md#wif-providers). diff --git a/docs/sandbox-networking.md b/docs/sandbox-networking.md new file mode 100644 index 0000000..e972e0a --- /dev/null +++ b/docs/sandbox-networking.md @@ -0,0 +1,130 @@ +# Sandbox Networking + +DNS and networking inside OpenShell sandboxes — why `yarn install` (and any +tool that resolves DNS directly) fails, and how to work around it. + +## Two-layer network namespace architecture + +OpenShell sandboxes are **not** plain containers. Each sandbox is a +**nested network namespace** inside a container. The agent process runs one +layer deeper than the container's network. + +``` +┌─── Host (macOS / CI runner) ────────────────────────────────────┐ +│ │ +│ Podman network "openshell" (10.89.0.0/24) │ +│ aardvark-dns on bridge interface (10.89.0.1:53) ✅ │ +│ │ +│ ┌─── Container (10.89.0.3 on "openshell" network) ──────────┐ │ +│ │ /etc/resolv.conf → nameserver 10.89.0.1 (works here) │ │ +│ │ openshell-sandbox process (PID 1 = supervisor) │ │ +│ │ │ │ +│ │ veth-h-* (10.200.0.1) ← host side of veth pair │ │ +│ │ └── :3128 L7 transparent proxy ✅ (only listener) │ │ +│ │ └── :53 ❌ nothing listening │ │ +│ │ │ │ │ +│ │ ┌──── │ ──── Inner netns (sandbox-*) ──────────────────┐ │ │ +│ │ │ veth-s-* (10.200.0.2) ← sandbox side │ │ │ +│ │ │ default route → 10.200.0.1 │ │ │ +│ │ │ /etc/resolv.conf → nameserver 10.89.0.1 (INHERITED) │ │ │ +│ │ │ ^^^^^^^^^^^^^^^^^^^ │ │ │ +│ │ │ UNREACHABLE from 10.200.0.0/24 │ │ │ +│ │ │ │ │ │ +│ │ │ Agent (Claude Code), yarn, node, git run here │ │ │ +│ │ └───────────────────────────────────────────────────────┘ │ │ +│ └────────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +## Why DNS fails + +1. **HTTP/HTTPS traffic works.** The supervisor sets up iptables rules that + intercept all outbound TCP from the inner netns and redirect it to the + L7 proxy at `10.200.0.1:3128`. The proxy resolves DNS from the container + netns (where `10.89.0.1` IS reachable), applies policy, and forwards. + +2. **DNS does NOT work.** The inner netns inherits `/etc/resolv.conf` from + the container, pointing to `10.89.0.1` (Podman's aardvark-dns). But the + inner netns only has a route to `10.200.0.0/24` — it cannot reach + `10.89.0.0/24`. All direct DNS queries fail with `getaddrinfo EAI_AGAIN`. + +3. **This is by design.** OpenShell deliberately provides no DNS inside the + sandbox. The security model requires all network access to go through + the L7 proxy, which resolves DNS on behalf of the client. DNS is a + separate channel that would bypass policy enforcement (DNS exfiltration). + +## What works vs. what doesn't + +| Tool | Works? | Why | +|------|--------|-----| +| `curl https://...` | ✅ | Intercepted by transparent proxy | +| `node -e "fetch('https://...')"` | ✅ | Node's fetch connects to IP; proxy intercepts TCP | +| `gh api ...` | ✅ | Uses HTTPS | +| `yarn install` | ❌ | Yarn Berry calls `getaddrinfo` before `fetch` | +| `pip install` | ❌ | Same — resolves DNS first | +| `go get` | ❌ | Same | +| `git clone` (HTTPS) | ❌ | libcurl resolves DNS first | +| `nslookup`, `dig` | ❌ | Direct DNS queries | +| `dns.resolve()` (Node.js) | ❌ | Direct DNS queries | + +## Workaround: explicit proxy in .yarnrc.yml + +Set `httpProxy`/`httpsProxy` pointing to the L7 proxy. Yarn sends HTTP +CONNECT to the proxy, which resolves DNS and forwards the request. + +**Tested approach:** Use an env file that maps OpenShell's `HTTP_PROXY` to +Yarn's proxy config (no hardcoded IPs): + +``` +# .fullsend/customized/env/yarn-proxy.env +YARN_HTTP_PROXY=${HTTP_PROXY:-http://10.200.0.1:3128} +YARN_HTTPS_PROXY=${HTTPS_PROXY:-http://10.200.0.1:3128} +``` + +Smoke test result (`yarn add is-odd` with proxy config): +``` +➤ YN0000: · Yarn 4.6.0 +➤ YN0085: │ + is-odd@npm:3.0.1, is-number@npm:6.0.0 +➤ YN0000: └ Completed in 0s 226ms +➤ YN0013: │ 2 packages were added to the project (+ 16.65 KiB). +➤ YN0000: └ Completed in 0s 497ms +``` + +**Not yet validated:** Full `yarn install` on the rhdh-plugins monorepo +(hundreds of packages). The smoke test covered a single package. + +## Upstream issue tracking + +| Issue | Status | Relevance | +|-------|--------|-----------| +| [OpenShell#364](https://github.com/NVIDIA/OpenShell/issues/364) | Closed wontfix | DNS resolution fails — maintainer: "DNS is incidental. Tools should use HTTPS_PROXY." | +| [OpenShell#1107](https://github.com/NVIDIA/OpenShell/issues/1107) | Open, assigned | Proposes `/etc/hosts` injection for policy-allowed hostnames at sandbox creation. Would fix this cleanly. | +| [OpenShell#1169](https://github.com/NVIDIA/OpenShell/issues/1169) | Closed (fixed) | Why DNS is intentionally blocked — DNS exfiltration bypass vector | + +The OpenShell team's position: the sandbox intentionally has no DNS. All +network access must go through the L7 proxy. This is a security design +decision, not a bug. + +## Why it seems to work in CI + +CI (GitHub Actions) uses the same stack — Podman + OpenShell + same +`action.yml`. The inner netns is identical. `yarn install` works in CI +because Node.js's `undici` (used by yarn's fetch) handles connections in a +way that gets intercepted by the transparent proxy before DNS resolution +is needed. + +## Verified facts (2026-06-09) + +Captured from a live running sandbox: + +``` +# Container netns (supervisor) +$ ip addr → eth0: 10.89.0.3/24, veth-h-*: 10.200.0.1/24 +$ nslookup registry.npmjs.org 10.89.0.1 → ✅ 104.16.x.34 +$ ss -tlnp → 10.200.0.1:3128 (proxy) + +# Inner sandbox netns (agent) +$ ip addr → veth-s-*: 10.200.0.2/24 +$ nslookup registry.npmjs.org 10.89.0.1 → ❌ connection refused +$ nslookup registry.npmjs.org 10.200.0.1 → ❌ connection refused +``` diff --git a/images/code/Containerfile b/images/code/Containerfile index 718ece3..5b2a2d4 100644 --- a/images/code/Containerfile +++ b/images/code/Containerfile @@ -6,8 +6,9 @@ # bootstrap inside read-only /usr sandbox) # - Node.js is already in the base image (installed by Claude Code) # -# This image eliminates the need for the sandbox-yarn-setup.sh -# host_files workaround in harness configs. +# Yarn is available on PATH immediately — no runtime corepack setup needed. +# DNS proxy config is handled by env/yarn-proxy.env (maps OpenShell's +# HTTP_PROXY to YARN_HTTP_PROXY). # # Build (native arch): # podman build -t rhdh-fullsend-code:local \