Skip to content

feat(infra-local): BoxLite-based local dev stack (L1 + L2 + M5 native runner)#595

Open
lilongen wants to merge 88 commits into
boxlite-ai:mainfrom
lilongen:feat/cloud-mvp
Open

feat(infra-local): BoxLite-based local dev stack (L1 + L2 + M5 native runner)#595
lilongen wants to merge 88 commits into
boxlite-ai:mainfrom
lilongen:feat/cloud-mvp

Conversation

@lilongen
Copy link
Copy Markdown

@lilongen lilongen commented May 26, 2026

Summary

Adds apps/infra-local/ — a fully self-hosted BoxLite cloud-MVP
control plane that runs on a single Apple Silicon Mac. One command
(make stack-up) brings up the entire stack and lets a developer create
real microVM sandboxes through the dashboard — no AWS, no Docker daemon
for the application boxes; everything runs natively on M5.

Ships as milestone/infra-local/v0.1.0. Cold-start → working sandbox +
browser terminal (root@boxlite:~#) in ~80 s.

Dogfood principle: the control-plane services run as BoxLite boxes
(not docker-compose), so any BoxLite weakness is felt by the team
immediately.

Architecture

Layer What Where
L1 — infra-local 10 BoxLite microVM boxes: PostgreSQL, Redis, MinIO, OCI registry, Dex (OIDC), Caddy, Jaeger, pgAdmin, registry-ui, OpenTelemetry collector (+ one-shot minio-init) libkrun microVMs on macOS Hypervisor.framework, orchestrated by the boxlite_local Python package
L2 — control plane 4 native processes: NestJS API (:3001), Go Runner (:3003), Go Proxy (:4000), Vite Dashboard (:3000) Native macOS arm64, driven by make stack-*
L3 — user sandboxes N user-created Ubuntu/Alpine sandboxes libkrun microVMs spawned by the L2 Runner in ~/.boxlite-runner/

The Runner runs natively on M5 (HVF + libkrun) — single runner host.
Multi-host autoscaler testing (the old Lima route) is intentionally out
of scope here and parked in a separate worktree.

What works end-to-end (verified)

  • Cold start: make stack-nuke && make stack-up boots all services + auto-seeds + waits for the default snapshot — ~80 s.
  • OIDC login via Dex (admin@boxlite.dev / password, plus a normal test01@boxlite.dev); API auto-creates user + Personal org + owner row on first login.
  • Sandbox lifecycle: create → start → stop → destroy via dashboard or POST /api/sandbox.
  • Live terminal: dashboard Terminal → Connect → real interactive shell inside the microVM (browser → Caddy → Proxy → Runner → microVM).
  • Snapshot pull: API auto-creates the ubuntu:22.04 default snapshot; runner pulls the arm64 layer from the local registry.
  • Region + API-key management, audit logs, etc.

Verified twice via full cold-start + browser-driven E2E (login → snapshot Active → create sandbox → terminal root@boxlite:~#).

Operate-by-make

make stack-up is the single, self-healing entry point — works from
a fresh checkout, after a reboot, or after make stack-down:

  • auto-runs make install if the orchestrator package isn't importable
  • auto-builds the native runner/proxy binaries if missing (e.g. /tmp cleared on reboot)
  • load-schema is idempotent — skips cleanly when the PG data volume survived a reboot

Other targets: stack-status, stack-logs COMPONENT=…, stack-restart COMPONENTS=…, stack-rebuild-l1-box BOX=…, tiered cleanup
stack-reset / stack-reset-hard / stack-nuke. See
docs/apps/infra-local-usage.md.

Notable implementation notes

  • Dev runner-score override (stack-up.sh): the Go runner reports
    host-wide CPU/RAM/disk to the API. On a dev Mac sharing RAM with
    IDE/Chrome/Docker, that drags availabilityScore below the prod
    threshold and the API rejects sandbox-create with "No available
    runners". stack-up.sh exports RUNNER_AVAILABILITY_SCORE_THRESHOLD=5
    / RUNNER_MEMORY_PENALTY_THRESHOLD=95 / RUNNER_DISK_PENALTY_THRESHOLD=95
    before launching the API. Documented in CONNECTIONS.md; structural
    fix (runner reports only its own usage) tracked as follow-up.
  • Supporting app changes to make the milestone work: PostHog
    feature-flag bootstrap (api + dashboard), runner runtime.GOARCH
    image pull (fixes ENOEXEC on arm64), jwt.strategy user-create
    fix, Caddy host-regexp → proxy for sandbox port-preview URLs.

Known limitations (intentional, not bugs)

PostHog, Billing (Stripe), Svix webhooks, Snapshot Manager (Dockerfile
builds), SSH Gateway, ClickHouse, OpenSearch, SMTP, dns-shim + TLS are
mocked / deferred — see the
milestone doc
"Known limitations" table. Production-parity gap (HVF vs KVM libkrun
backend) is acknowledged and revisited when the autoscaler lands.

Docs

Test plan

  • make stack-up from fresh checkout (auto install + build + up)
  • make stack-nuke && make stack-up cold start → ~80 s to working stack
  • Reboot → make stack-up (load-schema idempotent skip, binaries rebuilt)
  • Dashboard login via Dex → snapshot ubuntu:22.04 Active
  • Create sandbox via UI → state Started → terminal root@boxlite:~#
  • All committed docs English-only (CLAUDE.md); no Lima refs in scripts/code

🤖 Generated with Claude Code

lile and others added 30 commits May 20, 2026 17:31
…gn docs

Migrated from session work on fix/sandbox-from-image-alpine to start
dedicated feat/cloud-mvp track. Contents:

apps/infra-local/
  - goal.md
  - poc/single_service.py        Phase 0 — single postgres box PoC (✅)
  - poc/multi_service.py         Phase 1 — multi-service + host-as-hub (✅)
  - poc/diagnose_network.py      network diagnostic for box-to-box
  - poc/diagnose_network.result  captured diagnostic output
  - poc/README.md                Phase 0/1 docs + pass criteria

docs/apps/
  - cloud-mvp-plan.md            Foundation-first MVP roadmap (rewritten)
  - cloud-mvp-plan.md.bak-mvp-deadline-version  prior team/deadline version
  - own-dog-food-local-infra-solution.md  dogfood orchestrator design + Phase 1 results
  - infra-vs-local-infra.md      apps/infra vs apps/infra-local comparison
  - apps-overview.md             apps/ one-pager
  - apps-comprehensive.md        apps/ full breakdown
  - apps-api-overview.md         apps/api NestJS/TypeORM walkthrough
  - api-client-go.md             apps/api-client-go auto-gen overview
  - sdk-feedback/                dogfood-surfaced SDK gaps:
    - 01-host-boxlite-internal-unwired.md  (+ linear-friendly variants)
    - 02-postgres-trust-via-host-as-hub.md

BoxLite cloud MVP.md             input PRD-style brief

PoC status:
  - Phase 0 ✅ single postgres box, all 7 sub-phases pass
  - Phase 1 ✅ multi-box, host-as-hub via Mac LAN IP, detach=True works
  - 2 SDK bugs surfaced + documented

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 1 multi-service + box-to-box networking via host.boxlite.internal
validated end-to-end (12/12 phases pass). Revert PoC to non-default host
ports (25432/26379) as the durable hygiene rule for any service that
could collide with a local dev install; lift the same rule into the
design doc (§3.8) and bake it into the planned doctor preflight (§1.7.F).
Concretizes parent design doc §12.2 into a Phase-2 implementation
contract: walking skeleton (postgres-only end-to-end) before scaling to
the full 10-service orchestrator. Flat package layout, explicit
SERVICES registry, doctor port-preflight, integration tests on real
BoxLite. Ready for handoff to writing-plans.
Bite-sized 10-task plan derived from the Phase 2 spec. TDD where the
logic is testable in isolation (config, lsof parsing, topo_sort);
integration test for the end-to-end orchestrator flow. Ready for
subagent-driven or inline execution.
…althy)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g_dir + empty-graph coverage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ipe assertion + skip-doctor in itest

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5-service stack: pg (Phase 2) + redis/minio/minio-init/registry (3a).
Introduces http_url healthcheck, one_shot lifecycle, repo_root resolution.
Closes Phase-2 debt #1 (narrow start_service exception); defers debt #2
and tcp_port (no caller in 3a). Autonomous execution per /goal directive.
Bite-sized 5-task plan. TDD for InfraConfig extensions + orchestrator
helpers (_http_probe + _is_already_running_error); integration test
proves end-to-end 5-service round-trip on real BoxLite.
… exception (debt #1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
minio/minio:latest (RHEL UBI base) ships layers with directories having
no owner-write bit. The SDK's per-start rootfs merge then fails with
"storage error: Failed to ... Permission denied (os error 13)" or a
"RustPanic" at write time. Apply owner-write idempotently to the extracted
image cache before each start. Idempotent + cheap (~10ms). Remove when
SDK fixes rootfs-merge to relax dir perms at extract time.

exporters:
debug:
verbosity: basic
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otel collector is wired but doesn't feed Jaeger — local Jaeger UI will be empty.

Prod's OtelCollector (apps/infra/sst.config.ts:438) is BoxLite's custom build with the ClickHouse exporter. Here the local collector uses upstream otel/opentelemetry-collector:latest with exporters: debug only — traces land on stdout, never reach SPEC_JAEGER.

Two clean fixes, pick one:

  • (a) Wire it through: add otlp exporter pointing at jaeger:4317 and add jaeger to the pipelines.traces.exporters list. Local Jaeger UI becomes useful.
  • (b) Drop SPEC_OTEL entirely: Jaeger 1.67 accepts OTLP natively (COLLECTOR_OTLP_ENABLED=true is already set on SPEC_JAEGER) — have api/runner OTLP straight at jaeger and remove the collector hop.

Right now we ship both services but they don't talk, which is worse than shipping neither. Worth fixing in this PR or immediate follow-up.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 01850227 — went with option (a), keeping the collector in the path for prod parity.

  • _otel_config now adds an otlp/jaeger exporter and puts it on the traces pipeline alongside debug (metrics/logs stay debug-only — Jaeger doesn't ingest them).
  • Jaeger's OTLP gRPC receiver is now host-mapped (26687 → box :4317); the collector reaches it via host-as-hub (host.boxlite.internal:26687). COLLECTOR_OTLP_ENABLED=true was already set.
  • SPEC_OTEL now depends_on=["jaeger"].

Verified end-to-end on a live stack: POST an OTLP/HTTP trace to the collector (:24318) → the Jaeger query API returns the service + trace within ~2s, so the Jaeger UI is no longer empty. Docs updated (README §Known-limitations #2, CONNECTIONS §6/§8). The only remaining gap vs prod is the custom boxlite_exporter build (ClickHouse / api push-back), still noted as a limitation.

lile and others added 2 commits May 27, 2026 11:12
New principle on this branch: local infra targets the M5 native runner
only. Lima-based multi-host runner support is being explored in a
separate worktree and is deliberately not in scope here.

Doc / comment edits:

- apps/infra-local/goal.md: translated from Chinese to English per
  CLAUDE.md "Documentation Language" rule.
- apps/infra-local/tests/integration/test_e2e_full.py: drop
  "and Lima runner VM later" from the resource-budget docstring.
- docs/apps/infra-local-status.md: drop "no Lima" qualifiers from
  the platform line and the L2 runner row.
- docs/apps/milestones/2026-05-25-milestone-infra-local-v0.1.0.md:
  drop "no Lima" from the headline; reword to "everything runs
  natively on M5".
- docs/apps/infra-vs-local-infra.md: replace the entire §2 "Why
  Lima instead of HVF" decision archive (180+ lines) with a short
  "Runner placement on this branch — M5 native (HVF)" section.
  Update §1 topology + design decisions, §3 comparison-table rows
  (runner / sandbox isolation / autoscaler InfraProvider /
  multi-runner support), §4.1/4.2 asymmetries, §6 file pointers,
  and §7 one-sentence summary to match the M5-native reality.
  Production-parity tradeoff is acknowledged but flagged as future
  work outside this milestone.
- docs/apps/own-dog-food-local-infra-solution.md: rewrite §2.2,
  §2.4 (runner path), key-design-choices list, repo layout tree,
  §5.1 resource budget, §11 decision table, and §12.2 phase plan
  to describe the M5 native runner instead of a runner-in-Lima.

Verification:
- grep -iw "lima|limactl|LimaInfraProvider" → 0 hits across all
  PR-scope files.
- Python CJK regex check → 0 CJK chars across all PR-scope files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ke install

The previous §1 jumped from "yarn / go / python already installed"
straight to `make stack-build && make stack-up`, skipping two things
a fresh checkout actually needs:

1. The Python orchestrator package isn't installed yet — `python -m
   boxlite_local` doesn't work until `make install` runs `pip install
   -e ".[test]"`.
2. The `boxlite` Python SDK and CLI must already be present in the
   active environment (it's a transitive dep of `boxlite_local`, not
   installed by `make install`).

Restructured §1 into three sub-sections:

- §1.1 Prereqs — table listing the actual required tools + versions,
  plus a 3-line sanity check that surfaces missing prereqs before
  `make` runs and produces a less-actionable failure.
- §1.2 Three-step bring-up — now correctly shows
    make install        (pip install the orchestrator package)
    make stack-build    (yarn + go builds)
    make stack-up       (L1 + L2 + seed)
  with timing expectations (5-7 min cold, ~30 s-1 min warm).
- §1.3 First-time dashboard login — explicit credentials + the
  end-to-end smoke (create sandbox → terminal → root@boxlite:~#)
  so first-time users know what success looks like.

No behavior change — purely documentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 04:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 50 out of 54 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (4)

apps/runner/pkg/boxlite/registry.go:1

  • linuxAmd64Platform no longer represents amd64 specifically (it’s now host-arch dependent). Rename to something like linuxHostPlatform/defaultLinuxPlatform to avoid misleading future readers and call sites.
    apps/package.json:1
  • Using a caret range in resolutions can reduce determinism (you may get different transitive trees over time). Prefer pinning @types/express to an exact version here (and ensure the lockfile resolves consistently) to avoid unexpected type changes across installs.
    tsconfig.base.json:1
  • This file content is not valid JSON unless it’s being committed as a git symlink target. Ensure tsconfig.base.json is actually a symlink (mode 120000) to apps/tsconfig.base.json; otherwise TypeScript tooling will fail to parse it.
    docs/apps/infra-vs-local-infra.md:1
  • These port numbers conflict with the rest of the infra-local docs/scripts in this PR (API is started on :3001; dashboard on :3000). Update the table to reflect the actual local ports so readers don’t follow incorrect endpoints.

Comment thread apps/infra-local/boxlite_local/services.py
Comment thread apps/infra-local/boxlite_local/services.py
Comment thread apps/infra-local/boxlite_local/services.py
Comment thread apps/infra-local/boxlite_local/services.py
defaultSnapshot: 'ubuntu:22.04',
dashboardUrl: 'http://localhost:3000',
maxAutoArchiveInterval: 43200,
maintananceMode: false,
Comment on lines +388 to +412
http.get(`${API_URL}/*`, async ({ request }) => {
console.log('[MSW catch-all GET]', request.url)
if (request.url.includes('paginated')) {
return HttpResponse.json({ items: [], totalItems: 0, totalPages: 0 })
}
return HttpResponse.json([])
}),

// Catch-all POST/PUT/PATCH/DELETE — return {} so mutations don't error.
http.post(`${API_URL}/*`, async ({ request }) => {
console.log('[MSW catch-all POST]', request.url)
return HttpResponse.json({})
}),
http.put(`${API_URL}/*`, async ({ request }) => {
console.log('[MSW catch-all PUT]', request.url)
return HttpResponse.json({})
}),
http.patch(`${API_URL}/*`, async ({ request }) => {
console.log('[MSW catch-all PATCH]', request.url)
return HttpResponse.json({})
}),
http.delete(`${API_URL}/*`, async ({ request }) => {
console.log('[MSW catch-all DELETE]', request.url)
return HttpResponse.json({})
}),
Comment on lines +69 to +83
<PostHogProvider
apiKey="phc_local_dev_no_op"
options={{
// api_host is required by posthog-js but never used because
// capturing is opted out and flags are bootstrapped below.
api_host: 'https://localhost.invalid',
bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS },
advanced_disable_feature_flags: true, // skip /decide network call
autocapture: false,
capture_pageview: false,
capture_pageleave: false,
opt_out_capturing_by_default: true,
disable_session_recording: true,
loaded: (posthog) => posthog.opt_out_capturing(),
}}
Comment thread apps/infra-local/CONNECTIONS.md
lile and others added 2 commits May 27, 2026 12:17
… setup

Add the first-time bring-up commands (make install + make stack-build
+ make stack-up) at the top of the TL;DR cheat sheet so the entire
day-one workflow is visible in one block, without having to scroll to
§1.2. Day-to-day flow keeps its own bring-up line for clarity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…boot

After a machine reboot the L1 microVM boxes are gone but the postgres
data volume (~/.boxlite-local/data/pg/) persists on disk with the full
schema. `make stack-up` sees no postgres box, runs `make up-with-schema`,
which brings the box back (schema already present) and then runs
`make load-schema` — which previously hard-failed:

    ERR: public schema already has 27 table(s).
    Schema baseline is not idempotent. Run 'make wipe && make up' first

So every post-reboot `make stack-up` died at the schema step.

Fix: apply-schema.sh now treats an already-loaded schema as a no-op
instead of an error. When the public schema is non-empty it checks the
`migrations` table to distinguish:

  - COMPLETE prior load (tables + migrations recorded) → skip, exit 0
  - PARTIAL half-applied baseline (tables but no migrations) → still
    refuse with exit 3 (genuinely broken state; needs `make wipe`)

This makes load-schema / up-with-schema / stack-up all idempotent across
reboots. The non-idempotent baseline itself is unchanged — we just stop
trying to re-apply it when it's already there.

Verified on a live post-reboot stack:
    Schema already loaded (27 tables, 88 migrations recorded) — skipping.
    exit=0
and `make stack-up` then proceeds to L2 (api/runner/proxy/dashboard all up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 08:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 50 out of 54 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (5)

tsconfig.base.json:1

  • This file content is not valid JSON. If the intent is to make the repo-root tsconfig.base.json a symlink to apps/tsconfig.base.json, ensure it is committed as a git symlink (mode 120000). If it lands as a regular file, TypeScript tooling reading tsconfig.base.json will fail to parse it.
    libs:1
  • Like tsconfig.base.json, this looks like it’s meant to be a symlink. If it is committed as a normal file containing the text apps/libs, any tooling expecting libs/ to be a directory (or a workspace path) will break. Ensure this is a git symlink (mode 120000) rather than a regular file.
    apps/infra-local/scripts/stack-up.sh:1
  • On macOS/BSD, xargs does not support -r (GNU-only), so this will error and can break set -e flows. Replace the pipeline with a macOS-compatible pattern (e.g., capture PIDs and conditionally kill, or use an xargs invocation that does not rely on -r). Apply the same fix to the other xargs -r occurrences in this script.
    apps/runner/pkg/boxlite/registry.go:1
  • The variable name linuxAmd64Platform is now inaccurate since it can be arm64 (or others). Rename it to reflect the new behavior (e.g., linuxDefaultPlatform, linuxHostArchPlatform, etc.) to avoid confusing future readers.
    apps/package.json:1
  • As of Aug 2025, @swc/core versions were not in the 1.15.x range; ^1.15.33 may be a non-existent release and would break installs. Please confirm the intended @swc/core version exists in npm (and align it with the repo’s SWC toolchain expectations).

defaultSnapshot: 'ubuntu:22.04',
dashboardUrl: 'http://localhost:3000',
maxAutoArchiveInterval: 43200,
maintananceMode: false,
Comment on lines +338 to +342
const KEY = '__msw_api_keys__'
const load = (): ApiKey[] => {
try { return JSON.parse(sessionStorage.getItem(KEY) ?? '[]') } catch { return [] }
}
const save = (s: ApiKey[]) => sessionStorage.setItem(KEY, JSON.stringify(s))
Comment on lines +69 to +84
<PostHogProvider
apiKey="phc_local_dev_no_op"
options={{
// api_host is required by posthog-js but never used because
// capturing is opted out and flags are bootstrapped below.
api_host: 'https://localhost.invalid',
bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS },
advanced_disable_feature_flags: true, // skip /decide network call
autocapture: false,
capture_pageview: false,
capture_pageleave: false,
opt_out_capturing_by_default: true,
disable_session_recording: true,
loaded: (posthog) => posthog.opt_out_capturing(),
}}
>
Comment thread apps/infra-local/CONNECTIONS.md Outdated
lile and others added 2 commits May 28, 2026 15:39
Goal: a single `make stack-up` should work from a fresh checkout, after
a reboot, or after `make stack-down` — no need to remember to run
`make install` / `make stack-build` first.

Rather than wiring install/stack-build as hard `make` prerequisites
(which would force a pip-resolve + go-build on *every* stack-up,
including the fast daily restart loop), stack-up.sh now does both
checks *conditionally* so the common restart path pays nothing:

- New: if `python -c "import boxlite_local"` fails, run `make install`
  before bringing up L1 (which calls `python -m boxlite_local`).
- Existing (kept): if /tmp/boxlite-runner or /tmp/boxlite-proxy is
  missing, run stack-build.sh. Clarified in a comment that it only
  builds when missing — use `make stack-restart COMPONENTS=runner` to
  rebuild after a source change.

Combined with the load-schema idempotency fix, `make stack-up` is now
the single entry point in all scenarios:
  fresh checkout → install + up-with-schema + build + L2
  post-reboot    → up-with-schema (schema skip) + build (/tmp cleared) + L2
  post-down      → up-with-schema (boxes back) + L2  (binaries + pkg present)

Docs updated (README §Quick start, infra-local-usage §0 + §1.2) to
present `make stack-up` as the one command, with the explicit targets
kept as optional for forcing a rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e→stack-up dashboard

Schema loading
- Add scripts/build-all-in-one-sql.py: consolidates every apps/api TypeORM
  migration (legacy + pre/post-deploy, 87 total → 539 stmts) into a single
  sql/merged-schema.auto-gen.sql. Resolves TS-side ${...} interpolations,
  inlines parameterized queries, and mirrors TypeORM's enum/constraint
  auto-renames so the output loads cleanly from zero.
- `make load-schema` now regenerates + loads the merged schema; apply-schema.sh
  defaults to it and accepts a SCHEMA_SQL_FILE override.
- Drop sql/schema-baseline.sql + sql/REFRESH.md: the prod pg_dump is no longer
  the load source; schema is now generated from migrations (kept reachable via
  SCHEMA_SQL_FILE if ever needed for an A/B comparison).

Fix: wipe → stack-up left the dashboard non-functional
- Root cause: `make down`/`wipe` tore down only the L1 boxes, leaving the L2
  native procs (api/runner/proxy/dashboard) running. After a wipe the stale API
  held connections to the destroyed-and-recreated DB and never re-ran
  onApplicationBootstrap against the fresh DB → no admin user/org/region →
  dashboard loads but is unusable. (Confirmed independent of the schema swap:
  reproduced identically with the old baseline dump.)
- `make down`/`wipe` now stop L2 first (stack-down.sh).
- stack-up.sh stops any stale L2 when it (re)creates L1, covering teardown paths
  that bypass make (stack-rebuild-l1-box, direct `boxlite rm`,
  `python -m boxlite_local down`).

Verified: working-stack → make wipe → make stack-up → 27 tables / 87 migrations,
admin user + org + region seeded, dashboard :3000 + /api proxy + dex all HTTP 200.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 10:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 50 out of 54 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (4)

apps/infra-local/scripts/stack-up.sh:1

  • xargs -r is not supported by macOS/BSD xargs, so this command will fail on the target platform. Replace with a portable pattern (e.g., capture PIDs into a variable and only call kill when non-empty, or use xargs kill -9 guarded by if/grep -q .). Apply the same fix to the runner/proxy/dashboard blocks that use xargs -r.
    apps/runner/pkg/boxlite/registry.go:1
  • The variable name linuxAmd64Platform is now misleading because it can resolve to non-amd64 architectures. Rename it to something accurate (e.g., linuxHostPlatform or defaultLinuxPlatform) to avoid confusion and incorrect future usage.
    apps/package.json:1
  • Using a caret range in resolutions can reduce determinism (the resolved version may vary over time and across tooling). Consider pinning @types/express to an exact version to keep installs reproducible and avoid surprise type changes.
    tsconfig.base.json:1
  • This appears to be a Git symlink (stored as a file containing the target path). On platforms/environments that don’t support symlinks (or where Git symlinks are disabled, common on Windows), this will become a plain text file and break TypeScript tooling expecting valid JSON at tsconfig.base.json. If Windows support is needed, consider a real JSON file that extends apps/tsconfig.base.json instead of a symlink.

defaultSnapshot: 'ubuntu:22.04',
dashboardUrl: 'http://localhost:3000',
maxAutoArchiveInterval: 43200,
maintananceMode: false,
Comment on lines +48 to +51
defaultRegionId: 'local',
defaultRegion: 'local',
role: 'OWNER',
}
userId: _USER_ID,
email: 'admin@boxlite.dev',
name: 'Local Admin',
role: 'owner',
Comment on lines +292 to +298
return HttpResponse.json({
id: _USER_ID,
email: 'admin@boxlite.dev',
name: 'Local Admin',
role: 'OWNER',
personalOrganizationId: _ORG_ID,
})
Comment on lines +69 to +84
<PostHogProvider
apiKey="phc_local_dev_no_op"
options={{
// api_host is required by posthog-js but never used because
// capturing is opted out and flags are bootstrapped below.
api_host: 'https://localhost.invalid',
bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS },
advanced_disable_feature_flags: true, // skip /decide network call
autocapture: false,
capture_pageview: false,
capture_pageleave: false,
opt_out_capturing_by_default: true,
disable_session_recording: true,
loaded: (posthog) => posthog.opt_out_capturing(),
}}
>
@lilongen lilongen changed the title docs(infra-local): align with v0.1.0 + dev runner-score override feat(infra-local): BoxLite-based local dev stack (L1 + L2 + M5 native runner) May 29, 2026
infra-vs-local-infra.md described the original docker-compose + Lima
plan and a lengthy "Why Lima vs HVF" comparison that no longer reflects
the team's direction (per review feedback). What actually shipped is the
dogfood approach with an M5-native runner, fully documented in
own-dog-food-local-infra-solution.md. Rather than keep patching a
superseded comparison doc, drop it.

Also fix the two dangling references that pointed at it:
- apps/infra-local/goal.md: the design-basis line now points at the
  existing apps/infra-local/ design instead of the deleted doc.
- docs/apps/own-dog-food-local-infra-solution.md: drop the "previous
  version of the proposal" link and the stale "update
  infra-vs-local-infra.md" follow-up item.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 29, 2026 09:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

apps/runner/pkg/boxlite/registry.go:1

  • The variable name linuxAmd64Platform is now misleading because Architecture is derived from runtime.GOARCH (often arm64 on Apple Silicon). Rename it to something architecture-agnostic (e.g., linuxDefaultPlatform / linuxHostPlatform) to avoid confusion and prevent future misuse.

Comment on lines +70 to 75
defaultSnapshot: 'ubuntu:22.04',
dashboardUrl: 'http://localhost:3000',
maxAutoArchiveInterval: 43200,
maintananceMode: false,
environment: 'local',
billingApiUrl: BILLING_API_URL,
Comment thread apps/infra-local/CONNECTIONS.md
Comment thread apps/infra-local/scripts/build-all-in-one-sql.py Outdated
lile and others added 2 commits May 29, 2026 17:53
Before: the local otel-collector exported every pipeline to `debug`
(stdout) only, and the jaeger box's OTLP receiver wasn't host-mapped or
fed by anything. Net result — the Jaeger UI at :26686 was always empty;
traces sent to the collector died at the debug exporter. (Flagged in
review by DorianZheng.)

Fix (option a — keep the collector in the path for prod parity):
- SPEC_JAEGER: host-map jaeger's OTLP gRPC receiver (26687 → box :4317).
  jaeger 1.67 all-in-one already has COLLECTOR_OTLP_ENABLED=true.
- _OTEL_CONFIG → _otel_config(cfg): add an `otlp/jaeger` exporter
  (host.boxlite.internal:26687, tls insecure) and add it to the
  *traces* pipeline alongside debug. metrics/logs stay debug-only
  (Jaeger doesn't ingest them).
- SPEC_OTEL: depends_on=["jaeger"] so the export target is up first.

Verified end-to-end on a live stack: POST an OTLP/HTTP trace to the
collector (:24318) → within 2s the Jaeger query API returns the service
and the trace. 35 unit tests pass (topo sort handles the new
otel→jaeger edge).

Docs updated (README §Known-limitations #2, CONNECTIONS §6/§8) — the
"Jaeger pipeline not connected" limitation is removed; the remaining
gap is only the custom boxlite_exporter build (ClickHouse/api push-back).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses two review comments:

- CONNECTIONS.md §10: the route table claimed `/` proxies to host :4000
  (dashboard). It doesn't — the Caddyfile `handle /` returns a static
  help page, and the Proxy (:4000) is reached only by the signed
  port-preview Host matcher (`<port>-<token>.localhost`). Table now
  matches `_caddyfile()`.
- services.py module docstring still said "Phase 3b … otel-collector
  deferred". otel + caddy have shipped; rewrote it to list the actual
  10 boxes + 1 one-shot and point at `_otel_config()` / `_caddyfile()`.

Docs/comment only — no behavior change. 35 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 29, 2026 10:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (4)

tsconfig.base.json:1

  • tsconfig.base.json must be valid JSON for TypeScript/Nx tooling; the current content is a plain path string and will cause parsing failures. If the intent is to share apps/tsconfig.base.json, either commit a real symlink (so the repo root tsconfig.base.json resolves to the apps file) or replace this with a minimal JSON file that extends the apps base config.
    libs:1
  • A root-level libs entry is expected to be a directory (or a symlink to a directory) in most workspaces; committing it as a plain file containing a path will break module resolution and filesystem expectations. If this is intended to be a symlink to apps/libs, commit it as an actual symlink (or adjust workspace configuration so code references apps/libs directly).
    apps/runner/pkg/boxlite/registry.go:1
  • The identifier linuxAmd64Platform is now misleading because it no longer hardcodes amd64 and can evaluate to arm64, etc. Rename it to something architecture-agnostic (e.g., linuxHostPlatform / defaultLinuxPlatform) to avoid incorrect assumptions elsewhere.
    apps/package.json:1
  • The specified node-forge version ^1.4.0 may not exist on npm (as of common published versions, node-forge is typically 1.3.x). If this version is not actually published, installs will fail; please verify the published version and pin to an existing release (or update to the correct package/version if a different fork is intended).

Comment on lines +72 to +74
maxAutoArchiveInterval: 43200,
maintananceMode: false,
environment: 'local',
Comment on lines +75 to +76
bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS },
advanced_disable_feature_flags: true, // skip /decide network call
Comment thread apps/infra-local/boxlite_local/config.py Outdated
lile and others added 2 commits May 29, 2026 18:10
…one-sql.py

The file carried `Copyright Daytona Platforms Inc. / SPDX-License-Identifier:
AGPL-3.0` — an accidental copy-paste from a Daytona-derived apps/api file it
was modeled on. It's the only one of the 30 infra-local source files with
that header; the other 29 (orchestrator.py, services.py, all stack-*.sh, …)
carry none. Removing it for consistency with the rest of infra-local. No
license-header CI check requires it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README/CONNECTIONS docs tell users they can set
`BOXLITE_DATA_DIR=~/.boxlite-local/data`, but `Path("~/...")` does not
expand `~` — it would create a literal "~" directory under the cwd and
silently put PG/MinIO/registry data there. `.load()` now applies
`.expanduser()` to the constructed data_dir path.

Added test_load_expands_tilde_in_data_dir_env. 36 unit tests pass.

(Flagged by Copilot review.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 29, 2026 10:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (6)

apps/runner/pkg/boxlite/registry.go:1

  • The variable name linuxAmd64Platform is now misleading since it can be arm64 (or other). Rename to something architecture-neutral (e.g. linuxHostPlatform / defaultLinuxPlatform) so callers don’t infer it is always amd64.
    tsconfig.base.json:1
  • This appears to be a committed symlink (Git stores symlinks as a file whose contents is the target path). Symlinks can be problematic on Windows and in some packaging/CI environments; consider replacing this with a real tsconfig.base.json that uses extends to reference apps/tsconfig.base.json, or document/enforce symlink support in contributing/CI.
    libs:1
  • Similar to tsconfig.base.json, this looks like a committed symlink to apps/libs. If cross-platform support matters, consider using an actual directory or a tooling-level path mapping instead of a symlink, or ensure CI and team dev environments can reliably handle symlinks.
    apps/package.json:1
  • Based on versions available up to my knowledge cutoff (Aug 2025), node-forge@^1.4.0 and @swc/core@^1.15.33 look potentially non-existent or at least unusually high jumps. Please verify these versions exist on npm (and that the lockfile/tooling supports them); if not, pin to a published version or adjust the range to the intended release line.
    apps/package.json:1
  • Based on versions available up to my knowledge cutoff (Aug 2025), node-forge@^1.4.0 and @swc/core@^1.15.33 look potentially non-existent or at least unusually high jumps. Please verify these versions exist on npm (and that the lockfile/tooling supports them); if not, pin to a published version or adjust the range to the intended release line.
    apps/package.json:1
  • Using a caret range in resolutions can undermine determinism and make installs vary across time. Prefer pinning an exact version in resolutions (and relying on the lockfile) unless there’s a specific reason to allow upgrades here.

Comment on lines +30 to +42
ts: TopologicalSorter[str] = TopologicalSorter()
for name, spec in services.items():
ts.add(name, *spec.depends_on)
ts.prepare()
layers: list[list[str]] = []
while ts.is_active():
layer = sorted(ts.get_ready())
if not layer:
break
layers.append(layer)
for name in layer:
ts.done(name)
return layers
Comment on lines +70 to 75
defaultSnapshot: 'ubuntu:22.04',
dashboardUrl: 'http://localhost:3000',
maxAutoArchiveInterval: 43200,
maintananceMode: false,
environment: 'local',
billingApiUrl: BILLING_API_URL,
Comment thread apps/api/project.json
"main": "apps/api/src/main.ts",
"tsConfig": "apps/api/tsconfig.app.json",
"generatePackageJson": true,
"generatePackageJson": false,
lile and others added 2 commits May 29, 2026 18:21
# Conflicts:
#	.gitignore
#	apps/go.work
…on after main merge

Merging origin/main silently reverted an infra-local host-mode
adaptation. main refactored its tsconfig layout (api extends
`../tsconfig.base.json` = apps/tsconfig.base.json); feat/cloud-mvp had
deliberately pointed the api at the repo-root base config
(`../../tsconfig.base.json`, a symlink to the same file). Relative to
the merge-base only main touched this line, so git took main's value
with no conflict — a silent semantic regression.

Why it matters: under host-mode `nx serve api` (cwd=apps/, with the
`apps/apps -> .` symlink), the path used to extend the base config
decides how tsc resolves `rootDir: "."`:
  - ../../tsconfig.base.json → rootDir = repo root → apps/libs/* in range
  - ../tsconfig.base.json    → rootDir = apps/     → apps/libs/runner-api-client
                                                     out of range → TS6059 ×529
The api imports @boxlite-ai/runner-api-client (lives in apps/libs/), so
the collector/runner client sources must be inside rootDir. With main's
value the API failed to compile → `make stack-up` died at "api failed to
become healthy".

Restoring `../../tsconfig.base.json` re-applies the adaptation. Verified:
`make stack-up` now boots the API clean (webpack compiled successfully)
through to an active default snapshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 29, 2026 11:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (4)

tsconfig.base.json:1

  • tsconfig.base.json is not valid JSON as committed (it contains only a path). If this is intended to be a symlink, ensure it’s committed as a symlink in git (mode 120000); otherwise TypeScript tooling will fail when reading the root tsconfig. If symlinks are not desired/portable, replace this with an actual JSON tsconfig that extends ./apps/tsconfig.base.json.
    libs:1
  • This file content looks like a symlink target (and as a regular file it’s not a valid directory). If the intent is a repository-root libs symlink to apps/libs, make sure it’s committed as a symlink; otherwise consumers expecting a libs/ directory at repo root will break.
    apps/runner/pkg/boxlite/registry.go:1
  • The variable name linuxAmd64Platform is now misleading because it no longer hardcodes amd64 (it uses runtime.GOARCH). Rename it to reflect the new semantics (e.g., linuxHostPlatform, linuxDefaultPlatform, or linuxPlatformForHostArch) to avoid confusing future readers and callers.
    apps/package.json:1
  • As of my knowledge cutoff (Aug 2025), node-forge’s latest published version is 1.3.1 and ^1.4.0 is likely not resolvable from npm. Please verify the intended version exists; otherwise installs will fail in CI. If the goal is simply to add node-forge, consider pinning to ^1.3.1 (or whatever the current published latest is) and updating @types/node-forge accordingly.

Comment on lines +69 to +83
<PostHogProvider
apiKey="phc_local_dev_no_op"
options={{
// api_host is required by posthog-js but never used because
// capturing is opted out and flags are bootstrapped below.
api_host: 'https://localhost.invalid',
bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS },
advanced_disable_feature_flags: true, // skip /decide network call
autocapture: false,
capture_pageview: false,
capture_pageleave: false,
opt_out_capturing_by_default: true,
disable_session_recording: true,
loaded: (posthog) => posthog.opt_out_capturing(),
}}
Comment on lines +69 to +76
proxyToolboxUrl: 'http://localhost:28080',
defaultSnapshot: 'ubuntu:22.04',
dashboardUrl: 'http://localhost:3000',
maxAutoArchiveInterval: 43200,
maintananceMode: false,
environment: 'local',
billingApiUrl: BILLING_API_URL,
})
} as Partial<BoxliteConfiguration>)
Comment on lines +105 to 111
// Bootstrap-aware (mirror evaluateFlag's behavior for object resolution).
if (!this.isConfigured || !this.client) {
// Object flags are rare in our usage; not added to bootstrapFlags map
// (typed as boolean | string | number). Fall through to default.
logger.debug(`PostHog not configured, returning default value for flag ${flagKey}`)
return {
value: defaultValue,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants