feat(infra-local): BoxLite-based local dev stack (L1 + L2 + M5 native runner)#595
feat(infra-local): BoxLite-based local dev stack (L1 + L2 + M5 native runner)#595lilongen wants to merge 88 commits into
Conversation
…gn docs
Migrated from session work on fix/sandbox-from-image-alpine to start
dedicated feat/cloud-mvp track. Contents:
apps/infra-local/
- goal.md
- poc/single_service.py Phase 0 — single postgres box PoC (✅)
- poc/multi_service.py Phase 1 — multi-service + host-as-hub (✅)
- poc/diagnose_network.py network diagnostic for box-to-box
- poc/diagnose_network.result captured diagnostic output
- poc/README.md Phase 0/1 docs + pass criteria
docs/apps/
- cloud-mvp-plan.md Foundation-first MVP roadmap (rewritten)
- cloud-mvp-plan.md.bak-mvp-deadline-version prior team/deadline version
- own-dog-food-local-infra-solution.md dogfood orchestrator design + Phase 1 results
- infra-vs-local-infra.md apps/infra vs apps/infra-local comparison
- apps-overview.md apps/ one-pager
- apps-comprehensive.md apps/ full breakdown
- apps-api-overview.md apps/api NestJS/TypeORM walkthrough
- api-client-go.md apps/api-client-go auto-gen overview
- sdk-feedback/ dogfood-surfaced SDK gaps:
- 01-host-boxlite-internal-unwired.md (+ linear-friendly variants)
- 02-postgres-trust-via-host-as-hub.md
BoxLite cloud MVP.md input PRD-style brief
PoC status:
- Phase 0 ✅ single postgres box, all 7 sub-phases pass
- Phase 1 ✅ multi-box, host-as-hub via Mac LAN IP, detach=True works
- 2 SDK bugs surfaced + documented
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 1 multi-service + box-to-box networking via host.boxlite.internal validated end-to-end (12/12 phases pass). Revert PoC to non-default host ports (25432/26379) as the durable hygiene rule for any service that could collide with a local dev install; lift the same rule into the design doc (§3.8) and bake it into the planned doctor preflight (§1.7.F).
Concretizes parent design doc §12.2 into a Phase-2 implementation contract: walking skeleton (postgres-only end-to-end) before scaling to the full 10-service orchestrator. Flat package layout, explicit SERVICES registry, doctor port-preflight, integration tests on real BoxLite. Ready for handoff to writing-plans.
Bite-sized 10-task plan derived from the Phase 2 spec. TDD where the logic is testable in isolation (config, lsof parsing, topo_sort); integration test for the end-to-end orchestrator flow. Ready for subagent-driven or inline execution.
…pr + cover data_dir fallback
…ter test coverage
…althy) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g_dir + empty-graph coverage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ipe assertion + skip-doctor in itest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bite-sized 5-task plan. TDD for InfraConfig extensions + orchestrator helpers (_http_probe + _is_already_running_error); integration test proves end-to-end 5-service round-trip on real BoxLite.
… exception (debt #1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
minio/minio:latest (RHEL UBI base) ships layers with directories having no owner-write bit. The SDK's per-start rootfs merge then fails with "storage error: Failed to ... Permission denied (os error 13)" or a "RustPanic" at write time. Apply owner-write idempotently to the extracted image cache before each start. Idempotent + cheap (~10ms). Remove when SDK fixes rootfs-merge to relax dir perms at extract time.
|
|
||
| exporters: | ||
| debug: | ||
| verbosity: basic |
There was a problem hiding this comment.
Otel collector is wired but doesn't feed Jaeger — local Jaeger UI will be empty.
Prod's OtelCollector (apps/infra/sst.config.ts:438) is BoxLite's custom build with the ClickHouse exporter. Here the local collector uses upstream otel/opentelemetry-collector:latest with exporters: debug only — traces land on stdout, never reach SPEC_JAEGER.
Two clean fixes, pick one:
- (a) Wire it through: add
otlpexporter pointing atjaeger:4317and add jaeger to thepipelines.traces.exporterslist. Local Jaeger UI becomes useful. - (b) Drop
SPEC_OTELentirely: Jaeger 1.67 accepts OTLP natively (COLLECTOR_OTLP_ENABLED=trueis already set onSPEC_JAEGER) — have api/runner OTLP straight at jaeger and remove the collector hop.
Right now we ship both services but they don't talk, which is worse than shipping neither. Worth fixing in this PR or immediate follow-up.
There was a problem hiding this comment.
Fixed in 01850227 — went with option (a), keeping the collector in the path for prod parity.
_otel_confignow adds anotlp/jaegerexporter and puts it on the traces pipeline alongsidedebug(metrics/logs stay debug-only — Jaeger doesn't ingest them).- Jaeger's OTLP gRPC receiver is now host-mapped (
26687→ box:4317); the collector reaches it via host-as-hub (host.boxlite.internal:26687).COLLECTOR_OTLP_ENABLED=truewas already set. SPEC_OTELnowdepends_on=["jaeger"].
Verified end-to-end on a live stack: POST an OTLP/HTTP trace to the collector (:24318) → the Jaeger query API returns the service + trace within ~2s, so the Jaeger UI is no longer empty. Docs updated (README §Known-limitations #2, CONNECTIONS §6/§8). The only remaining gap vs prod is the custom boxlite_exporter build (ClickHouse / api push-back), still noted as a limitation.
New principle on this branch: local infra targets the M5 native runner only. Lima-based multi-host runner support is being explored in a separate worktree and is deliberately not in scope here. Doc / comment edits: - apps/infra-local/goal.md: translated from Chinese to English per CLAUDE.md "Documentation Language" rule. - apps/infra-local/tests/integration/test_e2e_full.py: drop "and Lima runner VM later" from the resource-budget docstring. - docs/apps/infra-local-status.md: drop "no Lima" qualifiers from the platform line and the L2 runner row. - docs/apps/milestones/2026-05-25-milestone-infra-local-v0.1.0.md: drop "no Lima" from the headline; reword to "everything runs natively on M5". - docs/apps/infra-vs-local-infra.md: replace the entire §2 "Why Lima instead of HVF" decision archive (180+ lines) with a short "Runner placement on this branch — M5 native (HVF)" section. Update §1 topology + design decisions, §3 comparison-table rows (runner / sandbox isolation / autoscaler InfraProvider / multi-runner support), §4.1/4.2 asymmetries, §6 file pointers, and §7 one-sentence summary to match the M5-native reality. Production-parity tradeoff is acknowledged but flagged as future work outside this milestone. - docs/apps/own-dog-food-local-infra-solution.md: rewrite §2.2, §2.4 (runner path), key-design-choices list, repo layout tree, §5.1 resource budget, §11 decision table, and §12.2 phase plan to describe the M5 native runner instead of a runner-in-Lima. Verification: - grep -iw "lima|limactl|LimaInfraProvider" → 0 hits across all PR-scope files. - Python CJK regex check → 0 CJK chars across all PR-scope files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ke install
The previous §1 jumped from "yarn / go / python already installed"
straight to `make stack-build && make stack-up`, skipping two things
a fresh checkout actually needs:
1. The Python orchestrator package isn't installed yet — `python -m
boxlite_local` doesn't work until `make install` runs `pip install
-e ".[test]"`.
2. The `boxlite` Python SDK and CLI must already be present in the
active environment (it's a transitive dep of `boxlite_local`, not
installed by `make install`).
Restructured §1 into three sub-sections:
- §1.1 Prereqs — table listing the actual required tools + versions,
plus a 3-line sanity check that surfaces missing prereqs before
`make` runs and produces a less-actionable failure.
- §1.2 Three-step bring-up — now correctly shows
make install (pip install the orchestrator package)
make stack-build (yarn + go builds)
make stack-up (L1 + L2 + seed)
with timing expectations (5-7 min cold, ~30 s-1 min warm).
- §1.3 First-time dashboard login — explicit credentials + the
end-to-end smoke (create sandbox → terminal → root@boxlite:~#)
so first-time users know what success looks like.
No behavior change — purely documentation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 50 out of 54 changed files in this pull request and generated 8 comments.
Comments suppressed due to low confidence (4)
apps/runner/pkg/boxlite/registry.go:1
linuxAmd64Platformno longer represents amd64 specifically (it’s now host-arch dependent). Rename to something likelinuxHostPlatform/defaultLinuxPlatformto avoid misleading future readers and call sites.
apps/package.json:1- Using a caret range in
resolutionscan reduce determinism (you may get different transitive trees over time). Prefer pinning@types/expressto an exact version here (and ensure the lockfile resolves consistently) to avoid unexpected type changes across installs.
tsconfig.base.json:1 - This file content is not valid JSON unless it’s being committed as a git symlink target. Ensure
tsconfig.base.jsonis actually a symlink (mode 120000) toapps/tsconfig.base.json; otherwise TypeScript tooling will fail to parse it.
docs/apps/infra-vs-local-infra.md:1 - These port numbers conflict with the rest of the infra-local docs/scripts in this PR (API is started on :3001; dashboard on :3000). Update the table to reflect the actual local ports so readers don’t follow incorrect endpoints.
| defaultSnapshot: 'ubuntu:22.04', | ||
| dashboardUrl: 'http://localhost:3000', | ||
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, |
| http.get(`${API_URL}/*`, async ({ request }) => { | ||
| console.log('[MSW catch-all GET]', request.url) | ||
| if (request.url.includes('paginated')) { | ||
| return HttpResponse.json({ items: [], totalItems: 0, totalPages: 0 }) | ||
| } | ||
| return HttpResponse.json([]) | ||
| }), | ||
|
|
||
| // Catch-all POST/PUT/PATCH/DELETE — return {} so mutations don't error. | ||
| http.post(`${API_URL}/*`, async ({ request }) => { | ||
| console.log('[MSW catch-all POST]', request.url) | ||
| return HttpResponse.json({}) | ||
| }), | ||
| http.put(`${API_URL}/*`, async ({ request }) => { | ||
| console.log('[MSW catch-all PUT]', request.url) | ||
| return HttpResponse.json({}) | ||
| }), | ||
| http.patch(`${API_URL}/*`, async ({ request }) => { | ||
| console.log('[MSW catch-all PATCH]', request.url) | ||
| return HttpResponse.json({}) | ||
| }), | ||
| http.delete(`${API_URL}/*`, async ({ request }) => { | ||
| console.log('[MSW catch-all DELETE]', request.url) | ||
| return HttpResponse.json({}) | ||
| }), |
| <PostHogProvider | ||
| apiKey="phc_local_dev_no_op" | ||
| options={{ | ||
| // api_host is required by posthog-js but never used because | ||
| // capturing is opted out and flags are bootstrapped below. | ||
| api_host: 'https://localhost.invalid', | ||
| bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS }, | ||
| advanced_disable_feature_flags: true, // skip /decide network call | ||
| autocapture: false, | ||
| capture_pageview: false, | ||
| capture_pageleave: false, | ||
| opt_out_capturing_by_default: true, | ||
| disable_session_recording: true, | ||
| loaded: (posthog) => posthog.opt_out_capturing(), | ||
| }} |
… setup Add the first-time bring-up commands (make install + make stack-build + make stack-up) at the top of the TL;DR cheat sheet so the entire day-one workflow is visible in one block, without having to scroll to §1.2. Day-to-day flow keeps its own bring-up line for clarity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…boot
After a machine reboot the L1 microVM boxes are gone but the postgres
data volume (~/.boxlite-local/data/pg/) persists on disk with the full
schema. `make stack-up` sees no postgres box, runs `make up-with-schema`,
which brings the box back (schema already present) and then runs
`make load-schema` — which previously hard-failed:
ERR: public schema already has 27 table(s).
Schema baseline is not idempotent. Run 'make wipe && make up' first
So every post-reboot `make stack-up` died at the schema step.
Fix: apply-schema.sh now treats an already-loaded schema as a no-op
instead of an error. When the public schema is non-empty it checks the
`migrations` table to distinguish:
- COMPLETE prior load (tables + migrations recorded) → skip, exit 0
- PARTIAL half-applied baseline (tables but no migrations) → still
refuse with exit 3 (genuinely broken state; needs `make wipe`)
This makes load-schema / up-with-schema / stack-up all idempotent across
reboots. The non-idempotent baseline itself is unchanged — we just stop
trying to re-apply it when it's already there.
Verified on a live post-reboot stack:
Schema already loaded (27 tables, 88 migrations recorded) — skipping.
exit=0
and `make stack-up` then proceeds to L2 (api/runner/proxy/dashboard all up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 50 out of 54 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (5)
tsconfig.base.json:1
- This file content is not valid JSON. If the intent is to make the repo-root
tsconfig.base.jsona symlink toapps/tsconfig.base.json, ensure it is committed as a git symlink (mode 120000). If it lands as a regular file, TypeScript tooling readingtsconfig.base.jsonwill fail to parse it.
libs:1 - Like
tsconfig.base.json, this looks like it’s meant to be a symlink. If it is committed as a normal file containing the textapps/libs, any tooling expectinglibs/to be a directory (or a workspace path) will break. Ensure this is a git symlink (mode 120000) rather than a regular file.
apps/infra-local/scripts/stack-up.sh:1 - On macOS/BSD,
xargsdoes not support-r(GNU-only), so this will error and can breakset -eflows. Replace the pipeline with a macOS-compatible pattern (e.g., capture PIDs and conditionallykill, or use anxargsinvocation that does not rely on-r). Apply the same fix to the otherxargs -roccurrences in this script.
apps/runner/pkg/boxlite/registry.go:1 - The variable name
linuxAmd64Platformis now inaccurate since it can bearm64(or others). Rename it to reflect the new behavior (e.g.,linuxDefaultPlatform,linuxHostArchPlatform, etc.) to avoid confusing future readers.
apps/package.json:1 - As of Aug 2025,
@swc/coreversions were not in the1.15.xrange;^1.15.33may be a non-existent release and would break installs. Please confirm the intended@swc/coreversion exists in npm (and align it with the repo’s SWC toolchain expectations).
| defaultSnapshot: 'ubuntu:22.04', | ||
| dashboardUrl: 'http://localhost:3000', | ||
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, |
| const KEY = '__msw_api_keys__' | ||
| const load = (): ApiKey[] => { | ||
| try { return JSON.parse(sessionStorage.getItem(KEY) ?? '[]') } catch { return [] } | ||
| } | ||
| const save = (s: ApiKey[]) => sessionStorage.setItem(KEY, JSON.stringify(s)) |
| <PostHogProvider | ||
| apiKey="phc_local_dev_no_op" | ||
| options={{ | ||
| // api_host is required by posthog-js but never used because | ||
| // capturing is opted out and flags are bootstrapped below. | ||
| api_host: 'https://localhost.invalid', | ||
| bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS }, | ||
| advanced_disable_feature_flags: true, // skip /decide network call | ||
| autocapture: false, | ||
| capture_pageview: false, | ||
| capture_pageleave: false, | ||
| opt_out_capturing_by_default: true, | ||
| disable_session_recording: true, | ||
| loaded: (posthog) => posthog.opt_out_capturing(), | ||
| }} | ||
| > |
Goal: a single `make stack-up` should work from a fresh checkout, after a reboot, or after `make stack-down` — no need to remember to run `make install` / `make stack-build` first. Rather than wiring install/stack-build as hard `make` prerequisites (which would force a pip-resolve + go-build on *every* stack-up, including the fast daily restart loop), stack-up.sh now does both checks *conditionally* so the common restart path pays nothing: - New: if `python -c "import boxlite_local"` fails, run `make install` before bringing up L1 (which calls `python -m boxlite_local`). - Existing (kept): if /tmp/boxlite-runner or /tmp/boxlite-proxy is missing, run stack-build.sh. Clarified in a comment that it only builds when missing — use `make stack-restart COMPONENTS=runner` to rebuild after a source change. Combined with the load-schema idempotency fix, `make stack-up` is now the single entry point in all scenarios: fresh checkout → install + up-with-schema + build + L2 post-reboot → up-with-schema (schema skip) + build (/tmp cleared) + L2 post-down → up-with-schema (boxes back) + L2 (binaries + pkg present) Docs updated (README §Quick start, infra-local-usage §0 + §1.2) to present `make stack-up` as the one command, with the explicit targets kept as optional for forcing a rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e→stack-up dashboard
Schema loading
- Add scripts/build-all-in-one-sql.py: consolidates every apps/api TypeORM
migration (legacy + pre/post-deploy, 87 total → 539 stmts) into a single
sql/merged-schema.auto-gen.sql. Resolves TS-side ${...} interpolations,
inlines parameterized queries, and mirrors TypeORM's enum/constraint
auto-renames so the output loads cleanly from zero.
- `make load-schema` now regenerates + loads the merged schema; apply-schema.sh
defaults to it and accepts a SCHEMA_SQL_FILE override.
- Drop sql/schema-baseline.sql + sql/REFRESH.md: the prod pg_dump is no longer
the load source; schema is now generated from migrations (kept reachable via
SCHEMA_SQL_FILE if ever needed for an A/B comparison).
Fix: wipe → stack-up left the dashboard non-functional
- Root cause: `make down`/`wipe` tore down only the L1 boxes, leaving the L2
native procs (api/runner/proxy/dashboard) running. After a wipe the stale API
held connections to the destroyed-and-recreated DB and never re-ran
onApplicationBootstrap against the fresh DB → no admin user/org/region →
dashboard loads but is unusable. (Confirmed independent of the schema swap:
reproduced identically with the old baseline dump.)
- `make down`/`wipe` now stop L2 first (stack-down.sh).
- stack-up.sh stops any stale L2 when it (re)creates L1, covering teardown paths
that bypass make (stack-rebuild-l1-box, direct `boxlite rm`,
`python -m boxlite_local down`).
Verified: working-stack → make wipe → make stack-up → 27 tables / 87 migrations,
admin user + org + region seeded, dashboard :3000 + /api proxy + dex all HTTP 200.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 50 out of 54 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (4)
apps/infra-local/scripts/stack-up.sh:1
xargs -ris not supported by macOS/BSDxargs, so this command will fail on the target platform. Replace with a portable pattern (e.g., capture PIDs into a variable and only callkillwhen non-empty, or usexargs kill -9guarded byif/grep -q .). Apply the same fix to the runner/proxy/dashboard blocks that usexargs -r.
apps/runner/pkg/boxlite/registry.go:1- The variable name
linuxAmd64Platformis now misleading because it can resolve to non-amd64 architectures. Rename it to something accurate (e.g.,linuxHostPlatformordefaultLinuxPlatform) to avoid confusion and incorrect future usage.
apps/package.json:1 - Using a caret range in
resolutionscan reduce determinism (the resolved version may vary over time and across tooling). Consider pinning@types/expressto an exact version to keep installs reproducible and avoid surprise type changes.
tsconfig.base.json:1 - This appears to be a Git symlink (stored as a file containing the target path). On platforms/environments that don’t support symlinks (or where Git symlinks are disabled, common on Windows), this will become a plain text file and break TypeScript tooling expecting valid JSON at
tsconfig.base.json. If Windows support is needed, consider a real JSON file thatextendsapps/tsconfig.base.jsoninstead of a symlink.
| defaultSnapshot: 'ubuntu:22.04', | ||
| dashboardUrl: 'http://localhost:3000', | ||
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, |
| defaultRegionId: 'local', | ||
| defaultRegion: 'local', | ||
| role: 'OWNER', | ||
| } |
| userId: _USER_ID, | ||
| email: 'admin@boxlite.dev', | ||
| name: 'Local Admin', | ||
| role: 'owner', |
| return HttpResponse.json({ | ||
| id: _USER_ID, | ||
| email: 'admin@boxlite.dev', | ||
| name: 'Local Admin', | ||
| role: 'OWNER', | ||
| personalOrganizationId: _ORG_ID, | ||
| }) |
| <PostHogProvider | ||
| apiKey="phc_local_dev_no_op" | ||
| options={{ | ||
| // api_host is required by posthog-js but never used because | ||
| // capturing is opted out and flags are bootstrapped below. | ||
| api_host: 'https://localhost.invalid', | ||
| bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS }, | ||
| advanced_disable_feature_flags: true, // skip /decide network call | ||
| autocapture: false, | ||
| capture_pageview: false, | ||
| capture_pageleave: false, | ||
| opt_out_capturing_by_default: true, | ||
| disable_session_recording: true, | ||
| loaded: (posthog) => posthog.opt_out_capturing(), | ||
| }} | ||
| > |
infra-vs-local-infra.md described the original docker-compose + Lima plan and a lengthy "Why Lima vs HVF" comparison that no longer reflects the team's direction (per review feedback). What actually shipped is the dogfood approach with an M5-native runner, fully documented in own-dog-food-local-infra-solution.md. Rather than keep patching a superseded comparison doc, drop it. Also fix the two dangling references that pointed at it: - apps/infra-local/goal.md: the design-basis line now points at the existing apps/infra-local/ design instead of the deleted doc. - docs/apps/own-dog-food-local-infra-solution.md: drop the "previous version of the proposal" link and the stale "update infra-vs-local-infra.md" follow-up item. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
apps/runner/pkg/boxlite/registry.go:1
- The variable name
linuxAmd64Platformis now misleading becauseArchitectureis derived fromruntime.GOARCH(oftenarm64on Apple Silicon). Rename it to something architecture-agnostic (e.g.,linuxDefaultPlatform/linuxHostPlatform) to avoid confusion and prevent future misuse.
| defaultSnapshot: 'ubuntu:22.04', | ||
| dashboardUrl: 'http://localhost:3000', | ||
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, | ||
| environment: 'local', | ||
| billingApiUrl: BILLING_API_URL, |
Before: the local otel-collector exported every pipeline to `debug` (stdout) only, and the jaeger box's OTLP receiver wasn't host-mapped or fed by anything. Net result — the Jaeger UI at :26686 was always empty; traces sent to the collector died at the debug exporter. (Flagged in review by DorianZheng.) Fix (option a — keep the collector in the path for prod parity): - SPEC_JAEGER: host-map jaeger's OTLP gRPC receiver (26687 → box :4317). jaeger 1.67 all-in-one already has COLLECTOR_OTLP_ENABLED=true. - _OTEL_CONFIG → _otel_config(cfg): add an `otlp/jaeger` exporter (host.boxlite.internal:26687, tls insecure) and add it to the *traces* pipeline alongside debug. metrics/logs stay debug-only (Jaeger doesn't ingest them). - SPEC_OTEL: depends_on=["jaeger"] so the export target is up first. Verified end-to-end on a live stack: POST an OTLP/HTTP trace to the collector (:24318) → within 2s the Jaeger query API returns the service and the trace. 35 unit tests pass (topo sort handles the new otel→jaeger edge). Docs updated (README §Known-limitations #2, CONNECTIONS §6/§8) — the "Jaeger pipeline not connected" limitation is removed; the remaining gap is only the custom boxlite_exporter build (ClickHouse/api push-back). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses two review comments: - CONNECTIONS.md §10: the route table claimed `/` proxies to host :4000 (dashboard). It doesn't — the Caddyfile `handle /` returns a static help page, and the Proxy (:4000) is reached only by the signed port-preview Host matcher (`<port>-<token>.localhost`). Table now matches `_caddyfile()`. - services.py module docstring still said "Phase 3b … otel-collector deferred". otel + caddy have shipped; rewrote it to list the actual 10 boxes + 1 one-shot and point at `_otel_config()` / `_caddyfile()`. Docs/comment only — no behavior change. 35 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (4)
tsconfig.base.json:1
tsconfig.base.jsonmust be valid JSON for TypeScript/Nx tooling; the current content is a plain path string and will cause parsing failures. If the intent is to shareapps/tsconfig.base.json, either commit a real symlink (so the repo roottsconfig.base.jsonresolves to the apps file) or replace this with a minimal JSON file thatextendsthe apps base config.
libs:1- A root-level
libsentry is expected to be a directory (or a symlink to a directory) in most workspaces; committing it as a plain file containing a path will break module resolution and filesystem expectations. If this is intended to be a symlink toapps/libs, commit it as an actual symlink (or adjust workspace configuration so code referencesapps/libsdirectly).
apps/runner/pkg/boxlite/registry.go:1 - The identifier
linuxAmd64Platformis now misleading because it no longer hardcodesamd64and can evaluate toarm64, etc. Rename it to something architecture-agnostic (e.g.,linuxHostPlatform/defaultLinuxPlatform) to avoid incorrect assumptions elsewhere.
apps/package.json:1 - The specified
node-forgeversion^1.4.0may not exist on npm (as of common published versions,node-forgeis typically1.3.x). If this version is not actually published, installs will fail; please verify the published version and pin to an existing release (or update to the correct package/version if a different fork is intended).
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, | ||
| environment: 'local', |
| bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS }, | ||
| advanced_disable_feature_flags: true, // skip /decide network call |
…one-sql.py The file carried `Copyright Daytona Platforms Inc. / SPDX-License-Identifier: AGPL-3.0` — an accidental copy-paste from a Daytona-derived apps/api file it was modeled on. It's the only one of the 30 infra-local source files with that header; the other 29 (orchestrator.py, services.py, all stack-*.sh, …) carry none. Removing it for consistency with the rest of infra-local. No license-header CI check requires it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README/CONNECTIONS docs tell users they can set
`BOXLITE_DATA_DIR=~/.boxlite-local/data`, but `Path("~/...")` does not
expand `~` — it would create a literal "~" directory under the cwd and
silently put PG/MinIO/registry data there. `.load()` now applies
`.expanduser()` to the constructed data_dir path.
Added test_load_expands_tilde_in_data_dir_env. 36 unit tests pass.
(Flagged by Copilot review.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (6)
apps/runner/pkg/boxlite/registry.go:1
- The variable name
linuxAmd64Platformis now misleading since it can bearm64(or other). Rename to something architecture-neutral (e.g.linuxHostPlatform/defaultLinuxPlatform) so callers don’t infer it is always amd64.
tsconfig.base.json:1 - This appears to be a committed symlink (Git stores symlinks as a file whose contents is the target path). Symlinks can be problematic on Windows and in some packaging/CI environments; consider replacing this with a real
tsconfig.base.jsonthat usesextendsto referenceapps/tsconfig.base.json, or document/enforce symlink support in contributing/CI.
libs:1 - Similar to
tsconfig.base.json, this looks like a committed symlink toapps/libs. If cross-platform support matters, consider using an actual directory or a tooling-level path mapping instead of a symlink, or ensure CI and team dev environments can reliably handle symlinks.
apps/package.json:1 - Based on versions available up to my knowledge cutoff (Aug 2025),
node-forge@^1.4.0and@swc/core@^1.15.33look potentially non-existent or at least unusually high jumps. Please verify these versions exist on npm (and that the lockfile/tooling supports them); if not, pin to a published version or adjust the range to the intended release line.
apps/package.json:1 - Based on versions available up to my knowledge cutoff (Aug 2025),
node-forge@^1.4.0and@swc/core@^1.15.33look potentially non-existent or at least unusually high jumps. Please verify these versions exist on npm (and that the lockfile/tooling supports them); if not, pin to a published version or adjust the range to the intended release line.
apps/package.json:1 - Using a caret range in
resolutionscan undermine determinism and make installs vary across time. Prefer pinning an exact version inresolutions(and relying on the lockfile) unless there’s a specific reason to allow upgrades here.
| ts: TopologicalSorter[str] = TopologicalSorter() | ||
| for name, spec in services.items(): | ||
| ts.add(name, *spec.depends_on) | ||
| ts.prepare() | ||
| layers: list[list[str]] = [] | ||
| while ts.is_active(): | ||
| layer = sorted(ts.get_ready()) | ||
| if not layer: | ||
| break | ||
| layers.append(layer) | ||
| for name in layer: | ||
| ts.done(name) | ||
| return layers |
| defaultSnapshot: 'ubuntu:22.04', | ||
| dashboardUrl: 'http://localhost:3000', | ||
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, | ||
| environment: 'local', | ||
| billingApiUrl: BILLING_API_URL, |
| "main": "apps/api/src/main.ts", | ||
| "tsConfig": "apps/api/tsconfig.app.json", | ||
| "generatePackageJson": true, | ||
| "generatePackageJson": false, |
# Conflicts: # .gitignore # apps/go.work
…on after main merge
Merging origin/main silently reverted an infra-local host-mode
adaptation. main refactored its tsconfig layout (api extends
`../tsconfig.base.json` = apps/tsconfig.base.json); feat/cloud-mvp had
deliberately pointed the api at the repo-root base config
(`../../tsconfig.base.json`, a symlink to the same file). Relative to
the merge-base only main touched this line, so git took main's value
with no conflict — a silent semantic regression.
Why it matters: under host-mode `nx serve api` (cwd=apps/, with the
`apps/apps -> .` symlink), the path used to extend the base config
decides how tsc resolves `rootDir: "."`:
- ../../tsconfig.base.json → rootDir = repo root → apps/libs/* in range
- ../tsconfig.base.json → rootDir = apps/ → apps/libs/runner-api-client
out of range → TS6059 ×529
The api imports @boxlite-ai/runner-api-client (lives in apps/libs/), so
the collector/runner client sources must be inside rootDir. With main's
value the API failed to compile → `make stack-up` died at "api failed to
become healthy".
Restoring `../../tsconfig.base.json` re-applies the adaptation. Verified:
`make stack-up` now boots the API clean (webpack compiled successfully)
through to an active default snapshot.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (4)
tsconfig.base.json:1
tsconfig.base.jsonis not valid JSON as committed (it contains only a path). If this is intended to be a symlink, ensure it’s committed as a symlink in git (mode 120000); otherwise TypeScript tooling will fail when reading the root tsconfig. If symlinks are not desired/portable, replace this with an actual JSON tsconfig that extends./apps/tsconfig.base.json.
libs:1- This file content looks like a symlink target (and as a regular file it’s not a valid directory). If the intent is a repository-root
libssymlink toapps/libs, make sure it’s committed as a symlink; otherwise consumers expecting alibs/directory at repo root will break.
apps/runner/pkg/boxlite/registry.go:1 - The variable name
linuxAmd64Platformis now misleading because it no longer hardcodesamd64(it usesruntime.GOARCH). Rename it to reflect the new semantics (e.g.,linuxHostPlatform,linuxDefaultPlatform, orlinuxPlatformForHostArch) to avoid confusing future readers and callers.
apps/package.json:1 - As of my knowledge cutoff (Aug 2025),
node-forge’s latest published version is1.3.1and^1.4.0is likely not resolvable from npm. Please verify the intended version exists; otherwise installs will fail in CI. If the goal is simply to addnode-forge, consider pinning to^1.3.1(or whatever the current published latest is) and updating@types/node-forgeaccordingly.
| <PostHogProvider | ||
| apiKey="phc_local_dev_no_op" | ||
| options={{ | ||
| // api_host is required by posthog-js but never used because | ||
| // capturing is opted out and flags are bootstrapped below. | ||
| api_host: 'https://localhost.invalid', | ||
| bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS }, | ||
| advanced_disable_feature_flags: true, // skip /decide network call | ||
| autocapture: false, | ||
| capture_pageview: false, | ||
| capture_pageleave: false, | ||
| opt_out_capturing_by_default: true, | ||
| disable_session_recording: true, | ||
| loaded: (posthog) => posthog.opt_out_capturing(), | ||
| }} |
| proxyToolboxUrl: 'http://localhost:28080', | ||
| defaultSnapshot: 'ubuntu:22.04', | ||
| dashboardUrl: 'http://localhost:3000', | ||
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, | ||
| environment: 'local', | ||
| billingApiUrl: BILLING_API_URL, | ||
| }) | ||
| } as Partial<BoxliteConfiguration>) |
| // Bootstrap-aware (mirror evaluateFlag's behavior for object resolution). | ||
| if (!this.isConfigured || !this.client) { | ||
| // Object flags are rare in our usage; not added to bootstrapFlags map | ||
| // (typed as boolean | string | number). Fall through to default. | ||
| logger.debug(`PostHog not configured, returning default value for flag ${flagKey}`) | ||
| return { | ||
| value: defaultValue, |
Summary
Adds
apps/infra-local/— a fully self-hosted BoxLite cloud-MVPcontrol plane that runs on a single Apple Silicon Mac. One command
(
make stack-up) brings up the entire stack and lets a developer createreal microVM sandboxes through the dashboard — no AWS, no Docker daemon
for the application boxes; everything runs natively on M5.
Ships as
milestone/infra-local/v0.1.0. Cold-start → working sandbox +browser terminal (
root@boxlite:~#) in ~80 s.Architecture
boxlite_localPython package:3001), Go Runner (:3003), Go Proxy (:4000), Vite Dashboard (:3000)make stack-*~/.boxlite-runner/The Runner runs natively on M5 (HVF + libkrun) — single runner host.
Multi-host autoscaler testing (the old Lima route) is intentionally out
of scope here and parked in a separate worktree.
What works end-to-end (verified)
make stack-nuke && make stack-upboots all services + auto-seeds + waits for the default snapshot — ~80 s.admin@boxlite.dev/password, plus a normaltest01@boxlite.dev); API auto-creates user + Personal org + owner row on first login.POST /api/sandbox.ubuntu:22.04default snapshot; runner pulls the arm64 layer from the local registry.Verified twice via full cold-start + browser-driven E2E (login → snapshot Active → create sandbox → terminal
root@boxlite:~#).Operate-by-make
make stack-upis the single, self-healing entry point — works froma fresh checkout, after a reboot, or after
make stack-down:make installif the orchestrator package isn't importable/tmpcleared on reboot)load-schemais idempotent — skips cleanly when the PG data volume survived a rebootOther targets:
stack-status,stack-logs COMPONENT=…,stack-restart COMPONENTS=…,stack-rebuild-l1-box BOX=…, tiered cleanupstack-reset/stack-reset-hard/stack-nuke. Seedocs/apps/infra-local-usage.md.Notable implementation notes
stack-up.sh): the Go runner reportshost-wide CPU/RAM/disk to the API. On a dev Mac sharing RAM with
IDE/Chrome/Docker, that drags
availabilityScorebelow the prodthreshold and the API rejects sandbox-create with "No available
runners".
stack-up.shexportsRUNNER_AVAILABILITY_SCORE_THRESHOLD=5/
RUNNER_MEMORY_PENALTY_THRESHOLD=95/RUNNER_DISK_PENALTY_THRESHOLD=95before launching the API. Documented in
CONNECTIONS.md; structuralfix (runner reports only its own usage) tracked as follow-up.
feature-flag bootstrap (api + dashboard), runner
runtime.GOARCHimage pull (fixes
ENOEXECon arm64),jwt.strategyuser-createfix, Caddy host-regexp → proxy for sandbox port-preview URLs.
Known limitations (intentional, not bugs)
PostHog, Billing (Stripe), Svix webhooks, Snapshot Manager (Dockerfile
builds), SSH Gateway, ClickHouse, OpenSearch, SMTP, dns-shim + TLS are
mocked / deferred — see the
milestone doc
"Known limitations" table. Production-parity gap (HVF vs KVM libkrun
backend) is acknowledged and revisited when the autoscaler lands.
Docs
apps/infra-local/README.md— quick start + make targets + architectureapps/infra-local/CONNECTIONS.md— endpoint / credential / env-var referencedocs/apps/infra-local-usage.md— day-to-day workflow + tiered cleanup decision treedocs/apps/infra-local-status.md— real vs mock vs missing inventorydocs/apps/milestones/2026-05-25-milestone-infra-local-v0.1.0.md— milestone summaryTest plan
make stack-upfrom fresh checkout (auto install + build + up)make stack-nuke && make stack-upcold start → ~80 s to working stackmake stack-up(load-schema idempotent skip, binaries rebuilt)ubuntu:22.04Activeroot@boxlite:~#🤖 Generated with Claude Code