Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions docs/integration/zid-dylib-issues.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Zid × agentOS dylib integration — issue tracker

Living tracker for the Zid team's dylib-preview migration. Covers **their** reported
blockers and **our** (agentOS / secure-exec) bugs surfaced while investigating them.

**Versions under test (their pins):**
- `@rivet-dev/agentos*` / `agentos-core` / `agentos-pi` / `agentos-sidecar`: `0.0.0-integrate-dylib-into-main.815fcda`
- `rivetkit`: `0.0.0-feat-dylib-actor-plugin.c44621f`
- `@agentos-software/common`: `0.3.0-rc.2`; `@secure-exec/core`: `0.3.0`
- Their runtime: Node 22, **linux-x64-gnu prebuilts under OrbStack + Rosetta x86-emulation on macOS** (prod = Railway, native linux-x64).

> Note: the dylib stack lives on the `integrate-dylib-into-main` branch (HEAD = `815fcda`),
> **not** on `main` yet. `815fcda` is what their version is built from.

## How they integrate
They do **not** use the native `agentOs()` actor. They built their own RivetKit JS actor
holding an `AgentOs` core instance (`c.vars.agentOs`) and drive it directly (the removed
`rivetkit/agent-os` actor's replacement). ~12 custom actions, a host toolkit (in-VM Pi
extension → HTTP to host), and server-side `onSessionEvent`.

## Reproduction status (key finding)
Reproduced on **native linux-x64** with their pinned versions, escalating to **their actual code**:
1. Faithful replay of their actor sequence (create → seed writes → their three
`createInMemoryFileSystem` JS-driver mounts → `createSession("pi")` with `cwd:"/workspace"`
and `cwd:"/"`) — **all pass.**
2. **Their actual custom adapter, bundled with their actual `build-adapter.mjs`** (Proxy-stubbing
8 packages + `eval(require())` + minify), swapped over the stock `agentos-pi` adapter →
`createSession("pi")` reaches `session/new` and returns a sessionId in ~1.4s. **No chdir ENOENT.**
3. **Their adapter UNBUNDLED** (runtime `import "@agentclientprotocol/sdk"`) → also resolves and
`createSession` succeeds. (The *stock* dylib adapter likewise imports `@agentclientprotocol/sdk`
at runtime and resolves.)

**So Q1–Q4 do NOT reproduce on native, even with their exact adapter + flow + mounts.** Their
blockers are **environment-specific**. Two root causes:
- **Q1 (resolution)** = **node_modules hoist layout.** Core mounts the agent package's hoisted
`node_modules` tree at `/root/node_modules`; the adapter resolves `@agentclientprotocol/sdk` by
walking up from `/root/node_modules/@rivet-dev/agentos-pi`. In a **flat npm install** the dep is
hoisted top-level → resolves. In their monorepo install it isn't on that chain → "not found."
Bundling (their workaround) is correct; alternatively mount/hoist the dep onto the chain.
- **Q2/Q3/Q4 (empty guest FS / no sh / writes invisible)** = **no native reproduction and no code
explanation** → the remaining uncontrolled variable is **OrbStack + Rosetta x86-emulation** of
the linux-x64 prebuilt (guest VFS/mount/exec syscalls misbehaving under emulation). Could be a
single shared cause (the sidecar's mount/VFS layer failing under emulation, which would explain
all three at once). **Not directly reproduced** (this host is native x64, can't run Rosetta).

**Decisive test for them:** run on **native linux-x64** (their Railway prod is native
linux-x64-glibc). If it works there but fails in OrbStack/Rosetta → emulation confirmed.

Their own repro scripts (`scripts/diag-adapter.mjs`, `scripts/smoke-agentos.mjs`) drive their
rivetkit server + swapped adapter; the VM flow + adapter they wrap is what was reproduced here.

---

## Their reported blockers

| # | Issue | Finding | Status |
|---|---|---|---|
| 1 | **Q0** — native `agentOs()` actor can't host their custom actions / host toolkit / `onSessionEvent`; is wrapping the core class supported? | Yes — core-direct is the documented pattern (all 13 quickstarts). All 4 of their native-actor claims verified TRUE (`actions:{}`, no JS callbacks, callbacks parsed-and-dropped, `toolKits` not serialized). | ✅ Answered |
| 2 | **Q1** — adapter `import @agentclientprotocol/sdk` → `_resolveModule returned non-string` | **Root cause: node_modules hoist layout.** Reproduced both stock & their adapter resolving the dep when hoisted flat; "not found" means it isn't on the `/root/node_modules` chain core mounts. Bundling (their workaround) is correct; else hoist/mount the dep. | ◐ Root-caused; error-message fix in **secure-exec PR #114** (diagnostics only) |
| 3 | **Q2** — `chdir` ENOENT for every path incl `/` | Base rootfs **is** provisioned — proven on their version with **their actual bundled adapter** + full flow; `createSession` reaches `session/new`. Not reproducible on native → **Rosetta x86-emulation** is the leading cause. | ✅ Diagnosed (not an SDK bug) |
| 4 | **Q3** — `command not found: sh` | `sh`/`bash` **do** ship (in `@agentos-software/coreutils`, inside `common`); works in repro. Their "common dropped sh" belief is wrong (only the package *description* omits "sh"). Likely same env (mount layer under emulation). | ✅ Diagnosed (not an SDK bug) |
| 5 | **Q4** — host `writeFile`/`mkdir` not visible to guest | Visible core-direct, **no mount required** (proven w/ their flow). Likely write-after-`createSession` ordering, a shadowing mount, or the same emulation env. | ✅ Diagnosed (not an SDK bug) |
| 6 | **Q5** — inherent to the VM model or core-class-specific? | **Neither** — full core-direct path (incl. `createSession("pi")`) reproduced working. | ✅ Answered |

---

## Our bugs / gaps (agentOS + secure-exec)

| # | Issue | Impact | Status | Proposed fix |
|---|---|---|---|---|
| 7 | `toolKit→sidecar` runs Zod **v4** `toJSONSchema()`; throws on Zod **v3** schemas | **Why they dropped `toolKits`** and went host-tools-over-HTTP | 🟡 Open | accept v3 (or convert), or document the constraint |
| 8 | Native actor **silently drops** `onSessionEvent`/`onPermissionRequest`/`onBeforeConnect`/`toolKits` | Silent footgun — users think these are wired | 🟡 Open | throw a clear "unsupported across the native boundary" error (or wire through) |
| 9 | core `AgentOs.create()` ignores the `defaultSoftware` option it documents | Latent; auto-include of `common` is actor-only (`actor.ts:192-200`) | 🟡 Open | honor `defaultSoftware` in core, or fix the JSDoc |
| 10 | `withAutoAgentNodeModulesMount` is **actor-only** — no public helper for core-direct users | Core-direct users with a custom adapter get no node_modules-mount helper (relates to #2) | 🟡 Open | export a public `nodeModulesMount`-style helper / do it in core |
| 11 | Native actor has **no `mountFs`** action and rejects JS-driver mounts (static, serializable Native mounts only) | **Blocks the proxy-actor pattern** from hosting their session/skills JS-driver VFS | 🟡 Open | dynamic `mountFs` (incl. JS-driver) on the native actor |
| 12 | Engine `:6420` vs httpPort `:6421` (`/metadata` 404) | DX confusion they hit | ⚪ Open | document, or client auto-detect |
| 13 | Audit their 11 carried patches for dylib obsolescence — esp. the WASI **read-blocked-as-write** permission typo (their patch 1) | Some patches may be obsolete; the WASI one is a real correctness bug if it survived the move into secure-exec | ⚪ Not audited | audit + confirm against `0.3.0` |
| 14 | Misleading `_resolveModule returned non-string` error (really "not found") | Sent them down the bundling path (#2) | ✅ **secure-exec PR #114 (open)** | reworded + main-only regression test |

**Status key:** ✅ done/answered · ◐ partial · 🟡 our bug, identified, not fixed · ⚪ not started

---

## Reproduction recipe
On native linux-x64 (NOT Rosetta), from public npm:
```
npm i @rivet-dev/agentos-core@0.0.0-integrate-dylib-into-main.815fcda \
@rivet-dev/agentos-pi@0.0.0-integrate-dylib-into-main.815fcda \
@agentos-software/common@0.3.0-rc.2
```
```js
import { AgentOs, createInMemoryFileSystem } from "@rivet-dev/agentos-core";
import common from "@agentos-software/common";
import pi from "@rivet-dev/agentos-pi";
const vm = await AgentOs.create({ software: [common, pi] });
console.log((await vm.exec("ls -la / && sh -c 'echo SH_OK' && pwd")).stdout); // base rootfs + sh
await vm.writeFile("/home/user/x.txt", "hi");
console.log((await vm.exec("cat /home/user/x.txt")).stdout); // host write visible, no mount
for (const p of ["/home/user/.pi/agent/sessions","/app/skills"]) vm.mountFs(p, createInMemoryFileSystem());
console.log((await vm.createSession("pi", { cwd: "/workspace", env: { HOME:"/home/user" } })).sessionId); // works
await vm.dispose();
```
All of the above succeed → Q1–Q4 are not SDK bugs.
Loading