feat: PVC-per-agent workspace persistence#355
Conversation
Greptile SummaryThis PR introduces per-agent PVC-backed workspace persistence so that sandbox filesystem and git state survive across session restarts. A
Confidence Score: 2/5Not safe to merge as-is: the migration contains unreviewed destructive changes to unrelated columns, and the DELETE handler can permanently orphan PVC resources. Three concrete defects on the changed paths: the migration drops defaults on two columns outside this feature scope (a potentially breaking schema change silently bundled in); the agent DELETE handler removes the DB row before the PVC, leaving an unrecoverable orphaned resource if the K8s call fails; and the ReadWriteOnce access mode will cause warm-pool pods to stall indefinitely on any multi-node cluster where the new warm task and the active session pod are scheduled on different nodes. The migration SQL and the agent DELETE route handler need the most attention before this ships.
|
| Filename | Overview |
|---|---|
| prisma/migrations/20260528080608_add_agent_workspace_pvc/migration.sql | Adds workspace_pvc column to managed_agent, but also drops defaults on message_id and project_id in unrelated tables — likely Prisma schema drift bundled into this migration, potentially breaking insert paths that rely on those database-level defaults. |
| src/app/api/v1/managed_agents/agents/[agent_id]/route.ts | Adds deleteWorkspacePvc to the DELETE handler, but the agent record is removed from the DB before the PVC is deleted — a K8s API failure orphans the PVC permanently with no recovery path. |
| src/server/k8s.ts | Adds ensureWorkspacePvc and deleteWorkspacePvc with correct idempotency; PVC mounts wired correctly into runTask. ReadWriteOnce access mode can deadlock warm-pool pods on multi-node clusters, and no storageClassName is specified. |
| src/server/warmPool/index.ts | Integrates PVC provisioning into provisionWarmTask correctly with idempotent DB update pattern; concurrent provisioning for the same agent is safe at the Prisma level but exposed to the ReadWriteOnce issue in k8s.ts. |
| src/app/api/v1/managed_agents/agents/[agent_id]/session/route.ts | Integrates PVC provisioning into coldBringUp correctly, skipping the K8s path when LOCAL_SANDBOX_URL is set; same ReadWriteOnce concern as warmPool. |
| src/server/types.ts | Cleanly adds workspace_pvc_name to RunTaskOpts and workspace_pvc to ApiAgent/toApiAgent; no issues. |
| prisma/schema.prisma | Adds nullable workspace_pvc String field to Agent model correctly. |
| prisma/migrations/migration_lock.toml | Trivial wording fix in comment (i.e. → e.g.); no issues. |
Reviews (1): Last reviewed commit: "feat: PVC-per-agent workspace persistenc..." | Re-trigger Greptile
| await prisma.agent.delete({ where: { agent_id } }); | ||
| await deleteWorkspacePvc(agent_id); | ||
| return new Response(null, { status: 204 }); |
There was a problem hiding this comment.
Agent deleted before PVC — orphaned PVC on K8s error
prisma.agent.delete commits first; if deleteWorkspacePvc then throws a non-404 K8s error (e.g. API unavailable, PVC is still terminating), the agent row is gone from the database but the PVC remains in the cluster with no record pointing to it and no future cleanup path. Reversing the order — delete the PVC first, then the agent — means a K8s failure leaves the agent row intact so an operator can retry or investigate, rather than creating an irrecoverable resource leak.
| -- AlterTable | ||
| ALTER TABLE "managed_agent_session_message" ALTER COLUMN "message_id" DROP DEFAULT; | ||
|
|
||
| -- AlterTable | ||
| ALTER TABLE "project" ALTER COLUMN "project_id" DROP DEFAULT; |
There was a problem hiding this comment.
Unrelated
DROP DEFAULT statements could break existing insert paths
Lines 4–8 drop the database-level defaults on managed_agent_session_message.message_id and project.project_id. These columns are not touched by the rest of this PR. If either column had a gen_random_uuid() database default and any code inserts without supplying an explicit value, those inserts will start failing after this migration is applied. This looks like Prisma bundled uncommitted schema drift into this migration — it should be verified intentional and, if not, split out or reverted before deployment.
| metadata: { | ||
| name, | ||
| namespace: ns, | ||
| labels: { | ||
| [LABEL_AGENT_ID]: agent_id, | ||
| }, | ||
| }, | ||
| spec: { | ||
| accessModes: ["ReadWriteOnce"], | ||
| resources: { | ||
| requests: { | ||
| storage: "10Gi", | ||
| }, | ||
| }, | ||
| }, | ||
| }; | ||
|
|
||
| try { | ||
| await coreApi().createNamespacedPersistentVolumeClaim({ | ||
| namespace: ns, | ||
| body: pvc, | ||
| }); | ||
| } catch (err) { | ||
| if (!isAlreadyExists(err)) throw err; | ||
| } | ||
| return name; |
There was a problem hiding this comment.
ReadWriteOnce PVC may block warm-pool pods on multi-node clusters
A PVC with accessModes: ["ReadWriteOnce"] can only be mounted by pods on a single node at a time. When a warm task is claimed by a session, provisionWarmTask immediately runs to top up the pool (per PER_AGENT_TARGET = 1). The new warm-task pod and the active session pod both attempt to mount the same PVC concurrently. On a multi-node cluster, if the scheduler places them on different nodes, the second pod is stuck in ContainerCreating indefinitely with a volume-attachment error and no automatic recovery. Using ReadWriteMany (if the storage class supports it) or serialising PVC access so only one pod mounts it at a time would avoid this.
| spec: { | ||
| accessModes: ["ReadWriteOnce"], | ||
| resources: { | ||
| requests: { | ||
| storage: "10Gi", | ||
| }, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
No
storageClassName is set, so the PVC will use whatever the cluster's default storage class is. Different environments (dev kind cluster, staging, prod) may have different defaults with very different performance, cost, and ReadWriteMany support characteristics. Explicitly specifying the storage class — or sourcing it from an env var — makes the behaviour deterministic across environments.
| spec: { | |
| accessModes: ["ReadWriteOnce"], | |
| resources: { | |
| requests: { | |
| storage: "10Gi", | |
| }, | |
| }, | |
| }, | |
| spec: { | |
| accessModes: ["ReadWriteOnce"], | |
| ...(env.K8S_WORKSPACE_STORAGE_CLASS ? { storageClassName: env.K8S_WORKSPACE_STORAGE_CLASS } : {}), | |
| resources: { | |
| requests: { | |
| storage: "10Gi", | |
| }, | |
| }, | |
| }, |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Implemented PVC per agent workspace persistence to preserve sandboxes files and git state across restarts, timeouts, and follow up sessions.
Fixes #208