Skip to content

feat: PVC-per-agent workspace persistence#355

Open
dushyantzz wants to merge 2 commits into
BerriAI:mainfrom
dushyantzz:main
Open

feat: PVC-per-agent workspace persistence#355
dushyantzz wants to merge 2 commits into
BerriAI:mainfrom
dushyantzz:main

Conversation

@dushyantzz
Copy link
Copy Markdown

@dushyantzz dushyantzz commented May 28, 2026

Implemented PVC per agent workspace persistence to preserve sandboxes files and git state across restarts, timeouts, and follow up sessions.

  1. Database & Types: Added a workspace_pvc column to the Agent model, generated/applied a schema migration, and extended RunTaskOpts to pass the PVC name.
  2. Kubernetes (PVC Management): Created ensureWorkspacePvc and deleteWorkspacePvc functions in k8s.ts and extended runTask to mount the agent's PVC to /work/repo in the harness container.
  3. Lifecycle Integration: Integrated PVC provisioning into the coldBringUp and provisionWarmTask paths.
  4. Cleanup: Added automatic PVC deletion to the agent's DELETE API handler.
Screenshot 2026-05-28 140749

Fixes #208

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR introduces per-agent PVC-backed workspace persistence so that sandbox filesystem and git state survive across session restarts. A workspace_pvc column is added to the Agent model, ensureWorkspacePvc/deleteWorkspacePvc helpers manage the PVC lifecycle in Kubernetes, and both the cold bring-up and warm-pool paths are updated to provision and mount the PVC at /work/repo.

  • The schema migration bundles two unrelated ALTER COLUMN … DROP DEFAULT statements against managed_agent_session_message.message_id and project.project_id that are not part of this feature and may break insert paths relying on those database defaults.
  • The DELETE handler removes the agent DB row before deleting the PVC; if the K8s call fails, the PVC is orphaned with no cleanup path.
  • The ReadWriteOnce access mode can cause warm-pool pods to stall indefinitely on multi-node clusters when a newly provisioned warm task and an active session pod are scheduled on different nodes.

Confidence Score: 2/5

Not safe to merge as-is: the migration contains unreviewed destructive changes to unrelated columns, and the DELETE handler can permanently orphan PVC resources.

Three concrete defects on the changed paths: the migration drops defaults on two columns outside this feature scope (a potentially breaking schema change silently bundled in); the agent DELETE handler removes the DB row before the PVC, leaving an unrecoverable orphaned resource if the K8s call fails; and the ReadWriteOnce access mode will cause warm-pool pods to stall indefinitely on any multi-node cluster where the new warm task and the active session pod are scheduled on different nodes.

The migration SQL and the agent DELETE route handler need the most attention before this ships.

Important Files Changed

Filename Overview
prisma/migrations/20260528080608_add_agent_workspace_pvc/migration.sql Adds workspace_pvc column to managed_agent, but also drops defaults on message_id and project_id in unrelated tables — likely Prisma schema drift bundled into this migration, potentially breaking insert paths that rely on those database-level defaults.
src/app/api/v1/managed_agents/agents/[agent_id]/route.ts Adds deleteWorkspacePvc to the DELETE handler, but the agent record is removed from the DB before the PVC is deleted — a K8s API failure orphans the PVC permanently with no recovery path.
src/server/k8s.ts Adds ensureWorkspacePvc and deleteWorkspacePvc with correct idempotency; PVC mounts wired correctly into runTask. ReadWriteOnce access mode can deadlock warm-pool pods on multi-node clusters, and no storageClassName is specified.
src/server/warmPool/index.ts Integrates PVC provisioning into provisionWarmTask correctly with idempotent DB update pattern; concurrent provisioning for the same agent is safe at the Prisma level but exposed to the ReadWriteOnce issue in k8s.ts.
src/app/api/v1/managed_agents/agents/[agent_id]/session/route.ts Integrates PVC provisioning into coldBringUp correctly, skipping the K8s path when LOCAL_SANDBOX_URL is set; same ReadWriteOnce concern as warmPool.
src/server/types.ts Cleanly adds workspace_pvc_name to RunTaskOpts and workspace_pvc to ApiAgent/toApiAgent; no issues.
prisma/schema.prisma Adds nullable workspace_pvc String field to Agent model correctly.
prisma/migrations/migration_lock.toml Trivial wording fix in comment (i.e. → e.g.); no issues.

Reviews (1): Last reviewed commit: "feat: PVC-per-agent workspace persistenc..." | Re-trigger Greptile

Comment on lines 179 to 181
await prisma.agent.delete({ where: { agent_id } });
await deleteWorkspacePvc(agent_id);
return new Response(null, { status: 204 });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Agent deleted before PVC — orphaned PVC on K8s error

prisma.agent.delete commits first; if deleteWorkspacePvc then throws a non-404 K8s error (e.g. API unavailable, PVC is still terminating), the agent row is gone from the database but the PVC remains in the cluster with no record pointing to it and no future cleanup path. Reversing the order — delete the PVC first, then the agent — means a K8s failure leaves the agent row intact so an operator can retry or investigate, rather than creating an irrecoverable resource leak.

Comment on lines +4 to +8
-- AlterTable
ALTER TABLE "managed_agent_session_message" ALTER COLUMN "message_id" DROP DEFAULT;

-- AlterTable
ALTER TABLE "project" ALTER COLUMN "project_id" DROP DEFAULT;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unrelated DROP DEFAULT statements could break existing insert paths

Lines 4–8 drop the database-level defaults on managed_agent_session_message.message_id and project.project_id. These columns are not touched by the rest of this PR. If either column had a gen_random_uuid() database default and any code inserts without supplying an explicit value, those inserts will start failing after this migration is applied. This looks like Prisma bundled uncommitted schema drift into this migration — it should be verified intentional and, if not, split out or reverted before deployment.

Comment thread src/server/k8s.ts
Comment on lines +1725 to +1750
metadata: {
name,
namespace: ns,
labels: {
[LABEL_AGENT_ID]: agent_id,
},
},
spec: {
accessModes: ["ReadWriteOnce"],
resources: {
requests: {
storage: "10Gi",
},
},
},
};

try {
await coreApi().createNamespacedPersistentVolumeClaim({
namespace: ns,
body: pvc,
});
} catch (err) {
if (!isAlreadyExists(err)) throw err;
}
return name;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ReadWriteOnce PVC may block warm-pool pods on multi-node clusters

A PVC with accessModes: ["ReadWriteOnce"] can only be mounted by pods on a single node at a time. When a warm task is claimed by a session, provisionWarmTask immediately runs to top up the pool (per PER_AGENT_TARGET = 1). The new warm-task pod and the active session pod both attempt to mount the same PVC concurrently. On a multi-node cluster, if the scheduler places them on different nodes, the second pod is stuck in ContainerCreating indefinitely with a volume-attachment error and no automatic recovery. Using ReadWriteMany (if the storage class supports it) or serialising PVC access so only one pod mounts it at a time would avoid this.

Comment thread src/server/k8s.ts
Comment on lines +1732 to +1739
spec: {
accessModes: ["ReadWriteOnce"],
resources: {
requests: {
storage: "10Gi",
},
},
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No storageClassName is set, so the PVC will use whatever the cluster's default storage class is. Different environments (dev kind cluster, staging, prod) may have different defaults with very different performance, cost, and ReadWriteMany support characteristics. Explicitly specifying the storage class — or sourcing it from an env var — makes the behaviour deterministic across environments.

Suggested change
spec: {
accessModes: ["ReadWriteOnce"],
resources: {
requests: {
storage: "10Gi",
},
},
},
spec: {
accessModes: ["ReadWriteOnce"],
...(env.K8S_WORKSPACE_STORAGE_CLASS ? { storageClassName: env.K8S_WORKSPACE_STORAGE_CLASS } : {}),
resources: {
requests: {
storage: "10Gi",
},
},
},

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: PVC-per-agent workspace persistence (snapshot-on-pause, instant follow-up)

1 participant