feat: PVC-per-agent workspace persistence by dushyantzz · Pull Request #355 · BerriAI/litellm-agent-platform

dushyantzz · 2026-05-28T08:38:46Z

Implemented PVC per agent workspace persistence to preserve sandboxes files and git state across restarts, timeouts, and follow up sessions.

Database & Types: Added a workspace_pvc column to the Agent model, generated/applied a schema migration, and extended RunTaskOpts to pass the PVC name.
Kubernetes (PVC Management): Created ensureWorkspacePvc and deleteWorkspacePvc functions in k8s.ts and extended runTask to mount the agent's PVC to /work/repo in the harness container.
Lifecycle Integration: Integrated PVC provisioning into the coldBringUp and provisionWarmTask paths.
Cleanup: Added automatic PVC deletion to the agent's DELETE API handler.

Fixes #208

greptile-apps · 2026-05-28T08:42:49Z

Greptile Summary

This PR introduces per-agent PVC-backed workspace persistence so that sandbox filesystem and git state survive across session restarts. A workspace_pvc column is added to the Agent model, ensureWorkspacePvc/deleteWorkspacePvc helpers manage the PVC lifecycle in Kubernetes, and both the cold bring-up and warm-pool paths are updated to provision and mount the PVC at /work/repo.

The schema migration bundles two unrelated ALTER COLUMN … DROP DEFAULT statements against managed_agent_session_message.message_id and project.project_id that are not part of this feature and may break insert paths relying on those database defaults.
The DELETE handler removes the agent DB row before deleting the PVC; if the K8s call fails, the PVC is orphaned with no cleanup path.
The ReadWriteOnce access mode can cause warm-pool pods to stall indefinitely on multi-node clusters when a newly provisioned warm task and an active session pod are scheduled on different nodes.

Confidence Score: 2/5

Not safe to merge as-is: the migration contains unreviewed destructive changes to unrelated columns, and the DELETE handler can permanently orphan PVC resources.

Three concrete defects on the changed paths: the migration drops defaults on two columns outside this feature scope (a potentially breaking schema change silently bundled in); the agent DELETE handler removes the DB row before the PVC, leaving an unrecoverable orphaned resource if the K8s call fails; and the ReadWriteOnce access mode will cause warm-pool pods to stall indefinitely on any multi-node cluster where the new warm task and the active session pod are scheduled on different nodes.

The migration SQL and the agent DELETE route handler need the most attention before this ships.

Important Files Changed

Filename	Overview
prisma/migrations/20260528080608_add_agent_workspace_pvc/migration.sql	Adds workspace_pvc column to managed_agent, but also drops defaults on message_id and project_id in unrelated tables — likely Prisma schema drift bundled into this migration, potentially breaking insert paths that rely on those database-level defaults.
src/app/api/v1/managed_agents/agents/[agent_id]/route.ts	Adds deleteWorkspacePvc to the DELETE handler, but the agent record is removed from the DB before the PVC is deleted — a K8s API failure orphans the PVC permanently with no recovery path.
src/server/k8s.ts	Adds ensureWorkspacePvc and deleteWorkspacePvc with correct idempotency; PVC mounts wired correctly into runTask. ReadWriteOnce access mode can deadlock warm-pool pods on multi-node clusters, and no storageClassName is specified.
src/server/warmPool/index.ts	Integrates PVC provisioning into provisionWarmTask correctly with idempotent DB update pattern; concurrent provisioning for the same agent is safe at the Prisma level but exposed to the ReadWriteOnce issue in k8s.ts.
src/app/api/v1/managed_agents/agents/[agent_id]/session/route.ts	Integrates PVC provisioning into coldBringUp correctly, skipping the K8s path when LOCAL_SANDBOX_URL is set; same ReadWriteOnce concern as warmPool.
src/server/types.ts	Cleanly adds workspace_pvc_name to RunTaskOpts and workspace_pvc to ApiAgent/toApiAgent; no issues.
prisma/schema.prisma	Adds nullable workspace_pvc String field to Agent model correctly.
prisma/migrations/migration_lock.toml	Trivial wording fix in comment (i.e. → e.g.); no issues.

_{Reviews (1): Last reviewed commit: "feat: PVC-per-agent workspace persistenc..." | Re-trigger Greptile}

greptile-apps · 2026-05-28T08:42:54Z

  await prisma.agent.delete({ where: { agent_id } });
+  await deleteWorkspacePvc(agent_id);
  return new Response(null, { status: 204 });


Agent deleted before PVC — orphaned PVC on K8s error

prisma.agent.delete commits first; if deleteWorkspacePvc then throws a non-404 K8s error (e.g. API unavailable, PVC is still terminating), the agent row is gone from the database but the PVC remains in the cluster with no record pointing to it and no future cleanup path. Reversing the order — delete the PVC first, then the agent — means a K8s failure leaves the agent row intact so an operator can retry or investigate, rather than creating an irrecoverable resource leak.

greptile-apps · 2026-05-28T08:42:56Z

+-- AlterTable
+ALTER TABLE "managed_agent_session_message" ALTER COLUMN "message_id" DROP DEFAULT;
+
+-- AlterTable
+ALTER TABLE "project" ALTER COLUMN "project_id" DROP DEFAULT;


Unrelated DROP DEFAULT statements could break existing insert paths

Lines 4–8 drop the database-level defaults on managed_agent_session_message.message_id and project.project_id. These columns are not touched by the rest of this PR. If either column had a gen_random_uuid() database default and any code inserts without supplying an explicit value, those inserts will start failing after this migration is applied. This looks like Prisma bundled uncommitted schema drift into this migration — it should be verified intentional and, if not, split out or reverted before deployment.

greptile-apps · 2026-05-28T08:42:58Z

+    metadata: {
+      name,
+      namespace: ns,
+      labels: {
+        [LABEL_AGENT_ID]: agent_id,
+      },
+    },
+    spec: {
+      accessModes: ["ReadWriteOnce"],
+      resources: {
+        requests: {
+          storage: "10Gi",
+        },
+      },
+    },
+  };
+
+  try {
+    await coreApi().createNamespacedPersistentVolumeClaim({
+      namespace: ns,
+      body: pvc,
+    });
+  } catch (err) {
+    if (!isAlreadyExists(err)) throw err;
+  }
+  return name;


ReadWriteOnce PVC may block warm-pool pods on multi-node clusters

A PVC with accessModes: ["ReadWriteOnce"] can only be mounted by pods on a single node at a time. When a warm task is claimed by a session, provisionWarmTask immediately runs to top up the pool (per PER_AGENT_TARGET = 1). The new warm-task pod and the active session pod both attempt to mount the same PVC concurrently. On a multi-node cluster, if the scheduler places them on different nodes, the second pod is stuck in ContainerCreating indefinitely with a volume-attachment error and no automatic recovery. Using ReadWriteMany (if the storage class supports it) or serialising PVC access so only one pod mounts it at a time would avoid this.

greptile-apps · 2026-05-28T08:43:00Z

+    spec: {
+      accessModes: ["ReadWriteOnce"],
+      resources: {
+        requests: {
+          storage: "10Gi",
+        },
+      },
+    },


No storageClassName is set, so the PVC will use whatever the cluster's default storage class is. Different environments (dev kind cluster, staging, prod) may have different defaults with very different performance, cost, and ReadWriteMany support characteristics. Explicitly specifying the storage class — or sourcing it from an env var — makes the behaviour deterministic across environments.

Suggested change

spec: {

accessModes: ["ReadWriteOnce"],

resources: {

requests: {

storage: "10Gi",

},

},

},

spec: {

accessModes: ["ReadWriteOnce"],

...(env.K8S_WORKSPACE_STORAGE_CLASS ? { storageClassName: env.K8S_WORKSPACE_STORAGE_CLASS } : {}),

resources: {

requests: {

storage: "10Gi",

},

},

},

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

…t persistence

feat: PVC-per-agent workspace persistence

95baa1b

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

fix(persistence): resolve greptile PR review defects for PVC per-agen…

e83c1a7

…t persistence

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PVC-per-agent workspace persistence#355

feat: PVC-per-agent workspace persistence#355
dushyantzz wants to merge 2 commits into
BerriAI:mainfrom
dushyantzz:main

dushyantzz commented May 28, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 28, 2026

Important Files Changed

Uh oh!

greptile-apps Bot May 28, 2026

Uh oh!

greptile-apps Bot May 28, 2026

Uh oh!

greptile-apps Bot May 28, 2026

Uh oh!

greptile-apps Bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dushyantzz commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented May 28, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Uh oh!

greptile-apps Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dushyantzz commented May 28, 2026 •

edited

Loading