Skip to content

[Bug]: Checkpoint capture on large monorepos retries a guaranteed 30s git-add timeout every turn — permanent CPU burn + tmp_pack disk litter #3646

Description

@mikeohuo

Before submitting

  • I searched existing issues and did not find a duplicate.
  • I included enough detail to reproduce or investigate the problem.

Area

apps/server

Steps to reproduce

  1. Open a workspace inside a large monorepo (in my case a Unity project — large pack file, very high file count; workspace cwd is a subdirectory of the repo root). The larger the repo, the worse this gets.
  2. Run agent threads normally. Every completed turn triggers CheckpointReactorGitVcsDriver.checkpoints.captureCheckpoint, which runs git read-tree HEAD + git add -A -- . against a temp GIT_INDEX_FILE.
  3. Watch Task Manager / .git/objects/pack/.

Expected behavior

Checkpoint capture should either succeed, or fail once and back off / disable itself for the workspace. A capture that cannot ever succeed should not be retried indefinitely, and killed git processes should not leave permanent garbage in .git.

Actual behavior

VcsProcess has DEFAULT_TIMEOUT_MS = 30_000 and captureCheckpoint passes no timeoutMs override. On a large monorepo a full-tree git add -A takes well over 30 seconds, so the process is killed at the timeout every single time. The checkpoint is recorded as missing (every thread.turn-diff-completed event in my state.sqlite has status="missing", across every thread; no refs/t3/checkpoints/* ref has ever been created), and the next turn simply retries. Net effect:

  • One CPU core pegged near-continuously by git for as long as any thread is in use — capture can never succeed on this repo, so the cost buys nothing.
  • Each kill that lands mid-stream of a blob larger than core.bigFileThreshold (files that size are common in Unity repos) leaves an orphaned tmp_pack_* in .git/objects/pack/. These accumulate over time — git count-objects -v flags them as garbage, along with a large number of unreferenced loose objects from the partial adds.

Related issues — this seems to be the connecting root cause:

Suggested direction (any one of these would resolve the pathology): a per-project or global setting to disable checkpointing; skip-with-backoff after N consecutive capture failures instead of retrying forever; an adaptive/longer timeout; and cleanup of tmp_pack_* left by killed VCS processes.

Impact

Major degradation or frequent failure

Version or commit

0.0.29-nightly.20260701.697 (desktop, Windows installer)

Environment

Windows 11 Pro (10.0.26200), T3 Code Nightly desktop, git 2.54.0.windows.1, provider claudeAgent

Logs or stack traces

Observed process loop (spawned by the t3code server process, back-to-back):
  git.EXE -C <repo>/<workspace-subdir> add -A -- .    (killed at ~30s)
  git.EXE -C <repo>/<workspace-subdir> add -A -- .    (respawned on next turn, killed at ~30s)
  ... repeats indefinitely; ~100% of one core per run

state.sqlite orchestration_events: every thread.turn-diff-completed has status="missing"

git count-objects -v:
  warning: garbage found: .git/objects/pack/tmp_pack_* (many, accumulating)
  (plus a large count of unreferenced loose objects)

Workaround

None found — there's no setting that disables checkpointing. Locally I'm considering patching captureCheckpoint in the bundled bin.mjs to early-exit for this repo, but a nightly update overwrites it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions