Skip to content

feat(controller): expose branch_concurrency via CLI + env var (closes #177)#179

Merged
WaylandYang merged 2 commits into
mainfrom
fix/177-branch-concurrency-override
May 28, 2026
Merged

feat(controller): expose branch_concurrency via CLI + env var (closes #177)#179
WaylandYang merged 2 commits into
mainfrom
fix/177-branch-concurrency-override

Conversation

@WaylandYang
Copy link
Copy Markdown
Contributor

Closes #177. Thanks to @cortsdine for the precise diagnosis and the suggested API shape.

Root cause

DEFAULT_BRANCH_CONCURRENCY = 4 at crates/forkd-controller/src/http.rs:72 is set once at compile time and used to construct the branch_sem semaphore in AppState. The comment ("Each BRANCH writes a full memory.bin, typically 256 MiB - 8 GiB, so the cap bounds peak transient disk usage") spells out exactly why the right value depends on the operator's environment — a constant that only changes via cargo build defeats the purpose.

The fix

  • New optional field on DaemonConfig: branch_concurrency: Option<usize>. None falls back to DEFAULT_BRANCH_CONCURRENCY (4), so behaviour for existing callers is unchanged.
  • New --branch-concurrency <N> CLI flag on forkd-controller serve, also readable from FORKD_BRANCH_CONCURRENCY (matches the existing FORKD_* env-var family pattern used for FORKD_TOKEN_FILE, FORKD_TLS_CERT, etc.).
  • Validation: a branch_concurrency = 0 is rejected at daemon startup rather than constructing a zero-permit Semaphore (which would deadlock the first BRANCH request).
  • Non-default values are logged at INFO during startup so operators see the override took effect.

Local verification

$ cargo fmt --all -- --check        # clean
$ cargo check -p forkd-controller   # ok

The flag follows the same clap pattern as the existing --token-file / --tls-cert / --prewarm-scratch-dir knobs; nothing new about how it's plumbed.

What's deliberately not in this PR

Your bonus suggestion of surfacing forkd_branches_in_flight on /metrics is a nice operational add but a separate concern (touches Prometheus registration, gauge state, possibly a registry trait extension). Filing as a follow-up rather than expanding this PR's review surface — happy to PR that next if you'd like it included.

DEFAULT_BRANCH_CONCURRENCY = 4 was a recompile-only knob. The right
value depends entirely on the operator's disk budget and snapshot-size
distribution: on a 2 TB NVMe with 512 MiB snapshots, 4 is wildly
conservative; on a 50 GiB EBS with 8 GiB snapshots, 4 is enough to
fill the disk.

This change plumbs the cap through DaemonConfig:

- New `branch_concurrency: Option<usize>` field on DaemonConfig. `None`
  falls back to the existing DEFAULT_BRANCH_CONCURRENCY (4), so no
  behaviour change for existing users.
- New `--branch-concurrency <N>` CLI flag on `forkd-controller serve`,
  also readable from the `FORKD_BRANCH_CONCURRENCY` env var (clap
  pattern matching the existing FORKD_* env-var family).
- Validation: `branch_concurrency = 0` bails at daemon startup rather
  than constructing a zero-permit Semaphore (would deadlock the first
  BRANCH request). Non-default values are logged at INFO at startup
  so operators see the override took effect.

Reported by @cortsdine in #177. Closes #177.

The bonus metrics suggestion (`forkd_branches_in_flight` gauge on
/metrics) is left for a follow-up PR to keep this diff focused on the
config plumbing.
@WaylandYang WaylandYang merged commit ebf6e2d into main May 28, 2026
2 checks passed
@WaylandYang WaylandYang deleted the fix/177-branch-concurrency-override branch May 28, 2026 01:38
WaylandYang added a commit that referenced this pull request May 28, 2026
…ency_cap (#183)

Follow-up to #179 — surfaces the in-flight BRANCH count and the
configured cap on /metrics, so operators can size
FORKD_BRANCH_CONCURRENCY empirically against their real workload
(per @cortsdine's suggestion in #177).

New gauges:
- forkd_branches_in_flight   (Mutex<HashSet<String>>.len() of BRANCHes
                              currently writing memory.bin)
- forkd_branch_concurrency_cap  (the value the Semaphore was
                                 constructed with; not exposed by
                                 tokio::Semaphore, so cached on
                                 AppState in a new branch_concurrency_cap
                                 field that mirrors the value from
                                 DaemonConfig::branch_concurrency)

The pair lets a Grafana panel show "5 / 16 in-flight" without external
knowledge of the cap.

Tests:
- Existing metrics_emits_prometheus_text extended to assert both new
  gauges appear with the default cap value.
- New metrics_branches_in_flight_tracks_slot_acquisitions verifies the
  in-flight gauge moves with BranchSlot acquire/drop — the actual
  Drop-recovery semantics @henliveira praised in #178, now visible to
  operators.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DEFAULT_BRANCH_CONCURRENCY = 4 has no env / CLI / config override

1 participant