feat(controller): expose branch_concurrency via CLI + env var (closes #177)#179
Merged
Merged
Conversation
DEFAULT_BRANCH_CONCURRENCY = 4 was a recompile-only knob. The right value depends entirely on the operator's disk budget and snapshot-size distribution: on a 2 TB NVMe with 512 MiB snapshots, 4 is wildly conservative; on a 50 GiB EBS with 8 GiB snapshots, 4 is enough to fill the disk. This change plumbs the cap through DaemonConfig: - New `branch_concurrency: Option<usize>` field on DaemonConfig. `None` falls back to the existing DEFAULT_BRANCH_CONCURRENCY (4), so no behaviour change for existing users. - New `--branch-concurrency <N>` CLI flag on `forkd-controller serve`, also readable from the `FORKD_BRANCH_CONCURRENCY` env var (clap pattern matching the existing FORKD_* env-var family). - Validation: `branch_concurrency = 0` bails at daemon startup rather than constructing a zero-permit Semaphore (would deadlock the first BRANCH request). Non-default values are logged at INFO at startup so operators see the override took effect. Reported by @cortsdine in #177. Closes #177. The bonus metrics suggestion (`forkd_branches_in_flight` gauge on /metrics) is left for a follow-up PR to keep this diff focused on the config plumbing.
This was referenced May 28, 2026
WaylandYang
added a commit
that referenced
this pull request
May 28, 2026
…ency_cap (#183) Follow-up to #179 — surfaces the in-flight BRANCH count and the configured cap on /metrics, so operators can size FORKD_BRANCH_CONCURRENCY empirically against their real workload (per @cortsdine's suggestion in #177). New gauges: - forkd_branches_in_flight (Mutex<HashSet<String>>.len() of BRANCHes currently writing memory.bin) - forkd_branch_concurrency_cap (the value the Semaphore was constructed with; not exposed by tokio::Semaphore, so cached on AppState in a new branch_concurrency_cap field that mirrors the value from DaemonConfig::branch_concurrency) The pair lets a Grafana panel show "5 / 16 in-flight" without external knowledge of the cap. Tests: - Existing metrics_emits_prometheus_text extended to assert both new gauges appear with the default cap value. - New metrics_branches_in_flight_tracks_slot_acquisitions verifies the in-flight gauge moves with BranchSlot acquire/drop — the actual Drop-recovery semantics @henliveira praised in #178, now visible to operators.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #177. Thanks to @cortsdine for the precise diagnosis and the suggested API shape.
Root cause
DEFAULT_BRANCH_CONCURRENCY = 4atcrates/forkd-controller/src/http.rs:72is set once at compile time and used to construct thebranch_semsemaphore inAppState. The comment ("Each BRANCH writes a full memory.bin, typically 256 MiB - 8 GiB, so the cap bounds peak transient disk usage") spells out exactly why the right value depends on the operator's environment — a constant that only changes viacargo builddefeats the purpose.The fix
DaemonConfig:branch_concurrency: Option<usize>.Nonefalls back toDEFAULT_BRANCH_CONCURRENCY(4), so behaviour for existing callers is unchanged.--branch-concurrency <N>CLI flag onforkd-controller serve, also readable fromFORKD_BRANCH_CONCURRENCY(matches the existingFORKD_*env-var family pattern used forFORKD_TOKEN_FILE,FORKD_TLS_CERT, etc.).branch_concurrency = 0is rejected at daemon startup rather than constructing a zero-permitSemaphore(which would deadlock the first BRANCH request).Local verification
The flag follows the same clap pattern as the existing
--token-file/--tls-cert/--prewarm-scratch-dirknobs; nothing new about how it's plumbed.What's deliberately not in this PR
Your bonus suggestion of surfacing
forkd_branches_in_flighton/metricsis a nice operational add but a separate concern (touches Prometheus registration, gauge state, possibly a registry trait extension). Filing as a follow-up rather than expanding this PR's review surface — happy to PR that next if you'd like it included.