Skip to content

docs(controller-manager): RFC for sharding projects across replicas#640

Open
ecv wants to merge 2 commits into
mainfrom
docs/shard-projects-across-replicas
Open

docs(controller-manager): RFC for sharding projects across replicas#640
ecv wants to merge 2 commits into
mainfrom
docs/shard-projects-across-replicas

Conversation

@ecv

@ecv ecv commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

RFC / design proposal (no production code changes) for sharding Project cluster engagements across N controller-manager replicas.

The controller-manager engages one full controller-runtime cluster.Cluster per Ready Project (cache + informers + reflectors), and with leader election + replicas: 1 a single active replica holds them all. This makes the goroutine count and stack memory scale linearly with project count: ~1,100 goroutines/project, 1.51 GiB go_memstats_stack_inuse_bytes at ~395 projects (#631). #632's managedFields strip cut per-object heap but does not reduce this goroutine-stack floor.

The RFC proposes distributing per-project engagements across replicas (active/active) so each pod holds ~projects/N caches → ~1/N of the stack floor per pod.

Document

docs/proposals/shard-projects-across-replicas.md

Covers: the goroutine-stack floor problem, three sharding approaches (static label-selector / hash-based / lease-per-shard) with trade-offs and a recommendation (hash-based via StatefulSet ordinal, HRW hashing), leader-election → active/active migration and failover, a concrete change sketch in provider.go/controllermanager.go, a metrics-based rollout/validation plan, and open risks (rebalancing churn, lease scoping vs. singleton controllers, webhook traffic, and quota cross-cluster coordination correctness — flagged as highest risk).

This is an RFC / design proposal only — no implementation.

Refs #631, follow-up to #632.

Follow-up to #631. Design proposal to distribute per-project cluster
engagements across N controller-manager replicas, cutting the goroutine
-stack floor (one reflector set per project, 1.51 GiB at ~395 projects)
that #632's managedFields strip does not address. RFC only; no code.

Refs #631

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ecv ecv requested a review from scotwells June 4, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant