Parallelise installing resources by ntnn · Pull Request #3847 · kcp-dev/kcp

ntnn · 2026-02-19T17:16:22Z

Summary

I've been fiddling with the startup because 30s cold start for an instance is a bit much for my liking.
This cuts a bit off - not a lot though. On average parallelizing this cuts down ~5s startup time.

Other functions are already parallelizing in the same way, so it's more of a cleanup to adhere to standards set in other functions.

What Type of PR Is This?

/kind cleanup

Related Issue(s)

Fixes #

Release Notes

NONE

ntnn · 2026-02-19T17:46:10Z

/test pull-kcp-test-integration

gman0 · 2026-02-19T18:43:43Z

config/helpers/bootstrap.go

 		} else if err != nil {
-			return err
-		}
-		if len(bytes.TrimSpace(doc)) == 0 {


Why move the empty-doc check from here into the for ... := range docs loop below?

True that is better, at the time I was more focused on the startup in pkg/server so I didn't think twice about it.

gman0 · 2026-02-19T18:47:46Z

Otherwise LGTM.

gman0 · 2026-02-20T09:48:03Z

/lgtm

kcp-ci-bot · 2026-02-20T09:48:08Z

LGTM label has been added.

Details

Git tree hash: 183666035c3ece9c747fd959ecbf556cb1770b17

mjudeikis · 2026-02-23T09:21:13Z

config/helpers/bootstrap.go

+			if f.IsDir() {
+				return
+			}
+			if err := CreateResourceFromFS(ctx, client, mapper, batteriesIncluded, f.Name(), fs, transformers...); err != nil {


I remember some bootstrap files used a file in order. Can we maybe add some high-level ordering? Meaning - namespaces first, cluster object second, and the rest follows?

This lib is heavy used outside kcp in our other repos and I think it might break things if no ordering are done?

Mhm, we could do weights but I'd rather wrap createResourceFromFS in an eventually so ordering doesn't matter.
So basically CreateResourcesFromFS - multiplex -> CreateResourceFromFS - multiplex -> createRetryResourceFromFS - retry -> createResourceFromFs.
I was already thinking that wrapping CreateResourcesFromFS in an eventually in the startup was a bit too expensive because on every failure it would apply everything again.

But does this mean if you have 100 files, it will fire 100 create requests, and 99 will fail due to missing namespace, and 1 namespace succeeds, and after that, 99 retries succeed.

vs: 1 namespace request, and 99 following.

Basically, 199 requests vs 100 requests.

True, but then we should redesign the resource installation; so we weight the resources being installing, group them by weight and install in chunks, with resources in one chunk being parallelised.
I think @xrstf already made a similar weighing for the init-agent, we could reuse that code and go from there.

mjudeikis-bot

Review notes:

1. Data race in CreateResourceFromFS — transformer loop

The goroutine captures err from the outer function scope rather than declaring it locally:

doc, err = transformer(doc)  // ← outer err, data race across goroutines

Should be:

doc, err := transformer(doc)

Multiple goroutines writing to the same err variable is a data race. go test -race will catch this.

2. Dependency ordering — namespaces, CRDs, resources

Parallelizing within a file and across files changes the semantics when resources have install-time dependencies. For example:

A multi-doc YAML with a Namespace followed by resources in that namespace — both now land concurrently, so the resource creation may fail with namespace not found
Across files: CRDs and their CRs in separate files can now race

This is only safe if every call site already wraps CreateResourcesFromFS in a retry loop. Worth auditing call sites — if any are fire-once, this is a silent regression that converts a previously-reliable install into a flaky one.

3. Minor: inconsistent closure style

BindRootAPIs goroutine correctly passes exportName explicitly:

go func(exportName string) { ... }(exportName)

The other goroutines rely on Go 1.22+ per-iteration range variable semantics instead. Not wrong if the module requires 1.22+, but worth noting the inconsistency — explicit params are safer and clearer about intent.

kcp-ci-bot · 2026-02-25T01:13:06Z

New changes are detected. LGTM label has been removed.

ntnn · 2026-02-25T01:39:29Z

/hold

I've parallelised installing resources in a few more spaces. Works fine to startup etc.pp. initially but now the batteries kcp-admin can't access workspace types:

Error from server (Forbidden): workspacetypes.tenancy.kcp.io is forbidden: User "kcp-admin" cannot list resource "workspacetypes" in API group "tenancy.kcp.io" at the cluster scope: access denied

I'll revisit the changes later.

config/crds/bootstrap.go

mjudeikis-bot · 2026-02-25T10:35:48Z

On it @ntnn — will open a follow-up PR with a utility function in pkg/ (new pkg/concurrency package or suitable existing) including the exitEarly param that cancels all goroutines on first error. Will link back here once the PR is up.

Add ForEach[T] to pkg/errgroup as a convenience wrapper over Group that runs a function concurrently for each item in a slice. The exitEarly flag mirrors Group.FailFast: when true, the context passed to fn is cancelled as soon as any invocation returns a non-nil error so callers can abort in-flight work promptly. Suggested-by: ntnn (kcp-dev#3847) Co-authored-by: Mangirdas Judeikis <mangirdas@judeikis.lt>

mjudeikis-bot · 2026-03-09T19:13:07Z

Reviewed the changes. Overall the approach is solid — good separation of concerns across , , and the refactored .

A few observations:

** / **
The func (cache invalidation) is called on every retry of any resource, even if only one resource in a parallel chunk failed. This means a cache miss on one resource could trigger unnecessary invalidations for concurrent goroutines that are succeeding. The original code invalidated only at the top-level poll loop, which was coarser but intentional. Worth deciding whether per-resource invalidation is the right granularity here — it may cause extra API server round-trips in large bootstraps.

** capacity hint**
make([][]*unstructured.Unstructured, 0, len(weights)) pre-allocates capacity for 8 chunks (the number of weight tiers). In practice bootstraps rarely have all 8 tiers populated, so this is fine, but it might be slightly misleading — capacity here reflects tiers, not actual object count. A comment would help.

** with filter**
The filter walks the entire FS to find a single file. If callers are using this in hot paths (e.g., per-resource retry loops) this could be expensive. Not a correctness issue, just worth noting.

** — no **
The errgroup in doesn't set , so if one resource's fails (context cancelled), other goroutines in the chunk will continue until their contexts are also cancelled. The behaviour is correct since they share the same context, but makes the failure mode slightly slower than fail-fast.

Overall LGTM from a logic standpoint — the hierarchy-aware chunking correctly preserves the CRD → APIExport → APIBinding install ordering that was implicit before.

ntnn · 2026-03-15T13:17:19Z

/hold cancel

Okay, I finally worked out all of the nits that were wrong. I also moved a bunch of code so it's not everything in the one bootstrap.go file.

mjudeikis-bot

Thanks for the iteration — the refactoring is in much better shape now. The separation into binds.go, read.go, sort.go, template.go, transform.go, and yaml.go makes the package a lot easier to navigate. Good call extracting TemplateInput as an exported type too.

Previous concern addressed ✅ — The cache invalidation is now scoped to chunk-level failures via the reset function pointer, which is a clear improvement. The comment explaining why it is kept despite the new ordering is appreciated.

A few minor observations on the updated code:

read.go — fail-fast vs. aggregate errors
The old CreateResourcesFromFS used utilerrors.NewAggregate to collect all per-file errors and return them together. ReadResourcesFromFS now returns on the first file error. This is likely fine for bootstrap use-cases but is a subtle behavior change worth noting in case callers relied on seeing all errors at once.

binds.go — unconditional spec overwrite in BindRootAPI
In the AlreadyExists polling loop, existing.Spec = binding.Spec is always applied before calling Update, without first checking whether the spec is already correct. In practice this is harmless (the update becomes a no-op server-side if nothing changed), but it does generate unnecessary write traffic on every restart when the binding is already up-to-date. A quick equality check before the Update call would eliminate the churn.

sort.go — objectWeight partial view
I can see the weights slice but not the full objectWeight implementation in the diff. Assuming it handles the {Group: "apiextensions.k8s.io"} entry (no Kind) as a prefix/group-only match for CRDs — worth confirming that the matching logic does not accidentally over-match other groups.

None of these are blockers. The core parallelisation logic and the hierarchy-based chunking are well-implemented and well-tested. Happy to see this move forward.

config/helpers/bootstrap.go

Copilot

Pull request overview

This PR refactors the config bootstrap/resource installation helpers to reduce cold-start time by parallelizing resource creation and centralizing YAML parsing/templating/transform logic into reusable helper functions.

Changes:

Extracted YAML parsing, templating, and file transformation logic into new config/helpers/* helpers.
Introduced hierarchy-based sorting/chunking of unstructured objects to apply foundational resources earlier (CRDs, APIs, namespaces, etc.).
Updated bootstrap/apply logic to process chunks sequentially while creating resources within a chunk concurrently; added unit tests for sorting/chunking.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
config/helpers/bootstrap.go	Refactors bootstrap/apply to read + chunk resources and create objects concurrently within each chunk.
config/helpers/read.go	Adds helpers to read, transform, template, and filter resources from an embedded FS.
config/helpers/yaml.go	Adds YAML-to-unstructured parsing helper for multi-doc YAML.
config/helpers/template.go	Adds manifest templating helper and template input construction.
config/helpers/transform.go	Extracts transform option types + replace transformer and applies transformer pipeline.
config/helpers/sort.go	Adds object hierarchy weighting, sorting, and chunking.
config/helpers/sort_test.go	Adds tests for sorting and chunking behavior.
config/helpers/binds.go	Moves root API binding helpers into a dedicated file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

config/helpers/bootstrap.go

config/helpers/binds.go

config/helpers/read.go

config/helpers/sort_test.go

config/helpers/bootstrap.go

config/helpers/read.go

+	return raw, nil
+}
+
+// ReadResourceFromFS uses ReadRawFromFS to a read a file and returns it


ntnn · 2026-03-15T13:25:16Z

@mjudeikis-bot Apply each commit in this PR and run the lint and e2e tests every time to verify that each commit is atomic.

mjudeikis-bot · 2026-03-15T13:26:41Z

Took a look at the updated changes — the refactoring into read.go, binds.go, and the slimmed-down bootstrap.go is a clear improvement. Good split.

Re: Copilot loop-variable capture warnings — these are false positives. The project requires Go ≥ 1.25 (go.mod), and Go 1.22+ changed loop variable semantics so each iteration already gets its own copy. No shadowing needed.

CI failures — the integration and e2e suites are all red on the latest commit (1bf3bff). Before we can LGTM the refactored code those need to pass. Is the parallelised install order causing the timing sensitivity that tripped these tests, or is it a pre-existing flake? Worth checking the prow logs.

Minor: the comment in bootstrapChunk still reads "they resource dependencies" — looks like the suggestion landed but the typo remains.

mjudeikis-bot · 2026-03-15T13:26:41Z

@ntnn @xrstf — picking up the mentions here.

Re: utility function request — I don't have direct push access to kcp-dev/kcp, so I can't open a PR from here. I'll flag this for @mjudeikis to decide whether to proceed via a fork. The function would live in something like pkg/util/parallel/ and look roughly like:

func RunParallel(ctx context.Context, exitEarly bool, fns ...func(ctx context.Context) error) error {
    ctx, cancel := context.WithCancel(ctx)
    defer cancel()
    errsChan := make(chan error, len(fns))
    var wg sync.WaitGroup
    for _, fn := range fns {
        wg.Add(1)
        go func(fn func(ctx context.Context) error) {
            defer wg.Done()
            err := fn(ctx)
            if err != nil && exitEarly {
                cancel()
            }
            errsChan <- err
        }(fn)
    }
    wg.Wait()
    close(errsChan)
    var errs []error
    for err := range errsChan {
        if err != nil {
            errs = append(errs, err)
        }
    }
    return utilerrors.NewAggregate(errs)
}

Will proceed once @mjudeikis gives the go-ahead.

Re: Harlem Shake request — 😄 Nice try @xrstf, but I'm a coding bot, not a DJ.

mjudeikis-bot · 2026-03-15T13:30:52Z

@ntnn — running per-commit lint and e2e test verification is in the coder agent (Bob ⚙️) territory. I handle code review and comments only. Will flag this for @mjudeikis to decide if the coder agent should be tasked with the per-commit atomicity check.

ntnn · 2026-03-15T14:18:49Z

/test pull-kcp-test-e2e-shared
/test pull-kcp-test-e2e-multiple-runs

That looks like flakes.

Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com> Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>

kcp-ci-bot · 2026-03-15T15:50:03Z

@ntnn: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kcp-test-e2e-sharded	`0d4485e`	link	true	`/test pull-kcp-test-e2e-sharded`
pull-kcp-test-e2e	`0d4485e`	link	true	`/test pull-kcp-test-e2e`
pull-kcp-test-e2e-multiple-runs	`0d4485e`	link	true	`/test pull-kcp-test-e2e-multiple-runs`

Full PR test history

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ntnn force-pushed the parallelise-installing-resources branch from e79eb40 to aed85ef Compare February 19, 2026 17:27

gman0 reviewed Feb 19, 2026

View reviewed changes

ntnn force-pushed the parallelise-installing-resources branch from aed85ef to b7c5af2 Compare February 19, 2026 19:13

kcp-ci-bot assigned gman0 Feb 20, 2026

kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 20, 2026

mjudeikis reviewed Feb 23, 2026

View reviewed changes

mjudeikis-bot reviewed Feb 24, 2026

View reviewed changes

kcp-ci-bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 25, 2026

kcp-ci-bot requested a review from gman0 February 25, 2026 01:13

kcp-ci-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 25, 2026

ntnn force-pushed the parallelise-installing-resources branch 3 times, most recently from 49da64b to 0816ec3 Compare February 25, 2026 01:18

kcp-ci-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 25, 2026

xrstf reviewed Feb 25, 2026

View reviewed changes

config/crds/bootstrap.go Outdated Show resolved Hide resolved

ntnn mentioned this pull request Feb 25, 2026

Add pkg/errgroup #3861

Merged

ntnn force-pushed the parallelise-installing-resources branch from 08ddecf to 4a82c7d Compare March 3, 2026 23:54

ntnn force-pushed the parallelise-installing-resources branch from 1bf3bff to cccd052 Compare March 15, 2026 13:16

kcp-ci-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 15, 2026

kcp-ci-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 15, 2026

ntnn force-pushed the parallelise-installing-resources branch from cccd052 to 5614b73 Compare March 15, 2026 13:20

ntnn requested a review from Copilot March 15, 2026 13:20

Copilot started reviewing on behalf of ntnn March 15, 2026 13:21 View session

mjudeikis-bot reviewed Mar 15, 2026

View reviewed changes

ntnn commented Mar 15, 2026

View reviewed changes

config/helpers/bootstrap.go Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 15, 2026

View reviewed changes

ntnn added this to tbd Mar 15, 2026

github-project-automation bot moved this to Backlog in tbd Mar 15, 2026

ntnn added 8 commits March 15, 2026 16:12

Add helpers.ChunkObjectsByHierarchy and helpers.SortObjectsByHierarchy

bd2893d

Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com> Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>

Add helpers.ParseYAML

a13da07

Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com> Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>

Split templating into function

0e27d65

Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com> Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>

Move BindRootAPI{,s} into own file

417bdd6

Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com> Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>

Split createResource

5221fa1

Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com> Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>

Move transform into own file

c5e8f87

Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com> Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>

Add read functions in separate file

efcab3f

Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com> Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>

Parallelize bootstrap code

0d4485e

Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com> Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>

ntnn force-pushed the parallelise-installing-resources branch from 6908961 to 0d4485e Compare March 15, 2026 15:13

ntnn moved this from Backlog to In progress in tbd Mar 19, 2026

Conversation

ntnn commented Feb 19, 2026

Summary

What Type of PR Is This?

Related Issue(s)

Release Notes

Uh oh!

ntnn commented Feb 19, 2026

Uh oh!

gman0 Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntnn Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

gman0 commented Feb 19, 2026

Uh oh!

gman0 commented Feb 20, 2026

Uh oh!

kcp-ci-bot commented Feb 20, 2026

Uh oh!

mjudeikis Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

ntnn Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

mjudeikis Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

ntnn Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mjudeikis-bot left a comment

Choose a reason for hiding this comment

Uh oh!

kcp-ci-bot commented Feb 25, 2026

Uh oh!

ntnn commented Feb 25, 2026

Uh oh!

Uh oh!

mjudeikis-bot commented Feb 25, 2026

Uh oh!

mjudeikis-bot commented Mar 9, 2026

Uh oh!

ntnn commented Mar 15, 2026

Uh oh!

mjudeikis-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ntnn commented Mar 15, 2026

Uh oh!

mjudeikis-bot commented Mar 15, 2026

Uh oh!

mjudeikis-bot commented Mar 15, 2026

Uh oh!

mjudeikis-bot commented Mar 15, 2026

Uh oh!

ntnn commented Mar 15, 2026

Uh oh!

kcp-ci-bot commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

gman0 Feb 19, 2026 •

edited

Loading