Skip to content

Parallelise installing resources#3847

Open
ntnn wants to merge 8 commits intokcp-dev:mainfrom
ntnn:parallelise-installing-resources
Open

Parallelise installing resources#3847
ntnn wants to merge 8 commits intokcp-dev:mainfrom
ntnn:parallelise-installing-resources

Conversation

@ntnn
Copy link
Member

@ntnn ntnn commented Feb 19, 2026

Summary

I've been fiddling with the startup because 30s cold start for an instance is a bit much for my liking.
This cuts a bit off - not a lot though. On average parallelizing this cuts down ~5s startup time.

Other functions are already parallelizing in the same way, so it's more of a cleanup to adhere to standards set in other functions.

What Type of PR Is This?

/kind cleanup

Related Issue(s)

Fixes #

Release Notes

NONE

@kcp-ci-bot kcp-ci-bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the DCO. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 19, 2026
@ntnn ntnn force-pushed the parallelise-installing-resources branch from e79eb40 to aed85ef Compare February 19, 2026 17:27
@ntnn
Copy link
Member Author

ntnn commented Feb 19, 2026

/test pull-kcp-test-integration

} else if err != nil {
return err
}
if len(bytes.TrimSpace(doc)) == 0 {
Copy link
Contributor

@gman0 gman0 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move the empty-doc check from here into the for ... := range docs loop below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True that is better, at the time I was more focused on the startup in pkg/server so I didn't think twice about it.

@gman0
Copy link
Contributor

gman0 commented Feb 19, 2026

Otherwise LGTM.

@ntnn ntnn force-pushed the parallelise-installing-resources branch from aed85ef to b7c5af2 Compare February 19, 2026 19:13
@gman0
Copy link
Contributor

gman0 commented Feb 20, 2026

/lgtm

@kcp-ci-bot kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 20, 2026
@kcp-ci-bot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: 183666035c3ece9c747fd959ecbf556cb1770b17

if f.IsDir() {
return
}
if err := CreateResourceFromFS(ctx, client, mapper, batteriesIncluded, f.Name(), fs, transformers...); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember some bootstrap files used a file in order. Can we maybe add some high-level ordering? Meaning - namespaces first, cluster object second, and the rest follows?

This lib is heavy used outside kcp in our other repos and I think it might break things if no ordering are done?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm, we could do weights but I'd rather wrap createResourceFromFS in an eventually so ordering doesn't matter.
So basically CreateResourcesFromFS - multiplex -> CreateResourceFromFS - multiplex -> createRetryResourceFromFS - retry -> createResourceFromFs.
I was already thinking that wrapping CreateResourcesFromFS in an eventually in the startup was a bit too expensive because on every failure it would apply everything again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But does this mean if you have 100 files, it will fire 100 create requests, and 99 will fail due to missing namespace, and 1 namespace succeeds, and after that, 99 retries succeed.

vs: 1 namespace request, and 99 following.

Basically, 199 requests vs 100 requests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but then we should redesign the resource installation; so we weight the resources being installing, group them by weight and install in chunks, with resources in one chunk being parallelised.
I think @xrstf already made a similar weighing for the init-agent, we could reuse that code and go from there.

Copy link

@mjudeikis-bot mjudeikis-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review notes:

1. Data race in CreateResourceFromFS — transformer loop

The goroutine captures err from the outer function scope rather than declaring it locally:

doc, err = transformer(doc)  // ← outer err, data race across goroutines

Should be:

doc, err := transformer(doc)

Multiple goroutines writing to the same err variable is a data race. go test -race will catch this.


2. Dependency ordering — namespaces, CRDs, resources

Parallelizing within a file and across files changes the semantics when resources have install-time dependencies. For example:

  • A multi-doc YAML with a Namespace followed by resources in that namespace — both now land concurrently, so the resource creation may fail with namespace not found
  • Across files: CRDs and their CRs in separate files can now race

This is only safe if every call site already wraps CreateResourcesFromFS in a retry loop. Worth auditing call sites — if any are fire-once, this is a silent regression that converts a previously-reliable install into a flaky one.


3. Minor: inconsistent closure style

BindRootAPIs goroutine correctly passes exportName explicitly:

go func(exportName string) { ... }(exportName)

The other goroutines rely on Go 1.22+ per-iteration range variable semantics instead. Not wrong if the module requires 1.22+, but worth noting the inconsistency — explicit params are safer and clearer about intent.

@kcp-ci-bot kcp-ci-bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 25, 2026
@kcp-ci-bot kcp-ci-bot requested a review from gman0 February 25, 2026 01:13
@kcp-ci-bot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@kcp-ci-bot kcp-ci-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 25, 2026
@ntnn ntnn force-pushed the parallelise-installing-resources branch 3 times, most recently from 49da64b to 0816ec3 Compare February 25, 2026 01:18
@ntnn
Copy link
Member Author

ntnn commented Feb 25, 2026

/hold

I've parallelised installing resources in a few more spaces. Works fine to startup etc.pp. initially but now the batteries kcp-admin can't access workspace types:

Error from server (Forbidden): workspacetypes.tenancy.kcp.io is forbidden: User "kcp-admin" cannot list resource "workspacetypes" in API group "tenancy.kcp.io" at the cluster scope: access denied

I'll revisit the changes later.

@kcp-ci-bot kcp-ci-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 25, 2026
@mjudeikis-bot
Copy link

On it @ntnn — will open a follow-up PR with a utility function in pkg/ (new pkg/concurrency package or suitable existing) including the exitEarly param that cancels all goroutines on first error. Will link back here once the PR is up.

@ntnn ntnn mentioned this pull request Feb 25, 2026
@ntnn ntnn force-pushed the parallelise-installing-resources branch from 08ddecf to 4a82c7d Compare March 3, 2026 23:54
mjudeikis-bot pushed a commit to mjudeikis-bot/kcp that referenced this pull request Mar 4, 2026
Add ForEach[T] to pkg/errgroup as a convenience wrapper over Group that
runs a function concurrently for each item in a slice.

The exitEarly flag mirrors Group.FailFast: when true, the context
passed to fn is cancelled as soon as any invocation returns a non-nil
error so callers can abort in-flight work promptly.

Suggested-by: ntnn (kcp-dev#3847)

Co-authored-by: Mangirdas Judeikis <mangirdas@judeikis.lt>
@mjudeikis-bot
Copy link

Reviewed the changes. Overall the approach is solid — good separation of concerns across , , and the refactored .

A few observations:

** / **
The func (cache invalidation) is called on every retry of any resource, even if only one resource in a parallel chunk failed. This means a cache miss on one resource could trigger unnecessary invalidations for concurrent goroutines that are succeeding. The original code invalidated only at the top-level poll loop, which was coarser but intentional. Worth deciding whether per-resource invalidation is the right granularity here — it may cause extra API server round-trips in large bootstraps.

** capacity hint**
make([][]*unstructured.Unstructured, 0, len(weights)) pre-allocates capacity for 8 chunks (the number of weight tiers). In practice bootstraps rarely have all 8 tiers populated, so this is fine, but it might be slightly misleading — capacity here reflects tiers, not actual object count. A comment would help.

** with filter**
The filter walks the entire FS to find a single file. If callers are using this in hot paths (e.g., per-resource retry loops) this could be expensive. Not a correctness issue, just worth noting.

** — no **
The errgroup in doesn't set , so if one resource's fails (context cancelled), other goroutines in the chunk will continue until their contexts are also cancelled. The behaviour is correct since they share the same context, but makes the failure mode slightly slower than fail-fast.

Overall LGTM from a logic standpoint — the hierarchy-aware chunking correctly preserves the CRD → APIExport → APIBinding install ordering that was implicit before.

@ntnn ntnn force-pushed the parallelise-installing-resources branch from 1bf3bff to cccd052 Compare March 15, 2026 13:16
@kcp-ci-bot kcp-ci-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 15, 2026
@ntnn
Copy link
Member Author

ntnn commented Mar 15, 2026

/hold cancel

Okay, I finally worked out all of the nits that were wrong. I also moved a bunch of code so it's not everything in the one bootstrap.go file.

@kcp-ci-bot kcp-ci-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 15, 2026
@ntnn ntnn force-pushed the parallelise-installing-resources branch from cccd052 to 5614b73 Compare March 15, 2026 13:20
@ntnn ntnn requested a review from Copilot March 15, 2026 13:20
Copy link

@mjudeikis-bot mjudeikis-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the iteration — the refactoring is in much better shape now. The separation into binds.go, read.go, sort.go, template.go, transform.go, and yaml.go makes the package a lot easier to navigate. Good call extracting TemplateInput as an exported type too.

Previous concern addressed ✅ — The cache invalidation is now scoped to chunk-level failures via the reset function pointer, which is a clear improvement. The comment explaining why it is kept despite the new ordering is appreciated.

A few minor observations on the updated code:

read.go — fail-fast vs. aggregate errors
The old CreateResourcesFromFS used utilerrors.NewAggregate to collect all per-file errors and return them together. ReadResourcesFromFS now returns on the first file error. This is likely fine for bootstrap use-cases but is a subtle behavior change worth noting in case callers relied on seeing all errors at once.

binds.go — unconditional spec overwrite in BindRootAPI
In the AlreadyExists polling loop, existing.Spec = binding.Spec is always applied before calling Update, without first checking whether the spec is already correct. In practice this is harmless (the update becomes a no-op server-side if nothing changed), but it does generate unnecessary write traffic on every restart when the binding is already up-to-date. A quick equality check before the Update call would eliminate the churn.

sort.goobjectWeight partial view
I can see the weights slice but not the full objectWeight implementation in the diff. Assuming it handles the {Group: "apiextensions.k8s.io"} entry (no Kind) as a prefix/group-only match for CRDs — worth confirming that the matching logic does not accidentally over-match other groups.

None of these are blockers. The core parallelisation logic and the hierarchy-based chunking are well-implemented and well-tested. Happy to see this move forward.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the config bootstrap/resource installation helpers to reduce cold-start time by parallelizing resource creation and centralizing YAML parsing/templating/transform logic into reusable helper functions.

Changes:

  • Extracted YAML parsing, templating, and file transformation logic into new config/helpers/* helpers.
  • Introduced hierarchy-based sorting/chunking of unstructured objects to apply foundational resources earlier (CRDs, APIs, namespaces, etc.).
  • Updated bootstrap/apply logic to process chunks sequentially while creating resources within a chunk concurrently; added unit tests for sorting/chunking.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
config/helpers/bootstrap.go Refactors bootstrap/apply to read + chunk resources and create objects concurrently within each chunk.
config/helpers/read.go Adds helpers to read, transform, template, and filter resources from an embedded FS.
config/helpers/yaml.go Adds YAML-to-unstructured parsing helper for multi-doc YAML.
config/helpers/template.go Adds manifest templating helper and template input construction.
config/helpers/transform.go Extracts transform option types + replace transformer and applies transformer pipeline.
config/helpers/sort.go Adds object hierarchy weighting, sorting, and chunking.
config/helpers/sort_test.go Adds tests for sorting and chunking behavior.
config/helpers/binds.go Moves root API binding helpers into a dedicated file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return raw, nil
}

// ReadResourceFromFS uses ReadRawFromFS to a read a file and returns it
@ntnn
Copy link
Member Author

ntnn commented Mar 15, 2026

@mjudeikis-bot Apply each commit in this PR and run the lint and e2e tests every time to verify that each commit is atomic.

@mjudeikis-bot
Copy link

Took a look at the updated changes — the refactoring into read.go, binds.go, and the slimmed-down bootstrap.go is a clear improvement. Good split.

Re: Copilot loop-variable capture warnings — these are false positives. The project requires Go ≥ 1.25 (go.mod), and Go 1.22+ changed loop variable semantics so each iteration already gets its own copy. No shadowing needed.

CI failures — the integration and e2e suites are all red on the latest commit (1bf3bff). Before we can LGTM the refactored code those need to pass. Is the parallelised install order causing the timing sensitivity that tripped these tests, or is it a pre-existing flake? Worth checking the prow logs.

Minor: the comment in bootstrapChunk still reads "they resource dependencies" — looks like the suggestion landed but the typo remains.

@mjudeikis-bot
Copy link

@ntnn @xrstf — picking up the mentions here.

Re: utility function request — I don't have direct push access to kcp-dev/kcp, so I can't open a PR from here. I'll flag this for @mjudeikis to decide whether to proceed via a fork. The function would live in something like pkg/util/parallel/ and look roughly like:

func RunParallel(ctx context.Context, exitEarly bool, fns ...func(ctx context.Context) error) error {
    ctx, cancel := context.WithCancel(ctx)
    defer cancel()
    errsChan := make(chan error, len(fns))
    var wg sync.WaitGroup
    for _, fn := range fns {
        wg.Add(1)
        go func(fn func(ctx context.Context) error) {
            defer wg.Done()
            err := fn(ctx)
            if err != nil && exitEarly {
                cancel()
            }
            errsChan <- err
        }(fn)
    }
    wg.Wait()
    close(errsChan)
    var errs []error
    for err := range errsChan {
        if err != nil {
            errs = append(errs, err)
        }
    }
    return utilerrors.NewAggregate(errs)
}

Will proceed once @mjudeikis gives the go-ahead.

Re: Harlem Shake request — 😄 Nice try @xrstf, but I'm a coding bot, not a DJ.

@mjudeikis-bot
Copy link

@ntnn — running per-commit lint and e2e test verification is in the coder agent (Bob ⚙️) territory. I handle code review and comments only. Will flag this for @mjudeikis to decide if the coder agent should be tasked with the per-commit atomicity check.

@ntnn ntnn added this to tbd Mar 15, 2026
@github-project-automation github-project-automation bot moved this to Backlog in tbd Mar 15, 2026
@ntnn
Copy link
Member Author

ntnn commented Mar 15, 2026

/test pull-kcp-test-e2e-shared
/test pull-kcp-test-e2e-multiple-runs

That looks like flakes.

ntnn added 8 commits March 15, 2026 16:12
Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
@ntnn ntnn force-pushed the parallelise-installing-resources branch from 6908961 to 0d4485e Compare March 15, 2026 15:13
@kcp-ci-bot
Copy link
Contributor

@ntnn: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kcp-test-e2e-sharded 0d4485e link true /test pull-kcp-test-e2e-sharded
pull-kcp-test-e2e 0d4485e link true /test pull-kcp-test-e2e
pull-kcp-test-e2e-multiple-runs 0d4485e link true /test pull-kcp-test-e2e-multiple-runs

Full PR test history

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ntnn ntnn moved this from Backlog to In progress in tbd Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the DCO. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants