Skip to content

fix: Tiltfile consolidattion of third-party operators, add readiness gates, and improve dev tooling#133

Merged
j7m4 merged 16 commits intov2from
j7m4/tilt-infra-operator-work
Feb 27, 2026
Merged

fix: Tiltfile consolidattion of third-party operators, add readiness gates, and improve dev tooling#133
j7m4 merged 16 commits intov2from
j7m4/tilt-infra-operator-work

Conversation

@j7m4
Copy link

@j7m4 j7m4 commented Feb 27, 2026

Summary

  • Consolidate third-party operator installs: Replace 7 individual helm_repo/helm_resource calls with a single helm_resource('third-party-operators') backed by the deploy/operator umbrella chart (wandb-operator.enabled=false), mirroring how operators are installed in production
  • Add explicit readiness gates: operator-crds-ready, vm-crds-ready, grafana-crds-ready, and vm-operator-ready gates prevent resources from being applied before their CRDs or operator webhooks are ready — eliminating fresh-cluster race conditions
  • Tighten resource dependencies: Fix missing deps (RBAC → codegen, Operator-Certs → cert-manager intent documented), remove redundant transitive deps with explanatory comments
  • Fix Helm umbrella chart: Guard webhooks, certificates, and Wandb CR behind wandb-operator.enabled using not (eq ... false) to avoid Helm's false | default true trap
  • Kind cluster utilities: Add hack/scripts/kind-images-manager.sh (scrape/pull/load subcommands for caching images across cluster recreations) and improve scripts/setup_kind.sh with single-node vs multi-node profile selection
  • Dependency graph docs: Add docs/design/wandb_v2/tilt.md with a Mermaid graph of all Tilt resource dependencies, plus agent instructions for keeping it in sync with the Tiltfile

Test plan

  • tilt up from a fresh kind cluster reaches a healthy state without manual intervention
  • Telemetry stack (installTelemetry: true) starts cleanly with all readiness gates passing in order
  • hack/scripts/kind-images-manager.sh scrape + pull + load round-trip works against a running cluster
  • scripts/setup_kind.sh creates a single-node cluster for dev CRDs and a multi-node cluster for non-dev CRDs
  • Mermaid diagram in docs/design/wandb_v2/tilt.md renders correctly on GitHub

🤖 Generated with Claude Code

j7m4 and others added 16 commits February 25, 2026 17:30
On a fresh cluster, victoria-metrics and grafana CRDs aren't established
when Tilt tries to apply telemetry resources, even with resource_deps set.
Add explicit kubectl wait gates (vm-crds-ready, grafana-crds-ready) so
the k8s_yaml resources are only applied after their CRDs are established.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On a fresh cluster the Application and WeightsAndBiases CRDs applied by
kustomize may not be established by the time the controller manager tries
to start. Add operator-crds-ready gate using kubectl wait.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…etrics

infra-metrics-dev.yaml contains a headless Service and VMServiceScrape for
ClickHouse that were not listed in the Infrastructure-Metrics k8s_resource
objects. Tilt was grouping them as unmatched resources alongside unrelated
kustomize cert-manager objects.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
make manifests regenerates CRD YAML files, causing Tilt to re-apply them
concurrently with the initial apply. This produces a resourceVersion
conflict ("object has been modified"). Waiting for manifests and generate
to finish ensures CRDs are applied once with up-to-date content.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vm-crds-ready only ensures CRDs are established, not that the operator
pod is running. The victoria-metrics-operator registers a validating
webhook for VMSingle/VLSingle/VTSingle — applying those resources while
the pod is still starting causes "connection refused" on the webhook.
Add vm-operator-ready gate using kubectl rollout status.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The namespace, certificates, and issuer from kustomize build were not
assigned to any k8s_resource, causing Tilt to display them unlabeled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scripts/setup_kind.sh: add DEV_PROFILE flag; use single-node cluster
  when wandbCRD contains 'dev', multi-node otherwise
- hack/scripts/kind-images-manager.sh: scrape/pull/load utility for
  caching cluster images locally and loading into kind
- .gitignore: ignore .k8s-images artifact

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…bgraph

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Operator-Certs now waits for cert-manager before applying certificates/issuer
- RBAC now waits for manifests/generate before applying generated roles
- Remove redundant direct manifests/generate dep from operator-controller-manager
  (transitively satisfied via operator-crds-ready)
- Remove redundant vm-crds-ready dep from metrics resources
  (transitively satisfied via Victoria-Metrics → vm-operator-ready)
- Add comment on Wandb CRD third-party-operators label explaining intent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
deploy_cert_manager() runs local() commands only and registers no named
Tilt resource, so the dependency cannot be declared.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add codegen ==> rbac (RBAC gained manifests/generate deps)
- Remove codegen ==> controller (dep dropped as transitive via operator-crds-ready)
- Remove vm_crds --> kube_metrics/op_metrics/infra_metrics (deps dropped as transitive)
- Fix Victoria Stack section comment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 27, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch j7m4/tilt-infra-operator-work

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@j7m4 j7m4 changed the title Tiltfile: consolidate third-party operators, add readiness gates, and improve dev tooling fix: Tiltfile consolidattion of third-party operators, add readiness gates, and improve dev tooling Feb 27, 2026
@j7m4 j7m4 merged commit 075390f into v2 Feb 27, 2026
5 of 6 checks passed
@j7m4 j7m4 deleted the j7m4/tilt-infra-operator-work branch February 27, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants