orkspace · iAlexeze · May 31, 2026 · May 31, 2026 · May 31, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -16,10 +16,10 @@
 - `SetStarted()` was unconditionally setting `pending=true` on every worker start, including resync-triggered restarts — overwriting the `pending=false` set by `RecordSuccess()`. Now only sets pending if the CRD has not yet successfully reconciled.
 
 **CRD showing "not started" or "degraded" under network lag**
-- `PostStartRetryInterval` was left at 3 seconds (a debug value) instead of the intended 30 seconds. This caused the retry loop to hit the API server every 3 seconds across all CRDs continuously.
+- `postStartRetryInterval` was left at 3 seconds (a debug value) instead of the intended 30 seconds. This caused the retry loop to hit the API server every 3 seconds across all CRDs continuously.
 - `crdExists()` collapsed all errors — including network timeouts and dial failures — into `false`, treating any transient API server hiccup as "CRD disappeared." Phase 1 (runtime disappearance check) and Phase 2 (missing CRD activation) would then call `SetMissingAtRuntime()` + `SetDegraded()`, flipping healthy CRDs to degraded and pending CRDs to "not started."
 - Fixed: `crdExists()` now returns a tri-state — `(true, nil)` exists, `(false, nil)` definitively absent, `(false, err)` transient. All three callers skip state changes when `err != nil`.
-- `PostStartRetryInterval` restored to 90 seconds with exponential backoff capped at 5 minutes.
+- `postStartRetryInterval` restored to 90 seconds with exponential backoff capped at 5 minutes.
 
 ### RBAC — Namespace-Scoped ClusterRole Names
 

diff --git a/charts/orkestra/Chart.yaml b/charts/orkestra/Chart.yaml
@@ -98,7 +98,7 @@ annotations:
     - kind: fixed
       description: "crdExists() now distinguishes transient network errors from genuine CRD absence"
     - kind: fixed
-      description: "PostStartRetryInterval restored from 3s (debug) to 90s with 5-minute capped backoff"
+      description: "postStartRetryInterval restored from 3s (debug) to 90s with 5-minute capped backoff"
     - kind: fixed
       description: "ClusterRole/ClusterRoleBinding names scoped to orkestra-<namespace> — no more last-write-wins collision"
     - kind: fixed

diff --git a/documentation/concepts/simulate/index.md b/documentation/concepts/simulate/index.md
@@ -0,0 +1,116 @@
+# ork simulate
+
+`ork simulate` runs the operator reconcile loop against a fake in-memory cluster. No Kubernetes cluster, no `kubectl`, no network.
+
+---
+
+## Why it exists
+
+Most operator frameworks give you two options for testing: write unit tests that mock the Kubernetes client (which do not reflect real merge-patch semantics), or spin up a kind cluster and apply real CRs (which is slow and requires environment setup).
+
+`ork simulate` takes a third path: it runs the same `GenericReconciler` that runs in production, but wires it to a fake in-memory Kubernetes store. This means:
+
+- The template engine executes exactly as it would in production
+- `when:` conditions are evaluated
+- Status propagation runs
+- `onCreate` and `onReconcile` blocks both execute, in order
+- Steady state is detected when two consecutive cycles produce identical operations
+
+The output tells you exactly what the operator would create, update, or delete — before you write a single CRD manifest or touch a cluster.
+
+---
+
+## The key property: same reconciler, no cluster
+
+The distinction that makes `ork simulate` genuinely useful (not just a dry-run) is that it does not approximate what the reconciler does. It *is* the reconciler. Every code path that runs when a real CR lands in a real cluster runs identically in simulation.
+
+This means simulation catches:
+
+- Template expression errors (bad field references, type mismatches)
+- Missing `when:` guards that cause unintended resource creation on first reconcile
+- Status field propagation that references children not yet created
+- `onReconcile` blocks that produce different output than `onCreate` when they should be identical
+
+It does not catch:
+
+- Admission webhook behavior (the fake cluster does not run webhooks)
+- Actual Kubernetes API server behavior (merge-patch edge cases, field ownership)
+- External HTTP responses — `external:` blocks produce empty responses in simulation
+
+Use `ork simulate` to verify your templates are correct. Use `ork e2e` to verify your operator behaves correctly end-to-end against a real cluster.
+
+---
+
+## Running it
+
+```bash
+# Standard layout — katalog.yaml + cr.yaml in current directory
+ork simulate
+
+# Non-standard filenames
+ork simulate -f my-operator.yaml --cr my-cr.yaml
+
+# Simulate one CRD from a multi-CRD Katalog
+ork simulate --crd website
+
+# Run exactly 5 cycles (default is 10)
+ork simulate --cycles 5
+```
+
+## Reading the output
+
+```
+Simulating website/my-site
+
+  Cycle 1:
+    + deployments/my-site
+    + services/my-site
+    ~ status/my-site
+
+  Cycle 2:
+    ~ status/my-site
+
+  (cycles 3–10: identical)
+
+  ✓ Steady state at cycle 3 in 189ms
+```
+
+`+` — resource created. `~` — resource updated. `-` — resource deleted.
+
+`status/...` appearing on every cycle is expected — status fields are re-evaluated on every reconcile. Everything else should reach steady state within 2–3 cycles. If a resource keeps appearing as `+` after cycle 1, the reconciler is re-creating it instead of finding it stable — this usually means a template expression produces a different value on each evaluation (e.g., `now()` in a timestamp field).
+
+Consecutive identical cycles are collapsed in the output. Steady state is noted but does not stop the run when `--cycles` is set explicitly.
+
+---
+
+## Workflow: write → simulate → cluster
+
+The recommended development loop for a new operator:
+
+1. Write the Katalog and a minimal CR
+2. Run `ork simulate` — verify the right resources appear in cycle 1
+3. Check that cycle 2 shows only `status/...` (no unexpected re-creation)
+4. Adjust `when:` conditions, field references, or template expressions as needed
+5. When simulation is clean, run against a real cluster with `ork run`
+
+For anything more complex — state machines, dependencies between CRDs, admission webhooks — write an `ork e2e` spec that tests the full lifecycle. Simulation covers the template logic; E2E covers the system behavior.
+
+---
+
+## Relation to ork e2e
+
+| | `ork simulate` | `ork e2e` |
+|---|---|---|
+| Requires cluster | No | Yes (kind or existing) |
+| Runs real reconciler | Yes | Yes |
+| Tests webhooks | No | Yes |
+| Tests external calls | No (empty responses) | Yes |
+| Tests health transitions | No | Yes |
+| Speed | Milliseconds | Minutes |
+| Best for | Template correctness | System correctness |
+
+Use both. Simulate is the fast inner loop. E2E is the outer verification gate that runs before `ork registry push`.
+
+---
+
+→ See also: [`ork simulate` CLI reference](../../reference/cli/05-simulate.md), [`ork e2e`](../../reference/cli/08-e2e.md)
diff --git a/documentation/reference/cli/08-e2e.md b/documentation/reference/cli/08-e2e.md
@@ -107,4 +107,4 @@ FAIL  hello-website-e2e  (62s)
 
 ## E2E spec reference
 
-→ [documentation/reference/schema/04-e2e/](../schema/04-e2e/)
+→ [documentation/reference/schema/04-e2e/](../schema/04-e2e/index.md)
diff --git a/documentation/roadmap.md b/documentation/roadmap.md
@@ -45,7 +45,7 @@ Orkestra is a complete declarative operator runtime for Kubernetes. The core is
 
 **CLI**
 
-`ork init`, `ork run`, `ork gate`, `ork validate`, `ork template`, `ork simulate`, `ork plan`, `ork diff`, `ork generate`, `ork registry`, `ork control`, `ork notes`, `ork e2e`, `ork deploy`, `ork tunnel`, `ork version`
+`ork init`, `ork run`, `ork gate`, `ork validate`, `ork template`, `ork simulate`, `ork plan`, `ork diff`, `ork generate`, `ork registry`, `ork control`, `ork notes`, `ork e2e`, `ork version`
 
 **Distribution**
 
@@ -106,6 +106,32 @@ func main() {
 
 They get the full runtime, gateway, CLI, and webhook system. If they need a custom webhook, they know exactly where to plug it in. Two things needed: a version-pinned `go.mod` import and this entrypoint.
 
+### ork lint
+
+`ork validate` checks schema correctness — the document is well-formed. `ork lint` checks semantic correctness — the document is safe and sound for your deployment context.
+
+```bash
+ork lint -f katalog.yaml
+ork lint -f katalog.yaml --policy org-policy.yaml
+```
+
+Examples of what lint catches that validate cannot:
+
+- A Deployment with no resource requests (will be evicted under pressure)
+- A ServiceAccount bound to cluster-wide verbs (over-privileged)
+- A Secret with no rotation policy declared
+- A CRD with `condition: healthy` on a dependency that has a history of degradation
+
+Lint runs at CI time, not author time. It is a different gate — closer to `golangci-lint` than to `go vet`.
+
+### Namespaced katalogs
+
+Today, the merger merges all Katalog sources into one flat runtime Katalog. A Katalog with `namespace: platform-team` would stay scoped — the merger produces `map[namespace]*Katalog` instead of one merged output. Each namespaced Katalog runs in its own reconciler scope with independent health tracking, independent workers, and real isolation from other namespaces.
+
+The Control Center shows each namespace as a separate panel — from its perspective, namespaced Katalogs look like separate runtimes.
+
+This makes Orkestra usable as a shared platform primitive: one Orkestra instance, multiple teams with real isolation, no cross-contamination when one team's CRD degrades.
+
 ### Performance benchmarks
 
 Published numbers for reconcile throughput, queue latency, and informer memory usage at 50+ and 100+ CRDs. Stress test results with quality gates.
@@ -118,6 +144,26 @@ Target 2027. Prerequisite is production adoption at multiple organisations, with
 
 ## The longer horizon
 
+### Declarative canary rollouts
+
+A `rollout:` block in the Katalog gates how a template change propagates:
+
+```yaml
+operatorBox:
+  rollout:
+    strategy: canary
+    initialWeight: 10
+    increment: 20
+    interval: 5m
+    gate:
+      metric: error_rate
+      threshold: "< 1"
+```
+
+Orkestra manages the weight split, polls the gate condition (using the same expression engine as `when:`), and advances or rolls back automatically. The substrate already has all the pieces — template engine, health model, conditional evaluation. Canary is applying them to a new lifecycle concern.
+
+### Katalog and Komposer as native Kubernetes kinds
+
 Katalog and Komposer as native Kubernetes kinds — registered by the cluster itself, understood by `kube-controller-manager`, auditable through the standard Kubernetes audit log.
 
 ```bash
@@ -134,7 +180,7 @@ for the full argument.
 
 ## What we are not building
 
-**Multi-cluster federation.** Orkestra manages CRDs within one cluster. Cross-cluster operations belong to a different architectural layer.
+**Multi-cluster federation.** Orkestra manages CRDs within one cluster. Cross-cluster *composition* already works today: `cross:` reads sibling operator state over HTTP, and `external:` can gate a reconcile on a remote operator's health endpoint. Per-cluster Orkestra instances compose at runtime. What we are not building is a control plane deployed in one cluster that federates multiple clusters.
 
 **Replacing controller-runtime.** Orkestra is a higher-level abstraction. Custom constructors bridge to controller-runtime for use cases that need it. They are complementary, not competitive.
 

diff --git a/pkg/kordinator/constants.go b/pkg/kordinator/constants.go
@@ -9,7 +9,8 @@ const (
 	DefaultDependencyInterval = 10 * time.Second
 
 	// PostStart Retry loop
-	PostStartRetryInterval        = 90 * time.Second
+	postStartRetryInterval        = 90 * time.Second // in-cluster (prod)
+	postStartRetryIntervalDev     = 10 * time.Second // local dev (not in pod)
 	PostStartBackoff              = 5 * time.Second
 	PostStartBackoffMax           = 5 * time.Minute
 	DependencyHealthCheckInterval = 10 * time.Second

diff --git a/pkg/kordinator/dependency_kordinator.go b/pkg/kordinator/dependency_kordinator.go
@@ -17,7 +17,7 @@ Startup:
   - Kordinate continues (does NOT block) because missing CRDs are skipped
   - Retry loop starts in background
 
-Retry loop (every PostStartRetryInterval):
+Retry loop (every postStartRetryInterval):
   - Phase 1: checks missing map
     - finds A is missing
     - calls utils.WaitForCRD() → false

diff --git a/pkg/kordinator/docs/04-self-healing.md b/pkg/kordinator/docs/04-self-healing.md
@@ -9,7 +9,7 @@ Both run for the operator's lifetime and stop only when the context is cancelled
 
 ## retryMissingCRDs
 
-Runs on `PostStartRetryInterval`. Each tick executes four phases in order.
+Runs on `postStartRetryInterval` (90s in-cluster, 10s when running locally via `ork run`). The interval is selected at startup using `utils.IsRunningInCluster()` — no configuration required. Each tick executes four phases in order.
 
 ### Phase 1 — Detect runtime disappearances
 

diff --git a/pkg/kordinator/post_start_hooks.go b/pkg/kordinator/post_start_hooks.go
@@ -13,54 +13,14 @@ import (
 	"k8s.io/apimachinery/pkg/runtime/schema"
 )
 
-// retryMissingCRDs runs continuously to detect and activate CRDs that were missing at startup.
-//
-// It runs forever because CRDs can be installed after Orkestra starts.
-// The loop stops only when the context is cancelled (leadership lost or shutdown).
-//
-// Flow:
-//   - Periodically checks the missing map
-//   - When a missing CRD appears in the cluster, activateCRD is called
-//   - Uses exponential backoff to avoid API server pressure
-//
-// Note: This loop handles activation only. Deactivation is not implemented —
-//
-//	if a CRD is deleted after startup, the informer continues running.
-//	But workers are drained through deactivateCRD.
-// dependenciesReady returns true if all declared dependencies are currently
-// satisfied (i.e., the required channel is already closed).
-// This check is non‑blocking.
-// func (k *DependencyKordinator) dependenciesReady(crd orktypes.CRDEntry, nameToGVK map[string]string) bool {
-// 	for depName, depCond := range crd.DependsOn {
-// 		depGVK, ok := nameToGVK[depName]
-// 		if !ok {
-// 			logger.Error().Str("crd", crd.Name).Str("dependency", depName).Msg("dependency GVK not found")
-// 			return false
-// 		}
-// 		switch strings.ToLower(depCond.Condition) {
-// 		case string(orktypes.DependencyConditionHealthy):
-// 			select {
-// 			case <-k.healthyCh[depGVK]:
-// 				// channel closed → dependency healthy
-// 			default:
-// 				return false
-// 			}
-// 		default: // started
-// 			select {
-// 			case <-k.startedCh[depGVK]:
-// 				// channel closed → dependency started
-// 			default:
-// 				return false
-// 			}
-// 		}
-// 	}
-// 	return true
-// }
-
 // retryMissingCRDs runs continuously to detect and activate CRDs that were missing at startup
 // or deferred because dependencies were not ready.
 func (k *DependencyKordinator) retryMissingCRDs(ctx context.Context) {
-	ticker := time.NewTicker(PostStartRetryInterval)
+	retryInterval := postStartRetryInterval
+	if !utils.IsRunningInCluster() {
+		retryInterval = postStartRetryIntervalDev
+	}
+	ticker := time.NewTicker(retryInterval)
 	defer ticker.Stop()
 
 	backoff := PostStartBackoff

diff --git a/pkg/simulate/kubeclient.go b/pkg/simulate/kubeclient.go
@@ -215,7 +215,10 @@ func (f *FakeKubeclient) PatchFinalizers(_ context.Context, obj runtime.Object,
 	return nil
 }
 
-func (f *FakeKubeclient) PatchLabels(_ context.Context, obj runtime.Object, _ schema.GroupVersionResource, _, desired map[string]string) error {
+func (f *FakeKubeclient) PatchLabels(_ context.Context, obj runtime.Object, _ schema.GroupVersionResource, base, desired map[string]string) error {
+	if stringMapsEqual(base, desired) {
+		return nil
+	}
 	f.mu.Lock()
 	f.ops = append(f.ops, Op{
 		Cycle:    f.currentCycle,
@@ -225,8 +228,6 @@ func (f *FakeKubeclient) PatchLabels(_ context.Context, obj runtime.Object, _ sc
 		At:       time.Now(),
 	})
 	f.mu.Unlock()
-	// Persist to the in-memory object so subsequent cycles see the update
-	// and the idempotency guard in Ensure**Label skips the patch.
 	if mo, ok := obj.(metav1.Object); ok {
 		mo.SetLabels(desired)
 	}
@@ -264,6 +265,18 @@ func (f *FakeKubeclient) PatchStatus(_ context.Context, obj domain.Object, _ sch
 	return nil
 }
 
+func stringMapsEqual(a, b map[string]string) bool {
+	if len(a) != len(b) {
+		return false
+	}
+	for k, v := range a {
+		if b[k] != v {
+			return false
+		}
+	}
+	return true
+}
+
 func nameFromRuntimeObject(obj runtime.Object) string {
 	if acc, ok := obj.(interface{ GetName() string }); ok {
 		return acc.GetName()
Original file line number	Diff line number	Diff line change
Expand Up		@@ -107,4 +107,4 @@ FAIL hello-website-e2e (62s)

		## E2E spec reference

		→ [documentation/reference/schema/04-e2e/](../schema/04-e2e/)
		→ [documentation/reference/schema/04-e2e/](../schema/04-e2e/index.md)