feat: harden Kubernetes HA (PDB, probes, NATS tuning, migration locks)#3062
feat: harden Kubernetes HA (PDB, probes, NATS tuning, migration locks)#3062
Conversation
Three implementation plans covering: - Helm chart HA (PDB, probes, anti-affinity, NATS cluster, NetworkPolicy, Ingress) - Application code fixes (NATS reconnect bug, plugin migration locks, shutdown timeout) - Infrastructure & advanced (production values overlay, ESO, cert-manager, Karpenter)
Key changes: - Advisory lock: use pg_advisory_xact_lock on pinned connection (pool-safe) - Keep ScheduleAnyway as default (DoNotSchedule only in prod overlay) - Keep readinessProbe failureThreshold at 3 (not 2) - Remove PriorityClass and NetworkPolicy from Helm chart (YAGNI) - Slim production overlay to deltas only, rename to .example.yaml - Fix ESO API v1beta1 -> v1, reduce refreshInterval to 5m - Fix test mock async iterator to terminate on unsubscribe - Move NetworkPolicy to infrastructure examples with RFC1918 exclusions
The start() method had an early return when this.sub was non-null, preventing re-subscription after NATS reconnects. The old subscription object was stale, silently breaking the notify path and forcing all event delivery to fall back to 5s polling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Uses pg_advisory_xact_lock on a pinned connection to prevent race conditions when multiple pods start simultaneously. Transaction-scoped lock auto-releases on connection return, safe with connection poolers.
…riod The Helm chart's terminationGracePeriodSeconds increases from 60 to 65 to accommodate a 5s preStop hook. The app's internal hard timeout increases from 55s to 58s to stay within the grace period.
- pingInterval: 20s (was 120s default) for faster dead connection detection - reconnectJitter: 500ms to prevent thundering herd on reconnect - connection name for debugging in NATS monitoring
Includes External Secrets Operator (v1 API), cert-manager ClusterIssuer, and NetworkPolicy example for production HA deployment.
…nection) db.connection().execute() pins the connection but does NOT wrap in a transaction. pg_advisory_xact_lock releases at transaction end, so with auto-commit each statement gets its own implicit transaction and the lock releases immediately. db.transaction().execute() both pins the connection AND wraps in a transaction, making the lock effective.
🧪 BenchmarkShould we run the Virtual MCP strategy benchmark for this PR? React with 👍 to run the benchmark.
Benchmark will run on the next push after you react. |
Release OptionsSuggested: Minor ( React with an emoji to override the release type:
Current version:
|
There was a problem hiding this comment.
2 issues found across 19 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="deploy/infrastructure/networkpolicy.yaml">
<violation number="1" location="deploy/infrastructure/networkpolicy.yaml:26">
P1: This ingress rule allows port 3000 from any source, so the preceding `ingress-nginx` restriction no longer has any effect.</violation>
</file>
<file name="deploy/helm/values.yaml">
<violation number="1" location="deploy/helm/values.yaml:177">
P2: Scope the new pod anti-affinity to the release instance as well; matching only `app.kubernetes.io/name` makes different releases of this chart interfere with each other’s scheduling.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| ports: | ||
| - port: 3000 | ||
| protocol: TCP | ||
| - ports: |
There was a problem hiding this comment.
P1: This ingress rule allows port 3000 from any source, so the preceding ingress-nginx restriction no longer has any effect.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At deploy/infrastructure/networkpolicy.yaml, line 26:
<comment>This ingress rule allows port 3000 from any source, so the preceding `ingress-nginx` restriction no longer has any effect.</comment>
<file context>
@@ -0,0 +1,63 @@
+ ports:
+ - port: 3000
+ protocol: TCP
+ - ports:
+ - port: 3000
+ protocol: TCP
</file context>
| podAffinityTerm: | ||
| labelSelector: | ||
| matchLabels: | ||
| app.kubernetes.io/name: chart-deco-studio |
There was a problem hiding this comment.
P2: Scope the new pod anti-affinity to the release instance as well; matching only app.kubernetes.io/name makes different releases of this chart interfere with each other’s scheduling.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At deploy/helm/values.yaml, line 177:
<comment>Scope the new pod anti-affinity to the release instance as well; matching only `app.kubernetes.io/name` makes different releases of this chart interfere with each other’s scheduling.</comment>
<file context>
@@ -143,21 +165,21 @@ tolerations: []
+ podAffinityTerm:
+ labelSelector:
+ matchLabels:
+ app.kubernetes.io/name: chart-deco-studio
+ topologyKey: kubernetes.io/hostname
+
</file context>
| app.kubernetes.io/name: chart-deco-studio | |
| app.kubernetes.io/name: chart-deco-studio | |
| app.kubernetes.io/instance: deco-studio |
What is this contribution about?
Comprehensive Kubernetes high-availability hardening across Helm chart, application code, and infrastructure documentation. Addresses pod scheduling resilience, NATS reconnection reliability, database migration safety, and graceful shutdown alignment.
Helm chart:
maxUnavailable: 1, conditional on multi-replica)latestto2.22.35Application code fixes:
NatsNotifyStrategy.start()had an early return preventing re-subscription after reconnect, silently falling back to 5s pollingpg_advisory_xact_lockin a transaction on a pinned connection (pool-safe, auto-releases)Infrastructure:
How to Test
bun run check— all workspaces pass type checkingbun test apps/mesh/src/event-bus/nats-notify.test.ts— 5 tests pass (covers reconnect bug fix)helm template test deploy/helm/ --set database.engine=postgresql --set database.url=postgresql://x— renders PDB, startupProbe, topology constraintshelm template prod deploy/helm/ -f deploy/helm/values-production.example.yaml --set database.url=postgresql://x— renders production overlay with 3-node NATS, DoNotSchedule, IngressMigration Notes
terminationGracePeriodSecondsincreases from 60 to 65. Alifecycle.preStophook with 5s sleep is now the default.readinessProbe.failureThresholdchanges from 4 to 3.livenessProbe.initialDelaySecondschanges from 30 to 0 (startupProbe now handles startup).memoryStore.maxSizereduced from 1Gi to 512Mi,fileStore.pvc.sizefrom 10Gi to 5Gi. Existing PVCs cannot be shrunk — delete and recreate if downsizing.s3Sync.image.tagchanges fromlatestto2.22.35.Review Checklist
Summary by cubic
Hardened Kubernetes HA across the Helm chart and app: added PDB, startup probes, anti‑affinity/spread, preStop with a 65s grace period, optional Ingress, HPA behavior, and right‑sized NATS. Also fixed a NATS re‑subscription bug, added advisory locks for plugin migrations, tuned NATS client settings, and updated tests.
New Features
maxUnavailable: 1when multi‑replica/HPA), startupProbe, optional Ingress, HPA behavior, anti‑affinity + zone/host spread, preStop 5s + 65s termination grace.Migration
terminationGracePeriodSeconds60→65; defaultlifecycle.preStopadded.readinessProbe.failureThreshold4→3;livenessProbe.initialDelaySeconds30→0 (startupProbe now handles startup).s3Sync.image.taglatest→2.22.35.Written for commit b489fac. Summary will update on new commits.