feat: harden Kubernetes HA (PDB, probes, NATS tuning, migration locks) by tlgimenes · Pull Request #3062 · decocms/studio

tlgimenes · 2026-04-08T16:26:58Z

What is this contribution about?

Comprehensive Kubernetes high-availability hardening across Helm chart, application code, and infrastructure documentation. Addresses pod scheduling resilience, NATS reconnection reliability, database migration safety, and graceful shutdown alignment.

Helm chart:

PodDisruptionBudget template (maxUnavailable: 1, conditional on multi-replica)
startupProbe support separating startup from liveness detection
Ingress template (optional, disabled by default)
HPA behavior rendering support for custom scale-up/down policies
Pod anti-affinity and dual topology spread constraints (zone + hostname)
preStop hook (5s sleep) + 65s termination grace period for zero-downtime deploys
NATS JetStream right-sizing (1Gi→512Mi memory, 10Gi→5Gi file)
s3Sync image pinned from latest to 2.22.35

Application code fixes:

Fix NATS re-subscription bug: NatsNotifyStrategy.start() had an early return preventing re-subscription after reconnect, silently falling back to 5s polling
Advisory lock for plugin migrations using pg_advisory_xact_lock in a transaction on a pinned connection (pool-safe, auto-releases)
Shutdown force-exit timeout aligned from 55s to 58s for new 65s grace period
NATS client tuning: 20s ping interval (was 120s default), reconnect jitter, connection name

Infrastructure:

Production values example overlay (deltas only, 3-node NATS cluster, DoNotSchedule zone spread, HPA behavior)
Reference manifests: External Secrets Operator (v1 API), cert-manager ClusterIssuer, NetworkPolicy example

How to Test

bun run check — all workspaces pass type checking
bun test apps/mesh/src/event-bus/nats-notify.test.ts — 5 tests pass (covers reconnect bug fix)
helm template test deploy/helm/ --set database.engine=postgresql --set database.url=postgresql://x — renders PDB, startupProbe, topology constraints
helm template prod deploy/helm/ -f deploy/helm/values-production.example.yaml --set database.url=postgresql://x — renders production overlay with 3-node NATS, DoNotSchedule, Ingress

Migration Notes

Helm values change: terminationGracePeriodSeconds increases from 60 to 65. A lifecycle.preStop hook with 5s sleep is now the default.
Helm values change: readinessProbe.failureThreshold changes from 4 to 3. livenessProbe.initialDelaySeconds changes from 30 to 0 (startupProbe now handles startup).
Helm values change: NATS JetStream memoryStore.maxSize reduced from 1Gi to 512Mi, fileStore.pvc.size from 10Gi to 5Gi. Existing PVCs cannot be shrunk — delete and recreate if downsizing.
Helm values change: s3Sync.image.tag changes from latest to 2.22.35.
App code: Plugin migrations now run inside a transaction with an advisory lock. No user action needed.

Review Checklist

PR title is clear and descriptive
Changes are tested and working
Documentation is updated (if needed)
No breaking changes

Summary by cubic

Hardened Kubernetes HA across the Helm chart and app: added PDB, startup probes, anti‑affinity/spread, preStop with a 65s grace period, optional Ingress, HPA behavior, and right‑sized NATS. Also fixed a NATS re‑subscription bug, added advisory locks for plugin migrations, tuned NATS client settings, and updated tests.

New Features
- Helm HA: PDB (maxUnavailable: 1 when multi‑replica/HPA), startupProbe, optional Ingress, HPA behavior, anti‑affinity + zone/host spread, preStop 5s + 65s termination grace.
- NATS: JetStream resized (512Mi memory, 5Gi file); client ping interval 20s with reconnect jitter and named connection.
- Examples: production values overlay (3‑node NATS, DoNotSchedule zone spread, HPA behavior) and reference manifests for External Secrets, cert‑manager, and NetworkPolicy.
Migration
- terminationGracePeriodSeconds 60→65; default lifecycle.preStop added.
- Probes: readinessProbe.failureThreshold 4→3; livenessProbe.initialDelaySeconds 30→0 (startupProbe now handles startup).
- NATS JetStream: memory 1Gi→512Mi, file 10Gi→5Gi; shrinking needs PVC recreate.
- s3Sync.image.tag latest → 2.22.35.

^{Written for commit b489fac. Summary will update on new commits.}

Three implementation plans covering: - Helm chart HA (PDB, probes, anti-affinity, NATS cluster, NetworkPolicy, Ingress) - Application code fixes (NATS reconnect bug, plugin migration locks, shutdown timeout) - Infrastructure & advanced (production values overlay, ESO, cert-manager, Karpenter)

Key changes: - Advisory lock: use pg_advisory_xact_lock on pinned connection (pool-safe) - Keep ScheduleAnyway as default (DoNotSchedule only in prod overlay) - Keep readinessProbe failureThreshold at 3 (not 2) - Remove PriorityClass and NetworkPolicy from Helm chart (YAGNI) - Slim production overlay to deltas only, rename to .example.yaml - Fix ESO API v1beta1 -> v1, reduce refreshInterval to 5m - Fix test mock async iterator to terminate on unsubscribe - Move NetworkPolicy to infrastructure examples with RFC1918 exclusions

…3Sync image

…5Gi file)

The start() method had an early return when this.sub was non-null, preventing re-subscription after NATS reconnects. The old subscription object was stale, silently breaking the notify path and forcing all event delivery to fall back to 5s polling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Uses pg_advisory_xact_lock on a pinned connection to prevent race conditions when multiple pods start simultaneously. Transaction-scoped lock auto-releases on connection return, safe with connection poolers.

…riod The Helm chart's terminationGracePeriodSeconds increases from 60 to 65 to accommodate a 5s preStop hook. The app's internal hard timeout increases from 55s to 58s to stay within the grace period.

- pingInterval: 20s (was 120s default) for faster dead connection detection - reconnectJitter: 500ms to prevent thundering herd on reconnect - connection name for debugging in NATS monitoring

Includes External Secrets Operator (v1 API), cert-manager ClusterIssuer, and NetworkPolicy example for production HA deployment.

…nection) db.connection().execute() pins the connection but does NOT wrap in a transaction. pg_advisory_xact_lock releases at transaction end, so with auto-commit each statement gets its own implicit transaction and the lock releases immediately. db.transaction().execute() both pins the connection AND wraps in a transaction, making the lock effective.

github-actions · 2026-04-08T16:27:09Z

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction	Action
👍	Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

github-actions · 2026-04-08T16:27:13Z

Release Options

Suggested: Minor (2.249.0) — based on feat: prefix

React with an emoji to override the release type:

Reaction	Type	Next Version
👍	Prerelease	`2.248.9-alpha.1`
🎉	Patch	`2.248.9`
❤️	Minor	`2.249.0`
🚀	Major	`3.0.0`

Current version: 2.248.8

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: patch).

cubic-dev-ai

2 issues found across 19 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="deploy/infrastructure/networkpolicy.yaml">

<violation number="1" location="deploy/infrastructure/networkpolicy.yaml:26">
P1: This ingress rule allows port 3000 from any source, so the preceding `ingress-nginx` restriction no longer has any effect.</violation>
</file>

<file name="deploy/helm/values.yaml">

<violation number="1" location="deploy/helm/values.yaml:177">
P2: Scope the new pod anti-affinity to the release instance as well; matching only `app.kubernetes.io/name` makes different releases of this chart interfere with each other’s scheduling.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-04-08T16:51:08Z

deploy/infrastructure/networkpolicy.yaml

+      ports:
+        - port: 3000
+          protocol: TCP
+    - ports:


P1: This ingress rule allows port 3000 from any source, so the preceding ingress-nginx restriction no longer has any effect.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At deploy/infrastructure/networkpolicy.yaml, line 26: <comment>This ingress rule allows port 3000 from any source, so the preceding `ingress-nginx` restriction no longer has any effect.</comment> <file context> @@ -0,0 +1,63 @@ + ports: + - port: 3000 + protocol: TCP + - ports: + - port: 3000 + protocol: TCP </file context>

cubic-dev-ai · 2026-04-08T16:51:08Z

deploy/helm/values.yaml

+        podAffinityTerm:
+          labelSelector:
+            matchLabels:
+              app.kubernetes.io/name: chart-deco-studio


P2: Scope the new pod anti-affinity to the release instance as well; matching only app.kubernetes.io/name makes different releases of this chart interfere with each other’s scheduling.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At deploy/helm/values.yaml, line 177: <comment>Scope the new pod anti-affinity to the release instance as well; matching only `app.kubernetes.io/name` makes different releases of this chart interfere with each other’s scheduling.</comment> <file context> @@ -143,21 +165,21 @@ tolerations: [] + podAffinityTerm: + labelSelector: + matchLabels: + app.kubernetes.io/name: chart-deco-studio + topologyKey: kubernetes.io/hostname + </file context>

Suggested change

app.kubernetes.io/name: chart-deco-studio

app.kubernetes.io/name: chart-deco-studio

app.kubernetes.io/instance: deco-studio

tlgimenes and others added 16 commits April 7, 2026 21:02

feat(helm): add PodDisruptionBudget template for HA deployments

dd7fcff

feat(helm): add startupProbe support to deployment template

bad4233

feat(helm): add optional Ingress template

50edfc4

feat(helm): harden probes, add preStop hook, pod anti-affinity, pin s…

2bf3dfc

…3Sync image

feat(helm): add HPA behavior support to template

d22f425

feat(helm): right-size NATS JetStream storage (1Gi->512Mi mem, 10Gi->…

19fc937

…5Gi file)

fix(database): add advisory lock to plugin migrations

047175b

Uses pg_advisory_xact_lock on a pinned connection to prevent race conditions when multiple pods start simultaneously. Transaction-scoped lock auto-releases on connection return, safe with connection poolers.

fix(shutdown): align force-exit timeout with 65s termination grace pe…

51a1a87

…riod The Helm chart's terminationGracePeriodSeconds increases from 60 to 65 to accommodate a 5s preStop hook. The app's internal hard timeout increases from 55s to 58s to stay within the grace period.

feat(nats): tune client connection for faster dead-connection detection

397c670

- pingInterval: 20s (was 120s default) for faster dead connection detection - reconnectJitter: 500ms to prevent thundering herd on reconnect - connection name for debugging in NATS monitoring

fix(test): resolve type errors in nats-notify test

ff7c752

feat(helm): add production values example overlay

e0f6797

docs(infra): add infrastructure reference manifests

fb4c8ba

Includes External Secrets Operator (v1 API), cert-manager ClusterIssuer, and NetworkPolicy example for production HA deployment.

cubic-dev-ai bot reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: harden Kubernetes HA (PDB, probes, NATS tuning, migration locks)#3062

feat: harden Kubernetes HA (PDB, probes, NATS tuning, migration locks)#3062
tlgimenes wants to merge 16 commits intomainfrom
tlgimenes/k8s-ha-brainstorm

tlgimenes commented Apr 8, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

github-actions bot commented Apr 8, 2026

Uh oh!

github-actions bot commented Apr 8, 2026

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Apr 8, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	app.kubernetes.io/name: chart-deco-studio
	app.kubernetes.io/name: chart-deco-studio
	app.kubernetes.io/instance: deco-studio

Conversation

tlgimenes commented Apr 8, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this contribution about?

How to Test

Migration Notes

Review Checklist

Summary by cubic

Uh oh!

github-actions bot commented Apr 8, 2026

🧪 Benchmark

Uh oh!

github-actions bot commented Apr 8, 2026

Release Options

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tlgimenes commented Apr 8, 2026 •

edited by cubic-dev-ai bot

Loading

cubic-dev-ai bot Apr 8, 2026 •

edited

Loading

cubic-dev-ai bot Apr 8, 2026 •

edited

Loading