Skip to content

feat: harden Kubernetes HA (PDB, probes, NATS tuning, migration locks)#3062

Open
tlgimenes wants to merge 16 commits intomainfrom
tlgimenes/k8s-ha-brainstorm
Open

feat: harden Kubernetes HA (PDB, probes, NATS tuning, migration locks)#3062
tlgimenes wants to merge 16 commits intomainfrom
tlgimenes/k8s-ha-brainstorm

Conversation

@tlgimenes
Copy link
Copy Markdown
Contributor

@tlgimenes tlgimenes commented Apr 8, 2026

What is this contribution about?

Comprehensive Kubernetes high-availability hardening across Helm chart, application code, and infrastructure documentation. Addresses pod scheduling resilience, NATS reconnection reliability, database migration safety, and graceful shutdown alignment.

Helm chart:

  • PodDisruptionBudget template (maxUnavailable: 1, conditional on multi-replica)
  • startupProbe support separating startup from liveness detection
  • Ingress template (optional, disabled by default)
  • HPA behavior rendering support for custom scale-up/down policies
  • Pod anti-affinity and dual topology spread constraints (zone + hostname)
  • preStop hook (5s sleep) + 65s termination grace period for zero-downtime deploys
  • NATS JetStream right-sizing (1Gi→512Mi memory, 10Gi→5Gi file)
  • s3Sync image pinned from latest to 2.22.35

Application code fixes:

  • Fix NATS re-subscription bug: NatsNotifyStrategy.start() had an early return preventing re-subscription after reconnect, silently falling back to 5s polling
  • Advisory lock for plugin migrations using pg_advisory_xact_lock in a transaction on a pinned connection (pool-safe, auto-releases)
  • Shutdown force-exit timeout aligned from 55s to 58s for new 65s grace period
  • NATS client tuning: 20s ping interval (was 120s default), reconnect jitter, connection name

Infrastructure:

  • Production values example overlay (deltas only, 3-node NATS cluster, DoNotSchedule zone spread, HPA behavior)
  • Reference manifests: External Secrets Operator (v1 API), cert-manager ClusterIssuer, NetworkPolicy example

How to Test

  1. bun run check — all workspaces pass type checking
  2. bun test apps/mesh/src/event-bus/nats-notify.test.ts — 5 tests pass (covers reconnect bug fix)
  3. helm template test deploy/helm/ --set database.engine=postgresql --set database.url=postgresql://x — renders PDB, startupProbe, topology constraints
  4. helm template prod deploy/helm/ -f deploy/helm/values-production.example.yaml --set database.url=postgresql://x — renders production overlay with 3-node NATS, DoNotSchedule, Ingress

Migration Notes

  • Helm values change: terminationGracePeriodSeconds increases from 60 to 65. A lifecycle.preStop hook with 5s sleep is now the default.
  • Helm values change: readinessProbe.failureThreshold changes from 4 to 3. livenessProbe.initialDelaySeconds changes from 30 to 0 (startupProbe now handles startup).
  • Helm values change: NATS JetStream memoryStore.maxSize reduced from 1Gi to 512Mi, fileStore.pvc.size from 10Gi to 5Gi. Existing PVCs cannot be shrunk — delete and recreate if downsizing.
  • Helm values change: s3Sync.image.tag changes from latest to 2.22.35.
  • App code: Plugin migrations now run inside a transaction with an advisory lock. No user action needed.

Review Checklist

  • PR title is clear and descriptive
  • Changes are tested and working
  • Documentation is updated (if needed)
  • No breaking changes

Summary by cubic

Hardened Kubernetes HA across the Helm chart and app: added PDB, startup probes, anti‑affinity/spread, preStop with a 65s grace period, optional Ingress, HPA behavior, and right‑sized NATS. Also fixed a NATS re‑subscription bug, added advisory locks for plugin migrations, tuned NATS client settings, and updated tests.

  • New Features

    • Helm HA: PDB (maxUnavailable: 1 when multi‑replica/HPA), startupProbe, optional Ingress, HPA behavior, anti‑affinity + zone/host spread, preStop 5s + 65s termination grace.
    • NATS: JetStream resized (512Mi memory, 5Gi file); client ping interval 20s with reconnect jitter and named connection.
    • Examples: production values overlay (3‑node NATS, DoNotSchedule zone spread, HPA behavior) and reference manifests for External Secrets, cert‑manager, and NetworkPolicy.
  • Migration

    • terminationGracePeriodSeconds 60→65; default lifecycle.preStop added.
    • Probes: readinessProbe.failureThreshold 4→3; livenessProbe.initialDelaySeconds 30→0 (startupProbe now handles startup).
    • NATS JetStream: memory 1Gi→512Mi, file 10Gi→5Gi; shrinking needs PVC recreate.
    • s3Sync.image.tag latest2.22.35.

Written for commit b489fac. Summary will update on new commits.

tlgimenes and others added 16 commits April 7, 2026 21:02
Three implementation plans covering:
- Helm chart HA (PDB, probes, anti-affinity, NATS cluster, NetworkPolicy, Ingress)
- Application code fixes (NATS reconnect bug, plugin migration locks, shutdown timeout)
- Infrastructure & advanced (production values overlay, ESO, cert-manager, Karpenter)
Key changes:
- Advisory lock: use pg_advisory_xact_lock on pinned connection (pool-safe)
- Keep ScheduleAnyway as default (DoNotSchedule only in prod overlay)
- Keep readinessProbe failureThreshold at 3 (not 2)
- Remove PriorityClass and NetworkPolicy from Helm chart (YAGNI)
- Slim production overlay to deltas only, rename to .example.yaml
- Fix ESO API v1beta1 -> v1, reduce refreshInterval to 5m
- Fix test mock async iterator to terminate on unsubscribe
- Move NetworkPolicy to infrastructure examples with RFC1918 exclusions
The start() method had an early return when this.sub was non-null,
preventing re-subscription after NATS reconnects. The old subscription
object was stale, silently breaking the notify path and forcing all
event delivery to fall back to 5s polling.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Uses pg_advisory_xact_lock on a pinned connection to prevent race
conditions when multiple pods start simultaneously. Transaction-scoped
lock auto-releases on connection return, safe with connection poolers.
…riod

The Helm chart's terminationGracePeriodSeconds increases from 60 to 65
to accommodate a 5s preStop hook. The app's internal hard timeout
increases from 55s to 58s to stay within the grace period.
- pingInterval: 20s (was 120s default) for faster dead connection detection
- reconnectJitter: 500ms to prevent thundering herd on reconnect
- connection name for debugging in NATS monitoring
Includes External Secrets Operator (v1 API), cert-manager ClusterIssuer,
and NetworkPolicy example for production HA deployment.
…nection)

db.connection().execute() pins the connection but does NOT wrap in a
transaction. pg_advisory_xact_lock releases at transaction end, so with
auto-commit each statement gets its own implicit transaction and the
lock releases immediately. db.transaction().execute() both pins the
connection AND wraps in a transaction, making the lock effective.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction Action
👍 Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Release Options

Suggested: Minor (2.249.0) — based on feat: prefix

React with an emoji to override the release type:

Reaction Type Next Version
👍 Prerelease 2.248.9-alpha.1
🎉 Patch 2.248.9
❤️ Minor 2.249.0
🚀 Major 3.0.0

Current version: 2.248.8

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: patch).

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 19 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="deploy/infrastructure/networkpolicy.yaml">

<violation number="1" location="deploy/infrastructure/networkpolicy.yaml:26">
P1: This ingress rule allows port 3000 from any source, so the preceding `ingress-nginx` restriction no longer has any effect.</violation>
</file>

<file name="deploy/helm/values.yaml">

<violation number="1" location="deploy/helm/values.yaml:177">
P2: Scope the new pod anti-affinity to the release instance as well; matching only `app.kubernetes.io/name` makes different releases of this chart interfere with each other’s scheduling.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

ports:
- port: 3000
protocol: TCP
- ports:
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: This ingress rule allows port 3000 from any source, so the preceding ingress-nginx restriction no longer has any effect.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At deploy/infrastructure/networkpolicy.yaml, line 26:

<comment>This ingress rule allows port 3000 from any source, so the preceding `ingress-nginx` restriction no longer has any effect.</comment>

<file context>
@@ -0,0 +1,63 @@
+      ports:
+        - port: 3000
+          protocol: TCP
+    - ports:
+        - port: 3000
+          protocol: TCP
</file context>
Fix with Cubic

podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: chart-deco-studio
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Scope the new pod anti-affinity to the release instance as well; matching only app.kubernetes.io/name makes different releases of this chart interfere with each other’s scheduling.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At deploy/helm/values.yaml, line 177:

<comment>Scope the new pod anti-affinity to the release instance as well; matching only `app.kubernetes.io/name` makes different releases of this chart interfere with each other’s scheduling.</comment>

<file context>
@@ -143,21 +165,21 @@ tolerations: []
+        podAffinityTerm:
+          labelSelector:
+            matchLabels:
+              app.kubernetes.io/name: chart-deco-studio
+          topologyKey: kubernetes.io/hostname
+
</file context>
Suggested change
app.kubernetes.io/name: chart-deco-studio
app.kubernetes.io/name: chart-deco-studio
app.kubernetes.io/instance: deco-studio
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant