test(integration): add four HA scenario phases to kind-based CI by robfrank · Pull Request #7 · ArcadeData/arcadedb-helm

robfrank · 2026-05-09T23:07:16Z

Summary

Extend ci/integration-test.sh with four new phases derived from a recent support email Q&A on Raft HA behavior:

Phase 5 — assert peers[].status is HEALTHY/CATCHING_UP (graceful skip on older images that predate the STATUS column).
Phase 6 — runtime leadership transfer via POST /api/v1/cluster/leader, then verify writes via the new leader.
Phase 7 — helm upgrade --set replicaCount=5, re-verify quorum and STATUS across 5 pods.
Phase 8 — delete a non-leader pod, verify it recovers via the snapshot-install path (best-effort SnapshotInstaller log grep).

Refactor lifts the per-pod consensus loop out of phase 2 into assert_quorum_n N so phases 2 and 7 share it. CI install gets arcadedb.ha.snapshotThreshold=50 so phase 8 can exercise the snapshot path without writing 100k rows.

Spec: docs/superpowers/specs/2026-05-09-ha-integration-tests-design.md
Plan: docs/superpowers/plans/2026-05-09-ha-integration-tests.md

Test plan

CI integration job completes inside the 20-minute timeout
All 8 [N/8] phase headers appear in the log
==> All checks passed. in the final log line
If the pinned image predates the STATUS column, phases 5/7 emit the WARNING graceful-skip line and the run still passes

🤖 Generated with Claude Code

Capture scenarios derived from a support email Q&A on Raft HA behavior: STATUS column assertion, runtime leadership transfer, scale-up via helm upgrade, and snapshot-install recovery. Discards serverRole=replica, 1->3 helm upgrade, and >1GB import scenarios as not feasibly testable in the kind-based CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Seven-task plan covering the four scenarios in the design spec: snapshotThreshold CI flag, assert_quorum_n refactor, STATUS assertion, leadership transfer, scale-up via helm upgrade, snapshot-install recovery, and CI verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request significantly expands the HA integration test suite for ArcadeDB. It introduces new test phases for peer health monitoring, runtime leadership transfer, cluster scaling from 3 to 5 replicas, and snapshot-based recovery. The test script is refactored to use generalized helpers for quorum and health checks, and detailed design and implementation plans are provided. Review feedback suggests optimizing the quorum helper by reducing port-forwarding overhead and improving the robustness of the recovery wait loop by handling transient API failures.

gemini-code-assist · 2026-05-09T23:09:15Z

+  while true; do
+    LEADERS=()
+    for (( i=0; i<n; i++ )); do
+      local_port=$(( HTTP_PORT + 10 + i ))
+      pid=$(pf_start "$i" "$local_port")
+      if pf_wait "$local_port"; then
+        l=$(api "$local_port" GET /api/v1/cluster \
+          | jq -r '.leaderId // empty' 2>/dev/null || echo "")
+        LEADERS+=("$l")
+      fi
+      pf_stop "$pid"
+    done


The current implementation of assert_quorum_n starts and stops a kubectl port-forward process for every pod in every iteration of the polling loop. Establishing a port-forward connection can be relatively slow and resource-intensive. Consider starting the port-forwarding processes for all pods once before entering the while loop and stopping them after the loop finishes. This would significantly reduce the overhead and total execution time of the test, especially as the cluster scales to 5 pods.

Real overhead, but YAGNI at current scope. After dropping phase 7 in fa18fb0, assert_quorum_n only runs against the freshly-formed 3-pod cluster in phase 2; the latest CI run converges in a single iteration in ~1.6s. Three short-lived port-forwards per iteration is comfortably below the cost of managing N persistent ones (lifecycle tracking, individual pf-failure handling within the polling loop, port allocation, interaction with the existing cleanup EXIT trap). Keeping as-is; revisit if we ever re-introduce a higher-N variant.

gemini-code-assist · 2026-05-09T23:09:15Z

+RECOVERED=0
+LAST_STATUS=""
+while (( SECONDS < DEADLINE )); do
+  STATUS_JSON=$(api "$HTTP_PORT" GET /api/v1/cluster)


In the Phase 8 recovery wait loop, the api call is made without error handling. Since set -e is active, any transient failure in the API call (e.g., a network blip or the leader being temporarily busy) will cause the entire script to exit immediately. It would be more robust to handle potential failures gracefully within the loop, allowing it to retry until the deadline is reached, similar to how it's handled in the Phase 6 wait loop (line 221). Using a fallback empty structure like {"peers":[]} ensures subsequent jq calls don't fail on empty input.

Suggested change

STATUS_JSON=$(api "$HTTP_PORT" GET /api/v1/cluster)

STATUS_JSON=$(api "$HTTP_PORT" GET /api/v1/cluster 2>/dev/null || echo '{"peers":[]}')

Moot — phase 8 was dropped in fa18fb0. The line referenced no longer exists. The post-implementation discovery section in docs/superpowers/specs/2026-05-09-ha-integration-tests-design.md records why: a helm upgrade --set replicaCount=5 rolling-restarts the StatefulSet (serverList grows from 3 to 5) and the current ArcadeDB image does not auto-vote new peers into the Raft configuration — per the support email, that requires an explicit POST /api/v1/cluster/peer call from the leader. The cluster falls below quorum during the restart and never re-converges, making the snapshot-install phase unrunnable.

helm --set 'list[N]=value' replaces the whole list, leaving index 0 as null and producing a literal "null" arg in the rendered command — the JVM rejected it and pods never became Ready. Use the {a,b} array literal syntax to set both elements explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 7's helm upgrade triggers a rolling restart of the StatefulSet because the serverList env var grows from 3 to 5 entries. With persistence.enabled=false (CI default), in-memory database state is lost. Recreate integration-test + TestDoc at the top of phase 8 before the 100-row write so the INSERTs target a real schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI surfaced a fundamental limitation of the deployed ArcadeDB image: `helm upgrade --set replicaCount=5` rolling-restarts the StatefulSet (serverList grows from 3 to 5 entries) and Raft does not auto-vote in the new peers — per the support email, that requires an explicit POST /api/v1/cluster/peer call. The cluster falls below quorum during the restart and never re-converges, so phase 7 times out. Phase 8 depended on a 5-pod cluster with the integration-test database, so it goes with phase 7. Drop the snapshotThreshold install flag too — only phase 8 needed it. Phases 1-6 retain full coverage of the feasible scenarios from the support email Q&A: rollout, Raft formation, write/read via leader, STATUS=HEALTHY assertion (graceful skip on older images), and runtime leadership transfer. Spec updated with a post-implementation discovery section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

robfrank and others added 8 commits May 9, 2026 18:25

ci: lower snapshot threshold for HA integration test

54fc73f

test(integration): extract assert_quorum_n helper, renumber phases

bdded85

test(integration): assert peers STATUS=HEALTHY (phase 5)

f9879a0

test(integration): exercise runtime leadership transfer (phase 6)

9b9eca1

test(integration): scale-up 3->5 via helm upgrade (phase 7)

832b8d6

test(integration): snapshot-install on follower recovery (phase 8)

dfba3df

gemini-code-assist Bot reviewed May 9, 2026

View reviewed changes

robfrank and others added 3 commits May 10, 2026 01:15

robfrank merged commit 6221034 into main May 10, 2026
3 checks passed

robfrank deleted the feat/tests-improvement branch May 10, 2026 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(integration): add four HA scenario phases to kind-based CI#7

test(integration): add four HA scenario phases to kind-based CI#7
robfrank merged 11 commits into
mainfrom
feat/tests-improvement

robfrank commented May 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Uh oh!

robfrank May 10, 2026

Uh oh!

gemini-code-assist Bot May 9, 2026

Uh oh!

robfrank May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	STATUS_JSON=$(api "$HTTP_PORT" GET /api/v1/cluster)
	STATUS_JSON=$(api "$HTTP_PORT" GET /api/v1/cluster 2>/dev/null \|\| echo '{"peers":[]}')

Conversation

robfrank commented May 9, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

robfrank May 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

robfrank May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant