test(integration): add four HA scenario phases to kind-based CI#7
Conversation
Capture scenarios derived from a support email Q&A on Raft HA behavior: STATUS column assertion, runtime leadership transfer, scale-up via helm upgrade, and snapshot-install recovery. Discards serverRole=replica, 1->3 helm upgrade, and >1GB import scenarios as not feasibly testable in the kind-based CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Seven-task plan covering the four scenarios in the design spec: snapshotThreshold CI flag, assert_quorum_n refactor, STATUS assertion, leadership transfer, scale-up via helm upgrade, snapshot-install recovery, and CI verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request significantly expands the HA integration test suite for ArcadeDB. It introduces new test phases for peer health monitoring, runtime leadership transfer, cluster scaling from 3 to 5 replicas, and snapshot-based recovery. The test script is refactored to use generalized helpers for quorum and health checks, and detailed design and implementation plans are provided. Review feedback suggests optimizing the quorum helper by reducing port-forwarding overhead and improving the robustness of the recovery wait loop by handling transient API failures.
| while true; do | ||
| LEADERS=() | ||
| for (( i=0; i<n; i++ )); do | ||
| local_port=$(( HTTP_PORT + 10 + i )) | ||
| pid=$(pf_start "$i" "$local_port") | ||
| if pf_wait "$local_port"; then | ||
| l=$(api "$local_port" GET /api/v1/cluster \ | ||
| | jq -r '.leaderId // empty' 2>/dev/null || echo "") | ||
| LEADERS+=("$l") | ||
| fi | ||
| pf_stop "$pid" | ||
| done |
There was a problem hiding this comment.
The current implementation of assert_quorum_n starts and stops a kubectl port-forward process for every pod in every iteration of the polling loop. Establishing a port-forward connection can be relatively slow and resource-intensive. Consider starting the port-forwarding processes for all pods once before entering the while loop and stopping them after the loop finishes. This would significantly reduce the overhead and total execution time of the test, especially as the cluster scales to 5 pods.
There was a problem hiding this comment.
Real overhead, but YAGNI at current scope. After dropping phase 7 in fa18fb0, assert_quorum_n only runs against the freshly-formed 3-pod cluster in phase 2; the latest CI run converges in a single iteration in ~1.6s. Three short-lived port-forwards per iteration is comfortably below the cost of managing N persistent ones (lifecycle tracking, individual pf-failure handling within the polling loop, port allocation, interaction with the existing cleanup EXIT trap). Keeping as-is; revisit if we ever re-introduce a higher-N variant.
| RECOVERED=0 | ||
| LAST_STATUS="" | ||
| while (( SECONDS < DEADLINE )); do | ||
| STATUS_JSON=$(api "$HTTP_PORT" GET /api/v1/cluster) |
There was a problem hiding this comment.
In the Phase 8 recovery wait loop, the api call is made without error handling. Since set -e is active, any transient failure in the API call (e.g., a network blip or the leader being temporarily busy) will cause the entire script to exit immediately. It would be more robust to handle potential failures gracefully within the loop, allowing it to retry until the deadline is reached, similar to how it's handled in the Phase 6 wait loop (line 221). Using a fallback empty structure like {"peers":[]} ensures subsequent jq calls don't fail on empty input.
| STATUS_JSON=$(api "$HTTP_PORT" GET /api/v1/cluster) | |
| STATUS_JSON=$(api "$HTTP_PORT" GET /api/v1/cluster 2>/dev/null || echo '{"peers":[]}') |
There was a problem hiding this comment.
Moot — phase 8 was dropped in fa18fb0. The line referenced no longer exists. The post-implementation discovery section in docs/superpowers/specs/2026-05-09-ha-integration-tests-design.md records why: a helm upgrade --set replicaCount=5 rolling-restarts the StatefulSet (serverList grows from 3 to 5) and the current ArcadeDB image does not auto-vote new peers into the Raft configuration — per the support email, that requires an explicit POST /api/v1/cluster/peer call from the leader. The cluster falls below quorum during the restart and never re-converges, making the snapshot-install phase unrunnable.
helm --set 'list[N]=value' replaces the whole list, leaving index 0
as null and producing a literal "null" arg in the rendered command —
the JVM rejected it and pods never became Ready. Use the {a,b} array
literal syntax to set both elements explicitly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 7's helm upgrade triggers a rolling restart of the StatefulSet because the serverList env var grows from 3 to 5 entries. With persistence.enabled=false (CI default), in-memory database state is lost. Recreate integration-test + TestDoc at the top of phase 8 before the 100-row write so the INSERTs target a real schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI surfaced a fundamental limitation of the deployed ArcadeDB image: `helm upgrade --set replicaCount=5` rolling-restarts the StatefulSet (serverList grows from 3 to 5 entries) and Raft does not auto-vote in the new peers — per the support email, that requires an explicit POST /api/v1/cluster/peer call. The cluster falls below quorum during the restart and never re-converges, so phase 7 times out. Phase 8 depended on a 5-pod cluster with the integration-test database, so it goes with phase 7. Drop the snapshotThreshold install flag too — only phase 8 needed it. Phases 1-6 retain full coverage of the feasible scenarios from the support email Q&A: rollout, Raft formation, write/read via leader, STATUS=HEALTHY assertion (graceful skip on older images), and runtime leadership transfer. Spec updated with a post-implementation discovery section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Extend
ci/integration-test.shwith four new phases derived from a recent support email Q&A on Raft HA behavior:peers[].statusisHEALTHY/CATCHING_UP(graceful skip on older images that predate the STATUS column).POST /api/v1/cluster/leader, then verify writes via the new leader.helm upgrade --set replicaCount=5, re-verify quorum and STATUS across 5 pods.SnapshotInstallerlog grep).Refactor lifts the per-pod consensus loop out of phase 2 into
assert_quorum_n Nso phases 2 and 7 share it. CI install getsarcadedb.ha.snapshotThreshold=50so phase 8 can exercise the snapshot path without writing 100k rows.Spec:
docs/superpowers/specs/2026-05-09-ha-integration-tests-design.mdPlan:
docs/superpowers/plans/2026-05-09-ha-integration-tests.mdTest plan
integrationjob completes inside the 20-minute timeout[N/8]phase headers appear in the log==> All checks passed.in the final log line🤖 Generated with Claude Code