Skip to content

test(integration): add four HA scenario phases to kind-based CI#7

Merged
robfrank merged 11 commits into
mainfrom
feat/tests-improvement
May 10, 2026
Merged

test(integration): add four HA scenario phases to kind-based CI#7
robfrank merged 11 commits into
mainfrom
feat/tests-improvement

Conversation

@robfrank
Copy link
Copy Markdown
Contributor

@robfrank robfrank commented May 9, 2026

Summary

Extend ci/integration-test.sh with four new phases derived from a recent support email Q&A on Raft HA behavior:

  • Phase 5 — assert peers[].status is HEALTHY/CATCHING_UP (graceful skip on older images that predate the STATUS column).
  • Phase 6 — runtime leadership transfer via POST /api/v1/cluster/leader, then verify writes via the new leader.
  • Phase 7helm upgrade --set replicaCount=5, re-verify quorum and STATUS across 5 pods.
  • Phase 8 — delete a non-leader pod, verify it recovers via the snapshot-install path (best-effort SnapshotInstaller log grep).

Refactor lifts the per-pod consensus loop out of phase 2 into assert_quorum_n N so phases 2 and 7 share it. CI install gets arcadedb.ha.snapshotThreshold=50 so phase 8 can exercise the snapshot path without writing 100k rows.

Spec: docs/superpowers/specs/2026-05-09-ha-integration-tests-design.md
Plan: docs/superpowers/plans/2026-05-09-ha-integration-tests.md

Test plan

  • CI integration job completes inside the 20-minute timeout
  • All 8 [N/8] phase headers appear in the log
  • ==> All checks passed. in the final log line
  • If the pinned image predates the STATUS column, phases 5/7 emit the WARNING graceful-skip line and the run still passes

🤖 Generated with Claude Code

robfrank and others added 8 commits May 9, 2026 18:25
Capture scenarios derived from a support email Q&A on Raft HA behavior:
STATUS column assertion, runtime leadership transfer, scale-up via helm
upgrade, and snapshot-install recovery. Discards serverRole=replica,
1->3 helm upgrade, and >1GB import scenarios as not feasibly testable
in the kind-based CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Seven-task plan covering the four scenarios in the design spec:
snapshotThreshold CI flag, assert_quorum_n refactor, STATUS assertion,
leadership transfer, scale-up via helm upgrade, snapshot-install
recovery, and CI verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly expands the HA integration test suite for ArcadeDB. It introduces new test phases for peer health monitoring, runtime leadership transfer, cluster scaling from 3 to 5 replicas, and snapshot-based recovery. The test script is refactored to use generalized helpers for quorum and health checks, and detailed design and implementation plans are provided. Review feedback suggests optimizing the quorum helper by reducing port-forwarding overhead and improving the robustness of the recovery wait loop by handling transient API failures.

Comment thread ci/integration-test.sh
Comment on lines +57 to +68
while true; do
LEADERS=()
for (( i=0; i<n; i++ )); do
local_port=$(( HTTP_PORT + 10 + i ))
pid=$(pf_start "$i" "$local_port")
if pf_wait "$local_port"; then
l=$(api "$local_port" GET /api/v1/cluster \
| jq -r '.leaderId // empty' 2>/dev/null || echo "")
LEADERS+=("$l")
fi
pf_stop "$pid"
done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of assert_quorum_n starts and stops a kubectl port-forward process for every pod in every iteration of the polling loop. Establishing a port-forward connection can be relatively slow and resource-intensive. Consider starting the port-forwarding processes for all pods once before entering the while loop and stopping them after the loop finishes. This would significantly reduce the overhead and total execution time of the test, especially as the cluster scales to 5 pods.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real overhead, but YAGNI at current scope. After dropping phase 7 in fa18fb0, assert_quorum_n only runs against the freshly-formed 3-pod cluster in phase 2; the latest CI run converges in a single iteration in ~1.6s. Three short-lived port-forwards per iteration is comfortably below the cost of managing N persistent ones (lifecycle tracking, individual pf-failure handling within the polling loop, port allocation, interaction with the existing cleanup EXIT trap). Keeping as-is; revisit if we ever re-introduce a higher-N variant.

Comment thread ci/integration-test.sh Outdated
RECOVERED=0
LAST_STATUS=""
while (( SECONDS < DEADLINE )); do
STATUS_JSON=$(api "$HTTP_PORT" GET /api/v1/cluster)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the Phase 8 recovery wait loop, the api call is made without error handling. Since set -e is active, any transient failure in the API call (e.g., a network blip or the leader being temporarily busy) will cause the entire script to exit immediately. It would be more robust to handle potential failures gracefully within the loop, allowing it to retry until the deadline is reached, similar to how it's handled in the Phase 6 wait loop (line 221). Using a fallback empty structure like {"peers":[]} ensures subsequent jq calls don't fail on empty input.

Suggested change
STATUS_JSON=$(api "$HTTP_PORT" GET /api/v1/cluster)
STATUS_JSON=$(api "$HTTP_PORT" GET /api/v1/cluster 2>/dev/null || echo '{"peers":[]}')

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moot — phase 8 was dropped in fa18fb0. The line referenced no longer exists. The post-implementation discovery section in docs/superpowers/specs/2026-05-09-ha-integration-tests-design.md records why: a helm upgrade --set replicaCount=5 rolling-restarts the StatefulSet (serverList grows from 3 to 5) and the current ArcadeDB image does not auto-vote new peers into the Raft configuration — per the support email, that requires an explicit POST /api/v1/cluster/peer call from the leader. The cluster falls below quorum during the restart and never re-converges, making the snapshot-install phase unrunnable.

robfrank and others added 3 commits May 10, 2026 01:15
helm --set 'list[N]=value' replaces the whole list, leaving index 0
as null and producing a literal "null" arg in the rendered command —
the JVM rejected it and pods never became Ready. Use the {a,b} array
literal syntax to set both elements explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 7's helm upgrade triggers a rolling restart of the StatefulSet
because the serverList env var grows from 3 to 5 entries. With
persistence.enabled=false (CI default), in-memory database state is
lost. Recreate integration-test + TestDoc at the top of phase 8
before the 100-row write so the INSERTs target a real schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI surfaced a fundamental limitation of the deployed ArcadeDB image:
`helm upgrade --set replicaCount=5` rolling-restarts the StatefulSet
(serverList grows from 3 to 5 entries) and Raft does not auto-vote in
the new peers — per the support email, that requires an explicit
POST /api/v1/cluster/peer call. The cluster falls below quorum during
the restart and never re-converges, so phase 7 times out.

Phase 8 depended on a 5-pod cluster with the integration-test database,
so it goes with phase 7.

Drop the snapshotThreshold install flag too — only phase 8 needed it.

Phases 1-6 retain full coverage of the feasible scenarios from the
support email Q&A: rollout, Raft formation, write/read via leader,
STATUS=HEALTHY assertion (graceful skip on older images), and runtime
leadership transfer.

Spec updated with a post-implementation discovery section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@robfrank robfrank merged commit 6221034 into main May 10, 2026
3 checks passed
@robfrank robfrank deleted the feat/tests-improvement branch May 10, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant