Skip to content

refactor: use aleph CLI for operations that were ad-hoc#19

Merged
odesenfans merged 5 commits into
mainfrom
od/refactor-cli-migrations
May 28, 2026
Merged

refactor: use aleph CLI for operations that were ad-hoc#19
odesenfans merged 5 commits into
mainfrom
od/refactor-cli-migrations

Conversation

@odesenfans
Copy link
Copy Markdown
Contributor

The v0.10.3 CLI now subsumes plumbing that we'd previously hand-rolled. This PR replaces it where the CLI gives the same answer in fewer moving parts.

Migrations

Before After
scripts/crn-up.sh: 3× curl /api/v0/aggregates/<addr>.json + jq to count CCNs by owner / find a CRN hash by URL aleph node list --type {ccn,crn} --address | --all --corechannel-address X --json
scripts/fund-test-accounts.sh: 2× curl /api/v0/addresses/X/balance + jq .credit_balance aleph account balance X --json (.credits)
conftest.ccn_aggregates: urllib + manual data.X.Y parsing aleph aggregate get <key> --address X --json
conftest.crn_nodes: scrape corechannel.resource_nodes from the aggregate dict aleph node list --type crn --all --corechannel-address X --json
test_instances/test_migration: poll scheduler-api /api/v0/allocation/{hash}, then poll CRN /v2/about/executions/list to find mapped_ports.22.host, then ssh subprocess with -o StrictHostKeyChecking=no -p <port> -i <key> One aleph instance show <hash> --verbose --json (rolls up CCN message + scheduler placement + CRN networking + mapped ports), then aleph instance ssh <hash> --identity <key> -- <cmd> for SSH

Infrastructure in conftest.py

Two new session-scoped fixtures support the instance-test migration:

  • aleph_cli_config — creates a tmp XDG_CONFIG_HOME and registers a testnet network whose scheduler-url points at $ALEPH_TESTNET_SCHEDULER_API_URL. Required because aleph instance show/ssh resolve the scheduler from the current network's config; without isolation we'd either hit the CLI's builtin mainnet scheduler or mutate the user's ~/.config/aleph/config.toml.
  • _ssh_home — minimal $HOME with a permissive ~/.ssh/config (StrictHostKeyChecking no, ephemeral known_hosts). The CLI doesn't expose ssh -o passthrough, and aleph instance ssh runs ssh inheriting parent env, so this is the cleanest knob. $HOME is overridden only for instance ssh invocations.

The aleph_cli fixture also now returns None (instead of raising JSONDecodeError) when parse_json=True meets empty stdout — that's the documented contract for aleph aggregate get on a missing key, and the new ccn_aggregates relies on it to keep "returns None on miss" semantics.

Caveat

aleph instance ssh connects to the VM's IPv6 (the CRN's /124 pool). The previous code went over the CRN's IPv4 + a host-side mapped port. If the CI runner can't reach the CRN droplet's IPv6 we'll see it on the first run; falling back to keeping just the manual ssh subprocess for that one call is straightforward.

Net diff

194 insertions / 263 deletions (–69 net). test_migration.py shrinks ~45%. The test bodies now read like "show this instance / ssh into it" rather than URL+field plumbing.

Verified offline

  • aleph aggregate get on missing key → exit 0, empty stdout (matches fixture semantics).
  • aleph node list --all --type crn --json against mainnet → array of {type, hash, name, owner, address, ...} — keys match what crn_nodes / _find_crn_hash-replacement read.
  • aleph node list --address X --type ccn ... works without ALEPH_PRIVATE_KEY in env (no --chain required; the script's curl-replaced calls had no signing env).
  • bash -n / py_compile pass.

Replaces hand-rolled CCN/scheduler/CRN plumbing with the v0.10.3 CLI
where it now subsumes the same work end-to-end.

- crn-up.sh: 3 corechannel `curl + jq` queries → `aleph node list`
  (filter by owner / type / corechannel-address).
- fund-test-accounts.sh: 2 balance `curl + jq` queries →
  `aleph account balance --json` (.credits).
- conftest: `ccn_aggregates` and `crn_nodes` fixtures now run through
  `aleph aggregate get` and `aleph node list` instead of urllib +
  manual aggregate parsing.
- test_instances / test_migration: drop the scheduler-api allocation
  poll and the CRN `/about/executions/list` poll in favor of a single
  `aleph instance show --verbose --json` (placement.allocated_node for
  CRN identity, mapped_ports for SSH readiness). The marker-write/read
  and "echo hello" SSH go through `aleph instance ssh` (IPv6).

To make `instance show`/`instance ssh` reach the local scheduler, a
new session-scoped `aleph_cli_config` fixture seeds an isolated CLI
config (XDG_CONFIG_HOME) with a `testnet` network whose scheduler URL
points at $ALEPH_TESTNET_SCHEDULER_API_URL. A `_ssh_home` fixture
provides a permissive `~/.ssh/config` ($HOME override scoped to
`instance ssh` calls) because the CLI does not expose `ssh -o`
passthrough.

Net: -69 lines in the tests/scripts touched (test_migration.py shrinks
by ~45%) and the test bodies now express intent ("show this instance")
rather than URL/field plumbing.

Caveat: `aleph instance ssh` connects to the VM's IPv6, not the
IPv4+port-forward path used previously. If the CI runner can't reach
the CRN's IPv6 pool we'll see it on the first run.
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean refactor replacing ad-hoc HTTP/curl calls with the aleph CLI across shell scripts and pytest fixtures. The new aleph_cli_config and _ssh_home fixtures correctly isolate CLI state and SSH config for test environments. Test bodies shrink ~45% and read much more declaratively. The only note is the duplicated _poll() helper in both test files, which is pre-existing and out of scope for this PR. The IPv6 caveat for SSH connections is honestly documented. No correctness or security issues found.

tests/test_instances.py (line 6): The _poll function here is identical to the one in test_migration.py:22. Consider extracting to a shared helper (e.g. tests/conftest.py) in a follow-up.

tests/test_migration.py (line 22): Same _poll function as in test_instances.py:6 — worth deduplicating in a follow-up.

Two CI-revealed bugs from the previous commit:

1. `ccn_aggregates` returned the wrong shape. `aleph aggregate get
   <key>` emits `{"<key>": <value>}` (matching the SDK's
   AggregateResponse.data layout); the urllib fixture had unwrapped that.
   Drop the wrapping in the new fixture too — `test_corechannel_aggregate_exists`
   and `test_create_node` both rely on `agg.get("nodes"/"resource_nodes")`.

2. `aleph instance ssh` connects over the VM's /124 IPv6, and the CI
   runner has no route to the CRN droplet's IPv6 pool — both
   `test_instance_create_and_ssh` and `test_instance_migration` timed
   out at 60s with no SSH progress. Revert just the SSH transport to a
   manual ssh subprocess over the CRN's IPv4 + the host-side mapped
   port. Discovery still goes through the CLI: `aleph instance show
   --verbose` yields placement.allocated_node + mapped_ports["22"] in
   one call, and `aleph node list --type crn --all` resolves the
   allocated node hash to its registered URL. Drop the now-unused
   `_ssh_home` fixture and HOME override on `aleph_cli`.
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured refactor that replaces ad-hoc HTTP+JSON-plumbing with the aleph CLI. The new conftest fixtures (aleph_cli_config, aleph_cli) correctly isolate the CLI configuration and prevent cross-test pollution. Test logic in both test_instances.py and test_migration.py is significantly simpler — instance show --verbose collapses what was previously 3-4 polling loops into one call, and SSH goes through the CRN IPv4+mapped-port path with the same reliability as before. The only notable code quality issue is the duplication of _resolve_crn_host across both test files.

tests/test_instances.py (line 26): This helper is identical to test_migration.py:43. Consider extracting into conftest.py as a shared utility (or making it a fixture) to avoid maintenance duplication.

The combined `instance show` poll has to wait for the scheduler to
reallocate AND for the new CRN to boot the VM + map SSH. The original
test split these into 300s + 180s + 60s polls; my single poll was set
to 300s, which let migration time out in CI. Bump to 540s to match the
original headroom.
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-structured refactoring. The mapped_ports format assumption needs verification against actual CLI output to avoid a runtime TypeError. Also address the missing running state check (minor regression), and deduplicate _resolve_crn_host/NODESTATUS_ADDR into conftest.py. The SSH auth failure is environment-related, not a PR regression.

tests/test_migration.py (line 113): mapped_ports format risk: the old code polled the raw CRN API which returned {"22": {"host": 12345}}mapped.get("22") returned a dict, not a port number. If aleph instance show --verbose passes this through without flattening, int(data["mapped_ports"]["22"]) will raise TypeError. Verify the actual CLI output format and add handling for the nested case, or test against the real CLI.

tests/test_migration.py (line 96): Missing running state check: old _wait_for_vm_ssh_port required execution.get("running") before returning. The replacement only checks allocated_node + mapped_ports. If these get reported before the VM is booted, SSH connections arrive before sshd is ready. In practice _wait_for_ssh's polling masks this, but it's a minor behavioral regression.

tests/test_instances.py (line 26): Code duplication: _resolve_crn_host and NODESTATUS_ADDR are identical in both test files (test_instances.py:26-38, test_migration.py:43-53). Extract to conftest.py as a shared helper function.

Root cause of the migration timeout (found via CI scheduler-rs logs):
the VM never migrated because the unlink was a silent no-op, so
scheduler-rs kept seeing the CRN as linked ("Plan updated: 2 nodes",
never re-planned) and never re-dispatched.

Node operations embed the *current network's name* as the corechannel
`--network-tag`. crn-up.sh links the CRNs with the CLI's builtin
default network → tag `mainnet`. The aleph_cli_config fixture had run
`network add testnet` + `network use testnet`, so every `node unlink`
issued via aleph_cli was tagged `testnet` and no longer matched the
`mainnet`-tagged link — the network accepted the POST (exit 0) but it
referenced a non-existent link, so nothing changed downstream. This is
why bumping the timeout didn't help: migration was never going to
happen.

Fix: set the scheduler URL on the builtin `mainnet` network in place
(`network set --scheduler-url`) instead of creating/switching to a
differently-named network. instance show/ssh still resolve the local
scheduler, and node-op tags stay `mainnet`, matching crn-up.sh.

Verified by dry-run: under the new fixture config, `node unlink`
embeds tags ["unlink","mainnet"] (was ["unlink","testnet"]).
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid refactor replacing ad-hoc HTTP/curl/urllib calls with the aleph CLI across shell scripts and Python tests. The key architectural win is the aleph_cli_config fixture that correctly isolates CLI config while preserving network-tag parity with crn-up.sh — fixing the silent migration breakage that was the root cause of CI timeouts. Code is clean, well-commented, and the net -69 line diff confirms genuine simplification.

tests/test_instances.py (line 81): mapped.get("22") only checks string keys, but the CRN API may return integer keys ({22: port}). The old code had mapped_ports.get("22") or mapped_ports.get(22). If the CLI doesn't normalize keys to strings, this poll will silently time out. Same issue at test_migration.py:114.

tests/test_migration.py (line 43): _resolve_crn_host is an exact duplicate of the function in test_instances.py:26. Consider extracting to conftest.py as a module-level helper or a small fixture.

scripts/fund-test-accounts.sh (line 103): Uses bare aleph (resolved from PATH) while crn-up.sh uses $aleph_cli resolved from $BIN_DIR/aleph. If aleph isn't on PATH in the CI runner's compose-exec context, this could fail. Consider using the same indirection pattern as crn-up.sh.

Real root cause of the migration failure (confirmed from CI scheduler-rs
logs: it stayed at "2 nodes" and never re-dispatched): nodestatus only
processes corechannel operations whose tag matches its FILTER_TAG, which
was `mainnet`. The CRNs were therefore linked under the `mainnet` tag,
and any unlink had to carry the same tag to take effect.

Rather than keep masquerading as mainnet, make the whole testnet use the
`testnet` tag consistently, and pin it explicitly so we can never fall
back to mainnet by accident:

- nodestatus: FILTER_TAG mainnet -> testnet (both the corechannel and
  balances services, deploy/docker-compose.yml). This is the setting that
  selects which network's corechannel ops nodestatus ingests
  (aleph_nodestatus Settings.filter_tag, env FILTER_TAG).
- The CLI derives the corechannel --network-tag from the *current
  network's name*, so every invocation now passes `--network testnet`
  against a `testnet` network:
  - tests/conftest.py: aleph_cli_config registers a `testnet` network
    (scheduler URL -> local) and aleph_cli passes `--network testnet` on
    every call.
  - scripts/crn-up.sh: registers `testnet` and passes `--network testnet`
    on create-ccn / create-crn / link / node list.
  - scripts/fund-test-accounts.sh: passes `--network testnet` on the
    account-balance reads (raw --ccn, so the network is never contacted,
    but explicit for safety).

The raw `--ccn <url>` always wins over the network's configured CCN
(resolve_ccn_url early-returns on a URL), so pointing at the local CCN is
unaffected. Balance/credit tests query by post type + sender, not by tag,
so flipping the balances service is safe.

Verified by dry-run: create-crn/link/unlink now embed the `testnet` tag
under this config (were `mainnet`).
@odesenfans odesenfans deployed to digitalocean May 28, 2026 21:33 — with GitHub Actions Active
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean refactor replacing ad-hoc HTTP/shell plumbing with aleph CLI calls. Reduces test_migration.py by ~45% while maintaining correctness. The corechannel network-tag invariant (testnet) is consistently applied across scripts, docker-compose, and test fixtures. Well-documented design decisions and clear reasoning throughout.

tests/conftest.py (line 54): This TESTNET_NETWORK constant is the linchpin tying together scripts, nodestatus FILTER_TAG, and test operations. Consider adding a comment noting that if this is ever changed, the matching FILTER_TAG in deploy/docker-compose.yml and --network in scripts/*.sh must also change — currently a silent no-op failure mode exists if they drift.

tests/test_instances.py (line 26): _resolve_crn_host and _poll are duplicated verbatim in test_migration.py. Consider extracting to conftest.py for a future cleanup — non-blocking.

tests/test_migration.py (line 43): _resolve_crn_host is duplicated from test_instances.py. Same suggestion as above.

@odesenfans odesenfans merged commit 18d9b39 into main May 28, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants