refactor: use aleph CLI for operations that were ad-hoc#19
Conversation
Replaces hand-rolled CCN/scheduler/CRN plumbing with the v0.10.3 CLI
where it now subsumes the same work end-to-end.
- crn-up.sh: 3 corechannel `curl + jq` queries → `aleph node list`
(filter by owner / type / corechannel-address).
- fund-test-accounts.sh: 2 balance `curl + jq` queries →
`aleph account balance --json` (.credits).
- conftest: `ccn_aggregates` and `crn_nodes` fixtures now run through
`aleph aggregate get` and `aleph node list` instead of urllib +
manual aggregate parsing.
- test_instances / test_migration: drop the scheduler-api allocation
poll and the CRN `/about/executions/list` poll in favor of a single
`aleph instance show --verbose --json` (placement.allocated_node for
CRN identity, mapped_ports for SSH readiness). The marker-write/read
and "echo hello" SSH go through `aleph instance ssh` (IPv6).
To make `instance show`/`instance ssh` reach the local scheduler, a
new session-scoped `aleph_cli_config` fixture seeds an isolated CLI
config (XDG_CONFIG_HOME) with a `testnet` network whose scheduler URL
points at $ALEPH_TESTNET_SCHEDULER_API_URL. A `_ssh_home` fixture
provides a permissive `~/.ssh/config` ($HOME override scoped to
`instance ssh` calls) because the CLI does not expose `ssh -o`
passthrough.
Net: -69 lines in the tests/scripts touched (test_migration.py shrinks
by ~45%) and the test bodies now express intent ("show this instance")
rather than URL/field plumbing.
Caveat: `aleph instance ssh` connects to the VM's IPv6, not the
IPv4+port-forward path used previously. If the CI runner can't reach
the CRN's IPv6 pool we'll see it on the first run.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Clean refactor replacing ad-hoc HTTP/curl calls with the aleph CLI across shell scripts and pytest fixtures. The new aleph_cli_config and _ssh_home fixtures correctly isolate CLI state and SSH config for test environments. Test bodies shrink ~45% and read much more declaratively. The only note is the duplicated _poll() helper in both test files, which is pre-existing and out of scope for this PR. The IPv6 caveat for SSH connections is honestly documented. No correctness or security issues found.
tests/test_instances.py (line 6): The _poll function here is identical to the one in test_migration.py:22. Consider extracting to a shared helper (e.g. tests/conftest.py) in a follow-up.
tests/test_migration.py (line 22): Same _poll function as in test_instances.py:6 — worth deduplicating in a follow-up.
Two CI-revealed bugs from the previous commit:
1. `ccn_aggregates` returned the wrong shape. `aleph aggregate get
<key>` emits `{"<key>": <value>}` (matching the SDK's
AggregateResponse.data layout); the urllib fixture had unwrapped that.
Drop the wrapping in the new fixture too — `test_corechannel_aggregate_exists`
and `test_create_node` both rely on `agg.get("nodes"/"resource_nodes")`.
2. `aleph instance ssh` connects over the VM's /124 IPv6, and the CI
runner has no route to the CRN droplet's IPv6 pool — both
`test_instance_create_and_ssh` and `test_instance_migration` timed
out at 60s with no SSH progress. Revert just the SSH transport to a
manual ssh subprocess over the CRN's IPv4 + the host-side mapped
port. Discovery still goes through the CLI: `aleph instance show
--verbose` yields placement.allocated_node + mapped_ports["22"] in
one call, and `aleph node list --type crn --all` resolves the
allocated node hash to its registered URL. Drop the now-unused
`_ssh_home` fixture and HOME override on `aleph_cli`.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Well-structured refactor that replaces ad-hoc HTTP+JSON-plumbing with the aleph CLI. The new conftest fixtures (aleph_cli_config, aleph_cli) correctly isolate the CLI configuration and prevent cross-test pollution. Test logic in both test_instances.py and test_migration.py is significantly simpler — instance show --verbose collapses what was previously 3-4 polling loops into one call, and SSH goes through the CRN IPv4+mapped-port path with the same reliability as before. The only notable code quality issue is the duplication of _resolve_crn_host across both test files.
tests/test_instances.py (line 26): This helper is identical to test_migration.py:43. Consider extracting into conftest.py as a shared utility (or making it a fixture) to avoid maintenance duplication.
The combined `instance show` poll has to wait for the scheduler to reallocate AND for the new CRN to boot the VM + map SSH. The original test split these into 300s + 180s + 60s polls; my single poll was set to 300s, which let migration time out in CI. Bump to 540s to match the original headroom.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Clean, well-structured refactoring. The mapped_ports format assumption needs verification against actual CLI output to avoid a runtime TypeError. Also address the missing running state check (minor regression), and deduplicate _resolve_crn_host/NODESTATUS_ADDR into conftest.py. The SSH auth failure is environment-related, not a PR regression.
tests/test_migration.py (line 113): mapped_ports format risk: the old code polled the raw CRN API which returned {"22": {"host": 12345}} — mapped.get("22") returned a dict, not a port number. If aleph instance show --verbose passes this through without flattening, int(data["mapped_ports"]["22"]) will raise TypeError. Verify the actual CLI output format and add handling for the nested case, or test against the real CLI.
tests/test_migration.py (line 96): Missing running state check: old _wait_for_vm_ssh_port required execution.get("running") before returning. The replacement only checks allocated_node + mapped_ports. If these get reported before the VM is booted, SSH connections arrive before sshd is ready. In practice _wait_for_ssh's polling masks this, but it's a minor behavioral regression.
tests/test_instances.py (line 26): Code duplication: _resolve_crn_host and NODESTATUS_ADDR are identical in both test files (test_instances.py:26-38, test_migration.py:43-53). Extract to conftest.py as a shared helper function.
Root cause of the migration timeout (found via CI scheduler-rs logs):
the VM never migrated because the unlink was a silent no-op, so
scheduler-rs kept seeing the CRN as linked ("Plan updated: 2 nodes",
never re-planned) and never re-dispatched.
Node operations embed the *current network's name* as the corechannel
`--network-tag`. crn-up.sh links the CRNs with the CLI's builtin
default network → tag `mainnet`. The aleph_cli_config fixture had run
`network add testnet` + `network use testnet`, so every `node unlink`
issued via aleph_cli was tagged `testnet` and no longer matched the
`mainnet`-tagged link — the network accepted the POST (exit 0) but it
referenced a non-existent link, so nothing changed downstream. This is
why bumping the timeout didn't help: migration was never going to
happen.
Fix: set the scheduler URL on the builtin `mainnet` network in place
(`network set --scheduler-url`) instead of creating/switching to a
differently-named network. instance show/ssh still resolve the local
scheduler, and node-op tags stay `mainnet`, matching crn-up.sh.
Verified by dry-run: under the new fixture config, `node unlink`
embeds tags ["unlink","mainnet"] (was ["unlink","testnet"]).
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Solid refactor replacing ad-hoc HTTP/curl/urllib calls with the aleph CLI across shell scripts and Python tests. The key architectural win is the aleph_cli_config fixture that correctly isolates CLI config while preserving network-tag parity with crn-up.sh — fixing the silent migration breakage that was the root cause of CI timeouts. Code is clean, well-commented, and the net -69 line diff confirms genuine simplification.
tests/test_instances.py (line 81): mapped.get("22") only checks string keys, but the CRN API may return integer keys ({22: port}). The old code had mapped_ports.get("22") or mapped_ports.get(22). If the CLI doesn't normalize keys to strings, this poll will silently time out. Same issue at test_migration.py:114.
tests/test_migration.py (line 43): _resolve_crn_host is an exact duplicate of the function in test_instances.py:26. Consider extracting to conftest.py as a module-level helper or a small fixture.
scripts/fund-test-accounts.sh (line 103): Uses bare aleph (resolved from PATH) while crn-up.sh uses $aleph_cli resolved from $BIN_DIR/aleph. If aleph isn't on PATH in the CI runner's compose-exec context, this could fail. Consider using the same indirection pattern as crn-up.sh.
Real root cause of the migration failure (confirmed from CI scheduler-rs
logs: it stayed at "2 nodes" and never re-dispatched): nodestatus only
processes corechannel operations whose tag matches its FILTER_TAG, which
was `mainnet`. The CRNs were therefore linked under the `mainnet` tag,
and any unlink had to carry the same tag to take effect.
Rather than keep masquerading as mainnet, make the whole testnet use the
`testnet` tag consistently, and pin it explicitly so we can never fall
back to mainnet by accident:
- nodestatus: FILTER_TAG mainnet -> testnet (both the corechannel and
balances services, deploy/docker-compose.yml). This is the setting that
selects which network's corechannel ops nodestatus ingests
(aleph_nodestatus Settings.filter_tag, env FILTER_TAG).
- The CLI derives the corechannel --network-tag from the *current
network's name*, so every invocation now passes `--network testnet`
against a `testnet` network:
- tests/conftest.py: aleph_cli_config registers a `testnet` network
(scheduler URL -> local) and aleph_cli passes `--network testnet` on
every call.
- scripts/crn-up.sh: registers `testnet` and passes `--network testnet`
on create-ccn / create-crn / link / node list.
- scripts/fund-test-accounts.sh: passes `--network testnet` on the
account-balance reads (raw --ccn, so the network is never contacted,
but explicit for safety).
The raw `--ccn <url>` always wins over the network's configured CCN
(resolve_ccn_url early-returns on a URL), so pointing at the local CCN is
unaffected. Balance/credit tests query by post type + sender, not by tag,
so flipping the balances service is safe.
Verified by dry-run: create-crn/link/unlink now embed the `testnet` tag
under this config (were `mainnet`).
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Clean refactor replacing ad-hoc HTTP/shell plumbing with aleph CLI calls. Reduces test_migration.py by ~45% while maintaining correctness. The corechannel network-tag invariant (testnet) is consistently applied across scripts, docker-compose, and test fixtures. Well-documented design decisions and clear reasoning throughout.
tests/conftest.py (line 54): This TESTNET_NETWORK constant is the linchpin tying together scripts, nodestatus FILTER_TAG, and test operations. Consider adding a comment noting that if this is ever changed, the matching FILTER_TAG in deploy/docker-compose.yml and --network in scripts/*.sh must also change — currently a silent no-op failure mode exists if they drift.
tests/test_instances.py (line 26): _resolve_crn_host and _poll are duplicated verbatim in test_migration.py. Consider extracting to conftest.py for a future cleanup — non-blocking.
tests/test_migration.py (line 43): _resolve_crn_host is duplicated from test_instances.py. Same suggestion as above.
The v0.10.3 CLI now subsumes plumbing that we'd previously hand-rolled. This PR replaces it where the CLI gives the same answer in fewer moving parts.
Migrations
scripts/crn-up.sh: 3×curl /api/v0/aggregates/<addr>.json + jqto count CCNs by owner / find a CRN hash by URLaleph node list --type {ccn,crn} --address | --all --corechannel-address X --jsonscripts/fund-test-accounts.sh: 2×curl /api/v0/addresses/X/balance + jq .credit_balancealeph account balance X --json(.credits)conftest.ccn_aggregates: urllib + manualdata.X.Yparsingaleph aggregate get <key> --address X --jsonconftest.crn_nodes: scrape corechannel.resource_nodes from the aggregate dictaleph node list --type crn --all --corechannel-address X --jsontest_instances/test_migration: poll scheduler-api/api/v0/allocation/{hash}, then poll CRN/v2/about/executions/listto findmapped_ports.22.host, then ssh subprocess with-o StrictHostKeyChecking=no -p <port> -i <key>aleph instance show <hash> --verbose --json(rolls up CCN message + scheduler placement + CRN networking + mapped ports), thenaleph instance ssh <hash> --identity <key> -- <cmd>for SSHInfrastructure in
conftest.pyTwo new session-scoped fixtures support the instance-test migration:
aleph_cli_config— creates a tmpXDG_CONFIG_HOMEand registers atestnetnetwork whosescheduler-urlpoints at$ALEPH_TESTNET_SCHEDULER_API_URL. Required becausealeph instance show/sshresolve the scheduler from the current network's config; without isolation we'd either hit the CLI's builtin mainnet scheduler or mutate the user's~/.config/aleph/config.toml._ssh_home— minimal$HOMEwith a permissive~/.ssh/config(StrictHostKeyChecking no, ephemeral known_hosts). The CLI doesn't exposessh -opassthrough, andaleph instance sshrunssshinheriting parent env, so this is the cleanest knob.$HOMEis overridden only forinstance sshinvocations.The
aleph_clifixture also now returnsNone(instead of raisingJSONDecodeError) whenparse_json=Truemeets empty stdout — that's the documented contract foraleph aggregate geton a missing key, and the newccn_aggregatesrelies on it to keep "returns None on miss" semantics.Caveat
aleph instance sshconnects to the VM's IPv6 (the CRN's/124pool). The previous code went over the CRN's IPv4 + a host-side mapped port. If the CI runner can't reach the CRN droplet's IPv6 we'll see it on the first run; falling back to keeping just the manual ssh subprocess for that one call is straightforward.Net diff
194 insertions / 263 deletions (–69 net).
test_migration.pyshrinks ~45%. The test bodies now read like "show this instance / ssh into it" rather than URL+field plumbing.Verified offline
aleph aggregate geton missing key → exit 0, empty stdout (matches fixture semantics).aleph node list --all --type crn --jsonagainst mainnet → array of{type, hash, name, owner, address, ...}— keys match whatcrn_nodes/_find_crn_hash-replacement read.aleph node list --address X --type ccn ...works withoutALEPH_PRIVATE_KEYin env (no--chainrequired; the script's curl-replaced calls had no signing env).