feat: [MR-144] Reuse unchanged canister digests across HashTrees#10532
feat: [MR-144] Reuse unchanged canister digests across HashTrees#10532alin-at-dfinity wants to merge 78 commits into
HashTrees#10532Conversation
* Use precomputed empty leaf hash instead of computing it from scratch every time. * Pass the key to `MapTransformFork::mk_tree` by reference, don't clone it (especially since it's only used in a couple of cases). * Simplidy CanisterFork.
Looks like it was actually dm-2 (store-crypt) and/or dm-4 (store-shared--data, more likely) suffering from read amplification, not vda. They're probably all the same (virtual?) device underneath, which is why reads appear to be spiking on all 3 at once. But it's likely the shared data partition that is actually being pounded. Reduce the read-ahead for all these devices similar to what was done for vda, just to be safe.
…10422) ## Summary `ic-admin get-elected-guestos-versions --json` (alias `get-blessed-replica-versions`) stopped emitting JSON. The handler used to print JSON via the generic `print_and_get_last_value` helper, but #10301 rewrote it to enumerate `ReplicaVersionRecord` entries with a plain `println!` loop and never re-wired `opts.json`. Since `--json` is a global flag, it is still accepted but silently ignored, so the command always prints plaintext. `get-elected-hostos-versions` had the same plaintext-only behavior. This broke downstream automation in `dre-airflow`, which parsed the JSON output and started failing with `JSONDecodeError: Extra data: line 1 column 7 (char 6)` (the first version hash's leading digits parsed as an integer). This PR restores `--json` for both subcommands: - with `--json`: print the elected version IDs as a JSON array (`["<hash>", ...]`) - without `--json`: unchanged, newline-separated version IDs ## How was this tested? `bazel build //rs/registry/admin:ic-admin` builds successfully. Manual shape (with `--json`): ```json [ "abc...def", "012...789" ] ``` --------- Co-authored-by: IDX GitHub Automation <infra+github-automation@dfinity.org>
Update mainnet revisions file to include the latest version released on the mainnet. This PR is created automatically using [`mainnet_revisions.py`](https://github.com/dfinity/ic/blob/master/ci/src/mainnet_revisions/mainnet_revisions.py) Co-authored-by: CI Automation <infra+github-automation@dfinity.org>
This PR adds a PocketIC operation to delete a subnet: - the subnet's `StateMachine` is properly dropped from the PocketIC state (including state directory removal); - the registry is updated accordingly (including registry changes in the registry canister if the registry canister has been bootstrapped by PocketIC, i.e., not manually). Note. This PR does not implement subnet deletion in DSM! Stale artifacts (such as stale streams and hanging unbounded-wait calls) might persist. --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…0426) Following the introduction of structured `CanisterStates` (hot vs cold canisters), almost half of the remaining DSM overhead is not covered by `mr_process_batch_phase_duration_seconds`, so it can only be bundled under "other" (the difference to `mr_process_batch_duration_seconds`). Explicitly track `read_registry()` duration (including blockmaker metrics updates, likely insignificant); and schedule/queues garbage collection plus (heavy) `canister_state_bytes` computation.
…gine (#10427) This PR extends the migration canister to reject canister migration requests in which the migrated or replaced canister is on a cloud engine. Currently, such a canister migration is rejected because xnet calls to a cloud engine are not allowed. This PR makes the requirement explicit so that canister migration from/to a cloud engine keeps being rejected once xnet calls to a cloud engine are allowed.
## Background Currently, the price for an HTTP outcall is calculated and charged upfront. In the future, we want to implement a "pay-as-you-go" pricing model. This model assigns a fraction (budget) of the total attached cycles to each participating replica (per-replica allowance). As the request passes through the target server and the transform function, the HTTP adapter of each replica subtracts cycles from its own budget, based on the amount of resources it consumed (download time, downloaded bytes, transform instructions). In the end, the amount of remaining cycles (refund) is passed back to the replica together with the HTTP response. ## Proposed Changes With this PR, we let the replica include (and sign) this refund as part of the gossiped response share. Before, each replica only signed the response metadata which should be equal across all responses in order to reach consensus. Now, the signed messages may be different even across agreeing replicas, as each of them could sign a different refund share. Therefore, we change the structure of the aggregated HTTP response proof. Previously, this was just the metadata of the response and a collection of individual signatures over it. Now, it is the metadata and a collection of individual signatures and refund receipts of each replica. In order to verify the proof, the individual messages consisting of both the metadata and the refund share have to be reconstructed, such that the signature can be verified. For that reason we additionally switch signature verification of aggregated responses to use the new batch verification API for multiple messages, similar to what was done in #10345. As a side effect, this comes with some performance improvements for payload validation: ``` canister_http_payload_verification/mixed_subnet34 time: [146.72 ms 148.46 ms 150.26 ms] change: [-14.887% -13.555% -12.243%] (p = 0.00 < 0.05) Performance has improved. canister_http_payload_verification/many_replicated_responses_subnet34 time: [225.27 ms 227.44 ms 229.63 ms] change: [-17.525% -16.435% -15.417%] (p = 0.00 < 0.05) Performance has improved. canister_http_payload_verification/many_non_replicated_responses_subnet34 time: [11.124 ms 11.263 ms 11.406 ms] change: [-61.100% -60.382% -59.642%] (p = 0.00 < 0.05) Performance has improved. canister_http_payload_verification/many_divergence_responses_subnet34 time: [116.12 ms 117.52 ms 118.94 ms] change: [-4.0609% -2.4873% -0.9067%] (p = 0.00 < 0.05) Change within noise threshold. canister_http_payload_verification/many_flexible_responses_subnet34 time: [253.98 ms 256.09 ms 258.22 ms] change: [-3.1565% -2.0740% -0.8815%] (p = 0.00 < 0.05) Change within noise threshold. ``` Additionally, during payload validation, we verify that none of the refund shares exceed the per-replica allowance. Note that in the current (legacy) pricing model, nothing is refunded yet, so the allowance, and the returned refund is always 0. --------- Co-authored-by: IDX GitHub Automation <infra+github-automation@dfinity.org>
This PR enables the canister log memory store feature on all subnets.
It seems that Ubuntu 26 upgrade changed the way the network is initialized by `systemd` which breaks our intermediate DHCP step. It's hard to debug especially in combination with SELinux policies. * Use `dhcpcd` instead of `systemd-networkd` to obtain a temporary DHCP address. This avoids problems with new Ubuntu 26 init order (`systemd-networkd` is started now by `systemd` early it seems and if we launch a 2nd instance - it falls apart). * Remove `console=tty0` from kernel boot args. It was added to support logging on serial-port-less clouds, but the problem is that when both `ttyX` and `ttySX` are present - the kernel makes the normal (graphical / ttyX) console the main one (regardless of their order) and serial console doesn't get any recovery shell or proper logging. We can try later to improve that somehow to get the best of the both worlds, though it's not easy. Also see systemd/systemd#9899 about that. * Alter the `icos_logging` so that it logs to both `stderr` and `journald` by default to ease debugging * Remove timeout for `init-config` service since it anyway is critical and it makes not much sense to kill it if it hangs around longer * Add timeouts to cloud-related HTTP calls, add more verbose logging
…10435) ## Summary - Raises `MAX_ENVIRONMENT_VARIABLES` in `rs/config/src/execution_environment.rs` from 20 to 32. ## Test plan - [x] `rustfmt` clean - [x] `cargo clippy -p ic-config` clean - [x] `bazel build //rs/config:config` - [x] Ran directly-affected bazel tests (depth 2); env-var specific test (`test_environment_variable_system_api`) passes. Unrelated darwin-local flakes were observed in stable-memory / NNS / state_manager / consensus integration tests; none reference env vars. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Bjoern Tackmann <bjoern@dfinity.org> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This bumps our bazel version to the latest available version. Some dependencies had to be bumped for 9.X support, but overall very little changed.
## Root Cause
The `//rs/tests/networking:canister_http_*` tests fail in `setup` with:
```
Orchestrator rule did not appear in time.: Timed out waiting for orchestrator rule.
...
Output: Err: Error: No such file or directory
list chain ip6 filter OUTPUT
```
`canister_http::setup` → `start_httpbin_on_uvm` →
`wait_for_orchestrator_fw_rules` polls every node for the
orchestrator-applied `ip6 filter OUTPUT` chain. In the analyzed failure
the **entire `ip6 filter` table** was missing on one node and never
appeared during the whole 60s retry window — long after all replicas
reported healthy. That's a *permanent* condition, not a slow one.
The GuestOS firewall application pipeline is:
1. The orchestrator generates the nftables ruleset and writes it to
`/run/ic-node/nftables-ruleset/nftables.conf` — but **only when its
content changes** (or on the very first check after start).
2. `reload_nftables.path` (`PathChanged=`) triggers
`reload_nftables.service`, which ran:
- `ExecStartPre=/usr/sbin/nft flush ruleset`
- `ExecStart=systemctl reload nftables.service` (→ `nft -f
/etc/nftables.conf`)
This sequence is **neither atomic nor retried**:
- The flush runs *outside* the nft transaction. If the subsequent load
fails for any transient reason, the node is left with an **empty
ruleset**.
- The canonical transient failure is hostname resolution of `hostos` in
the template rule `ip6 saddr { hostos } ct state { new } tcp dport {
42372 } accept`, which is resolved at `nft` load time (via the
`nss_icos` NSS plugin, falling through to DNS) and can fail with `Error:
Could not resolve hostname: Temporary failure in name resolution` early
after boot.
- Nothing retries: the service is `Type=oneshot` without restart, and
the path unit only re-fires when the orchestrator rewrites the file —
which it effectively never does again on a short-lived testnet.
A single transient `nft` load failure therefore permanently disables the
firewall on that node, and the test times out waiting for a chain that
will never exist. The `canister_http_*` family is the canary because
it's the only test family that explicitly waits for this rule on
**every** node of its testnet.
PR #9857 previously narrowed this race by adding
`After=nss-lookup.target` ordering to `reload_nftables.service`, but
`After=` is pure start ordering — by the time the path unit fires, those
targets are long active, so it doesn't guarantee that resolving `hostos`
succeeds. The race window was narrowed, not closed.
## Fix
Make the firewall apply **atomic** and **retried**:
- Prepend `flush ruleset` to the orchestrator-generated nftables config
(both the replica and the boundary-node firewall templates in
`ic.json5.template`), so flushing the old ruleset and loading the new
one happen in a single `nft` transaction. A failed load now leaves the
previously active ruleset in effect instead of leaving the node without
a firewall.
- In `reload_nftables.service`, drop the non-atomic `ExecStartPre`
flush, run `nft -f /etc/nftables.conf` directly, and add
`Restart=on-failure` / `RestartSec=2` (bounded by
`StartLimitIntervalSec=300` / `StartLimitBurst=30`, i.e. ~60s of
retries). Since the load is atomic, retrying is safe. If it keeps
failing, the unit ends up in a failed state, making the problem visible
(e.g. to `//rs/tests/node:guestos_no_failed_systemd_units`).
The 6 nftables golden files are updated accordingly.
## Verification
- `bazel test //rs/orchestrator:orchestrator_test` (nftables golden
tests) passes.
- `bazel test --runs_per_test=//rs/tests/networking:canister_http_test@3
//rs/tests/networking:canister_http_test
//rs/tests/node:guestos_no_failed_systemd_units`
---
This PR was created following the steps in
`.claude/skills/fix-flaky-tests/SKILL.md`.
Updating base container image references. Run URL: https://github.com/dfinity/ic/actions/runs/27337008974 Co-authored-by: github-merge-queue <118344674+github-merge-queue@users.noreply.github.com>
This moves two further IC-OS tools, dosfstools and mtools, out of the Dockerfile and into the Bazel build. This is similar to what we did with e2fsprogs. This makes the build more portable. See also: #10369
…#10445) ## Root Cause On PRs where testnet allocation is pinned to the local DC (#10436), `//rs/tests/networking:canister_http_socks_test` fails with every outcall attempt erroring as: ``` direct connect ... ConnectionRefused and connect through socks "... Connector(ConnectError(\"tcp connect error\", ... ConnectionRefused))" ``` The *direct* refusal is intentional (the test injects an egress-reject rule for httpbin first). The real failure is the SOCKS leg: the nested `Connector(ConnectError(...))` means the TCP connection **to the SOCKS proxy itself** — API boundary node port 1080 — was refused, on every attempt for the entire run, on **both** API BNs. A "connection refused" (RST) pins this down precisely: - The API BN firewall uses `policy drop` with an explicit `accept` for node IPs on port 1080, so a firewall problem would cause *timeouts*, not RSTs. - An RST means the BN was up, its global IPv6 was configured in the kernel, the packet passed the firewall — and **nothing was listening on :1080**. `danted.conf` configured the listener as: ``` internal: enp1s0 port = 1080 ``` Dante resolves an interface name to its addresses **once at startup** and never re-binds. GuestOS receives its global IPv6 via SLAAC. If `danted.service` starts while `enp1s0` only has its link-local address (router advertisement not yet processed), danted binds the link-local scope only and the global `[...]:1080` endpoint stays closed forever — `Restart=always` never kicks in because danted keeps running happily. This is confirmed directly by the journald logs of one of the failing API BNs: ``` 13:57:58.543 enp1s0: Gained IPv6LL 13:57:58.547 Finished systemd-networkd-wait-online.service - Wait for Network to be Online. 13:57:58.549 Reached target network-online.target - Network is Online. 13:57:58.553 Started danted.service - SOCKS (v4 and v5) proxy daemon (danted). ... 13:57:58.637 danted[1030]: info: Dante/server[1/1] v1.4.4 running ``` `systemd-networkd-wait-online` completed 4 ms after the interface gained only its **link-local** address — `network-online.target` does not guarantee a global SLAAC address — and danted started 6 ms later, binding the link-local address only. This race got amplified by DC pinning: in the failing run all nine VMs of the testnet (5 replicas, 2 API BNs, 2 UVMs) were packed onto a single host, slowing down boot and RA/SLAAC delivery enough to hit the race on both API BNs at once. The bug is pre-existing on `master`; the DC pinning only widened the window. The same fragility was previously patched around in #4658 (`PartOf=systemd-networkd.service` to restart danted when networkd restarts). ## Fix Bind the wildcard address instead of an interface name: ``` internal: :: port = 1080 ``` A wildcard bind does not depend on address assignment timing — connections to the global address succeed as soon as the address exists. Access to the SOCKS proxy remains restricted through the firewall, which only whitelists node IPs on port 1080 (the config already noted "Allow everyone - this is already restricted through the firewall"). `external: enp1s0` (the outgoing side) is unchanged. ## Verification - `bazel test --runs_per_test=3 //rs/tests/networking:canister_http_socks_test` passes 3/3 (avg 186 s) on the DC-pinned branch that previously reproduced the failure.
This commit introduces XNet reject signals that will fire when messages are sent to/from engines that are not allowed. The current PR only introduces the ability to read these signals, without writing them themselves. The actual implementation of XNet with engines will come in a follow-up, and will start sending these signals.
…tem-tests (#10436) # What Force Farm testnet allocation to the same DC as the GitHub runner executing the test (which is also the DC holding the just-built images) — for **every** system-test. This generalizes the opt-in `.allocate_testnet_to_local_dc()` mechanism introduced in #10122. # Why #10122 showed that cross-DC transfers of large images (e.g. 2.6G SetupOS images from `dm1` to `zh1` or vice versa) cause download timeouts and flaky tests. The same applies to all system-tests, so testnets should always be allocated in the DC where the test runs and the images live. # How * Replace the `allocate_testnet_to_local_dc` bool on `SystemTestGroup` (and its builder method) with the `ALLOCATE_TESTNET_TO_LOCAL_DC` environment variable, read by the test driver in `create_group_setup` (accepted values: `1`/`true`/`0`/`false`). * Set `ALLOCATE_TESTNET_TO_LOCAL_DC=1` unconditionally from the `system_test` macro in `rs/tests/system_tests.bzl`, covering both the plain and the `_colocate` targets. This remains a no-op when the `DC` volatile status variable is unknown (e.g. local runs without `NODE_NAME`). * Drop the now-redundant `.allocate_testnet_to_local_dc()` calls from the 7 nested system-tests. * Replace `dep_download_url` in `rs/tests/upload_systest_dep.sh` to no longer go via the dc_http_proxy but point directly at the DC-local bazel cache: `https://artifacts.$cluster.dfinity.network/cas/$dep_sha256`. * Force the `release-system-tests` job in `release-testing.yml` and the `system-tests-benchmarks-nightly` job to run in runner group `dm1` (the `&dind-large-setup` anchor moved to `setup-guest-os-qualification`, whose jobs keep their current runners). # Notes for reviewers * Pinning all tests to the runner's DC concentrates Farm load in `dm1` where most runners live; watch for allocation failures after rollout. Extra charts have been added to the [Farm Dashboard](https://grafana.dm1-idx1.dfinity.network/d/uwEFG_yGk/farm-dashboard?from=now-3h&to=now&timezone=utc&refresh=10s) for monitoring dm1 and zh1. * Farm hosts/UVMs now fetch deps from `artifacts.<cluster>.dfinity.network` over HTTPS (previously plain http via proxy-global:8080); the redirect server already returns URLs of this form.
…ly (#10431) The users should be able to control their own settings for the engines they use. For now we are scoping that to only `subnet_admins` and `replica_version_id` but in the future we might want to allow more things. --------- Co-authored-by: IDX GitHub Automation <infra+github-automation@dfinity.org> Co-authored-by: pietrodimarco-dfinity <124565147+pietrodimarco-dfinity@users.noreply.github.com>
This introduces a skill to fix determinism issues, as well as two other skills that can be used by agents and humans alike to perform IC builds on non-DFINITY infrastructure.
…10455) ## Problem Since the migration of the XNet payload builder to HTTP/2 (#1506), a dead or stalled pooled connection to an XNet endpoint is reused **indefinitely**: - Cancelling a request (e.g. due to the 5-second query timeout in `XNetClientImpl::query`) only resets the respective h2 *stream*; it does not affect the underlying connection. - The connection pool only drops connections that report themselves as closed; a stalled-but-established h2 connection never does. A connection that ends up dead/stalled (e.g. due to packet loss or CPU starvation while under load) therefore black-holes **all** pulls from the respective subnet pair until the replica restarts. Under the previous HTTP/1.1 client this could not happen, because cancelling a request closes its connection. ### Observed impact `//rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test` failed reproducibly: the synchronized startup of 120×119 TLS connections (while replicas were CPU-starved) wedged 191 directed subnet pairs. Each wedged pair timed out on every pull (at the 5s timeout, ~2.3/s) for the entire run with zero recovery, pushing best-effort calls on 20 subnets past their 30s deadline (5.7%–13.7% failed calls vs. the 5% threshold). Per-subnet error rate was predicted by the number of wedged pairs involving that subnet with correlation 0.997. ## Fix Enable HTTP/2 keep-alive pings on the XNet client: 10s interval, 5s timeout, also while idle (`pool_max_idle_per_host(1)` keeps an idle connection pooled for up to 600s). A dead connection is now detected and closed within ~15s — well below the 30s best-effort deadline — after which the next query establishes a fresh connection. Note: this requires supplying an h2 timer via `.timer(TokioTimer::new())` in addition to the existing `.pool_timer(...)`; without it hyper panics at runtime once keep-alive is enabled. As defense in depth, also set a 5s TCP connect timeout on the `HttpConnector` underlying `TlsConnector` (both `new` and `new_for_tests`), so a stuck handshake fails fast rather than relying solely on the per-query timeout. This does not address the observed failure on its own — the wedged connections had completed TCP+TLS and stalled at the h2 layer — but it bounds connection-setup stalls. Cost: at most ~119 connections per node → ~12 PING frames/s/node; both endpoints are replica-controlled (hyper's h2 server ACKs pings natively). ## Verification On CI the `Release System Tests` job [succeeds](https://github.com/dfinity/ic/actions/runs/27438248671/job/81105591015) including `//rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test_colocate` ([BuildBuddy](https://dash.dm1-idx1.dfinity.network/invocation/2d53507f-fe7d-45a8-a006-1fecfd20666e?target=%2F%2Frs%2Ftests%2Fmessage_routing%2Fxnet%3Axnet_slo_120_subnets_staging_test_colocate&targetStatus=5)). Manual runs (from `dm1`) succeed as well: ``` bazel test //rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test \ --cache_test_results=no --test_output=errors \ --test_arg=--set-required-host-features=dc=dm1 ``` - Before: FAILED — 20/120 subnets above the 5% failed-call threshold (5.7%–13.7%). - After: **PASSED in 745.8s — 0 failed calls on all 120 subnets** (worst subnet: 0/56049). - `//rs/xnet/payload_builder:payload_builder_test` and `:payload_builder_integration` pass (the latter exercises real client queries and would catch the missing-timer panic). After reducing the ping frequency to a 10s interval and adding the connect timeout (review follow-up), `bazel test //rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test --cache_test_results=no` still succeeds, and the unit/integration tests above still pass.
…istry versions are empty (#10408) The peer manager maintains connections to all peers that are in the subnet record in any registry version between the "oldest version in use" and the latest locally available registry version. The "oldest registry version in use" is determined by the CUP. In case of a subnet creation, this CUP is a registry CUP that contains the registry version just _before_ the subnet was created (see [here](https://sourcegraph.com/r/github.com/dfinity/ic@8cf338f5ef95432dc61233df889d98837d780466/-/blob/rs/registry/canister/src/mutations/do_create_subnet.rs?L63)). This means that when the peer manager attempts to make connections to peers according to the "oldest registry version in use", the subnet doesn't actually exist yet, which then triggers the `empty_list_of_node_records` error. This is fine however, because it will find the correct peers in the subsequent registry version (which is guaranteed to exist, otherwise we would not have the registry CUP). To improve the error signal of this metric, we change it such that it is only incremented if _all_ registry versions do not contain transport infos, meaning there are no peers shared with the new subnet topology.
…v4 for UVMs (#10464) All `//rs/tests/networking:canister_http...` tests are [very flaky](https://dash.dm1-idx1.dfinity.network/invocation/34f3e50b-0713-40e2-a983-b71373683092) due to their `httpbin` UVMs failing to acquire an IPv4 address in time in the `dm1` DC. To get a bit more insight into why this is the case we log `networkctl status enp2s0` on timeout.
There was a problem hiding this comment.
This pull request changes code owned by the Governance team. Therefore, make sure that
you have considered the following (for Governance-owned code):
-
Update
unreleased_changelog.md(if there are behavior changes, even if they are
non-breaking). -
Are there BREAKING changes?
-
Is a data migration needed?
-
Security review?
How to Satisfy This Automatic Review
-
Go to the bottom of the pull request page.
-
Look for where it says this bot is requesting changes.
-
Click the three dots to the right.
-
Select "Dismiss review".
-
In the text entry box, respond to each of the numbered items in the previous
section, declare one of the following:
-
Done.
-
$REASON_WHY_NO_NEED. E.g. for
unreleased_changelog.md, "No
canister behavior changes.", or for item 2, "Existing APIs
behave as before.".
Brief Guide to "Externally Visible" Changes
"Externally visible behavior change" is very often due to some NEW canister API.
Changes to EXISTING APIs are more likely to be "breaking".
If these changes are breaking, make sure that clients know how to migrate, how to
maintain their continuity of operations.
If your changes are behind a feature flag, then, do NOT add entrie(s) to
unreleased_changelog.md in this PR! But rather, add entrie(s) later, in the PR
that enables these changes in production.
Reference(s)
For a more comprehensive checklist, see here.
GOVERNANCE_CHECKLIST_REMINDER_DEDUP
mbjorkqvist
left a comment
There was a problem hiding this comment.
The DeFi-related changes seem to be coming from a sub-optimal merge of master, but approving nonetheless!
Does not touch any Governance code. This was simply a bad merge (or rather a git pull of changes from GitHub without an explicit git merge).
Lesson learned.
|
FWIW, in But it improves the hot path (reusing subtree digests from an earlier |
| impl PartialEq for SubtreeSource { | ||
| /// A conservative, false-negative-only reuse-gate; see the type-level note. | ||
| fn eq(&self, other: &Self) -> bool { | ||
| self.addr() == other.addr() && std::ptr::fn_addr_eq(self.expander, other.expander) |
There was a problem hiding this comment.
Why do we need && std::ptr::fn_addr_eq(self.expander, other.expander)? Isn't self.addr() == other.addr() sufficient?
There was a problem hiding this comment.
It's not. We need the same data AND the same certification version. Equal function pointers implies the same certification version. (The reverse is not true, you could have two functions that produce output with the same certification version, but we don't have that and we wouldn't care anyway.)
It's described in the SubtreeSource doc comment. If that's not clear or it's too far from the spot, I can move it / duplicate it here.
There was a problem hiding this comment.
What if
// The bare address, used for identity comparison.
fn addr(&self) -> *const () {
Arc::as_ptr(&self.) as *const ()
}
(I thought this is what self.addr() returns when writing this comment.)
| // stand-alone subtree is just `hash_lazy_tree` on that subtree's root, which | ||
| // is in turn materialized for the same reason. |
There was a problem hiding this comment.
hash_lazy_tree is not called recursively though so not sure what this means to say
There was a problem hiding this comment.
It refers to (temporarily) building stand-alone subtrees in order to compute the root hash of stub nodes. Let me know if you have a better wording suggestion.
There was a problem hiding this comment.
So you want to point out that we use build_tree as opposed to build_child here? If so, I'm not actually sure why we couldn't use build_child here, too.
Or you want to point out that we need to assign the root field? If so, I'd justify it by the fact that the recursive build_child/build_tree functions cannot set the root (because they are recursive).
| let children_range = if root.kind() == NodeKind::Stub { | ||
| NodeIndexRange::default() | ||
| } else { | ||
| ht.root_labels_range.clone() |
There was a problem hiding this comment.
this is a "hack", right? we set the parent as "root" (via NodeId::empty()) in the above call to build_child so that its labels range is set as ht.root_labels_range (over-writen for every new child which is fine since splice_subtree ignores root_labels_range anyway)
There was a problem hiding this comment.
I suppose that on the one hand it's just how you build an independent HashTree (you start with the root); and on the other hand, you then need to record the range of its children. This code was already here, see a couple of lines below.
There was a problem hiding this comment.
This code was already here, see a couple of lines below.
Indeed, and I believe your change makes sense, I just wanted to confirm my understanding of using ht.root_labels_range.clone() repeatedly on a shared tree.
|
AFAICT, Governance approval is not needed here and I'm being emailed in error. If that's not right, please, contact me in Slack. |
Summary
hash_lazy_treere-hashes every canister's subtree each round, even though almost all canisters are untouched. This PR lets a newHashTreereuse subtree digests from a previous one, keyed by the canister's backingArc<CanisterState>(shared copy-on-write, replaced only on mutation).What changed
NodeKind::Stubcollapses a reusable subtree to a digest-only node. ALazyForkopts in viasubtree_source();CanisterForkreturns aSubtreeSourceover itsArc<CanisterState>. Other forks are still materialized inline.SubtreeSource. A type-erased, owned handle (Arc<dyn Any>+ expander fn pointer) that keeps the source alive (no ABA on the address) and re-expands the stub for witnesses.Eqis a conservative, false-negative-only reuse gate: same source allocation and same expander.expand_canister::<const V: u32>bakes the certification version into the function pointer, so it isn't stored per stub and reuse can't mix versions.hash_lazy_tree_with_baseline. Lockstep-joins the new tree against a baseline; matchingSubtreeSources reuse the baseline digest, otherwise rebuild. Witnesses expand stubs viaSubtreeSource::expand(), so the tree stays self-contained.PARALLEL_MIN_CHILDRENrebuilds. The common reuse-heavy path stays sequential; large/from-scratch forks parallelize.Tests
New
tests/subtree.rs: stubbing, witness/absence-proof expansion from source, baseline builds (full and partial change) matching from-scratch, andreuse_is_by_identity_not_by_value(sharedArcs reuse, replacedArcs rebuild).Follow-up
Actually invoke
hash_lazy_tree_with_baselinefrom production code, with the latest availableHashTree. For now, it is only invoked from the benchmark and tests.Stubbing is implemented for
/canistersonly; ingress/messagesare left for later.