Allow Datil restore test to work with unequal node counts#83
Allow Datil restore test to work with unequal node counts#83
Conversation
…er in node code ;-)
There was a problem hiding this comment.
Pull request overview
This PR modifies the Datil restore test to support scenarios where the number of nodes differs from the recovery party size, allowing more flexible test configurations.
Changes:
- Added graceful handling for missing backup files and blinders when node count exceeds recovery party size
- Modified assertions to verify successful uploads based on recovery party count rather than expecting all nodes to succeed
- Added logging for initial node count in version upgrade tests
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
rust/lit-node/lit-node/tests/upgrades/version_upgrades.rs |
Added logging statement to display initial node count for better test observability |
rust/lit-node/lit-node/tests/integration/backup_datil_long.rs |
Modified backup and blinder upload functions to handle unequal node counts, added error handling for missing files, changed assertions from individual to count-based verification |
rust/lit-node/lit-node/tests/common/lit_actions.rs |
Prefixed unused variables with underscores to suppress compiler warnings |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let file = match file { | ||
| Ok(file) => file, | ||
| Err(e) => { | ||
| error!("No file for: {}", e); |
There was a problem hiding this comment.
The error message "No file for: {}" is unclear and unhelpful. The error variable 'e' represents the file open error, not the file itself. Consider changing this to something more descriptive like "Failed to open backup file: {}" to accurately reflect what went wrong.
| error!("No file for: {}", e); | |
| error!("Failed to open backup file {}: {}", tar_file.display(), e); |
| success_count += 1; | ||
| } | ||
| } | ||
| assert!(success_count == recovery_party_size); |
There was a problem hiding this comment.
The assertion assert!(success_count == recovery_party_size) should use the equality assertion assert_eq!(success_count, recovery_party_size) for better error messages. When the assertion fails, assert_eq! will show the actual values of both sides, making debugging easier.
| assert!(success_count == recovery_party_size); | |
| assert_eq!(success_count, recovery_party_size); |
| info!("Response: {}", response); | ||
| public_address | ||
| }); | ||
| }; |
There was a problem hiding this comment.
The semicolon after the closing brace on this line is unnecessary. An 'if' statement does not require a semicolon after its closing brace unless it's part of an expression context. Remove the semicolon for cleaner code.
| }; | |
| } |
|
PASS [ 43.215s] (3/3) lit_node::test toxiproxy::perf_tests::load_with_no_latency |
| client: &Client, | ||
| validator_collection: &ValidatorCollection, | ||
| backup_directory: &PathBuf, | ||
| recovery_party_size: usize, |
There was a problem hiding this comment.
Is this the number of key shares being restored or the number of parties being restored to? Can we restore 4 key shares to 6 parties? If so is recovery_party_size here 4 or 6?
There was a problem hiding this comment.
Yes and no. Each backup share ( ie, the ones downloaded by the node ops ) can only go to a single new node. In this test, things are a little convoluted since the backup parties and the node ops with backups are equivalent.
In your example, if we have 4 backup shares and 6 nodes, and run this test it will pass. But only because the network receiving the shares will kick the 2 nodes that don't have them - because those 2 nodes can't participate in the DKG. It's all relatively clever. ;-)
There was a problem hiding this comment.
But we don't have the test where the backups < recovery parties so no one gets kicked in the existing test. Can we add this additional test as well
There was a problem hiding this comment.
For Naga Prod we only have 6 backup shares, from what I understand. This means we'll need to reduce the network size, regardless of the number of backup parties.
However, I can write a test that uses 5 backup shares, 5 new nodes and 3 recovery parties. This should work as I believe we only need a threshold of the recovery parties to authorize the recovery to proceed. It won't be a short task though - most of these tests to Mike and Ege days to write, and they had prior knowledge of the backup system. I'm going in cold ;-)
DashKash54
left a comment
There was a problem hiding this comment.
Just to confirm we already supported restoring to a larger set of nodes than the backup keys? So we're not doing any actual node or contracts code changes?
| info!("Response: {}", response); | ||
| public_address | ||
| }); | ||
| if downloaded_blinders.contains_key(&public_address) { |
There was a problem hiding this comment.
Where is the test to simulate a 4 -> 6 restore?
There was a problem hiding this comment.
These are all pre-existing tests. I just made them work when the backup ##s don't match the restore nodes.
Currently we don't have enough blinders - I'm trying to generate more tomorrow, and can update the test, but there isn't any good reason to do so. 3 -> 5 is no different than 4 -> 6 in terms of what the software does - it's just slower.
There was a problem hiding this comment.
Yeah but we should still have a test that validates the kicking of the remaining nodes and the eventual success of the DKG which seems to be missing rn?
There was a problem hiding this comment.
Yes, this test does just that .... though you'd have to change the paramaters. I flipped them back to "normal" since it's more efficient to run in CI.
Sort of ? We restore to a larger set by uploading what we have and kicking the remaining nodes because the DKG can't be processed by nodes without backups. This test proves that this works. |
No description provided.