remote-ext: fix state download stall on slow connections and reduce memory usage#1295
Merged
liamaharon merged 16 commits intomasterfrom Oct 10, 2023
Merged
remote-ext: fix state download stall on slow connections and reduce memory usage#1295liamaharon merged 16 commits intomasterfrom
liamaharon merged 16 commits intomasterfrom
Conversation
bkchr
reviewed
Aug 30, 2023
…/remote-ext-target-request-time
…tytech/polkadot-sdk into liam/remote-ext-target-request-time
…/remote-ext-target-request-time
…tytech/polkadot-sdk into liam/remote-ext-target-request-time
…tytech/polkadot-sdk into liam/remote-ext-target-request-time
Contributor
Author
|
@bkchr thanks for your suggestions. I've refactored the recursive function to be iterative and implemented your suggestions, and the memory usage is orders of magnitude lower now. |
niklasad1
reviewed
Sep 27, 2023
| retries = 0; | ||
| batch_response | ||
| }, | ||
| Err(e) => { |
Contributor
There was a problem hiding this comment.
Ideally you should check the kind of error you get back here...
If it' just Errror::Transport(Transport::RequestTooBig) or Error::RequestTimeout then it shouldn't be a problem to continue but unfortunately it's not a concrete type you would need to downcast which is PITA... https://docs.rs/jsonrpsee-http-client/0.20.1/src/jsonrpsee_http_client/transport.rs.html#212
niklasad1
approved these changes
Sep 27, 2023
bkchr
approved these changes
Sep 29, 2023
…/remote-ext-target-request-time
ordian
added a commit
that referenced
this pull request
Oct 12, 2023
* master: (33 commits) ci: set CI_IMAGE back to (now updated) .ci-unified (#1854) ci: bump ci image to rust 1.73.0 (#1830) Refactor Identity to benchmark v2 (#1838) PVF worker: bump landlock, update ABI docs (#1850) Xcm emulator nits (#1649) Fixes path issue in derive-impl (#1823) upgrade to macro_magic 0.4.3 (#1832) Use safe math when pruning statuses (#1835) remote-ext: fix state download stall on slow connections and reduce memory usage (#1295) Update testnet bootnode dns name (#1712) [FRAME] Warn on unchecked weight witness (#1818) [xcm] Use `Weight::MAX` for `reserve_asset_deposited`, `receive_teleported_asset` benchmarks (#1726) Update bridges subtree (#1803) Check for parent of first ready block being on chain (#1812) Make CheckNonce refuse transactions signed by accounts with no providers (#1578) Fix Asset Hub collator crashing when starting from genesis (#1788) Mixnet integration (#1346) [xcm-emulator] Decouple the `AccountId` type from `AccountId32` (#1458) Treasury spends various asset kinds (#1333) chore: bump zombienter version (#1806) ...
bgallois
pushed a commit
to duniter/duniter-polkadot-sdk
that referenced
this pull request
Mar 25, 2024
…emory usage (paritytech#1295) Original PR paritytech/substrate#14746 --- ## Fixing stall ### Introduction I experienced an apparent stall downloading state from `https://rococo-try-runtime-node.parity-chains.parity.io:443` which was having networking difficulties only responding to my JSONRPC requests with 50-200KB/s of bandwidth. This PR fixes the issue causing the stall, and generally improves performance remote-ext when it downloads state by greatly reducing the chances of a timeout occuring. ### Description Introduces a new `REQUEST_DURATION_TARGET` constant and modifies `get_storage_data_dynamic_batch_size` to - Increase or decrease the batch size of the next request depending on whether the elapsed time of the last request was gt or lt the target - Reset the batch size to 1 if the request times out This fixes an issue on slow connections that can otherwise cause multiple timeouts and a stalled download when: 1. The batch size increases rapidly as remote-ext downloads keys with small associated storage values 2. remote-ext tries to process a large series of subsequent keys all with extremely large associated storage values (Rococo has a series of keys 1-5MB large) 3. The huge storage values download for 5 minutes until the request times out 4. The partially downloaded keys are thrown out and remote-ext tries again with a smaller batch size, but the batch size is still far too large and takes 5 minutes to be reduced again 5. The download will be essentially stalled for many hours while the above step cycles After this PR, the request size will - Not grow as large to begin with, as it is regulated downwards as the request duration exceeds the target - Drop immediately to 1 if the request times out. A timeout indicates the keys next in line to download have extremely large storage values compared to previously downloaded keys, and we need to reset the batch size to figure out what our new ideal batch size is. By not resetting down to 1, we risk the next request timing out again. ## Reducing memory As suggested by @bkchr, I adjusted `get_storage_data_dynamic_batch_size` from being recursive to a loop which allows removing a bunch of clones that were chewing through a lot of memory. I noticed actually it was using up to 50GB swap previously when downloading Polkadot keys on a slow connection, because it needed to recurse and clone a lot. After this change it uses only ~1.5GB memory.
bkchr
pushed a commit
that referenced
this pull request
Apr 10, 2024
This was referenced Jun 5, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Original PR paritytech/substrate#14746
Fixing stall
Introduction
I experienced an apparent stall downloading state from
https://rococo-try-runtime-node.parity-chains.parity.io:443which was having networking difficulties only responding to my JSONRPC requests with 50-200KB/s of bandwidth.This PR fixes the issue causing the stall, and generally improves performance remote-ext when it downloads state by greatly reducing the chances of a timeout occuring.
Description
Introduces a new
REQUEST_DURATION_TARGETconstant and modifiesget_storage_data_dynamic_batch_sizetoThis fixes an issue on slow connections that can otherwise cause multiple timeouts and a stalled download when:
After this PR, the request size will
Reducing memory
As suggested by @bkchr, I adjusted
get_storage_data_dynamic_batch_sizefrom being recursive to a loop which allows removing a bunch of clones that were chewing through a lot of memory. I noticed actually it was using up to 50GB swap previously when downloading Polkadot keys on a slow connection, because it needed to recurse and clone a lot.After this change it uses only ~1.5GB memory.