fix(massive/ftp): emit progress while walking the FTPS tree#108
fix(massive/ftp): emit progress while walking the FTPS tree#108ypriverol wants to merge 1 commit into
Conversation
Large MassIVE timsTOF deposits are spread over thousands of .d directories, each requiring its own PASV+TLS data connection to list. _walk_ftp_tree enumerates the entire tree before any download and was silent throughout, so a multi-minute enumeration looked like a hang (e.g. ~1719 .d in MSV000098940 ≈ 62 min of no output -> users Ctrl-C). Add an INFO progress heartbeat every 100 directories plus a final summary. Pure instrumentation: the returned file list and recursion are unchanged (covered by a mock-tree unit test); existing tests pass.
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Problem
For large MassIVE deposits,
pridepy download-all-public-raw-filesappears tohang after connecting. Example with
MSV000098940(timsTOF, 1719.d):This is not a TLS/connection problem (FTPS connects fine) and not a
missing-protocol problem. The MassIVE provider calls
list_files→_resolve_and_walk_ftp_dataset→_walk_ftp_tree, which enumerates theentire tree before any download begins: it lists
raw/(1719.d) and thendescends into every
.dand its nested.m/. Each listing is a freshPASV+TLS data connection (~2.2 s each), so the pre-walk alone takes:
With no output during those ~62 minutes, the run looks dead.
Fix
Emit an INFO progress heartbeat from
_walk_ftp_treeevery 100 directories,plus a final summary. This is pure instrumentation — the returned file
list and recursion are unchanged. The recursion now threads an internal
_progresscounter (callers still invoke_walk_ftp_tree(ftp, remote_dir)).New output during a large walk:
Testing
heartbeats/summary fire.
test_download_resilience.py+test_review_fixes.py: 37 passed.Follow-up (not in this PR)
The ~62 min is still spent enumerating before the first byte transfers. A
larger, separate change could overlap enumeration with download for MassIVE
(download each top-level
.drecursively as it is discovered, mirror-style,like
wget -r), eliminating the up-front wait entirely. Happy to do that as asecond PR if desired.