Skip to content

fix(massive/ftp): emit progress while walking the FTPS tree#108

Open
ypriverol wants to merge 1 commit into
devfrom
feat/ftps-walk-progress
Open

fix(massive/ftp): emit progress while walking the FTPS tree#108
ypriverol wants to merge 1 commit into
devfrom
feat/ftps-walk-progress

Conversation

@ypriverol
Copy link
Copy Markdown
Contributor

Problem

For large MassIVE deposits, pridepy download-all-public-raw-files appears to
hang after connecting. Example with MSV000098940 (timsTOF, 1719 .d):

INFO - Data will be downloaded from ftp
INFO - Connected to FTP host: massive-ftp.ucsd.edu (tls=True)
INFO - Found MSV000098940 under /v10/MSV000098940 on massive-ftp.ucsd.edu
<… long silence, no output … user assumes it hung and Ctrl-C's …>

This is not a TLS/connection problem (FTPS connects fine) and not a
missing-protocol problem. The MassIVE provider calls list_files
_resolve_and_walk_ftp_dataset_walk_ftp_tree, which enumerates the
entire tree before any download begins
: it lists raw/ (1719 .d) and then
descends into every .d and its nested .m/. Each listing is a fresh
PASV+TLS data connection (~2.2 s each), so the pre-walk alone takes:

top-level LIST: 1719 .d dirs in 1.8s
recursed 20 .d (40 LISTs) in 43.3s  -> 2.17s per .d
=> full walk of 1719 .d ≈ 62 min of silent enumeration BEFORE any download

With no output during those ~62 minutes, the run looks dead.

Fix

Emit an INFO progress heartbeat from _walk_ftp_tree every 100 directories,
plus a final summary. This is pure instrumentation — the returned file
list and recursion are unchanged. The recursion now threads an internal
_progress counter (callers still invoke _walk_ftp_tree(ftp, remote_dir)).

New output during a large walk:

INFO - Connected to FTP host: massive-ftp.ucsd.edu (tls=True)
INFO - Found MSV000098940 under /v10/MSV000098940 on massive-ftp.ucsd.edu
INFO - Listing remote tree: 100 directories scanned, 95 files found so far...
INFO - Listing remote tree: 200 directories scanned, 195 files found so far...
...
INFO - Listing remote tree complete: 1719 directories, 8500 files.

Testing

  • New mock-tree unit check: identical file list returned (recursion intact) and
    heartbeats/summary fire.
  • Existing test_download_resilience.py + test_review_fixes.py: 37 passed.

Follow-up (not in this PR)

The ~62 min is still spent enumerating before the first byte transfers. A
larger, separate change could overlap enumeration with download for MassIVE
(download each top-level .d recursively as it is discovered, mirror-style,
like wget -r), eliminating the up-front wait entirely. Happy to do that as a
second PR if desired.

Large MassIVE timsTOF deposits are spread over thousands of .d
directories, each requiring its own PASV+TLS data connection to list.
_walk_ftp_tree enumerates the entire tree before any download and was
silent throughout, so a multi-minute enumeration looked like a hang
(e.g. ~1719 .d in MSV000098940 ≈ 62 min of no output -> users Ctrl-C).

Add an INFO progress heartbeat every 100 directories plus a final
summary. Pure instrumentation: the returned file list and recursion
are unchanged (covered by a mock-tree unit test); existing tests pass.
@qodo-code-review
Copy link
Copy Markdown
Contributor

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 30, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fafdc05c-02d4-46c5-bb08-8bd635569e26

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/ftps-walk-progress

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant