Skip to content

Major development to include MassIVE; iPROX and PRoteomeXchange#106

Merged
ypriverol merged 55 commits into
masterfrom
dev
May 29, 2026
Merged

Major development to include MassIVE; iPROX and PRoteomeXchange#106
ypriverol merged 55 commits into
masterfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Contributor

@ypriverol ypriverol commented May 28, 2026

This pull request introduces several significant improvements to documentation, developer tooling, and internal code organization. The most notable change is a comprehensive rewrite and expansion of the README.md to provide clearer, more structured usage instructions for all major features, including new repositories (MassIVE, JPOST, iProX), command overviews, and detailed option tables. Additionally, a new Codacy CLI bootstrap script and configuration are added, and the pridepy.download package now includes a docstring overview of its architecture.

Documentation improvements and feature explanations:

  • Major rewrite and restructuring of README.md:
    • Adds a detailed command overview table and groups usage examples into expandable sections for clarity.
    • Documents support for MassIVE (MSV...), JPOST (JPST...), and iProX (IPX...) repositories, including protocol fallbacks and repository-specific behaviors.
    • Expands and clarifies download options, parallelization, resume, checksum validation, and category/collection selection.
    • Provides comprehensive metadata, search, and private file access examples, with option tables for each command.
    • Updates Python API usage examples, including new repository support and best practices. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Developer tooling and configuration:

  • Adds Codacy CLI v2 bootstrap script (.codacy/cli.sh) for automated code analysis, with OS/arch detection, caching, and update handling.
  • Introduces .codacy/codacy.yaml specifying language runtimes and static analysis tools for consistent CI checks.

Internal code documentation:

  • Adds a high-level docstring to pridepy/download/__init__.py describing the download subsystem's architecture, including repository adapters, registry, transport, and client facade.

Summary by CodeRabbit

  • New Features

    • Unified download client with new providers for MassIVE, JPOST, iProX, PRIDE, ProteomeXchange; URL-based downloader; multi-protocol transfers and optional preserve-structure output.
  • Documentation

    • Expanded README and new docs/usage.md with command overview, examples, repository-specific download guidance, CLI options, and breaking-change note.
  • Bug Fixes

    • Improved resume/retry behavior, checksum validation, and FTP size-mismatch handling.
  • Chores

    • Download-layer refactor and package version bumped to 0.0.16 (deps updated).
  • Tests

    • Added and updated unit/integration tests; live-API outage tolerance.

Review Change Stack

ypriverol added 30 commits May 27, 2026 11:03
Extends the direct-download support introduced for MassIVE in PR #98 to two
more proteomics repositories whose datasets are often standalone (no
ProteomeXchange accession):

  - JPOST (Japan ProteOme STandard Repository): JPST\d{6} accessions,
    listed and downloaded from ftp.jpostdb.org.
  - iProX (Integrated Proteome resources): IPX\d{7,10} accessions, listed
    and downloaded from ftp.iprox.cn.

Refactor:

  - Add is_direct_download_accession() unifying the MSV/JPST/IPX checks,
    plus _list_direct_download_files() and _download_direct_download_records()
    dispatchers. All call sites (get_all_raw_file_list, download_all_raw_files,
    download_all_category_files, get_file_from_api, download_file_by_name,
    download_files_by_list, get_all_category_file_list) now go through the
    unified entry points.
  - Extract _list_ftp_repo_files() helper so the FTP connection lifecycle
    (connect / login / passive / walk / quit) lives in one place. As part
    of that, fix the FTP-constructor-outside-try issue flagged in the PR #98
    review: a connect failure no longer triggers NameError in finally.
  - Keep is_massive_accession, _list_massive_public_files, and
    _download_massive_file_records as thin backward-compatible wrappers
    so existing tests and external callers continue to work.

Tests: add test_jpost_files.py and test_iprox_files.py mirroring the
MassIVE coverage (regex match, record building, raw-only filtering, and
the download_file_by_name happy path). All 19 direct-download tests pass.

Version: 0.0.16.
Address feedback that direct downloads should match the PRIDE feature
set (resume, parallel, retries) and that MassIVE's actual FTP server
requires TLS.

Live findings:

  - massive-ftp.ucsd.edu now rejects plain anonymous FTP with
    421 TLS is required. The merged PR #98 code was effectively broken
    against the live server. Switch to FTP_TLS + PROT P.
  - ftp.jpostdb.org accepts plain anonymous FTP; keep as-is.
  - ftp.iprox.cn does not resolve (DNS fail) and no other iProX FTP
    host responds. iProX is HTTPS-only and needs a different transport
    (REST API). Defer iProX support; user to provide the endpoint.

Implementation:

  - New static _open_ftp_connection(host, use_tls) opens FTP or FTP_TLS
    with the right TLS setup, and transparently falls back to FTPS if
    a plain FTP server replies 'TLS is required'.
  - _list_ftp_repo_files() and download_ftp_urls() both grow a use_tls
    flag. _repo_uses_tls(accession) wires this from the repo type
    (MassIVE = True, JPOST = False).
  - download_ftp_urls() grows parallel_files: when >1, a ThreadPoolExecutor
    runs that many FTP workers per host, each with its own connection.
    Existing serial single-connection-per-host path is preserved for
    parallel_files <= 1.
  - Extracted _download_one_ftp_path() with REST-based resume + per-file
    retries; _download_ftp_paths_serial() / _download_ftp_paths_parallel()
    pick the right scheduling. REST resume verified live (3 KB pre-stage
    -> 10 KB final, MD5 matches full file).
  - _download_direct_download_records() now accepts parallel_files and
    forwards it (along with use_tls derived from the accession) to
    download_ftp_urls. All call sites (download_all_raw_files,
    download_all_category_files, download_files_by_list) thread the
    user-supplied -w/--parallel-files through.

Tests:

  - test_jpost_files / test_massive_files updated to assert the new
    kwargs (use_tls, parallel_files).
  - New test_repo_uses_tls_true_for_massive_false_for_jpost and
    test_download_all_raw_files_threads_parallel_files_for_massive.
  - test_iprox_files.py removed (iProX is deferred).
  - All 15 unit tests pass.

Live testing:

  - MassIVE: listed MSV000080175 (44 files), single-file download of
    params.xml (10315 B, MD5 43d87368d705c3f380c1d030b14850c4),
    REST resume from a 3000 B partial, and 3-worker parallel download
    of files from MSV000080175 + MSV000078335 all succeeded.
  - JPOST: rate-limited from this IP ('421 too many connections').
    Code path is structurally identical to MassIVE (same FTP helper),
    routing covered by unit tests; deferred to a follow-up live check.
Address deferred items from PR review:

1. JPOST PROXI listing
   ftp.jpostdb.org rate-limits aggressively per source IP (sticky 421 on
   any retry within ~10 min). For listing this is fatal because every
   pridepy invocation needs a fresh tree walk. JPOST publishes a JSON
   PROXI endpoint at
     https://repository.jpostdb.org/proxi/datasets/<JPSTxxxxxx>
   that returns datasetFiles[*].value as ftp:// URIs alongside CV labels
   (Associated raw file URI, Search engine output file URI, Result file
   URI, Peak list file URI, ...). _list_jpost_public_files now hits
   PROXI first, builds file records with categories derived from the CV
   name (mapped via JPOST_PROXI_CATEGORY_MAP), and falls back to the
   FTP tree walk only if PROXI is unreachable or returns no records.
   Live-tested against JPST002311 -> 160 files (88 RAW, 72 SEARCH).

2. Post-transfer size check
   Neither MassIVE nor JPOST publishes per-file checksum manifests in
   a standard location, so md5 verification isn't an option for these
   datasets. As a lighter-weight integrity signal, _download_one_ftp_path
   now compares the local file size against ftp.size() after retrbinary
   returns and treats a mismatch as a retryable failure (next attempt
   resumes via REST from the current partial). This catches half-finished
   transfers where the data channel was closed early without retrbinary
   raising.

3. iProX accession guard
   Probing showed iProX REST endpoints (PMD009Controller/findByProjectId.jsonp,
   findFilesBySubProjectID.jsonp) all redirect unauthenticated callers to
   a CAS login page, and downloads use faspe:// URLs with per-session
   tokens. Native support is therefore blocked until iProX exposes an
   anonymous JSON API or pridepy carries iProX credentials. Until then,
   add is_iprox_accession() and _raise_if_iprox() so every public entry
   point (get_all_raw_file_list, download_all_raw_files,
   download_all_category_files, download_file_by_name, download_files_by_list,
   get_file_from_api) emits a clear NotImplementedError instead of
   silently falling through to the PRIDE API and 404-ing.

Tests
   - test_jpost_files: PROXI listing maps CV names to PRIDE categories;
     PROXI failure falls back to FTP walk.
   - test_iprox_guard: regex coverage; assert each entry point raises
     NotImplementedError for IPX accessions; assert
     is_direct_download_accession returns False for IPX.
   - test_ftp_download_validation: size mismatch retries until success;
     repeated mismatch raises after max_download_retries; correct size
     skips retries.
   - 25 tests total (10 MassIVE + 8 JPOST + 5 iProX guard + 3 size check).

Live verification (same MassIVE FTPS path as before)
   - MSV000080175 listing + params.xml download still produces
     10315 B, MD5 43d87368d705c3f380c1d030b14850c4.
   - JPST002311 PROXI listing returns 160 files with correct categories.
     Actual file transfer is still blocked from this IP by JPOST's
     421-too-many-connections rate limit; the code path is shared with
     MassIVE (same _download_one_ftp_path / download_ftp_urls), so a
     fresh IP should succeed.
….org

User pointed out the iProX dataset XML endpoint:
https://www.iprox.cn/FAF016Controller/readXml.jsonp?fileId=file_<id>_xml

Probing showed something simpler works: iProX publishes the
ProteomeXchange XML for every public dataset at a deterministic,
anonymous-accessible path on download.iprox.org. No CAS auth, no Aspera
tokens, no fileId discovery needed:

  http://download.iprox.org/<accession>/PX_<accession>.xml

The XML embeds Associated raw file URI / Search engine output file URI
cvParams pointing at HTTPS file URLs on the same host (Accept-Ranges:
bytes, so we get resume for free). The earlier 'iProX is auth-gated'
finding was specifically about the iProX UI's CAS-protected JSON
endpoints (PMD009Controller/findBySubProjectId.jsonp etc.); the public
download server is anonymous.

Changes
-------
- IPROX_DOWNLOAD_BASE_URL, IPROX_PX_XML_URL_TEMPLATE, IPROX_PX_CATEGORY_MAP
  added (the latter is the same CV map JPOST PROXI uses).
- _build_iprox_file_record() and _list_iprox_public_files() added,
  mirroring the JPOST helpers but using the PX XML schema.
- is_direct_download_accession() now returns True for IPX accessions.
- _raise_if_iprox() removed entirely. All entry points that previously
  called it no longer do, so IPX accessions flow through the unified
  direct-download dispatcher.
- _list_direct_download_files() dispatches IPX -> _list_iprox_public_files.
- _download_direct_download_records() now partitions URLs by scheme:
  ftp:// records go through download_ftp_urls (MassIVE/JPOST), http(s)://
  records through download_http_urls (iProX). A dataset whose records
  somehow contain both flows through both paths correctly.
- download_http_urls() grew parallel_files + max_retries kwargs and a
  ThreadPoolExecutor path. The per-file worker _http_download_one()
  wraps _parallel_download() (reused from the globus codepath) with
  retry, so iProX gets the same HEAD-then-Range resume and restart-on-
  non-206 behaviour we already use for PRIDE HTTPS.

Tests (26 pass)
---------------
- test_iprox_guard.py renamed to test_iprox_files.py and rewritten as
  positive-path tests: regex coverage, IPX is a direct-download
  accession, PX XML parsing maps cvParam name -> PRIDE category, RAW
  filtering ignores non-FTP/HTTPS URIs, download_file_by_name routes
  iProX URLs to download_http_urls (not download_ftp_urls) with
  parallel_files=1.
- All previously-added MassIVE / JPOST / size-validation tests still
  pass (20 + 6 iProX = 26 total).

Live verification (this branch, against download.iprox.org)
-----------------------------------------------------------
- IPX0017413000 listing: 7 files, correctly categorised (RAW + SEARCH).
- download_file_by_name(IPX0017413000, protein_annotation_profile.xlsx):
  3,253,230 B downloaded, MD5 c17baf230ffde1e2837ec4eb32dcea68, valid
  XLSX (PK header).
- Range-based resume: pre-staged 1,000,000 B partial, completed to
  3,253,230 B with the same MD5.
- Parallel HTTPS: 3 worker downloads from IPX0017413000 (xlsx + 2 RAW)
  ran concurrently; cancelled mid-flight after observing all 3 files
  growing in parallel (xlsx finished at 3.25 MB, Tumor_NK1.raw was at
  142 MB and Control_NK3.raw at 419 MB before cancellation).
…Registry

Empty scaffolding for the per-provider refactor (spec:
docs/specs/2026-05-27-files-py-provider-refactor-design.md). Introduces
the Provider abstract base class, BaseDirectDownloadProvider with a
shared download_files() that partitions URLs by scheme and routes
through Files.download_ftp_urls / Files.download_http_urls (preserving
test patches), and a Registry for accession -> provider resolution.

No behaviour change. No providers registered yet.
…t.py

Pulled download_ftp_urls, download_http_urls, and their helpers
(_open_ftp_connection, _walk_ftp_tree, _list_ftp_repo_files,
_download_one_ftp_path, _download_ftp_paths_serial/_parallel,
_http_download_one, _parallel_download, _local_path_for_url) out of the
Files class verbatim into providers/transport.py.

Files keeps a shim staticmethod for each function that does a lazy
import and delegates, so existing test patches like
patch.object(Files, 'download_ftp_urls') keep intercepting calls.

No behaviour change. Test suite green.
Moved Progress, _find_tsv_columns, _is_md5_checksum, read_checksum_file,
compute_md5, validate_download, _remove_if_exists, _get_download_url,
_resolve_local_path from Files into providers/util.py. Files keeps shim
re-exports so existing references keep working.

No behaviour change. Test suite green.
Moved MassIVE listing + record building from Files into
providers/massive.py as MassiveProvider(BaseDirectDownloadProvider).
Provider is registered with the Registry. Files keeps shim methods
(is_massive_accession, _list_massive_public_files,
_build_massive_file_record, _get_massive_public_root,
_get_massive_public_ftp_url, _map_massive_collection_to_category) that
delegate to the provider. MASSIVE_CATEGORY_MAP / MASSIVE_ARCHIVE_FTP
constants remain on Files as class-attribute re-exports.

All 10 MassIVE tests pass without modification. Full suite green.
Moved JPOST listing (PROXI JSON primary + FTP tree walk fallback) and
record building from Files into providers/jpost.py as
JpostProvider(BaseDirectDownloadProvider). Provider registered with the
Registry. Files keeps shim methods (is_jpost_accession,
_list_jpost_public_files, _list_jpost_public_files_via_proxi,
_build_jpost_file_record, _get_jpost_public_root,
_get_jpost_public_ftp_url) that delegate to the provider.
JPOST_ARCHIVE_FTP / JPOST_PROXI_BASE_URL / JPOST_PROXI_CATEGORY_MAP
constants remain on Files as class-attribute re-exports.

All JPOST tests pass without modification. Full suite green.
Moved iProX listing (PX XML from download.iprox.org) and record building
from Files into providers/iprox.py as IproxProvider(BaseDirectDownloadProvider).
Provider registered with the Registry. Files keeps shim methods
(is_iprox_accession, _list_iprox_public_files, _build_iprox_file_record,
_get_iprox_public_root, _get_iprox_public_ftp_url) that delegate to the
provider. IPROX_DOWNLOAD_BASE_URL / IPROX_PX_XML_URL_TEMPLATE /
IPROX_PX_CATEGORY_MAP constants remain on Files as class-attribute
re-exports.

_download_direct_download_records on Files now dispatches via the registry
instead of branching manually. _list_direct_download_files keeps shim
dispatching so existing test patches on _list_massive_public_files and
_list_jpost_public_files continue to intercept the calls.

All iProX tests pass without modification. Full suite green.
…oning

Mixed ftp:// + http(s):// records partition correctly: ftp URLs go to
Files.download_ftp_urls with use_tls=True (the MassIVE setting), http
URLs go to Files.download_http_urls. Both calls intercepted by
patch.object(Files, ...) — proving the provider routes back through
Files rather than calling transport directly (preserves test patches).
Moved PRIDE-specific logic (V3 API listing, multi-protocol batch
downloader with aspera/s3/ftp/globus fallback, private-dataset path,
submitter helpers, Globus/S3 per-protocol downloaders, legacy
single-connection FTP) from Files into providers/pride.py as
PrideProvider(Provider). ~510 LOC moved out of files.py.

Files keeps shim methods for every patched helper
(_batch_download_by_protocol, _download_with_fallback,
_globus_download_one, download_files_from_globus, download_files_from_s3,
download_files_from_ftp, download_private_file_name,
get_ascp_binary, save_checksum_file, stream_all_files_by_project,
stream_all_files_metadata, get_submitted_file_path_prefix,
_protocol_sequence, download_files).

PRIDE class constants (V3_API_BASE_URL, API_BASE_URL, API_PRIVATE_URL,
PRIDE_ARCHIVE_FTP, *_URL_PREFIX, S3_URL, S3_BUCKET, PROTOCOL_ORDER)
remain on Files as re-exports.

PrideProvider's internal calls to patch-sensitive helpers go through
Files.X (lazy import) so existing test patches in
test_download_resilience.py keep working without modification.

is_direct_download_accession updated to exclude PRIDE (returns True
only for MSV/JPST/IPX) now that PrideProvider is registered too.

All 8 patch-sensitive tests in test_download_resilience.py pass.
Full suite green at 67 passed, 4 skipped.
Files public methods (get_all_raw_file_list, download_all_raw_files,
download_all_category_files, get_all_category_file_list,
download_file_by_name, get_file_from_api, download_files_by_list) now
dispatch via registry.resolve(accession).{list,download}_files(...).

Removed dead helpers: _list_direct_download_files,
_download_direct_download_records. Kept _repo_uses_tls as a registry
shim. Fixed _download_massive_file_records to use registry.

PrideProvider.download_files refactored: the old static method is now
_download_files_batch; a proper Provider-interface instance method
download_files(self, accession, records, ...) wraps it so PRIDE routes
uniformly through the registry like other providers.

Tests updated: patches on Files._list_massive_public_files,
Files._list_jpost_public_files, files_obj.stream_all_files_by_project,
and files_obj.download_files now target the provider classes
(MassiveProvider, JpostProvider, PrideProvider) directly.

Full suite green. files.py size: 1254 LOC (was 1352 before Task 9).
Verifies that Files().download_all_raw_files for a PXD accession flows
through Registry.resolve -> PrideProvider.download_files ->
_batch_download_by_protocol (patched via Files), and that
_download_with_fallback is only called when batch returns failed files.

Also mocks validate_download to return success so the happy-path test
correctly asserts that fallback is not invoked when all files pass
validation after the primary-protocol batch run.

Spec acceptance criterion #5.
Empty scaffold for the follow-up refactor that extracts cross-cutting
commands (download_files_by_url, download_files_by_list,
download_px_raw_files) from Files into their own modules. No code moved
yet. No behaviour change. Test suite green at 68 passed, 4 skipped.
…roteomexchange.py

Moved download_px_raw_files, _normalize_px_xml_url,
_parse_px_xml_for_raw_file_urls from Files into
commands/proteomexchange.py. Files keeps shim re-exports.
Also removed now-unused xml.etree.ElementTree import from files.py.

No behaviour change. Test suite green.
Moved download_files_by_list from Files into commands/by_list.py.
Files keeps a shim re-export.

No behaviour change. Test suite green.
Moved download_files_by_url and its 6 helpers (_extract_pride_accession,
_validate_urls_checksums, _http_download_url, _ftp_download_url,
_dispatch_url_scheme, _download_single_url) from Files into
commands/by_url.py. Files keeps shim re-exports for each.

Internal calls to patch-sensitive helpers (_http_download_url,
_ftp_download_url, _dispatch_url_scheme, _download_single_url) go
through Files.X (lazy import) so existing test patches like
patch.object(Files, '_http_download_url') keep intercepting.

files.py drops below 1000 LOC.

No behaviour change. Test suite green at 68 passed, 4 skipped.
…/ as a class

ProteomeXchange behaves more like a provider than a command: it takes a
PXD/PRD accession (or a ProteomeCentral URL) and returns file records,
the same shape as the four other providers. Moving it to providers/
aligns the architecture.

Changes:

  - New: pridepy/providers/proteomexchange.py with
    ProteomeXchangeProvider(Provider). It implements the full Provider
    interface (matches, list_files, download_files) plus a convenience
    method download_from_accession_or_url() for the
    download-px-raw-files CLI command's existing behaviour
    (skip_if_downloaded_already defaults to True, no parallel workers).

  - Deleted: pridepy/commands/proteomexchange.py. Its three functions
    (_normalize_px_xml_url, _parse_px_xml_for_raw_file_urls,
    download_px_raw_files) are now static/instance methods on the
    provider class.

  - Updated: Files shims for _normalize_px_xml_url,
    _parse_px_xml_for_raw_file_urls, and download_px_raw_files now
    delegate to ProteomeXchangeProvider.

  - Updated: commands/__init__.py docstring notes the move.

Important: ProteomeXchangeProvider is NOT auto-registered with
pridepy.providers.registry. PXD/PRD accessions continue to route through
PrideProvider's V3 API path by default. ProteomeXchangeProvider is the
explicit gateway for the cross-repository XML view, invoked via the
download-px-raw-files CLI command and Files.download_px_raw_files.

Full suite green at 68 passed, 4 skipped. No behaviour change for
existing PXD downloads; download-px-raw-files keeps its XML-based
listing flow exactly as before.
Two related fixes:

1. CI lint failure: files.py used importlib.resources, subprocess, and
   time but never imported them. The Task 8 refactor removed these
   top-level imports along with the PRIDE-specific code that used them
   in PrideProvider, but two methods on Files (download_files_from_aspera
   and the now-dead _download_range) still referenced the undefined
   names. Flake8 with --select=F82 fails on 4 F821 undefined-name errors.
   Fix: re-add the three stdlib imports at the top of files.py. Also
   delete the orphaned _download_range method (no callers).

2. Hoist lazy provider/command imports out of Files method bodies.
   The shim pattern previously did 'from pridepy.providers import X'
   inside every method body — ~75 occurrences. Since providers do not
   import Files at module load (only inside method bodies), the reverse
   direction is safe to hoist: Files now imports
   {registry, transport, util, IproxProvider, JpostProvider,
   MassiveProvider, PrideProvider, ProteomeXchangeProvider, by_list,
   by_url} at module top, and the shims reference these names directly.

   Lazy imports inside provider/command method bodies that go back to
   Files (e.g. BaseDirectDownloadProvider.download_files doing
   'from pridepy.files.files import Files') are kept lazy — they are
   genuinely cyclic and required for backward-compat test patching.

   Also: commands/by_list.py's 'from pridepy.providers import registry'
   hoisted to module top (no Files dependency, no cycle risk).

   Note: 'import requests' is kept in files.py (noqa: F401) because
   test suites patch 'pridepy.files.files.requests.get' directly.

Tests: 68 passed, 4 skipped. flake8 --select=E9,F63,F7,F82 now clean.
Migrated ~17 patch.object(Files, "X") test targets to canonical locations:
- PRIDE methods now patched on PrideProvider
- transport/util helpers patched on transport / util modules
- by_url helpers patched on commands.by_url

Moved PRIDE protocol workers permanently into PrideProvider:
- download_files_from_{aspera,globus,s3,ftp}
- _batch_download_by_protocol, _download_with_fallback, _globus_download_one
- _protocol_sequence, download_private_file_name, stream_all_files_by_project
- get_submitted_file_path_prefix, save_checksum_file, get_ascp_binary,
  get_output_file_name

Eliminated every from pridepy.files.files import Files lazy import inside
providers/ and commands/. Providers call self.X / transport.X / util.X
directly. Zero back-into-Files coupling.

Files dropped from 912 to 382 LOC: only public CLI methods + class-attribute
constant re-exports. Kept ~5 one-line @staticmethod shims for likely
downstream imports (compute_md5, validate_download, read_checksum_file,
download_ftp_urls, download_http_urls) plus the accession-matcher helpers
(is_massive_accession etc. and is_direct_download_accession) as useful API.

No behaviour change. No test assertion changes. 68 passed, 4 skipped.
Flake8 (--select=E9,F63,F7,F82) clean.
Pure move: pridepy/providers/* -> pridepy/download/*, plus
commands/by_url.py and commands/by_list.py -> download/. All imports
updated from pridepy.providers / pridepy.commands to pridepy.download.
download/__init__.py docstring rewritten to describe the download
subsystem. No logic change. Test suite green (68 passed, 4 skipped).
The facade moved to pridepy/download/client.py and the class renamed
Files -> Client. pridepy/files/files.py is now a back-compat shim
(Files = Client, plus Progress re-export). CLI imports Client (aliased
as Files internally to keep the diff minimal). Test imports keep working
via the shim. No behaviour change; 68 passed, 4 skipped.
Fold BaseDirectDownloadProvider into a single rich Provider(ABC) that
owns the shared download workflow (get_raw_files / get_category_files /
find_file / download_all_raw / download_category / download_by_name /
download_by_filenames) plus a default scheme-partitioning download_files.
Adapters (MassIVE / JPOST / iProX / ProteomeXchange) now subclass
Provider directly and drop their redundant download_files overrides.
Align PrideProvider.download_files signature with the Provider base and
add a download_by_name override implementing PRIDE's public/private
split (V2 private API for private datasets, inherited public path
otherwise). Moves this logic out of the Client facade.
Reduce the public listing/download methods to one-line
registry.resolve(...).<workflow>(...) dispatches now that the Provider
base owns the workflow. Drops the in-facade PRIDE public/private split
and by_list delegation; removes newly-unused imports.
Tier 1 + Tier 2 of the provider-specific helper audit:

by_url.py (Tier 1) -> PrideProvider:
  - _extract_pride_accession -> PrideProvider.extract_accession_from_url
  - _validate_urls_checksums -> PrideProvider.validate_urls_checksums
  download_files_by_url now calls PrideProvider.validate_urls_checksums when
  checksum_check is set. by_url keeps only generic URL-download mechanics.

util.py (Tier 2) -> PrideProvider:
  - _get_download_url (the aspera/ftp/globus/s3 protocol resolution +
    PRIDE archive HTTPS rewrite for globus) -> PrideProvider._get_download_url
  - _resolve_local_path -> PrideProvider._resolve_local_path
  The base Provider.get_download_url is now generic (returns the
  'FTP Protocol' location value, which is what MassIVE/JPOST/iProX need);
  PrideProvider.get_download_url overrides it with the multi-protocol
  resolution. PRIDE's static download methods call
  PrideProvider._get_download_url directly.

util.py now holds only genuinely generic helpers (Progress, compute_md5,
validate_download, read_checksum_file, _find_tsv_columns, _is_md5_checksum,
_remove_if_exists). Removed the lazy 'from pridepy.download.pride import
PrideProvider' imports that util needed for the moved functions.

Test: the one test referencing util._get_download_url now references
PrideProvider._get_download_url (PrideProvider already imported).

No behaviour change. 68 passed, 4 skipped. flake8 (E9,F63,F7,F82) clean.
Smoke MD5 unchanged.
The pridepy/files/files.py shim (Files = Client re-export) was load-bearing
only for the test suite — the CLI already imports from pridepy.download.client.
Updated the 9 test files to 'from pridepy.download.client import Client as Files'
(keeps the local Files alias so test bodies are otherwise untouched) and
deleted the pridepy/files/ package entirely.

Breaking change for any downstream code doing 'from pridepy.files.files
import Files' — use 'from pridepy.download.client import Client' instead.
Pre-1.0 cleanup; the class was already renamed Files -> Client in this PR.

Also updated two stale docstrings (transport.py, client.py) that referenced
the old pridepy.files.files path.

68 passed, 4 skipped. flake8 (E9,F63,F7,F82) clean. Smoke MD5 unchanged.
ypriverol added 2 commits May 28, 2026 20:49
…ests

The dev->master build (#106) failed because test_raw_files hits the live
PRIDE API and a CI read timeout made Util.read_json_stream return None, which
then crashed with "TypeError: 'NoneType' object is not iterable" in
base.Provider.get_category_files.

- base.py: add _list_files_checked() and use it in get_raw_files,
  get_category_files, find_file, and download_by_filenames so an unreachable
  API raises a clear RuntimeError instead of an opaque TypeError.
- test_raw_files.py: these are live integration tests; skip them when the
  PRIDE API is unreachable so a transient timeout no longer fails the build.
- flake8 hygiene (informational warnings): drop unused imports
  (registry in test_download_resilience; keep __init__ re-export with noqa),
  unused `e` in api_handling, `== True` -> truthy in pride.py, and wrap two
  over-long docstring lines in pridepy.py.
The import-time reachability probe wasn't enough: it hit the fast API root
while the tests call slower data endpoints, so the build still flaked when a
data call timed out (RuntimeError "returned no data").

- Add pridepy/tests/_live_api.tolerate_api_outage: a context manager that
  skipTest()s on the signals an outage produces (requests.RequestException
  from get_api_call, the listing guard's RuntimeError, and len(None)/iterate
  TypeError from helpers that return None). AssertionError is not caught, so
  real regressions still fail.
- Wrap the live calls in test_raw_files and test_search with it (replaces the
  import-time probe).
- pride.get_submitted_file_path_prefix now routes through _list_files_checked
  so an outage raises the catchable RuntimeError instead of a TypeError.

Verified: with the API up all six live tests run and pass; with the API
helpers forced to fail, all six skip cleanly (no failures).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors pridepy’s download implementation into a modular provider-based architecture (PRIDE + direct-download adapters for MassIVE/JPOST/iProX), updates the CLI/tests to use the new Client facade, and substantially expands the README to document the new repositories and options. It also adds Codacy CLI bootstrap tooling and bumps the package version.

Changes:

  • Introduces pridepy.download provider/registry/transport architecture with repository adapters (PRIDE, MassIVE, JPOST, iProX, ProteomeXchange) and a Client facade.
  • Updates CLI entry points and test suite to use pridepy.download.client.Client, adds resilience/integrity tests (size checks, safe path joining, truncated-download detection).
  • Adds expanded documentation plus Codacy CLI bootstrap/config, and bumps version to 0.0.16.

Reviewed changes

Copilot reviewed 31 out of 32 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
README.md Major documentation rewrite covering CLI commands, new repositories, options, and Python API examples.
pyproject.toml Version bump to 0.0.16.
pridepy/util/api_handling.py Minor exception handler cleanup in streaming helper.
pridepy/pridepy.py CLI now uses the new download Client; updates some option defaults/help text.
pridepy/__init__.py Keeps main re-export with a linter-friendly comment.
pridepy/download/__init__.py Adds high-level architecture docstring for the download subsystem.
pridepy/download/base.py Adds Provider base class implementing shared listing/download workflow and URL-scheme dispatch.
pridepy/download/registry.py Adds provider registration + accession-to-provider resolution.
pridepy/download/transport.py Adds shared FTP/FTPS/HTTP transport with resume/retry/parallelization and safety checks.
pridepy/download/util.py Moves checksum/progress utilities into a shared module for providers.
pridepy/download/pride.py Implements PRIDE-specific behavior including multi-protocol fallback + private download path.
pridepy/download/massive.py Adds MassIVE provider with FTPS listing and HTTPS fallback via GNPS2 index/ProteoSAFe endpoint.
pridepy/download/jpost.py Adds JPOST provider with PROXI listing and FTP-walk fallback.
pridepy/download/iprox.py Adds iProX provider that parses PX XML and downloads over HTTP(S) with resume support.
pridepy/download/proteomexchange.py Adds explicit ProteomeXchange XML provider used by download-px-raw-files.
pridepy/download/by_url.py Adds URL-based downloader with scheme dispatch and optional PRIDE checksum validation.
pridepy/download/client.py Adds Client facade with compatibility shims and provider dispatch.
pridepy/files/files.py Removes legacy monolithic Files implementation in favor of the new download subsystem.
pridepy/tests/_live_api.py Adds helper to skip (not fail) live-API tests during transient PRIDE outages.
pridepy/tests/test_search.py Updates tests to use Client and wraps live API calls with outage tolerance.
pridepy/tests/test_raw_files.py Updates tests to use Client and wraps live API calls with outage tolerance.
pridepy/tests/test_massive_files.py Updates MassIVE tests for new provider/transport structure and adds new coverage.
pridepy/tests/test_jpost_files.py Adds new JPOST provider tests.
pridepy/tests/test_iprox_files.py Adds new iProX provider tests.
pridepy/tests/test_ftp_download_validation.py Adds coverage for FTP size-mismatch retry behavior.
pridepy/tests/test_download_resilience.py Updates and expands resilience tests for new provider/transport behavior.
pridepy/tests/test_download_by_url.py Updates URL-download tests to patch the new by_url module.
pridepy/tests/test_download_by_list.py Updates manifest-download tests to patch PrideProvider listing/download.
pridepy/tests/test_authentication.py Updates imports to use the new Client.
.codacy/codacy.yaml Adds Codacy configuration for runtimes/tools.
.codacy/cli.sh Adds a Codacy CLI v2 bootstrap/download script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .codacy/cli.sh Outdated
Comment thread .codacy/cli.sh Outdated
Comment on lines +3 to +10
iProX publishes the ProteomeXchange XML for each dataset at a deterministic
path on its anonymous HTTPS download server::

http://download.iprox.org/<accession>/PX_<accession>.xml

The referenced files are served from the same host over HTTPS with byte-range
support, so resume and parallel downloads use the same plumbing as PRIDE
HTTP(S) transfers.
Comment thread .codacy/cli.sh Outdated
Comment thread .codacy/cli.sh Outdated
Comment on lines +3 to +10
iProX publishes the ProteomeXchange XML for each dataset at a deterministic
path on its anonymous HTTPS download server::

http://download.iprox.org/<accession>/PX_<accession>.xml

The referenced files are served from the same host over HTTPS with byte-range
support, so resume and parallel downloads use the same plumbing as PRIDE
HTTP(S) transfers.
Comment thread .codacy/cli.sh Outdated
Comment thread .codacy/cli.sh Outdated
Comment on lines +3 to +10
iProX publishes the ProteomeXchange XML for each dataset at a deterministic
path on its anonymous HTTPS download server::

http://download.iprox.org/<accession>/PX_<accession>.xml

The referenced files are served from the same host over HTTPS with byte-range
support, so resume and parallel downloads use the same plumbing as PRIDE
HTTP(S) transfers.
Comment on lines +27 to +33
class Client:
"""High-level facade over the per-repository adapters."""

# PRIDE class-attribute re-exports (kept here for back-compat).
V3_API_BASE_URL = PrideProvider.V3_API_BASE_URL
API_BASE_URL = PrideProvider.API_BASE_URL
API_PRIVATE_URL = PrideProvider.API_PRIVATE_URL
ypriverol and others added 3 commits May 29, 2026 07:53
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…ings; Files migration note)

- Remove .codacy/cli.sh (resolves missing fatal() and unsafe eval findings)
- Correct iProX test docstring/comments: transport is HTTP, not HTTPS
- Document Files -> Client removal as a breaking change in README
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pridepy/download/pride.py (2)

115-119: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard empty RAW listing and the regex match.

_list_files_checked now protects against a None listing, but two reachable crashes remain: if the dataset has no RAW files, raw_files[0] raises IndexError; and if the first RAW URL doesn't match the path pattern, re.search(...) returns None and .group() raises AttributeError.

🛡️ Proposed guard
         records = self._list_files_checked(accession)
         raw_files = [r for r in records if r["fileCategory"]["value"] == "RAW"]
+        if not raw_files:
+            raise ValueError(f"No RAW files found for accession {accession}")
         first_file = raw_files[0]["publicFileLocations"][0]["value"]
-        path_fragment = re.search(r"\d{4}/\d{2}/PXD\d*", first_file).group()
-        return path_fragment
+        match = re.search(r"\d{4}/\d{2}/PXD\d*", first_file)
+        if not match:
+            raise ValueError(
+                f"Could not derive submitted-path prefix from {first_file}"
+            )
+        return match.group()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/download/pride.py` around lines 115 - 119, The code that extracts a
path fragment from RAW file records can crash when no RAW files exist or when
the URL doesn't match the expected pattern; update the block that uses
_list_files_checked to first check if raw_files is non-empty and raise a clear
ValueError (including the accession) if empty, then compute first_file and call
re.search but check the returned match is not None before calling .group() and
raise another ValueError (including accession and the offending first_file URL)
if the regex fails; reference the variables/functions raw_files, first_file,
_list_files_checked and the re.search call so the guards are added immediately
around them.

293-301: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a timeout to urlopen for the checksum fetch.

urllib.request.urlopen defaults to no timeout, so a stalled connection can hang the checksum download indefinitely on a request thread.

🛡️ Proposed fix
-        with urllib.request.urlopen(request) as response:
+        with urllib.request.urlopen(request, timeout=60) as response:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/download/pride.py` around lines 293 - 301, The checksum download uses
urllib.request.urlopen(request) without a timeout which can hang; update the
fetch in pride.py to pass a sensible timeout (e.g., timeout seconds or a
configurable constant) to urllib.request.urlopen so the call fails fast on
stalled connections. Locate the block that builds the Request (variable request)
and the urlopen call and add a timeout parameter (make it a named constant or
function parameter if needed), and ensure any timeout exception is propagated or
handled consistently with existing error handling around writing output_path for
accession into output_folder.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pridepy/tests/test_search.py`:
- Around line 62-69: The variable public_project is inverted: currently it sets
public_project = True when project_status.text == "PRIVATE" and False when
"PUBLIC"; update the assignments so public_project = False for "PRIVATE" and
public_project = True for "PUBLIC" (in the block that checks
project_status.status_code == 200), and keep the existing exception for other
values; ensure logging.debug(f"Public project: {public_project}") reflects the
corrected boolean.

---

Outside diff comments:
In `@pridepy/download/pride.py`:
- Around line 115-119: The code that extracts a path fragment from RAW file
records can crash when no RAW files exist or when the URL doesn't match the
expected pattern; update the block that uses _list_files_checked to first check
if raw_files is non-empty and raise a clear ValueError (including the accession)
if empty, then compute first_file and call re.search but check the returned
match is not None before calling .group() and raise another ValueError
(including accession and the offending first_file URL) if the regex fails;
reference the variables/functions raw_files, first_file, _list_files_checked and
the re.search call so the guards are added immediately around them.
- Around line 293-301: The checksum download uses
urllib.request.urlopen(request) without a timeout which can hang; update the
fetch in pride.py to pass a sensible timeout (e.g., timeout seconds or a
configurable constant) to urllib.request.urlopen so the call fails fast on
stalled connections. Locate the block that builds the Request (variable request)
and the urlopen call and add a timeout parameter (make it a named constant or
function parameter if needed), and ensure any timeout exception is propagated or
handled consistently with existing error handling around writing output_path for
accession into output_folder.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5b82081e-8aaf-40be-8f19-38af0a7e8dc6

📥 Commits

Reviewing files that changed from the base of the PR and between cb7e58a and 9609456.

📒 Files selected for processing (11)
  • README.md
  • pridepy/__init__.py
  • pridepy/download/base.py
  • pridepy/download/pride.py
  • pridepy/pridepy.py
  • pridepy/tests/_live_api.py
  • pridepy/tests/test_download_resilience.py
  • pridepy/tests/test_iprox_files.py
  • pridepy/tests/test_raw_files.py
  • pridepy/tests/test_search.py
  • pridepy/util/api_handling.py
💤 Files with no reviewable changes (1)
  • pridepy/tests/test_download_resilience.py
✅ Files skipped from review due to trivial changes (2)
  • pridepy/init.py
  • pridepy/util/api_handling.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • pridepy/tests/test_raw_files.py
  • pridepy/pridepy.py
  • README.md

Comment thread pridepy/tests/test_search.py
@ypriverol
Copy link
Copy Markdown
Contributor Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

ypriverol added 5 commits May 29, 2026 10:42
MassIVE spreads datasets across versioned roots (/v01../vNN), so a fixed
/v01 prefix made FTPS listing fail with "550 CD issue" for datasets stored
elsewhere (e.g. MSV000088302 under /v04), forcing the HTTPS fallback.

- Add transport._resolve_and_walk_ftp_dataset: probe top-level roots for
  <root>/<accession> over one connection, then walk the match.
- Prefer versioned (vNN) roots over auxiliary x01/z01 roots, which can hold
  only a partial peak-only copy (MSV000088302: 234 files under v04 vs 36
  peak-only under z01).
- Make _get_public_ftp_url / _build_file_record version-agnostic so download
  URLs and relative paths preserve the discovered root.
- Tests for non-v01 listing, version-root preference, and URL building.
…ucture to opt out

Files now download directly into the output folder by basename instead of
recreating the dataset's subdirectory tree (raw/.../*.raw). Colliding
basenames are de-duplicated deterministically (run.raw, run_1.raw) so
skip-if-downloaded and resume stay stable across runs.

- Add util.flatten_relative_paths: pure, deterministic basename dedup.
- base.Provider.download_files(flatten=True): compute flat dedup destination
  names over the whole set; flatten=False preserves relativePath layout.
- Thread flatten through Client + provider download_by_* and the PX provider.
- Add --preserve-structure CLI flag to download-all-public-raw-files,
  download-all-public-category-files, download-files-by-list,
  download-px-raw-files.
- PRIDE accepts flatten for parity but already writes flat (see #107).
…#107)

The legacy PrideProvider.download_files_from_ftp used one shared connection
with no resume: a single large-file read timeout (hardcoded 30s) left the
control socket broken, then every remaining file in the batch failed
instantly with "cannot read from timed out object", and partial files
restarted from byte 0 on retry.

Route the FTP batch through transport.download_ftp_urls instead, which:
- downloads each file with REST-based resume (no restart-from-zero),
- reconnects per file so one timeout no longer cascades to the rest,
- verifies post-transfer size and retries truncated files.

Files still land flat by basename, matching _resolve_local_path used by the
Phase-2 validation/fallback. Remove the legacy download_files_from_ftp and
its now-unused ftplib/socket/FTP imports.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pridepy/download/base.py`:
- Around line 268-277: The loop collecting entries calls
self.get_download_url(record) without forwarding the requested protocol, so
provider implementations never see the selected protocol; update that call to
pass the protocol argument (e.g., self.get_download_url(record, protocol)) when
building entries, keep using the returned url and relpath as before, and ensure
the variable name matches the protocol parameter used by the caller so
protocol-aware URL resolution in get_download_url(record, protocol) is
exercised.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8297bf14-1a16-44ae-8db6-8dedd668e8cd

📥 Commits

Reviewing files that changed from the base of the PR and between 8b2e8f2 and de05b06.

📒 Files selected for processing (12)
  • README.md
  • pridepy/download/base.py
  • pridepy/download/client.py
  • pridepy/download/pride.py
  • pridepy/download/proteomexchange.py
  • pridepy/download/util.py
  • pridepy/pridepy.py
  • pridepy/tests/test_cli_flatten.py
  • pridepy/tests/test_download_resilience.py
  • pridepy/tests/test_flatten_paths.py
  • pridepy/tests/test_jpost_files.py
  • pridepy/tests/test_massive_files.py
✅ Files skipped from review due to trivial changes (1)
  • README.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • pridepy/download/util.py
  • pridepy/tests/test_jpost_files.py
  • pridepy/download/client.py

Comment thread pridepy/download/base.py
@ypriverol
Copy link
Copy Markdown
Contributor Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

- client.get_file_from_api: chain re-raised exception with `from e`.
- base.Provider: tolerate partial/None records in get_raw_files /
  get_category_files / find_file via defensive .get() access.
- base.Provider.download_files: forward the requested protocol to
  get_download_url so protocol-aware adapters receive it.
- pride.get_submitted_file_path_prefix: raise a clear error when no RAW
  file has a public location, or the path layout doesn't match (no bare
  IndexError / AttributeError).
- pride: add timeouts to the private-file requests.get and the checksum
  urllib.urlopen so a stalled server can't hang indefinitely.
- transport.download_ftp_urls: reject FTP URLs with no host (fail fast).
- iprox / proteomexchange: parse remote XML with defusedxml (XXE/entity
  hardening); add defusedxml dependency.
- Tests for the new guards and protocol forwarding.
@ypriverol
Copy link
Copy Markdown
Contributor Author

Addressed the outstanding review comments in commit 1494462 (on dev):

  • client.pyget_file_from_api now chains the original exception (raise ... from e).
  • base.pyget_raw_files / get_category_files / find_file tolerate partial/None records via defensive .get() access (the None-listing case was already guarded by _list_files_checked).
  • base.pydownload_files forwards the requested protocol to get_download_url so protocol-aware adapters receive it.
  • pride.pyget_submitted_file_path_prefix raises a clear error instead of a bare IndexError/AttributeError when there is no public RAW file or the path layout doesn't match.
  • pride.py — added timeouts to the private-file requests.get and the checksum urllib.urlopen.
  • transport.pydownload_ftp_urls rejects FTP URLs with no host (fail fast).
  • iprox.py / proteomexchange.py — remote XML now parsed with defusedxml (added as a dependency).

Added tests in pridepy/tests/test_review_fixes.py; full suite: 110 passed, 4 skipped (live).

Already resolved earlier in this PR: the iProX test docstring (HTTP, not HTTPS). The removal of pridepy.files.files.Files is an intentional breaking change for 0.0.x, documented in the README and usage guide.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pridepy/download/pride.py`:
- Around line 125-131: The regex that extracts the submitted-path prefix
currently hardcodes "PXD" and fails for PRD accessions; update the pattern used
in re.search (the line with match = re.search(..., first_file)) to accept both
PXD and PRD (e.g., use (?:PXD|PRD)\d+ instead of PXD\d*) so match.group()
returns either form; keep the existing ValueError behavior if no match is found.

In `@requirements.txt`:
- Around line 10-11: Update the requirements entry for the httpx dependency to
pin it to a safe version by replacing the bare "httpx" line in requirements.txt
with a minimum/pinned version (for example "httpx>=0.23.0" or "httpx==0.23.3");
this ensures installs will not resolve to vulnerable pre-0.23.0 releases—edit
the httpx line only and keep other entries like defusedxml unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 645d9ce4-37ac-43a2-bd66-340bc761dc30

📥 Commits

Reviewing files that changed from the base of the PR and between de05b06 and 1494462.

📒 Files selected for processing (11)
  • README.md
  • docs/usage.md
  • pridepy/download/base.py
  • pridepy/download/client.py
  • pridepy/download/iprox.py
  • pridepy/download/pride.py
  • pridepy/download/proteomexchange.py
  • pridepy/download/transport.py
  • pridepy/tests/test_review_fixes.py
  • pyproject.toml
  • requirements.txt
✅ Files skipped from review due to trivial changes (1)
  • README.md
🚧 Files skipped from review as they are similar to previous changes (6)
  • pyproject.toml
  • pridepy/download/proteomexchange.py
  • pridepy/download/iprox.py
  • pridepy/download/transport.py
  • pridepy/download/base.py
  • pridepy/download/client.py

Comment thread pridepy/download/pride.py Outdated
Comment thread requirements.txt Outdated
- download_files_from_aspera/globus/s3 now collect per-file failures and
  raise RuntimeError after attempting every file, instead of logging and
  returning success — so a fully failed batch no longer exits 0 (the
  multi-protocol orchestrator still catches this and falls back per file).
- Use file.get("fileName") in the S3 helper to avoid KeyError on partial
  records.
- get_submitted_file_path_prefix now matches PRD as well as PXD accessions.
- Tests for batch-failure propagation and PRD support.
@ypriverol
Copy link
Copy Markdown
Contributor Author

Follow-up in commit 858a118 (the two remaining items):

  • pride.pydownload_files_from_aspera / download_files_from_globus / download_files_from_s3 now collect per-file failures and raise a RuntimeError after attempting every file, instead of logging and returning success. A fully failed batch no longer exits 0; the multi-protocol orchestrator still catches this and falls back per file. (Also use .get("fileName") in the S3 helper to avoid KeyError on partial records.)
  • pride.pyget_submitted_file_path_prefix now matches legacy PRD accessions as well as PXD.

Tests added for batch-failure propagation and PRD support. Full suite: 114 passed, 4 skipped (live).

@ypriverol ypriverol merged commit 59c55bf into master May 29, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants