Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
d471e0b
Add direct downloads for JPOST and iProX accessions; bump to 0.0.16
ypriverol May 27, 2026
3f8ed3f
Direct downloads: FTPS for MassIVE, parallelism, defer iProX
ypriverol May 27, 2026
a360378
JPOST PROXI listing, post-transfer size check, iProX accession guard
ypriverol May 27, 2026
df165ab
iProX direct downloads via PX XML + anonymous HTTPS at download.iprox…
ypriverol May 27, 2026
eedbef1
refactor(providers): scaffold providers/ package with Provider ABC + …
ypriverol May 27, 2026
912287d
refactor(providers): move FTP/HTTPS transport into providers/transpor…
ypriverol May 27, 2026
e16c870
refactor(providers): move cross-cutting utilities into providers/util.py
ypriverol May 27, 2026
41539e7
refactor(providers): extract MassiveProvider
ypriverol May 27, 2026
2010f63
refactor(providers): extract JpostProvider with PROXI + FTP listing
ypriverol May 27, 2026
e128a16
refactor(providers): extract IproxProvider with PX XML listing
ypriverol May 27, 2026
835d77b
test(providers): verify BaseDirectDownloadProvider URL-scheme partiti…
ypriverol May 27, 2026
3e358ae
refactor(providers): extract PrideProvider with multi-protocol fallback
ypriverol May 27, 2026
6fd743d
refactor(providers): rewire Files facade through Registry
ypriverol May 27, 2026
6d5637b
test: integration test for PRIDE multi-protocol fallback through facade
ypriverol May 27, 2026
f60a69c
chore(release): bump version to 0.0.17
ypriverol May 27, 2026
02de401
refactor(commands): scaffold commands/ package
ypriverol May 27, 2026
c165345
refactor(commands): move ProteomeXchange XML download into commands/p…
ypriverol May 27, 2026
1a0a9b8
refactor(commands): move download_files_by_list into commands/by_list.py
ypriverol May 27, 2026
ced7415
refactor(commands): move download_files_by_url into commands/by_url.py
ypriverol May 27, 2026
e3d694b
refactor(providers): move ProteomeXchange from commands/ to providers…
ypriverol May 27, 2026
6bc8e5a
fix(files): restore missing imports and hoist lazy imports to module top
ypriverol May 27, 2026
c441355
refactor: collapse Files shim layer; providers own their logic (0.0.18)
ypriverol May 27, 2026
0740c4b
refactor(download): rename providers/ to download/, fold commands/ in
ypriverol May 28, 2026
10ddb70
refactor(download): move files.py to download/client.py; Files -> Client
ypriverol May 28, 2026
bad78dd
refactor(download): Template Method workflow in Provider base
ypriverol May 28, 2026
bfe06d2
refactor(download): PrideProvider owns public/private download_by_name
ypriverol May 28, 2026
1990611
refactor(download): slim Client facade to registry dispatches
ypriverol May 28, 2026
4746408
chore(release): bump version to 0.0.19
ypriverol May 28, 2026
d7ea808
refactor(download): move PRIDE-specific helpers into PrideProvider
ypriverol May 28, 2026
121b06b
refactor(download): remove the pridepy/files/ back-compat shim
ypriverol May 28, 2026
b0776b0
docs+refactor(download): fix review findings and restructure README
ypriverol May 28, 2026
3188719
docs+fix: correct README/CLI accuracy issues from re-review
ypriverol May 28, 2026
0957ec6
fix(download): preserve dataset layout and verify HTTP transfer size
ypriverol May 28, 2026
9b2a61f
feat(massive): HTTPS fallback when FTP/FTPS is blocked
ypriverol May 28, 2026
9c31f15
chore(release): set version to 0.0.16
ypriverol May 28, 2026
cb02d51
fix(download): propagate direct-download failures; px relativePath
ypriverol May 28, 2026
15452c8
fix(download): by_url size verification; correct stale docs
ypriverol May 28, 2026
c224f9a
fix(download): avoid gzip false-positive size check; px root-prefix p…
ypriverol May 28, 2026
1d5b50b
chore(download): align by-url -w help; exercise size check in tests
ypriverol May 28, 2026
cb7e58a
Merge pull request #105 from PRIDE-Archive/refactor/provider-template…
ypriverol May 28, 2026
0595f5a
fix(ci): make listing robust to API outages; deterministic raw-file t…
ypriverol May 28, 2026
f087d95
test: make live PRIDE-API tests skip (not fail) on API outage
ypriverol May 28, 2026
3202993
Potential fix for pull request finding
ypriverol May 29, 2026
dff420f
Potential fix for pull request finding
ypriverol May 29, 2026
9609456
fix: address Copilot PR review (remove codacy bootstrap; iProX docstr…
ypriverol May 29, 2026
8b2e8f2
fix(massive): discover dataset version root instead of assuming /v01
ypriverol May 29, 2026
41c183e
docs: design spec for flatten-by-default downloads
ypriverol May 29, 2026
aca3228
feat(download): flatten into output folder by default, --preserve-str…
ypriverol May 29, 2026
e58a34c
docs: document GitHub-branch install and flat-by-default downloads (-…
ypriverol May 29, 2026
d73de3e
fix(pride): route FTP batch downloads through shared transport (close…
ypriverol May 29, 2026
de05b06
chore: remove internal design doc from repo
ypriverol May 29, 2026
3a3e3e8
docs: add docs/usage.md guide; slim README to install + overview
ypriverol May 29, 2026
1494462
fix(download): address PR #106 review comments (hardening)
ypriverol May 29, 2026
858a118
fix(pride): propagate batch failures and accept PRD accessions
ypriverol May 29, 2026
2ebc99c
build: pin httpx>=0.27.0 in requirements.txt (GHSA-h8pj-cxx2-jfg2)
ypriverol May 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .codacy/codacy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
runtimes:
- dart@3.7.2
- go@1.22.3
- java@17.0.10
- node@22.2.0
- python@3.11.11
tools:
- dartanalyzer@3.7.2
- eslint@8.57.0
- lizard@1.17.31
- pmd@7.11.0
- pylint@3.3.6
- revive@1.7.0
- semgrep@1.78.0
- trivy@0.66.0
243 changes: 47 additions & 196 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

You can:
- download public and private PRIDE files
- download public MassIVE datasets directly from `MSV...` accessions
- download public MassIVE (`MSV...`), JPOST (`JPST...`), and iProX (`IPX...`) datasets directly. MassIVE goes through FTPS at `massive-ftp.ucsd.edu`, with an automatic HTTPS fallback (via the GNPS2 file index and the `massive.ucsd.edu` ProteoSAFe endpoint) for networks that block FTP/FTPS; JPOST uses the JSON PROXI endpoint at `repository.jpostdb.org` for listings and `ftp.jpostdb.org` for transfers; iProX fetches the dataset's ProteomeXchange XML from `download.iprox.org` and downloads files over anonymous HTTP
- download by category (`RAW`, `SEARCH`, `RESULT`, etc.)
- stream project and file metadata
- search projects by keyword and filters
Expand Down Expand Up @@ -45,231 +45,82 @@ pip install --upgrade pridepy
pridepy --help
```

### Option 3: Install from source (development)
### Option 3: Install the latest code directly from GitHub

```bash
git clone https://github.com/PRIDE-Archive/pridepy
cd pridepy
uv sync --extra dev
uv run pridepy --help
```

## Quick Start (New Users)
To get features that have not been released to PyPI yet, install straight from a
branch. `master` holds the latest stable code; `dev` holds the newest (and
potentially unstable) development work.

### 1) Download all raw files for a project (robust mode)
With `uv`:

```bash
pridepy download-all-public-raw-files \
-a PXD008644 \
-o ./downloads/PXD008644 \
--checksum-check
```

What this does:
- default `ftp` starts with FTP and falls back (`ftp -> aspera -> s3 -> globus`)
- `--checksum-check` downloads project checksums and validates files
- empty/corrupt files are retried automatically
# Latest stable (master)
uv tool install "git+https://github.com/PRIDE-Archive/pridepy@master"

### 2) Continue interrupted downloads safely

```bash
pridepy download-all-public-raw-files \
-a PXD008644 \
-o ./downloads/PXD008644 \
--skip-if-downloaded-already \
--checksum-check
# Bleeding edge (dev)
uv tool install "git+https://github.com/PRIDE-Archive/pridepy@dev"
```

### 3) Download a public MassIVE dataset directly
Or with `pip`:

```bash
pridepy download-all-public-raw-files \
-a MSV000082297 \
-o ./downloads/MSV000082297
```

For direct `MSV...` downloads, `pridepy` enumerates the dataset from MassIVE's public FTP tree. Raw downloads follow MassIVE's own collection layout, so `download-all-public-raw-files` downloads the files stored under the dataset's `raw/` collection.
# Latest stable (master)
pip install --upgrade "git+https://github.com/PRIDE-Archive/pridepy@master"

### 4) Download only selected categories

```bash
pridepy download-all-public-category-files \
-a PXD022105 \
-o ./downloads/PXD022105 \
-c RAW,SEARCH
# Bleeding edge (dev)
pip install --upgrade "git+https://github.com/PRIDE-Archive/pridepy@dev"
```

You can also request a specific MassIVE collection through the same category interface:

```bash
pridepy download-all-public-category-files \
-a MSV000082297 \
-o ./downloads/MSV000082297-results \
-c RESULT
```

### 5) Download one file by name

```bash
pridepy download-file-by-name \
-a PXD022105 \
-f checksum.txt \
-o ./downloads/PXD022105 \
--checksum-check
```

### 6) Download raw files from ProteomeXchange

```bash
pridepy download-px-raw-files \
-a PXD039236 \
-o ./downloads/PXD039236
```

### 6) Download a named subset of files (manifest)

```bash
pridepy download-files-by-list \
-a PXD001819 \
-F files.txt \
-o ./downloads/PXD001819 \
--checksum-check
```

`files.txt` is one filename per line (blank lines and `#` comments are
ignored). Internally each filename is resolved against the project metadata
API and downloaded via the same batch + protocol-fallback engine as
`download-all-public-raw-files`. Use `-f a.raw,b.raw,c.raw` instead of
`-F` for a small inline list.

Useful options:

- `-p globus` — use the globus download strategy (HTTP Range + resume)
- `-w 3` — download up to 3 files in parallel (globus only, max 3)
- `--checksum-check` — validate files against PRIDE checksums after download
You can pin to any branch, tag, or commit by changing the part after `@` (e.g.
`@v0.0.16` or `@<commit-sha>`).

### 7) Download files from raw URLs
### Option 4: Install from source (development)

```bash
pridepy download-files-by-url \
-F urls.txt \
-o ./downloads/urls
git clone https://github.com/PRIDE-Archive/pridepy
cd pridepy
uv sync --extra dev
uv run pridepy --help
```

`urls.txt` is one fully-qualified URL per line. Schemes `http`, `https`, and
`ftp` are dispatched to the matching downloader. Use `-u/--urls` for one or
more comma-separated URLs, e.g. `--urls https://a.com/x.raw,ftp://b.com/y.raw`.
Note: URLs containing literal commas are not supported with `--urls`; use a
manifest file (`-F`) instead.
## Usage

Useful options:

- `-p globus` — use globus download strategy for http/https URLs (resume-capable)
- `-w 3` — download up to 3 files in parallel (globus only, max 3)
- `--checksum-check` — validate against PRIDE checksums (accession inferred
from PRIDE URL paths; only PRIDE archive URLs are supported)

## CLI Command Overview
See the **[usage guide](docs/usage.md)** for detailed instructions and examples:
downloading data (PRIDE, MassIVE, JPOST, iProX, ProteomeXchange), category and
manifest downloads, private files, streaming metadata, searching projects, and
the Python API.

```bash
pridepy --help
```

Main commands:
- `download-all-public-raw-files`
- `download-all-public-category-files`
- `download-file-by-name`
- `download-files-by-list`
- `download-files-by-url`
- `download-px-raw-files`
- `list-private-files`
- `stream-files-metadata`
- `stream-projects-metadata`
- `search-projects-by-keywords-and-filters`

## More CLI Examples

### Search projects

```bash
pridepy search-projects-by-keywords-and-filters \
-k human \
-f projectTags==ProteomeTools,organismsPart==Pancreas \
-sd DESC \
-sf accession \
-sf submissionDate
```

### Stream all project metadata to JSON

```bash
pridepy stream-projects-metadata -o all_pride_projects.json
```

### Stream all file metadata for one accession
| Command | Purpose |
| --- | --- |
| `download-all-public-raw-files` | Download every public RAW file of a dataset |
| `download-all-public-category-files` | Download files of one or more categories (RAW, SEARCH, …) |
| `download-file-by-name` | Download a single file (public or private) |
| `download-files-by-list` | Download a named subset of files from a manifest/CSV |
| `download-files-by-url` | Download files from raw `http`/`https`/`ftp` URLs |
| `download-px-raw-files` | Download RAW files resolved from a ProteomeXchange accession |
| `list-private-files` | List files of a private project (needs credentials) |
| `stream-files-metadata` | Stream file metadata (one project or all) to JSON |
| `stream-projects-metadata` | Stream all project metadata to JSON |
| `search-projects-by-keywords-and-filters` | Search projects by keyword and filters |

```bash
pridepy stream-files-metadata -a PXD005011 -o PXD005011_files.json
```

### Download private files

List files:

```bash
pridepy list-private-files -a PXD022105 -u YOUR_USER -p YOUR_PASSWORD
```

Download a private file:
Quick examples:

```bash
pridepy download-file-by-name \
-a PXD022105 \
-f checksum.txt \
-o ./downloads/private \
--username YOUR_USER \
--password YOUR_PASSWORD
```

## Python API Examples

### Example: get raw files for a project

```python
from pridepy.files.files import Files
# Download all public RAW files of a dataset (any repository)
pridepy download-all-public-raw-files -a PXD008644 -o ./downloads/PXD008644 --checksum-check

files = Files()
raw_files = files.get_all_raw_file_list("PXD008644")
print(f"RAW files: {len(raw_files)}")
print(raw_files[0]["fileName"])
```

For MassIVE accessions, the same method returns the files found under the dataset's `raw/` collection:
# Download a ProteomeXchange dataset by its PXD accession
pridepy download-px-raw-files -a PXD039236 -o ./downloads/PXD039236

```python
from pridepy.files.files import Files

files = Files()
raw_files = files.get_all_raw_file_list("MSV000082297")
print(f"MassIVE raw files: {len(raw_files)}")
# Download a native MassIVE / JPOST / iProX dataset
pridepy download-all-public-raw-files -a MSV000082297 -o ./downloads/MSV000082297
```

### Example: search projects

```python
from pridepy.project.project import Project

project = Project()
results = project.search_by_keywords_and_filters(
keyword="PXD009476",
query_filter="",
page_size=25,
page=0,
sort_direction="DESC",
sort_fields="accession",
)
print(f"Hits: {len(results)}")
```
Full option tables and more examples are in [docs/usage.md](docs/usage.md).

## Development and Release (uv)

Expand Down
Loading
Loading