Skip to content

Reduce containers overhead #267

@kim-fehl

Description

@kim-fehl

Description of the bug

Currently, there are more than 20 container images in the pipeline totaling ~60 GB of disk usage
I remember that for nf-core modules the recommended principle is "one tool – one container", but here we mostly have local modules and I think some redundancy (caused by incremental development) can be reduced.

IMAGE ID DISK USAGE CONTENT SIZE
SEQERA: anndata2ri_bioconductor-singlecellexperiment_anndata_r-seurat:5fae42aabf7a1c5f 291a48658716 3.38GB 846MB
SEQERA: anndata:0.10.9--1eab54e300e1e584 3471efcc8b48 936MB 231MB
SEQERA: anndata_pyyaml:82c6914e861435f7 7946ce8a97db 1.06GB 261MB
SEQERA: anndata_upsetplot:784e0f450da10178 766c2dff4b54 1.18GB 306MB
SEQERA: bbknn_pyyaml_scanpy:4cf2984722da607f 453b5e1f8972 1.6GB 382MB
SEQERA: bioconductor-celldex_bioconductor-hdf5array_bioconductor-singlecellexperiment_r-yaml:13bf33457e3e7490 fdb9aa052292 2.33GB 634MB
SEQERA: celltypist_scanpy:44b604b24dd4cf33 bfe009b0a96c 1.78GB 431MB
SEQERA: harmonypy_pyyaml_scanpy:f6cc57196369fb1e 0c62b23a31d6 1.63GB 392MB
SEQERA: leidenalg_python-igraph_pyyaml_scanpy:4936fa196b5f4340 8644a451da2a 1.66GB 401MB
SEQERA: liana_pyyaml:776fdd7103df146d 131e6bd9dccb 2.24GB 507MB
SEQERA: multiqc:1.33--ee7739d47738383b abd5751768f8 2.01GB 432MB
SEQERA: pandas:2.2.3--9b034ee33172d809 50da2ef5f060 765MB 190MB
SEQERA: python-igraph_pyyaml_scanpy:cc0304f4731f72f9 8f65ff8a2191 1.66GB 401MB
SEQERA: python_pyyaml_scanpy:b5509a698e9aae25 e0dac9eda4d7 1.85GB 461MB
SEQERA: python_pyyaml_scanpy_scikit-image:750e7b74b6d036e4 e2816307a73f 2.04GB 509MB
SEQERA: pyyaml_scanpy:3c9e9f631f45553d 7ed2839670f9 1.63GB 392MB
SEQERA: pyyaml_scanpy:a3a797e09552fddc 228c2994c5f4 1.86GB 466MB
SEQERA: scanpy_upsetplot:1ce883f3ff369ca8 a91e0a660553 1.67GB 414MB
SEQERA: scvi-tools:1.3.3--df115aabdccb7d6b 551e3b44c383 4.66GB 1.08GB
SEQERA: scvi-tools:1.4.1--47f5b0e6b70fd131 0ac460cb48b1 3.47GB 797MB
nicotru/celda:1d48a68e9d534b2b 3a4f38d26238 2.95GB 759MB
nicotru/scds:7788dbeb87bc7eec e6aac618e327 2.48GB 651MB
nicotru/seurat:b3b12d17271014d9 22f891364efc 3.35GB 853MB
nicotru/soupx:f6297681695fbfcf 222d79287a15 2.82GB 700MB
saditya88/singler:0.0.1 cb267ab7d826 9.13GB 2.64GB

(This issue has been brought to my attention as I rent the server and also pay for disk space 😃)

I asked Codex to analyze repo structure and find some ways to optimize container usage, not touching nf-core/modules and accounting for python version pinning you mentioned. Here's the output:

Implementation Plan

  • Use nf-core module containers as the canonical baseline for overlapping local tool families.
    • Align local SCVITOOLS_SCVI and SCVITOOLS_SCANVI to the same scvi-tools=1.3.3 container/env family already used by vendored SCVITOOLS_SOLO and SCVITOOLS_SCAR.
  • Collapse the local generic scanpy 1.11.5 / 1.11.2 split onto one pinned local baseline that is compatible with the nf-core scrublet stack.
    • Standardize local scanpy modules that only need core scanpy functionality on python=3.12.11, scanpy=1.11.2, pyyaml=6.0.2.
    • Apply the same base version to additive local scanpy envs (neighbors, paga, leiden, harmony, bbknn) while keeping their extra packages.
  • Collapse the local upsetplot fork.
    • Change ADATA_UPSETGENES to use anndata directly instead of scanpy for reading .h5ad.
    • Move ADATA_UPSETGENES and DOUBLET_REMOVAL onto one shared pinned local env: python=3.12.11, anndata=0.12.7, upsetplot=0.9.0.
  • Replace docker.io/saditya88/singler:0.0.1 with a local Dockerfile built from a minimal R/Wave-compatible base containing the actual R dependencies used by singleR.R, including bioconductor-hdf5array and anndataR.

Test Plan

  • Run nf-tests for affected local modules and subworkflows:
    • integrate
    • quality_control
    • doublet_detection
    • celltype_assignment
    • affected local modules under scanpy, scvitools, adata/upsetgenes, doublet_detection/doublet_removal, and celltypes/singler
  • Run pipeline tests with -profile test,docker and -profile test_full,docker.

Does this plan make sense?
You also mentioned that private docker hub images can be replaced with Seqera ones, but Codex thought it's too much for one conservative pass :)

Command used and terminal output

Relevant files

No response

System information

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions