Template for wrapping bioinformatics tools as CTS (CDM Task Service) jobs.
Copy this repo to create a new tool: kbaseincubator/cdm_{toolname}.
- Copy this repo → rename to
kbaseincubator/cdm_{toolname} - Edit
Dockerfile→ swap in the real tool image and entrypoint - Push / tag a release → GitHub Actions builds and pushes to GHCR automatically
- Ask a CTS admin to register the image (see
docs/pattern.md— regular users cannot register images) - Create a demo notebook at
global_share/{your_username}/{toolname}_demo.ipynbon hub.berdl.kbase.us - Submit a job and verify output lands in MinIO
- Write an importer (PR to
kbase/cdm-spark-events-importers) to load results into Delta Lake
See docs/pattern.md for the full pattern with examples.
cdm_{toolname}/
├── Dockerfile # Wraps the tool — the only thing that changes per tool
├── .github/workflows/
│ └── docker-publish.yaml # CI/CD: builds + pushes to ghcr.io/kbaseincubator/cdm_{toolname}
├── docs/
│ └── pattern.md # Full CTS tool pattern documentation
├── README.md
└── LICENSE.md
Demo notebooks and importers live in separate repos (see docs/pattern.md).
Status legend:
- Live: image registered in CTS, tested end-to-end, importer deployed
- Image registered: image registered in CTS, demo notebook works, importer in progress
- Awaiting registration: repo + image built and public on GHCR, refdata staged in MinIO if needed, waiting on CTS admin to register
- Repo built: GitHub repo + GHCR image done, no refdata work yet
- Planned: not started
| Tool | Repo | Image | Refdata | Status |
|---|---|---|---|---|
| mmseqs2 | cdm_mmseqs2 | 0.1.0 |
no | Live, importer pending merge (PR #35) |
| kofamscan | cdm_kofamscan | 0.1.0 |
cts-refdata/kofam/2025-04-30/kofam_refdata.tar.gz (UUID 84b31af0-…) |
Image registered, demo notebook + importer next |
| bakta | cdm_bakta | 0.1.0 |
cts-refdata/bakta/v6.0/bakta_db.tar.gz (UUID 663783c1-…) |
Image registered, demo notebook + importer next |
| psortb | cdm_psortb | 0.1.0 |
none (bundled in image) | Image registered, demo notebook + importer next |
| gtdbtk | — | — | ~100GB taxonomy DB | Planned |
| eggNOG | — | — | eggNOG DB | Planned |
| RAST | — | — | none | Planned (custom container, needs upstream coordination) |
| transyt | — | — | none | Planned (custom container) |
| modelseedpy | — | — | none | Planned (custom container from upstream maintainer) |
| skani | — | — | optional (refdata for query mode only) | Planned (deferred per scientific priorities) |
External / not built via this skeleton:
- checkm2 —
ghcr.io/kbasetest/cdm_checkm2:0.3.0(existing reference example, predates this skeleton) - InterProScan — external container (deployed to dev only, currently broken)
Things waiting on others. Update as items move.
| PR | Tool | Description | Waiting on |
|---|---|---|---|
| kbase/cdm-spark-events-importers#35 | mmseqs2 | First importer for the cluster TSV output. CI green, ready to merge. | CTS admin (review + merge + redeploy event processor) |
Tracked as GitHub issues with task list checkboxes the CTS admin ticks off as each step completes. Templates and archive in handoffs/.
Currently open: none. All previously open handoffs (kofamscan, bakta, psortb) registered 2026-05-07; pending verification + issue close.
Recently completed (templates kept for reference):
Refdata path convention: cts-refdata/{toolname}/{refdata_version}/{filename}. The path version is the refdata version, not the tool version. See handoffs/README.md for full conventions and process for adding new handoffs.
Source of truth for what is actually registered: query the CTS API directly at GET /refdata/ and GET /images/{image_id}. The tables above describe intent; the API describes reality.