Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
f7fc9d5
draft ncbi scripts
mattldawson Apr 16, 2026
fa5fa97
add local minio testing
mattldawson Apr 16, 2026
f4d838a
add NCBI end-to-end testing instructions
mattldawson Apr 17, 2026
5c6c547
debug and cleanup
mattldawson Apr 17, 2026
e5bc2f6
cleanup and formatting
mattldawson Apr 17, 2026
de8d325
formatting
mattldawson Apr 17, 2026
3f0e604
Potential fix for pull request finding 'Unused import'
mattldawson Apr 20, 2026
4f93d5f
Potential fix for pull request finding 'Unused import'
mattldawson Apr 20, 2026
399a00d
Potential fix for pull request finding
mattldawson Apr 20, 2026
fee2ab4
Potential fix for pull request finding
mattldawson Apr 20, 2026
5233d9d
Potential fix for pull request finding
mattldawson Apr 20, 2026
e5b58f2
Potential fix for pull request finding
mattldawson Apr 20, 2026
64ead4c
Potential fix for pull request finding
mattldawson Apr 20, 2026
9fba2c5
Potential fix for pull request finding
mattldawson Apr 20, 2026
c89ad37
address copilot comments
mattldawson Apr 20, 2026
9a5e2fe
increased timeout for trivy action
mattldawson Apr 20, 2026
de77b14
Potential fix for pull request finding 'Unused import'
mattldawson Apr 20, 2026
7d9ced1
Potential fix for pull request finding 'Unused import'
mattldawson Apr 20, 2026
16dfda3
Potential fix for pull request finding
mattldawson Apr 20, 2026
521978d
Potential fix for pull request finding
mattldawson Apr 20, 2026
8f73674
Potential fix for pull request finding
mattldawson Apr 20, 2026
1e8f363
add progress bar to manifest verification
mattldawson Apr 20, 2026
7326641
add synthetic assembly summary generation option
mattldawson Apr 20, 2026
4cf60d6
debug synthetic manifest
mattldawson Apr 20, 2026
9a1d1e5
add custom release date to synthetic manifest generation
mattldawson Apr 21, 2026
a4ee186
fix style
mattldawson Apr 21, 2026
e351288
Merge branch 'main' into develop-add-ncbi-ftp
mattldawson Apr 21, 2026
6e45879
update docs and progress bar
mattldawson Apr 21, 2026
adfda18
add checks for synthetic summary creation
mattldawson Apr 21, 2026
fdd67c4
add connection check to S3 store
mattldawson Apr 21, 2026
d4f868d
add s3 connection check
mattldawson Apr 21, 2026
55cbf9e
reduce progress bar update rate
mattldawson Apr 21, 2026
c1ee7e3
select database for synthetic summary
mattldawson Apr 21, 2026
76349c1
formatting
mattldawson Apr 22, 2026
3369e11
add frictionless descriptors
mattldawson Apr 22, 2026
21dda1c
Potential fix for pull request finding 'CodeQL / Incomplete URL subst…
mattldawson Apr 22, 2026
803d8c4
update url parsing test
mattldawson Apr 22, 2026
280bb63
Potential fix for pull request finding 'CodeQL / Incomplete URL subst…
mattldawson Apr 22, 2026
81a0821
Merge branch 'develop' into develop-add-ncbi-ftp
mattldawson Apr 23, 2026
09df496
merge file upload functions
mattldawson Apr 23, 2026
c56c3c9
formatting
mattldawson Apr 23, 2026
6d6415b
merge copy object functions
mattldawson Apr 23, 2026
bf0c01e
use tenacity
mattldawson Apr 23, 2026
de041dc
address reviewer comments
mattldawson Apr 23, 2026
0fefd29
remove defaults for minio test env vars
mattldawson Apr 23, 2026
f9bcaee
cleanup
mattldawson Apr 23, 2026
f88df1d
add docker compose integration test option and draft new action
mattldawson Apr 23, 2026
c9aa9ff
try again for the integration action
mattldawson Apr 23, 2026
36369ca
Potential fix for pull request finding 'CodeQL / Workflow does not co…
mattldawson Apr 23, 2026
f2f8a78
add labels to tests the send requests to the NCBI FTP server
mattldawson Apr 23, 2026
1bc43c7
Merge branch 'develop' into develop-add-ncbi-ftp
mattldawson Apr 27, 2026
6d923ce
address reviewer comments
mattldawson Apr 27, 2026
60d18f4
Merge branch 'develop' into develop-add-ncbi-ftp
mattldawson Apr 27, 2026
cc1042b
Potential fix for pull request finding 'Unused import'
mattldawson Apr 27, 2026
3216403
add notebook option for download phase
mattldawson Apr 28, 2026
3952655
formatting
mattldawson Apr 28, 2026
991b40e
Potential fix for pull request finding 'Unused import'
mattldawson Apr 28, 2026
359901f
Potential fix for pull request finding 'Unused import'
mattldawson Apr 28, 2026
51a3680
update notebooks
mattldawson Apr 29, 2026
8a41164
allow separate staging and destination buckets
mattldawson Apr 29, 2026
1da3d05
consolidate progress bars
mattldawson Apr 29, 2026
0d3aa68
consolidate download with staging and promotion with deletion from st…
mattldawson Apr 29, 2026
c823283
Potential fix for pull request finding 'Unused global variable'
mattldawson Apr 29, 2026
cf93013
Potential fix for pull request finding 'Unused global variable'
mattldawson Apr 29, 2026
2b45416
fix tests
mattldawson Apr 29, 2026
f3c0984
keep ftp connection alive
mattldawson Apr 29, 2026
b6f0bdd
Potential fix for pull request finding 'Unused import'
mattldawson Apr 29, 2026
5fcada9
smooth progress bars
mattldawson Apr 29, 2026
6b78cb0
add promote progress bar and reduce logger output on dry runs
mattldawson Apr 30, 2026
afcae5c
print removed manifest path
mattldawson Apr 30, 2026
2e8da0c
optimize array creation
mattldawson Apr 30, 2026
ae92916
remove checksum type from copy headers
mattldawson Apr 30, 2026
1a8f54a
use tags for metadata on copy to avoid boto errors
mattldawson Apr 30, 2026
fa3df73
remove tagging option on s3 copy
mattldawson Apr 30, 2026
4c9c1c8
parallelize copies and batch deletes
mattldawson Apr 30, 2026
409626d
optimize promotion step
mattldawson Apr 30, 2026
fdb7026
Potential fix for pull request finding 'Unused local variable'
mattldawson Apr 30, 2026
4d690a8
Potential fix for pull request finding 'Unused import'
mattldawson Apr 30, 2026
e305d74
Potential fix for pull request finding 'Unused import'
mattldawson Apr 30, 2026
c01dbdc
capture missing return values
mattldawson Apr 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions .github/workflows/integration_tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: Integration tests
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent -- thank you!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing!


on:
workflow_call:

push:
branches:
- main
pull_request:
types:
- opened
- reopened
- synchronize
- ready_for_review

permissions:
contents: read

jobs:
integration_tests:
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v6

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Build integration test image
uses: docker/build-push-action@v6
with:
context: .
load: true
tags: cdm-data-loaders-integration-tests:latest
cache-from: type=gha
cache-to: type=gha,mode=max

- name: Run integration tests
run: |
docker compose up \
--no-build \
--abort-on-container-exit \
--exit-code-from integration-tests

- name: Tear down
if: always()
run: docker compose down --volumes
Comment thread
github-advanced-security[bot] marked this conversation as resolved.
Fixed
89 changes: 89 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Repo for CDM input data loading and wrangling
- [Development](#development)
- [Spark and other non-python dependencies](#spark-and-other-non-python-dependencies)
- [Tests](#tests)
- [Integration tests (MinIO + NCBI FTP)](#integration-tests-minio--ncbi-ftp)
- [Loading genomes, contigs, and features](#loading-genomes-contigs-and-features)
- [Running bbmap stats and checkm2 on genome or contigset files](#running-bbmap-stats-and-checkm2-on-genome-or-contigset-files)

Expand Down Expand Up @@ -70,6 +71,35 @@ uv run python -m ipykernel install --user --name cdm-data-loaders --display-name

The `cdm-data-loaders` kernel should now be available from the dropdown list of kernels in the Jupyter notebook interface.

#### Jupyter Kernel Environment Variables

Often you will need access to environment variables that are included in the default Lakehouse
Jupyter environment, but will not be automatically included in your custom Jupyter kernel. To address
this, first identify the needed variables and values, and add them to your new kernel configuration
with the following steps:

Open a new Jupyter Notebook __with the default kernel__ and run this in a new cell:
```python
import os
for k, v in sorted(os.environ.items()):
if "AWS" in k or "S3" in k or "MINIO" in k: # replace with whatever keys you're interested in
print(f"{k}={v}")
```
Take the output and add the environment vars to the `kernel.json` for your new kernel (e.g., in `cdm-data-loaders/.venv/share/jupyter/kernels/python3/kernel.json`):
```json
{
"argv": ["..."],
"display_name": "cdm-data-loaders",
"language": "python",
"env": {
"AWS_ACCESS_KEY_ID": "...",
"AWS_SECRET_ACCESS_KEY": "...",
"AWS_DEFAULT_REGION": "...",
...
}
}
```


## Running import pipelines

Expand Down Expand Up @@ -146,6 +176,65 @@ To generate coverage for the tests, run
The standard python `coverage` package is used and coverage can be generated as html or other formats by changing the parameters.


#### Integration tests (MinIO + NCBI FTP)

End-to-end integration tests for the NCBI assembly pipeline live in `tests/integration/`. They exercise the full flow — manifest diffing, FTP download, S3 promote/archive — against a locally running [MinIO](https://min.io/) container and the real NCBI FTP server.

**Requirements:**
- Docker (for MinIO)
- Network access to `ftp.ncbi.nlm.nih.gov`

**Running with Docker Compose (easiest)**

The [docker-compose.yml](docker-compose.yml) at the repo root defines both a MinIO service and the integration test runner. To build the image, start MinIO, and run the integration tests in one command:

```sh
docker compose up --build --abort-on-container-exit
```

Compose will stream test output to the terminal and exit with the pytest exit code. To clean up afterwards:

```sh
docker compose down --volumes
```

**Running manually**

If you prefer to run the tests directly against a local MinIO instance (e.g. for faster iteration during development), follow the steps below.

**1. Start MinIO locally:**

```sh
docker run -d \
--name minio \
-p 9000:9000 \
-p 9001:9001 \
-e MINIO_ROOT_USER=minioadmin \
-e MINIO_ROOT_PASSWORD=minioadmin \
minio/minio:RELEASE.2025-02-28T09-55-16Z server /data --console-address ":9001"
```

**2. Run the integration tests:**

```sh
> uv run pytest tests/integration/ -m integration -v
```

Tests are automatically skipped when MinIO is not reachable, so the default `uv run pytest` will never fail due to a missing MinIO instance.

**3. Inspect results:**

Buckets are **not** cleaned up after tests. Browse the MinIO console at [http://localhost:9001](http://localhost:9001) (login: `minioadmin` / `minioadmin`) to inspect the final state of each test bucket. Each test method creates its own bucket (e.g. `integ-test-promote-dry-run`).

**4. Stop MinIO when done:**

```sh
docker stop minio && docker rm minio
```

> **Note:** These tests download real assemblies from NCBI FTP and are inherently slow (~30–60s per assembly). They are also marked `slow_test` so you can exclude them independently: `uv run pytest -m "not slow_test"`.


## Loading genomes, contigs, and features

The [genome loader](src/cdm_data_loaders/parsers/genome_loader.py) can be used to load and integrate data from related GFF and FASTA files. Currently, the loader requires a GFF file and two FASTA files (one for amino acid seqs, one for nucleic acid seqs) for each genome. The list of files to be processed should be specified in the genome paths file, which has the following format:
Expand Down
44 changes: 44 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
services:
minio:
image: quay.io/minio/minio:latest
command: server /data --console-address ":9001"
environment:
MINIO_ROOT_USER: minioadmin
MINIO_ROOT_PASSWORD: minioadmin
ports:
- "9000:9000"
- "9001:9001"
healthcheck:
test: [ "CMD", "mc", "ready", "local" ]
interval: 5s
timeout: 5s
retries: 5

integration-tests:
image: cdm-data-loaders-integration-tests:latest
build:
context: .
depends_on:
minio:
condition: service_healthy
environment:
MINIO_ENDPOINT_URL: http://minio:9000
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
entrypoint:
- /bin/sh
- -c
- |
attempts=0
until python3 -c "
import urllib.request, os
urllib.request.urlopen(os.environ['MINIO_ENDPOINT_URL'] + '/minio/health/live', timeout=1)
" 2>/dev/null; do
attempts=$$((attempts + 1))
if [ "$$attempts" -ge 30 ]; then
echo 'Timed out waiting for MinIO.' && exit 1
fi
echo 'Waiting for MinIO...' && sleep 1
done
exec /app/scripts/entrypoint.sh integration-test
command: []
Loading
Loading