Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
8c2ae28
Add in pattern file for easier loading specification
ialarmedalien Apr 21, 2026
9f66a51
alter command line letter
ialarmedalien Apr 21, 2026
6e1c58b
Updating ATB to add longer paths when downloading files
ialarmedalien Apr 21, 2026
5d65660
Bump lxml from 6.0.4 to 6.1.0 in the uv group across 1 directory
dependabot[bot] Apr 21, 2026
2cf0cf5
More updates for atb pipeline: upload files directly to s3 instead of…
ialarmedalien Apr 23, 2026
ff9c14f
minor tweaking
ialarmedalien Apr 23, 2026
7d592a1
Few more bits of tinkering to add in extra logging, etc., where needed.
ialarmedalien Apr 23, 2026
51e89dd
fix accidentally broken tests
ialarmedalien Apr 23, 2026
32cd8ab
Merge pull request #138 from kbase/pattern_file
ialarmedalien Apr 23, 2026
a98fdb9
Merge branch 'develop' into dependabot/uv/uv-2fce4a7a35
ialarmedalien Apr 23, 2026
c480350
Merge pull request #137 from kbase/dependabot/uv/uv-2fce4a7a35
ialarmedalien Apr 23, 2026
77c49b0
add file handler logging ext
ialarmedalien Apr 25, 2026
0cbe4be
Logging fix...?
ialarmedalien Apr 25, 2026
17f3b35
Turn down VCR logger
ialarmedalien Apr 25, 2026
9414497
Merge pull request #139 from kbase/file_logger
ialarmedalien Apr 25, 2026
4d13eeb
Set Trivy scan timeout to 15 minutes
ialarmedalien Apr 25, 2026
0702832
Merge pull request #140 from kbase/privy-timeout-fix
ialarmedalien Apr 25, 2026
e379dcb
Version bump and a couple of minor fixes
ialarmedalien Apr 25, 2026
d922205
Merge pull request #141 from kbase/version_bump
ialarmedalien Apr 25, 2026
21d3a6f
Merge branch 'main' into develop
ialarmedalien Apr 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/trivy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ jobs:
template: "@/contrib/sarif.tpl"
output: "trivy-results.sarif"
severity: "CRITICAL,HIGH"
timeout: 15m

- name: Upload Trivy scan results to GitHub Security tab
uses: github/codeql-action/upload-sarif@v3
Expand Down
52 changes: 52 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# CDM Data Loaders Changelog

- [CDM Data Loaders Changelog](#cdm-data-loaders-changelog)
- [v0.1.8](#v018)
- [v0.1.7](#v017)
- [v0.1.6](#v016)
- [v0.1.5](#v015)
- [v0.1.4](#v014)
- [v0.1.3](#v013)
- [v0.1.2](#v012)
- [v0.1.1](#v011)
- [v0.1.0](#v010)


### v0.1.8

- Add rotating file log handler for easier debugging.

### v0.1.7

- Add in AllTheBacteria file download client.

### v0.1.6

- Make NCBI REST API client more resilient to errors and ensure existing imports are not lost.

### v0.1.5

- Add batch size parameter to the NCBI REST API interface.


### v0.1.4

- Add in NCBI REST API interface.


### v0.1.3

- Add in file batcher for use with file-based importers.


### v0.1.2

- Update XML File Splitter to use the latest version, which includes the `gzip` parameter.

### v0.1.1

- Add [XML File Splitter](https://github.com/ialarmedalien/xml_file_splitter) to the container.

### v0.1.0

- Initial release.
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
# Place executables in the environment at the front of the path
ENV PATH="/app/.venv/bin:$PATH"

COPY --chmod=+x ./scripts/entrypoint.sh /app/
RUN chmod +x ./scripts/entrypoint.sh
# Use the non-root user to run our application
USER nonroot
ENTRYPOINT ["./entrypoint.sh"]
ENTRYPOINT ["./scripts/entrypoint.sh"]
47 changes: 0 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,6 @@ Repo for CDM input data loading and wrangling
- [Tests](#tests)
- [Loading genomes, contigs, and features](#loading-genomes-contigs-and-features)
- [Running bbmap stats and checkm2 on genome or contigset files](#running-bbmap-stats-and-checkm2-on-genome-or-contigset-files)
- [Changelog](#changelog)
- [v0.1.7](#v017)
- [v0.1.6](#v016)
- [v0.1.5](#v015)
- [v0.1.4](#v014)
- [v0.1.3](#v013)
- [v0.1.2](#v012)
- [v0.1.1](#v011)
- [v0.1.0](#v010)



Expand Down Expand Up @@ -168,41 +159,3 @@ Run the stats and checkm2 tools with the following command:
bash scripts/run_tools.sh path/to/genome_paths_file.json output_dir
```
where `path/to/genome_paths_file.json` specifies the path to the genome paths file (format specified above) and `output_dir` is the directory for the results.


## Changelog

### v0.1.7

- Add in AllTheBacteria file download client.

### v0.1.6

- Make NCBI REST API client more resilient to errors and ensure existing imports are not lost.

### v0.1.5

- Add batch size parameter to the NCBI REST API interface.


### v0.1.4

- Add in NCBI REST API interface.


### v0.1.3

- Add in file batcher for use with file-based importers.


### v0.1.2

- Update XML File Splitter to use the latest version, which includes the `gzip` parameter.

### v0.1.1

- Add [XML File Splitter](https://github.com/ialarmedalien/xml_file_splitter) to the container.

### v0.1.0

- Initial release.
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "cdm-data-loaders"
version = "0.1.7"
version = "0.1.8"
description = "Data loaders and wranglers for the CDM."
requires-python = ">= 3.13"
readme = "README.md"
Expand All @@ -17,7 +17,7 @@ dependencies = [
"dlt[deltalake,duckdb,filesystem,parquet]>=1.22.2",
"frictionless[aws]>=5.18.1",
"frozendict>=2.4.7",
"lxml>=6.0.2",
"lxml>=6.1.0",
"pydantic>=2.12.5",
"pydantic-settings>=2.12.0",
"tqdm>=4.67.3",
Expand Down
20 changes: 11 additions & 9 deletions scripts/entrypoint.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,16 @@
#!/usr/bin/env bash
set -euo pipefail

# Ensure at least one argument is provided
VALID_COMMANDS=(all_the_bacteria ncbi_rest_api uniprot uniref xml_split test bash)

usage() {
local joined
joined=$(IFS='|'; echo "${VALID_COMMANDS[*]}")
echo "Usage: $0 {${joined}} [args...]" >&2
}

if [ "$#" -eq 0 ]; then
echo "Usage: $0 {all_the_bacteria|ncbi_rest_api|uniprot|uniref|xml_split|test} [args...]"
usage
exit 1
fi

Expand All @@ -12,34 +19,29 @@ shift

case "$cmd" in
all_the_bacteria)
# All the Bacteria file importer
exec /usr/bin/tini -- uv run --no-sync all_the_bacteria "$@"
;;
ncbi_rest_api)
# Run the NCBI datasets API importer
exec /usr/bin/tini -- uv run --no-sync ncbi_rest_api "$@"
;;
uniprot)
# Run the uniprot pipeline with any additional arguments
exec /usr/bin/tini -- uv run --no-sync uniprot "$@"
;;
uniref)
# Run the uniref pipeline with any additional arguments
exec /usr/bin/tini -- uv run --no-sync uniref "$@"
;;
xml_split)
# Run the xml_file_splitter app
exec /usr/bin/tini -- xml_file_splitter "$@"
;;
test)
# run the tests
exec /usr/bin/tini -- uv run --no-sync pytest -m "not requires_spark"
;;
bash)
exec /usr/bin/tini -- /bin/bash
;;
*)
echo "Error: unknown command '$cmd'; valid commands are 'all_the_bacteria', 'ncbi_rest_api', 'uniprot', 'uniref', or 'xml_split'." >&2
echo "Error: unknown command '$cmd'." >&2
usage
exit 1
;;
esac
Loading
Loading