Add 'only run new IDs' behaviour by ainefairbrother · Pull Request #1200 · Ensembl/ensembl-variation

ainefairbrother · 2026-01-12T18:00:31Z

#1188 needs to go in first

Aim

This PR adds behaviour that takes in a list of URN . If these have been run before, the pipeline will skip them, and then merge the previous output file and the present output file.

Testing

Input files for this test can be found in /hps/nobackup/flicek/ensembl/variation/fairbrot/MaveDB/testing. Script below assumes running from this dir also.

Test script:

source ~/.bashrc
module load nextflow/24.04.3
export NXF_JVM_ARGS="-Xms5g -Xmx60g"
export NXF_SINGULARITY_CACHEDIR="work/singularity"

# generate mavedb output for URN IDs in urn_1.txt, yielding MaveDB_test_1.tsv.gz
# urn_1.txt contains 1 real URN ID and one fake one - this is to demo that the pipeline uses what's in the 
# file supplied to --urn to determine which to skip
nextflow run ./ensembl-variation/nextflow/MaveDB/main.nf \
    -profile slurm \
    -with-report reports/report.html \
    --registry registry.pm \
    --from-files true \
    --mappings_path mappings \
    --scores_path csv \
    --metadata_file main.json \
    --licences CC0 \
    --urn urn_1.txt \
    --output MaveDB_test_1.tsv.gz \
    --ensembl /hps/software/users/ensembl/variation/fairbrot/

# check the URN IDs in the output file
gzip -dc MaveDB_test_1.tsv.gz \
    | awk -F'\t' 'NR==1{for(i=1;i<=NF;i++)if($i=="urn"){c=i;break};if(!c)exit 1;next}{if($c&&$c!="urn")print $c}' \
    | sort -u

# Now test the new behaviour - supplying previous_urn and previous_output means that the pipeline will first check 
# in the previous urn file (urn_1.txt), compare the IDs in this file with the current file (urn_2.txt), and
# only run new IDs. The pipeline will then grab the previous output file (MaveDB_test_1.tsv.gz), the current output file
# (MaveDB_test_2.tsv.gz), and merge them, resulting in a final output file with all of the IDs from urn_1.txt and urn_2.txt, 
# minus any that the pipeline skips 
nextflow run ./ensembl-variation/nextflow/MaveDB/main.nf \
    -profile slurm \
    -with-report reports/report.html \
    --registry registry.pm \
    --from-files true \
    --mappings_path mappings \
    --scores_path csv \
    --metadata_file main.json \
    --licences CC0 \
    --urn urn_2.txt \
    --output MaveDB_test_2.tsv.gz \
    --previous_urn urn_1.txt \
    --previous_output MaveDB_test_1.tsv.gz \
    --ensembl <user_dir>

# check the URN IDs in the output file
gzip -dc MaveDB_test_2.tsv.gz \
    | awk -F'\t' 'NR==1{for(i=1;i<=NF;i++)if($i=="urn"){c=i;break};if(!c)exit 1;next}{if($c&&$c!="urn")print $c}' \
    | sort -u

…-run-new-ids

feat: add 'only run new URN IDs' feature

bdd7e72

ainefairbrother changed the title ~~Add only run new IDs behaviour~~ Add 'only run new IDs' behaviour Jan 13, 2026

ainefairbrother added 3 commits March 13, 2026 11:59

Merge branch 'main' of github.com:Ensembl/ensembl-variation into only…

acfeb14

…-run-new-ids

fix: bug fixes

e5f1104

feat: create planning module

335a939

ainefairbrother marked this pull request as ready for review March 20, 2026 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 'only run new IDs' behaviour#1200

Add 'only run new IDs' behaviour#1200
ainefairbrother wants to merge 4 commits intoEnsembl:mainfrom
ainefairbrother:only-run-new-ids

ainefairbrother commented Jan 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ainefairbrother commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Aim

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ainefairbrother commented Jan 12, 2026 •

edited

Loading