Skip to content

Add 'only run new IDs' behaviour#1200

Open
ainefairbrother wants to merge 4 commits intoEnsembl:mainfrom
ainefairbrother:only-run-new-ids
Open

Add 'only run new IDs' behaviour#1200
ainefairbrother wants to merge 4 commits intoEnsembl:mainfrom
ainefairbrother:only-run-new-ids

Conversation

@ainefairbrother
Copy link
Copy Markdown
Contributor

@ainefairbrother ainefairbrother commented Jan 12, 2026

#1188 needs to go in first

Aim

This PR adds behaviour that takes in a list of URN . If these have been run before, the pipeline will skip them, and then merge the previous output file and the present output file.

Testing

Input files for this test can be found in /hps/nobackup/flicek/ensembl/variation/fairbrot/MaveDB/testing. Script below assumes running from this dir also.

Test script:

source ~/.bashrc
module load nextflow/24.04.3
export NXF_JVM_ARGS="-Xms5g -Xmx60g"
export NXF_SINGULARITY_CACHEDIR="work/singularity"

# generate mavedb output for URN IDs in urn_1.txt, yielding MaveDB_test_1.tsv.gz
# urn_1.txt contains 1 real URN ID and one fake one - this is to demo that the pipeline uses what's in the 
# file supplied to --urn to determine which to skip
nextflow run ./ensembl-variation/nextflow/MaveDB/main.nf \
    -profile slurm \
    -with-report reports/report.html \
    --registry registry.pm \
    --from-files true \
    --mappings_path mappings \
    --scores_path csv \
    --metadata_file main.json \
    --licences CC0 \
    --urn urn_1.txt \
    --output MaveDB_test_1.tsv.gz \
    --ensembl /hps/software/users/ensembl/variation/fairbrot/

# check the URN IDs in the output file
gzip -dc MaveDB_test_1.tsv.gz \
    | awk -F'\t' 'NR==1{for(i=1;i<=NF;i++)if($i=="urn"){c=i;break};if(!c)exit 1;next}{if($c&&$c!="urn")print $c}' \
    | sort -u

# Now test the new behaviour - supplying previous_urn and previous_output means that the pipeline will first check 
# in the previous urn file (urn_1.txt), compare the IDs in this file with the current file (urn_2.txt), and
# only run new IDs. The pipeline will then grab the previous output file (MaveDB_test_1.tsv.gz), the current output file
# (MaveDB_test_2.tsv.gz), and merge them, resulting in a final output file with all of the IDs from urn_1.txt and urn_2.txt, 
# minus any that the pipeline skips 
nextflow run ./ensembl-variation/nextflow/MaveDB/main.nf \
    -profile slurm \
    -with-report reports/report.html \
    --registry registry.pm \
    --from-files true \
    --mappings_path mappings \
    --scores_path csv \
    --metadata_file main.json \
    --licences CC0 \
    --urn urn_2.txt \
    --output MaveDB_test_2.tsv.gz \
    --previous_urn urn_1.txt \
    --previous_output MaveDB_test_1.tsv.gz \
    --ensembl <user_dir>

# check the URN IDs in the output file
gzip -dc MaveDB_test_2.tsv.gz \
    | awk -F'\t' 'NR==1{for(i=1;i<=NF;i++)if($i=="urn"){c=i;break};if(!c)exit 1;next}{if($c&&$c!="urn")print $c}' \
    | sort -u

@ainefairbrother ainefairbrother changed the title Add only run new IDs behaviour Add 'only run new IDs' behaviour Jan 13, 2026
@ainefairbrother ainefairbrother marked this pull request as ready for review March 20, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant