-
Notifications
You must be signed in to change notification settings - Fork 263
Add Workflow Management Fundamentals side quest #679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add Workflow Management Fundamentals side quest #679
Conversation
This side quest demonstrates the benefits of workflow management systems by having users progressively build a bash script, experience its limitations firsthand, then see how Nextflow solves each problem. Key features: - Progressive hands-on approach: users build bash scripts incrementally - Experience problems before seeing solutions (sequential processing, resource conflicts, crashes, environment issues, no visibility) - Clear before/after comparisons with concrete time savings - Interactive exercises converting bash to Nextflow - Uses bacterial genome analysis as single exemplar throughout - Follows groovy_essentials conventions (tabs, console examples, etc.) Files added: - docs/side_quests/workflow_management_fundamentals.md (main tutorial) - side-quests/workflow_management_fundamentals/ (working examples) - process_samples.sh (bash script version) - main.nf (Nextflow workflow) - modules/ (modular process definitions) - data/samples.csv (sample metadata) - nextflow.config (configuration examples) 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
✅ Deploy Preview for nextflow-training ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
- Move solution files to solutions/ subdirectory so learners start fresh - Add placeholder FASTQ files in data/reads/ (referenced in samples.csv) - Add Exercise 5 for QUAST module (was used but undocumented) - Fix hl_lines in Exercise 3: "3 17" → "4 20" (highlight FASTP additions) - Fix hl_lines in Exercise 4: "4 21" → "5 22" (highlight SPADES additions) - Update tree output to show actual directory structure - Add tip about solutions directory for stuck learners - Update README with new structure and usage instructions - Remove extra emit:graph from spades.nf to match docs - Remove singularity.enabled=false from config to match docs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Nextflow linting complete! ❌ 4 files had 21 errors 💡 Tip: Click filename locations to go directly to that code. View all 21 issues
View formatting changes
|
- Replace placeholder FASTQ files with valid 3-read test data - Convert FASTQC module to generate mock HTML/ZIP reports - Convert FASTP module to pass-through reads with mock QC reports - Convert SPADES module to generate mock assembly contigs - Convert QUAST module to generate mock quality reports - Update nextflow.config to disable Docker by default for mocks - Add Docker profile for running with real bioinformatics containers The tutorial now executes end-to-end without requiring real bioinformatics tools, while still demonstrating all Nextflow workflow management concepts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Major restructure to match standard tutorial format: - Add starter files (main.nf, fastqc.nf, nextflow.config, process_one.sh) - Rewrite documentation with Before/After code blocks - Remove "Your Task/Solution" exercise format - Add progressive hands-on sections: 1. Bash script problem demonstration 2. Nextflow introduction with starter workflow 3. Add FASTP, SPADES, QUAST incrementally 4. Configuration profiles - All code is executable with mock bioinformatics processes - Solutions directory contains completed versions Tested walkthrough end-to-end with verified caching and parallelization. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Focus on persuading users WHY workflow management matters, not teaching Nextflow syntax. Tutorial now covers six key problems: 1. Sequential Processing → Automatic parallelization 2. No Resume After Failure → Built-in caching 3. "Works on My Machine" → Per-process containers 4. Scaling Requires Rewriting → Declarative scaling 5. No Audit Trail → Automatic provenance 6. Tied to One Environment → Configuration profiles Changes: - Reorganize docs around problem-solution pairs - Create bash_pipeline.sh that demonstrates limitations - Create nextflow_pipeline/ with complete workflow - Add container directives to show software isolation - Add profiles for laptop/slurm/aws/gcp execution - Remove hands-on "build from scratch" approach - Position Nextflow as example of workflow management tools Tested both pipelines: - Bash: ~34 seconds sequential - Nextflow: ~12 seconds parallel, all 12 tasks cache on resume Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add "Think about it" sections before each solution to prompt critical thinking - Add Section 3.6 on Effortless Collaboration (sharing workflows, nf-core) - Inspired by Seqera Academy teaching approaches Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
- Replace mock bacterial genome assembly with real RNA-seq workflow - Use nf-core test-datasets (GSE110004 yeast study) for real data - Workflow: FastQC → fastp → Salmon → MultiQC - Bash script runs actual tools (requires conda installation) - Nextflow runs same tools in containers (no installation needed) - Add installation instructions for bash script prerequisites - Update documentation to reflect RNA-seq scenario - Add reflection questions and collaboration sections Co-Authored-By: Claude <noreply@anthropic.com>
The key difference is not bash=conda vs nextflow=docker, but rather: - Bash: YOU manage environments (install, activate, document, troubleshoot) - Nextflow: WORKFLOW manages environments declaratively (just run it) Co-Authored-By: Claude <noreply@anthropic.com>
Testing identified two critical bugs that are now fixed: - Bash parallelization: tail | while creates subshell preventing wait from tracking background jobs. Fixed with process substitution. - Salmon index: queue channel consumed after first sample. Fixed with .first() to convert to value channel that broadcasts. Added new content emphasizing key workflow management benefits: - Separation of concerns (process definitions vs workflow logic) - Automatic task retries with dynamic resource allocation - Cluster/cloud execution without code changes - Transparent remote file handling (URLs as local files) - Per-process resource management (cpus, memory, maxForks) Restructured directory layout: - bash/ for learner bash script progression - nextflow/ for learner Nextflow pipeline - solutions/ for reference implementations Simplified solution to match tutorial content (removed MultiQC). Updated all summary tables throughout documentation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Restructured to match training conventions: - ## 1. Building a Bash Pipeline (Part 1 container) - ### 1.1. through ### 1.10. for sections - ## 2. Building a Nextflow Pipeline (Part 2 container) - ### 2.1. through ### 2.8. for sections - Sub-subsections converted to #### (unnumbered) - Fixed code block comment that triggered false positive Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Convert documentation to use tabbed Before/After blocks for each change - Create starter files with TODO placeholders (bash) and ??? (Nextflow) - Learners only write code for learning outcomes (tool commands, I/O, wiring) - Provide all boilerplate (script structure, process headers, loop structure) - Fix hl_lines attributes to correctly highlight changed lines - Same granularity for both bash and Nextflow sections Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The solution files use relative paths (../data/samples.csv) which require a data directory to exist. This symlink points to the shared data directory so solutions can be run without extra parameters. Tested with Docker: - Nextflow solution: All 10 tasks complete (UNTAR, FASTQC, FASTP, SALMON_QUANT) - Bash starters: Valid syntax verified - Nextflow starters: Expected syntax errors from ??? placeholders Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
After tab now appears first, consistent with other training materials. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Explain what learners are doing and why before each code block - Add "Understanding the Starter Script" section for bash part - Explain val() vs path(), emit:, and .first() in Nextflow sections - Connect each step to learning objectives - Clarify the two-input pattern for Salmon (per-sample + shared reference) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Bash sections: - Emphasize limitations without forward references to Nextflow - Stress reproducibility issues, resource limits, cluster rewrites - Let the problems speak for themselves Nextflow sections: - Explicitly contrast with bash pain points - "Remember in bash...? Here's how Nextflow handles it" - Connect each feature to a specific bash limitation solved Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add 7 production-quality standards upfront as bullet points - Clarify this applies to all scripting approaches, not just bash - Emphasize software management is about per-task isolation, not just version locking (conda installs everything in one env) - Add note explaining bash is used as example but issues apply to Python, R, etc. - Update summaries with scorecard bullets referencing standards - Change "Contrast with bash" tips to "Contrast with scripts" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Scripts hit resource limits on single machine and require complete rewrite for cluster/cloud. Nextflow manages local resources and scales to distributed infrastructure with config change. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Direct readers to Hello Nextflow for thorough introduction. This side quest focuses on illustrating workflow management benefits rather than teaching Nextflow in depth. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Tie back to opening question about why learn a framework - Summarize key insight: scripts mix science with infrastructure - Point to Hello Nextflow for proper learning - Link to docs, nf-core, and Seqera Platform for further exploration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Skills transfer between projects/organizations when using common frameworks. New team members contribute immediately instead of learning homegrown solutions. Community shares patterns and tools. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Frame as workflow managers generally being established tools with communities and transferable concepts, not assuming everyone uses the same specific framework. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
adamrtalbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start! I have some comments:
- reduce the gushing over workflows a bit, show don't tell
- reduce the rate of hyphens.
- we shouldn't over explain how a Nextflow process works. Just highlight the features (inputs, outputs, script, software packaging). If a user wants to learn how to use Nextflow, they should move to hello-nextflow afterwards.
| - **Efficient parallelization** - Independent tasks run simultaneously, so analysis completes in hours, not days. | ||
| - **Resource awareness** - Respects memory and CPU limits. No crashed jobs or killed processes. | ||
| - **Failure recovery** - Can resume from where it stopped. A single failure doesn't waste hours of completed work. | ||
| - **Portability** - Runs on laptop, cluster, or cloud with the same code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polyglot?
|
|
||
| #### 1.3.4. Add Salmon Index Download | ||
|
|
||
| Salmon needs a pre-built index of the reference transcriptome. We'll download a pre-built index (to save time) only if it doesn't already exist. This avoids re-downloading for every sample. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's possibly a lesson on conditional logic here, which is easier in a workflow language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, ish. It may be a trickier conceptually in Nextflow, but it's more robust (detecting the directory vs the object existing).
Changes:
- Fix LLM-cliche phrasing ("The question is how can we achieve...")
- Reformat tutorial overview table with Software column
- Add note about conditional logic fragility in scripts vs workflows
- Tone down gushing language throughout ("There has to be a better way" etc)
- Fix container accuracy - Nextflow supports multiple software packaging tools
- Streamline FASTQC process explanation - show complete process first
- Simplify container tip text
- Use val(id) instead of val(meta) throughout - simpler for intro tutorial
- Tone down LLM language patterns (removed "Remember...?" callbacks, "magic", etc)
- Add MultiQC aggregation section to BOTH bash and Nextflow parts to show
workflow benefits for complex pipelines with multiple outputs
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes addressing review feedbackAdam's specific comments addressed:
Additional changes per feedback:
Unclear comments (left as-is for clarification):
|
Removed section on tool installation requirements in Nextflow.
- Add "Software Management" header for orphaned section
- Tone down exclamations ("It worked!" -> "This works", "Much faster!" -> "Faster, because...")
- Remove LLM phrase "worth mentioning"
- Strip trailing dash phrases that added redundant emphasis
- Tighten prose throughout
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
This side quest teaches why workflow managers exist by having learners experience the limitations of scripting approaches firsthand, then see how Nextflow solves each problem.
Pedagogical Approach
Quality Framework
The tutorial opens by defining 7 production-quality standards for pipelines:
Learners then try to achieve these with scripts (Part 1) and see how workflow managers handle them (Part 2).
Part 1: Scripts
Learners build an RNA-seq pipeline in bash, progressively hitting limitations:
&andwait, no resource limits)Uses starter files with TODOs - learners fill in tool commands, not boilerplate.
Part 2: Nextflow
Learners rebuild the same pipeline, with explicit callbacks to Part 1:
container"&andwait? Nextflow infers parallelization from data flow"-resume"Includes a note that this isn't a proper Nextflow tutorial - directs to Hello Nextflow for that.
Before/After Blocks
Every code change uses tabbed Before/After blocks showing exactly what changes, with
hl_lineshighlighting the new/modified lines.Key Points
Files
Documentation
docs/side_quests/workflow_management_fundamentals.md- Main tutorial with Before/After blocksStarter Files (with TODOs/placeholders)
side-quests/workflow_management_fundamentals/bash/- Bash starter scriptsside-quests/workflow_management_fundamentals/nextflow/- Nextflow starter files with???placeholdersside-quests/workflow_management_fundamentals/data/samples.csv- Real RNA-seq test dataSolutions
side-quests/solutions/workflow_management_fundamentals/- Complete working versionsTesting
🤖 Generated with Claude Code