Skip to content

Targets-based workflows leveraging slurm for dist. computing - workflow 1a and POC#13

Open
hdpriest-ui wants to merge 29 commits intoccmmf:mainfrom
hdpriest-ui:abstracted_workflows
Open

Targets-based workflows leveraging slurm for dist. computing - workflow 1a and POC#13
hdpriest-ui wants to merge 29 commits intoccmmf:mainfrom
hdpriest-ui:abstracted_workflows

Conversation

@hdpriest-ui
Copy link
Contributor

@hdpriest-ui hdpriest-ui commented Sep 29, 2025

Updated description of PR contents:

Now contains fully-functional workflow for workflow 2a, as well as example workflows for downloading data, referencing previous workflow runs, copying data in from prior workflow runs, and executing PEcAn function calls in a distributed fashion via slurm and apptainer.

Also supports local execution in a containerized environment.

Have created 'adapter' R files which leverage @infotroph 's command line argument structure and pass it via a XML structure into the workflow_functions.R centralized versions of code for stand-alone execution.

workflow_functions.R versions of workflow steps are - for the most part - copy-pasted from @infotroph's implementations.

workflow 2a can be realized via (from repo root, on ccmmf test cluster)

conda activate /home/hdpriest/miniconda3/envs/pecan-all-102425
cd ./orchestration/

Note: you will have to edit the 'workflow.base.run.directory' XML parameter in the orchestration XML below to your preferred location

Rscript 01_get_base_data.R -s workflow_orchestration.xml
Rscript 02_create_clim_files_dist.R  -s workflow_orchestration.xml
Rscript 03_build_xml_and_run.R  -s workflow_orchestration.xml

…preparation and analysis. Update .gitignore to exclude workflow run directories. Enhance run_pipeline scripts for better directory management and parameterization. Introduce new utility functions for data handling and workflow execution.

slurm workflow not yet functional.
added simple roxygen docs
updated pecan settings qstat to work with zero-length strings
added first draft setup shell script for one-button install
added workflow functions necessary for 1a
Added apptainer build image parent workflow
added apptainer sipnet-carb build workflow
added dockerfile to tools/ subdirectory

unlikely first attempt will build.
added line on obtaining current temp container
NOTE THE BUG: apptainer must be updated both in runscript as well as in the XML.
@hdpriest-ui
Copy link
Contributor Author

GHA based workflow successfully builds sipnet-carb docker in the source repo:
https://github.com/hdpriest-ui/workflows/actions/runs/18386239890

Copy link
Contributor

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments:

  • I think that integrating targets into the workflow is a good idea. It will be worth reviewing together after Chris has also had a chance to review it.
  • Documentation (e.g. a README) would be helpful. Contents could include:
    • Overview and useage of targets for workflows - general approach that can be used across the project
    • Rationale behind divergence from more standard targets workflows, choice not to use crew or _targets.R file; use of environment vars + gsub; storing functions and args. I know you've explained in meetings, but these will be helpful to document.
    • For this specific example implementing the ensemble workflow, it would be useful to document the workflow components, including a tar_manifest and diagram of the DAG (output of tar_network() or tar_mermaid)?

Comment on lines +5 to +6
# function authors are encouraged to think carefully about the dependencies of their functions.
# if dependencies are not present, it would be ideal for functions to error informatively rather than fail on imports.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are dependencies used? What should the authors consider?

return(file.path(local_path, prefix_filename))
}

#' Prepare PEcAn Run Directory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something similar (create directory if it doesn't exist) is already done by prepare.settings() when it calls check.settings(). prepare.settings does a lot of other things but

https://github.com/PecanProject/pecan/blob/6d7a913dd9c5f6f3f992cbe3f9e3f263cd56bb6f/base/settings/R/check.all.settings.R#L479-L485

A standard pattern is:

settings <- PEcAn.settings::read.settings("pecan.xml")
settings <- PEcAn.settings::prepare.settings(settings)

But I don't see it used here in the workflows, so I'll defer to @infotroph to comment on whether not calling prepare.settings was a deliberate choice and whether it would be appropriate to use here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was deliberate: Specifically because prepare.settings wants a live DB and queries it too many times to change that readily, and more generally because this workflow puts responsibility for constructing and verifying the settings into the xml_build stage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit to not fully understanding the db dependency, but prepare.settings is used for this purpose in the db-independent Demo 1: https://github.com/PecanProject/pecan/blob/bff6203e17cf4ff7f6c8e553f0ea16170051018b/documentation/tutorials/Demo_1_Basic_Run/run_pecan.qmd#L137

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this duplicate or could it be combined with workflow_run_directory_setup below?

jobids[task_id] <- PEcAn.remote::qsub_get_jobid(
out = out[length(out)],
qsub.jobid = pecan_settings$host$qsub.jobid,
stop.on.error = stop.on.error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is stop.on.error defined (identified by lintr as 'no visible binding for global variable')

@@ -0,0 +1,159 @@
name: build-image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is a lightly modified version of the docker-build-image.yml in PecanProject/pecan and I have a vague memory of seeing instructions for using workflow files from other repositories. Would it be worth investigating if we can call this from the PEcAn repo rather than maintain duplicate versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its definitely worth investigating. In moving it into this repository, I hesitated to create a new maintenance point for the same method. Investigation is the right word - i have questions, such as: which ghcr/docker repo will the created sipnet-carb image end up in? (which do we want it to end up in?) Can we invoke the method in the pecan repo using secrets in the ccmmf repo?

I'll look into it.

hdpriest-ui and others added 4 commits October 20, 2025 09:59
Co-authored-by: David LeBauer <dlebauer@gmail.com>
… a distributed workflow

documents in hopefully useful state.
@dlebauer dlebauer requested a review from mdietze November 10, 2025 21:21
updated apptainer build to support develop
…can's version of this yaml.

added image-version input parameter at base apptainer sipnet-carb builder
-refactored configs into latest and devel for ease of stack testing
-refactored parameter passing: majority of workflow parameters are passed via orchestration XML
-minimized gsub replacements for clarity
- added script for XML build step
- added single function for XML build step
- leveraged targets "target_raw" methodology to enable function-call like invokations of multiple targets in re-usable blocks
- enabled parameter passing and parsing for function-like behavior of target blocks
- combined 03 and 04 steps from workflow 2a
- workflow 2a function execution working, data routing incomplete
…g parsing with centralized functions

- added smart functional resolution for either referencing external data, or copying external data into a run.
- added argument parsing through as.numeric() to correctly parameterize centralized workflow functions
- obtained successful 2a workflow replication via targets, apptainer and slurm
- updated example workflows for new data referencing
- removed obsolete example 3 variant
- removed some obsolete functions within workflow_functions.R
- added a gha for CI of workflows
- added self hosted runner info to github action
Copy link

@ashiklom ashiklom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I appreciate the amount of effort that went into this, my initial, overall reaction to this is that this adds a lot of code and complexity with unclear purpose or benefit.

First, at a very high level, I think this PR needs to state up front what problem(s) it is trying to solve. If nothing else, that will make it easier to review to see whether those problem(s) have actually been addressed or not.

Second, the usage of targets here is a bit puzzling. Assuming the point of this is to simplify the execution of each individual workflow (e.g., run all of 2a_grass with one command; distribute workers between local and remote (qsub, etc.) automatically as needed), I would have started with doing exactly that: First, lightly refactor the scripts in each workflow into functions. Then, chain those functions together in a single _targets.R file for each workflow (e.g., 2a_grass/workflow.R). It's possible this would require some of the intense engineering that went into this PR, but I'm not convinced. Conceptually, for 2a_grass, you should end up with a script that looks something like (I'm simplifying here, but you get the idea):

tar_script(
	list(
		# NOTE: Map across multiple inputs in parallel
		tar_target(
			sipnet_inputs,
			PEcAn.SIPNET::met2model.SIPNET(...),
			pattern = map(file_info)
		),
		tar_target(ic_files, build_ic_files(...)),
		# NOTE: Depends on sipnet_inputs and ic_files
		tar_target(xml, build_pecan_xml(sipnet_inputs, ic_files)),
		# NOTE: Depends on xml
		tar_target(raw_model, run_pecan_workflow(xml)),
		# NOTE: Depends on raw_model
		tar_target(results, post_process(raw_model))
	)
)
tar_make()

If you needed to run non-locally (including with a mix of local and remote workers, and even remote workers with heterogeneous resource requirements), targets supports that via crew (https://books.ropensci.org/targets/crew.html).

It's possible this would also lead to some of the complexity introduced by this PR (I'm skeptical of this...but open to being shown otherwise!). But, if getting targets working requires the level of complexity here then I'm not convinced it's worth it. You can get pretty far with just simple wrapper scripts, a basic makefile, and if blocks skipping inputs you've already generated.

A few specific points on this PR:

  1. Where targets shines is in its ability to selectively re-run only the parts of a workflow that need re-running without re-running everything. But from what I can tell of the scripts in orchestration, these are designed to run in isolated directories (similar to PEcAn workflows) with unique run IDs. Case in point: These scripts write new targets pipeline on each execution. From what I can tell (it's a bit hard to follow), we then implement some complex logic (step__resolve_data_routing -> loop resolve_data_routing , turning each one into a tar_target -> localize/reference -> fiddle with tar_path_store()) to try to share artifacts across workflows in a way targets understands. But a far simpler implementation would just plop artifacts into some shared location and do logic like if (!file.exists("shared/whatever") && !overwrite){ download_whatever(...) }.
  2. There's a lot of indirection, which makes this hard to follow and debug. Case in point: I had to jump through 4 layers of function calls to figure out what step__resolve_data_routing does. Also, we've moved code out of locations where it logically makes sense (e.g., build IC files logic lives in a script called build_ic_files.R) into a very large single kitchen sink file (workflow_functions.R) that other scripts all now have to depend on.
    1. If we don't want to go full R package here (which I think is overkill), we might consider box (https://klmr.me/box/). It provides a very Pythonic and elegant way to reuse code within (and across) projects without having to resort to the full complexity of R package infrastructure. If nothing else, its namespacing capabilities would at least make it much easier to quickly identify where certain functions where coming from (e.g., box::use(mod/workflow) -> workflow$build_iic_files).
  3. Similarly, the heavy use of base R's substitute + quote also adds a lot of complexity. (Case in point: To register functions with targets, step__run_distributed_write_configs has to create a bunch of tar_target_raw with a complicated nest of substitute, quote, and get calls. If that's what it takes to get this working with targets, it's again worth asking why we want to use targets here and whether any benefit we get from targets is worth such complexity.
    1. If we do want to stick with this, it may be worth switching to rlang's arguably more streamlined and predictable nonstandard evaluation capabilities. We're already implicitly depending on rlang through the tidyverse, so it doesn't actually add any dependencies.

Copy link
Contributor

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies - I just realized that I had made comments that were not submitted. @hdpriest-ui I believe that I brought some or all of these points up directly when we met about this, but wanted to make sure to get them posted as we work out a plan to move forward.

return(TRUE)
}

#' Start PEcAn Ecosystem Model Runs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of these wrapper functions appear to be light wrappers around existing PEcAn functions.

This function calls PEcAn.workflow::runModule_start_model_runs, which is already a wrapper around this workflow step. This function mainly adds status updates and logger. But targets already provides tracking / timing / error handling.

Calling the module function directly would simplify the workflow and reduce redundancy - unless these status updates are required downstream.

I think it would be worth clarifying where a wrapper function provides sufficient value to warrant the additional layer of complexity. Perhaps there is some functionality that could be directly added to the PEcAn function.

return(file.path(local_path, prefix_filename))
}

#' Prepare PEcAn Run Directory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this duplicate or could it be combined with workflow_run_directory_setup below?

### Distributed Compute for Workflows
In order to execute the workflow in a distributed manner, individual workflow steps are invoked within the specific workflow run directory. To accomplish this, a new R process is instantiated on the worker node, and the code is executed as part of the Targets framework.

This means that the workflow steps - as invoked by slurm within an R process - have access to the workflow run resources, such as data artifacts produced by preceding steps. It will also be executed in the context of the workflow run directory, and so the invocation of PEcAn methods within the workflow directory becomes quite direct. This should also make it clear that, as the step is invoked within a new R namespace (and indeed, on an entirely different compute node), each workflow step must import its own dependencies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to document why crew + crew.cluster packages are not used for distributed computing here.

Did you review the discussions in the crew.cluster repository around using targets w/ apptainer?

And have you run across https://github.com/dynverse/babelwhale - it was referenced above and looks like a pretty handy interface analogous to system2 but with containers, i.e.

babelwhale::run(
  container,
  command, args)

jetty seems similar, though not clear if it works w/ apptainer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashiklom there are some issues with the use of singularity with crew-cluster linked above; some workarounds are proposed but it isn’t clear if they will solve the problems @hdpriest-ui ran into. That said, it would probably benefit all involved if our project could submit a fix to the crew cluster package rather than rolling our own.

Comment on lines +167 to +179
# in order to parametrize the workflow script, we have to first create placeholders, and then below, replace them with actual values.
# if we simply place the variables in the script definition above, they are evaluated as the time the script is executed by tar_make()
# that execution takes place in a different process + memory space, in which those variables are not accessible.
# so, we create the execution script, and then text-edit in the parameters.
# Read the generated script and replace placeholders with actual file paths
script_content <- readLines(tar_script_path)
script_content <- gsub("@FUNCTIONPATH@", function_path, script_content)
script_content <- gsub("@PECANXML@", pecan_xml_path, script_content)
script_content <- gsub("@CCMMFDATAURL@", ccmmf_data_tarball_url, script_content)
script_content <- gsub("@CCMMFDATAFILENAME@", ccmmf_data_filename, script_content)
script_content <- gsub("@APPTAINERSOURCEDIR@", apptainer_source_dir, script_content)
script_content <- gsub("@APPTAINERNAME@", apptainer_name, script_content)
script_content <- gsub("@REMOTECONDAENV@", remote_conda_env, script_content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would tarchetypes::tar_map() or dynamic branching handle this kind of parameterization?

return(pecan_settings)
}

#' Reference External Data Entity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some context (here & README) explaining 1) why this function behaves differently than declaring the path with tar_target(..., format = "file")? Is it because the files are too large to duplicate? Or to meet targets' expectation that the file exists? and 2) when this function should be used.

@hdpriest-ui
Copy link
Contributor Author

I have seen the feedback here @ashiklom ; but i'm in some other projects first few days this week, will read more closely and respond tomorrow (wednesday 28th)

@hdpriest-ui
Copy link
Contributor Author

for quick reference the big driver that ended experimentation with crew was this issue:
wlandau/crew.cluster#64

we did discuss trying to resolve it but ultimately we didn't go down that path. we can revive that if we think its valuable.

@hdpriest-ui
Copy link
Contributor Author

I'm going to chunk up my thoughts, just to keep each chunk smaller for ease of referencing.

I'll call out at the beginning here that the only portions of this effort I have strong opinions on are the workflow reproducibility and transparency outcomes. Our approach to getting there is not nearly important to me as actually delivering a workflow framework which gives absolute certainty regarding the data and the methods that produced an outcome.

@hdpriest-ui
Copy link
Contributor Author

I'm going to go through this mostly in-order. I think there's some context and history that is lost while looking at this extremely sprawling PR only here at the end of it.

The overall goal here is not just to simplify the executions of the workflows for end users - though that is an intended side effect.

This grew out of the need to build a framework for CARB which would accomplish a few goals:

  1. Provide a workflow system that would enable reproducibility and transparency for their analytical workflows surrounding the CCMMF project. Specifically, we should enable them to have clarity about what code was run, what data was utilized, and what the outcome of that analytical process was. This will have to support many independent runs, in the context of datasets which are evolving, and code that is evolving.

  2. Support distributed execution or work on CARB's HPC, while minimizing maintenance overhead for managing software installs & deployments. The framework would have to be compatible with Slurm, and support configurable/swappable software versions (e.g., run an analysis with Pecan-dev, Pecan-dev*). This is where the use of Apptainer strongly stems from.

  3. Enable people on the PEcAn side to contribute to the framework without requiring extensive re-writes of a workflow-step's code, or significantly transmuting that code to the point where the original author can't maintain it. I.e., if Chris or David writes a method for ensemble modeling or downscaling, the ideal case is that their code can be leveraged as-is within the framework.

With the above in mind, Targets became an excellent foundation to build on because of its native support for indexing analysis artifacts - both code and data being tracked on a granular level. For clarity, the current functionality allows users to either generate analysis executions within the context of novel autogenerated UUID-stamped identifiers, or to re-evaluate existing executions. This behavior isn't perfect, as PEcAn has plenty of outputs that are generated purely as side-effects, rather than returned data products, but this can be mitigated.

@hdpriest-ui
Copy link
Contributor Author

So, this work began with refactoring a single workflow into a targets-mediated series of steps; but the intent was not to simplify the execution of each workflow. The intent was to build re-usable workflow steps which could be leveraged within/across multiple workflows. Workflow steps ideally would be able to be executed in a slurm-managed context or within a single node. (note from one of my other comments that unfortunately, targets + crew + slurm doesn't appear to be functional at the moment. If we want to pursue a fix in that repo, that seems like a worthy goal)

I don't think we've reached the point where each workflow step that is needed exists.

With regard to code organization & referencing; there were many times I wanted to go to a full R-package, because placing everything into workflow_functions.R is absolutely not a good solution. At the very least, it seems like a project-domain structure is indicated (eg, 'ensemble_functions', 'downscaling_functions', etc.). I haven't messed with box before, i'm open to different approaches here.

The driver behind the leveraging of targets_raw calls is to directly hook the executed code into Target's object indexing and workflow re-evaluation support and to bundle these steps together. By leveraging targets_raw calls, if the code that is executed to support the analyses changes, the workflow steps are registered for re-evaluation. This means that the full state of each analysis run is preserved within a workflow run directory permanently, and all code and data which supports an analytical outcome is indexed. This sidesteps the dangers of data products generated from (for example) pecan-develop or other package labels which are not 1-1 with a specific iteration of a codebase.

@hdpriest-ui
Copy link
Contributor Author

regarding data tracking and the varied use of data referencing vs data import.

(btw, the only hill in this PR that i'll die on is how we handle our data & code artifacts)

For clarity, when i refer to 'identity-traced' artifacts, i mean that the relevant data identifier represents an entirely static, non-variant data object. Often this is a snapshot of a particular workflow-external data entity at a given time.

There are a couple of key points that I've tried to support:

  • In the context of a workflow which is made up of a series of analytics-steps, when possible, all data artifacts needed to execute the entire workflow should exist somewhere within the framework's data storage as identity-traced artifacts. If, for example, the time-varying "dataset_A" is an input to a multi-step workflow, the version of dataset_A which was analyzed within a given run must be preserved or obtainable. Dependence on a framework-external "dataset_A" which is not identity-traced renders reproducibility, transparency, and confidence impossible.

  • there are instances within analytics-step executions where it is appropriate to create a step-local copy of a data product. There are instances where it is not appropriate. Therefore a workflow framework must support both referencing of data artifacts, as well as step-localization (i.e., copying) of data artifacts.

  • unless required, data artifacts should be referenced, not copied. Every copy of a data artifact is an opportunity for desynchronization. However, referenced data artifacts must be identity-traced.

  • necessarily, the first time that a data artifact appears within a workflow run, it will almost always be a copy of a workflow-external artifact created at a given time & place. This is a good example of when copying data is preferrable (indeed - required) to data referencing

@hdpriest-ui
Copy link
Contributor Author

The comments are running the risk of becoming as sprawling & unwieldy as the PR itself. Happy to discuss how we want to move to a framework that fills the requirement.

If there's unclarity in the above its probably due to the writer's error rather than the reader - hit me up for more discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants