Targets-based workflows leveraging slurm for dist. computing - workflow 1a and POC#13
Targets-based workflows leveraging slurm for dist. computing - workflow 1a and POC#13hdpriest-ui wants to merge 29 commits intoccmmf:mainfrom
Conversation
…it in case i break everything.
… entirely reproducible.
…preparation and analysis. Update .gitignore to exclude workflow run directories. Enhance run_pipeline scripts for better directory management and parameterization. Introduce new utility functions for data handling and workflow execution. slurm workflow not yet functional.
added simple roxygen docs updated pecan settings qstat to work with zero-length strings added first draft setup shell script for one-button install added workflow functions necessary for 1a
Added apptainer build image parent workflow added apptainer sipnet-carb build workflow added dockerfile to tools/ subdirectory unlikely first attempt will build.
added line on obtaining current temp container
NOTE THE BUG: apptainer must be updated both in runscript as well as in the XML.
|
GHA based workflow successfully builds sipnet-carb docker in the source repo: |
There was a problem hiding this comment.
General comments:
- I think that integrating targets into the workflow is a good idea. It will be worth reviewing together after Chris has also had a chance to review it.
- Documentation (e.g. a README) would be helpful. Contents could include:
- Overview and useage of targets for workflows - general approach that can be used across the project
- Rationale behind divergence from more standard targets workflows, choice not to use crew or _targets.R file; use of environment vars + gsub; storing functions and args. I know you've explained in meetings, but these will be helpful to document.
- For this specific example implementing the ensemble workflow, it would be useful to document the workflow components, including a
tar_manifestand diagram of the DAG (output oftar_network()ortar_mermaid)?
| # function authors are encouraged to think carefully about the dependencies of their functions. | ||
| # if dependencies are not present, it would be ideal for functions to error informatively rather than fail on imports. |
There was a problem hiding this comment.
How are dependencies used? What should the authors consider?
| return(file.path(local_path, prefix_filename)) | ||
| } | ||
|
|
||
| #' Prepare PEcAn Run Directory |
There was a problem hiding this comment.
Something similar (create directory if it doesn't exist) is already done by prepare.settings() when it calls check.settings(). prepare.settings does a lot of other things but
A standard pattern is:
settings <- PEcAn.settings::read.settings("pecan.xml")
settings <- PEcAn.settings::prepare.settings(settings)But I don't see it used here in the workflows, so I'll defer to @infotroph to comment on whether not calling prepare.settings was a deliberate choice and whether it would be appropriate to use here.
There was a problem hiding this comment.
It was deliberate: Specifically because prepare.settings wants a live DB and queries it too many times to change that readily, and more generally because this workflow puts responsibility for constructing and verifying the settings into the xml_build stage.
There was a problem hiding this comment.
I'll admit to not fully understanding the db dependency, but prepare.settings is used for this purpose in the db-independent Demo 1: https://github.com/PecanProject/pecan/blob/bff6203e17cf4ff7f6c8e553f0ea16170051018b/documentation/tutorials/Demo_1_Basic_Run/run_pecan.qmd#L137
There was a problem hiding this comment.
Does this duplicate or could it be combined with workflow_run_directory_setup below?
| jobids[task_id] <- PEcAn.remote::qsub_get_jobid( | ||
| out = out[length(out)], | ||
| qsub.jobid = pecan_settings$host$qsub.jobid, | ||
| stop.on.error = stop.on.error) |
There was a problem hiding this comment.
where is stop.on.error defined (identified by lintr as 'no visible binding for global variable')
| @@ -0,0 +1,159 @@ | |||
| name: build-image | |||
There was a problem hiding this comment.
I see this is a lightly modified version of the docker-build-image.yml in PecanProject/pecan and I have a vague memory of seeing instructions for using workflow files from other repositories. Would it be worth investigating if we can call this from the PEcAn repo rather than maintain duplicate versions?
There was a problem hiding this comment.
I think its definitely worth investigating. In moving it into this repository, I hesitated to create a new maintenance point for the same method. Investigation is the right word - i have questions, such as: which ghcr/docker repo will the created sipnet-carb image end up in? (which do we want it to end up in?) Can we invoke the method in the pecan repo using secrets in the ccmmf repo?
I'll look into it.
Co-authored-by: David LeBauer <dlebauer@gmail.com>
…intended for review at this point
…/workflows into abstracted_workflows
… a distributed workflow documents in hopefully useful state.
updated apptainer build to support develop
…can's version of this yaml. added image-version input parameter at base apptainer sipnet-carb builder
-refactored configs into latest and devel for ease of stack testing -refactored parameter passing: majority of workflow parameters are passed via orchestration XML -minimized gsub replacements for clarity
- added script for XML build step - added single function for XML build step
- leveraged targets "target_raw" methodology to enable function-call like invokations of multiple targets in re-usable blocks - enabled parameter passing and parsing for function-like behavior of target blocks - combined 03 and 04 steps from workflow 2a - workflow 2a function execution working, data routing incomplete
…g parsing with centralized functions - added smart functional resolution for either referencing external data, or copying external data into a run. - added argument parsing through as.numeric() to correctly parameterize centralized workflow functions - obtained successful 2a workflow replication via targets, apptainer and slurm - updated example workflows for new data referencing - removed obsolete example 3 variant - removed some obsolete functions within workflow_functions.R - added a gha for CI of workflows - added self hosted runner info to github action
- workflows now trigger on self hosted runner - to be resolved: CI location output cleanup
ashiklom
left a comment
There was a problem hiding this comment.
While I appreciate the amount of effort that went into this, my initial, overall reaction to this is that this adds a lot of code and complexity with unclear purpose or benefit.
First, at a very high level, I think this PR needs to state up front what problem(s) it is trying to solve. If nothing else, that will make it easier to review to see whether those problem(s) have actually been addressed or not.
Second, the usage of targets here is a bit puzzling. Assuming the point of this is to simplify the execution of each individual workflow (e.g., run all of 2a_grass with one command; distribute workers between local and remote (qsub, etc.) automatically as needed), I would have started with doing exactly that: First, lightly refactor the scripts in each workflow into functions. Then, chain those functions together in a single _targets.R file for each workflow (e.g., 2a_grass/workflow.R). It's possible this would require some of the intense engineering that went into this PR, but I'm not convinced. Conceptually, for 2a_grass, you should end up with a script that looks something like (I'm simplifying here, but you get the idea):
tar_script(
list(
# NOTE: Map across multiple inputs in parallel
tar_target(
sipnet_inputs,
PEcAn.SIPNET::met2model.SIPNET(...),
pattern = map(file_info)
),
tar_target(ic_files, build_ic_files(...)),
# NOTE: Depends on sipnet_inputs and ic_files
tar_target(xml, build_pecan_xml(sipnet_inputs, ic_files)),
# NOTE: Depends on xml
tar_target(raw_model, run_pecan_workflow(xml)),
# NOTE: Depends on raw_model
tar_target(results, post_process(raw_model))
)
)
tar_make()If you needed to run non-locally (including with a mix of local and remote workers, and even remote workers with heterogeneous resource requirements), targets supports that via crew (https://books.ropensci.org/targets/crew.html).
It's possible this would also lead to some of the complexity introduced by this PR (I'm skeptical of this...but open to being shown otherwise!). But, if getting targets working requires the level of complexity here then I'm not convinced it's worth it. You can get pretty far with just simple wrapper scripts, a basic makefile, and if blocks skipping inputs you've already generated.
A few specific points on this PR:
- Where
targetsshines is in its ability to selectively re-run only the parts of a workflow that need re-running without re-running everything. But from what I can tell of the scripts inorchestration, these are designed to run in isolated directories (similar to PEcAn workflows) with unique run IDs. Case in point: These scripts write newtargetspipeline on each execution. From what I can tell (it's a bit hard to follow), we then implement some complex logic (step__resolve_data_routing-> loopresolve_data_routing, turning each one into atar_target->localize/reference-> fiddle withtar_path_store()) to try to share artifacts across workflows in a way targets understands. But a far simpler implementation would just plop artifacts into some shared location and do logic likeif (!file.exists("shared/whatever") && !overwrite){ download_whatever(...) }. - There's a lot of indirection, which makes this hard to follow and debug. Case in point: I had to jump through 4 layers of function calls to figure out what
step__resolve_data_routingdoes. Also, we've moved code out of locations where it logically makes sense (e.g., build IC files logic lives in a script calledbuild_ic_files.R) into a very large single kitchen sink file (workflow_functions.R) that other scripts all now have to depend on.- If we don't want to go full R package here (which I think is overkill), we might consider
box(https://klmr.me/box/). It provides a very Pythonic and elegant way to reuse code within (and across) projects without having to resort to the full complexity of R package infrastructure. If nothing else, its namespacing capabilities would at least make it much easier to quickly identify where certain functions where coming from (e.g.,box::use(mod/workflow)->workflow$build_iic_files).
- If we don't want to go full R package here (which I think is overkill), we might consider
- Similarly, the heavy use of base R's
substitute+quotealso adds a lot of complexity. (Case in point: To register functions with targets,step__run_distributed_write_configshas to create a bunch oftar_target_rawwith a complicated nest ofsubstitute,quote, andgetcalls. If that's what it takes to get this working with targets, it's again worth asking why we want to usetargetshere and whether any benefit we get from targets is worth such complexity.- If we do want to stick with this, it may be worth switching to
rlang's arguably more streamlined and predictable nonstandard evaluation capabilities. We're already implicitly depending onrlangthrough thetidyverse, so it doesn't actually add any dependencies.
- If we do want to stick with this, it may be worth switching to
dlebauer
left a comment
There was a problem hiding this comment.
Apologies - I just realized that I had made comments that were not submitted. @hdpriest-ui I believe that I brought some or all of these points up directly when we met about this, but wanted to make sure to get them posted as we work out a plan to move forward.
| return(TRUE) | ||
| } | ||
|
|
||
| #' Start PEcAn Ecosystem Model Runs |
There was a problem hiding this comment.
Many of these wrapper functions appear to be light wrappers around existing PEcAn functions.
This function calls PEcAn.workflow::runModule_start_model_runs, which is already a wrapper around this workflow step. This function mainly adds status updates and logger. But targets already provides tracking / timing / error handling.
Calling the module function directly would simplify the workflow and reduce redundancy - unless these status updates are required downstream.
I think it would be worth clarifying where a wrapper function provides sufficient value to warrant the additional layer of complexity. Perhaps there is some functionality that could be directly added to the PEcAn function.
| return(file.path(local_path, prefix_filename)) | ||
| } | ||
|
|
||
| #' Prepare PEcAn Run Directory |
There was a problem hiding this comment.
Does this duplicate or could it be combined with workflow_run_directory_setup below?
1a_workflowed/README.md
Outdated
| ### Distributed Compute for Workflows | ||
| In order to execute the workflow in a distributed manner, individual workflow steps are invoked within the specific workflow run directory. To accomplish this, a new R process is instantiated on the worker node, and the code is executed as part of the Targets framework. | ||
|
|
||
| This means that the workflow steps - as invoked by slurm within an R process - have access to the workflow run resources, such as data artifacts produced by preceding steps. It will also be executed in the context of the workflow run directory, and so the invocation of PEcAn methods within the workflow directory becomes quite direct. This should also make it clear that, as the step is invoked within a new R namespace (and indeed, on an entirely different compute node), each workflow step must import its own dependencies. |
There was a problem hiding this comment.
It would be helpful to document why crew + crew.cluster packages are not used for distributed computing here.
Did you review the discussions in the crew.cluster repository around using targets w/ apptainer?
- Executing workers in singularity image wlandau/crew.cluster#41 and
- Issue running with Singularity on SLURM wlandau/crew.cluster#35
And have you run across https://github.com/dynverse/babelwhale - it was referenced above and looks like a pretty handy interface analogous to system2 but with containers, i.e.
babelwhale::run(
container,
command, args)
jetty seems similar, though not clear if it works w/ apptainer.
There was a problem hiding this comment.
@ashiklom there are some issues with the use of singularity with crew-cluster linked above; some workarounds are proposed but it isn’t clear if they will solve the problems @hdpriest-ui ran into. That said, it would probably benefit all involved if our project could submit a fix to the crew cluster package rather than rolling our own.
1a_workflowed/run_pipeline_slurm.R
Outdated
| # in order to parametrize the workflow script, we have to first create placeholders, and then below, replace them with actual values. | ||
| # if we simply place the variables in the script definition above, they are evaluated as the time the script is executed by tar_make() | ||
| # that execution takes place in a different process + memory space, in which those variables are not accessible. | ||
| # so, we create the execution script, and then text-edit in the parameters. | ||
| # Read the generated script and replace placeholders with actual file paths | ||
| script_content <- readLines(tar_script_path) | ||
| script_content <- gsub("@FUNCTIONPATH@", function_path, script_content) | ||
| script_content <- gsub("@PECANXML@", pecan_xml_path, script_content) | ||
| script_content <- gsub("@CCMMFDATAURL@", ccmmf_data_tarball_url, script_content) | ||
| script_content <- gsub("@CCMMFDATAFILENAME@", ccmmf_data_filename, script_content) | ||
| script_content <- gsub("@APPTAINERSOURCEDIR@", apptainer_source_dir, script_content) | ||
| script_content <- gsub("@APPTAINERNAME@", apptainer_name, script_content) | ||
| script_content <- gsub("@REMOTECONDAENV@", remote_conda_env, script_content) |
There was a problem hiding this comment.
Would tarchetypes::tar_map() or dynamic branching handle this kind of parameterization?
| return(pecan_settings) | ||
| } | ||
|
|
||
| #' Reference External Data Entity |
There was a problem hiding this comment.
Could you add some context (here & README) explaining 1) why this function behaves differently than declaring the path with tar_target(..., format = "file")? Is it because the files are too large to duplicate? Or to meet targets' expectation that the file exists? and 2) when this function should be used.
|
I have seen the feedback here @ashiklom ; but i'm in some other projects first few days this week, will read more closely and respond tomorrow (wednesday 28th) |
|
for quick reference the big driver that ended experimentation with crew was this issue: we did discuss trying to resolve it but ultimately we didn't go down that path. we can revive that if we think its valuable. |
|
I'm going to chunk up my thoughts, just to keep each chunk smaller for ease of referencing. I'll call out at the beginning here that the only portions of this effort I have strong opinions on are the workflow reproducibility and transparency outcomes. Our approach to getting there is not nearly important to me as actually delivering a workflow framework which gives absolute certainty regarding the data and the methods that produced an outcome. |
|
I'm going to go through this mostly in-order. I think there's some context and history that is lost while looking at this extremely sprawling PR only here at the end of it. The overall goal here is not just to simplify the executions of the workflows for end users - though that is an intended side effect. This grew out of the need to build a framework for CARB which would accomplish a few goals:
With the above in mind, Targets became an excellent foundation to build on because of its native support for indexing analysis artifacts - both code and data being tracked on a granular level. For clarity, the current functionality allows users to either generate analysis executions within the context of novel autogenerated UUID-stamped identifiers, or to re-evaluate existing executions. This behavior isn't perfect, as PEcAn has plenty of outputs that are generated purely as side-effects, rather than returned data products, but this can be mitigated. |
|
So, this work began with refactoring a single workflow into a targets-mediated series of steps; but the intent was not to simplify the execution of each workflow. The intent was to build re-usable workflow steps which could be leveraged within/across multiple workflows. Workflow steps ideally would be able to be executed in a slurm-managed context or within a single node. (note from one of my other comments that unfortunately, targets + crew + slurm doesn't appear to be functional at the moment. If we want to pursue a fix in that repo, that seems like a worthy goal) I don't think we've reached the point where each workflow step that is needed exists. With regard to code organization & referencing; there were many times I wanted to go to a full R-package, because placing everything into workflow_functions.R is absolutely not a good solution. At the very least, it seems like a project-domain structure is indicated (eg, 'ensemble_functions', 'downscaling_functions', etc.). I haven't messed with box before, i'm open to different approaches here. The driver behind the leveraging of targets_raw calls is to directly hook the executed code into Target's object indexing and workflow re-evaluation support and to bundle these steps together. By leveraging targets_raw calls, if the code that is executed to support the analyses changes, the workflow steps are registered for re-evaluation. This means that the full state of each analysis run is preserved within a workflow run directory permanently, and all code and data which supports an analytical outcome is indexed. This sidesteps the dangers of data products generated from (for example) pecan-develop or other package labels which are not 1-1 with a specific iteration of a codebase. |
|
regarding data tracking and the varied use of data referencing vs data import. (btw, the only hill in this PR that i'll die on is how we handle our data & code artifacts) For clarity, when i refer to 'identity-traced' artifacts, i mean that the relevant data identifier represents an entirely static, non-variant data object. Often this is a snapshot of a particular workflow-external data entity at a given time. There are a couple of key points that I've tried to support:
|
|
The comments are running the risk of becoming as sprawling & unwieldy as the PR itself. Happy to discuss how we want to move to a framework that fills the requirement. If there's unclarity in the above its probably due to the writer's error rather than the reader - hit me up for more discussion |
Updated description of PR contents:
Now contains fully-functional workflow for workflow 2a, as well as example workflows for downloading data, referencing previous workflow runs, copying data in from prior workflow runs, and executing PEcAn function calls in a distributed fashion via slurm and apptainer.
Also supports local execution in a containerized environment.
Have created 'adapter' R files which leverage @infotroph 's command line argument structure and pass it via a XML structure into the workflow_functions.R centralized versions of code for stand-alone execution.
workflow_functions.R versions of workflow steps are - for the most part - copy-pasted from @infotroph's implementations.
workflow 2a can be realized via (from repo root, on ccmmf test cluster)
conda activate /home/hdpriest/miniconda3/envs/pecan-all-102425 cd ./orchestration/Note: you will have to edit the 'workflow.base.run.directory' XML parameter in the orchestration XML below to your preferred location