Conversation
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
81f7cb7 to
8a43489
Compare
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
|
This PR is ready for review. The publish dir now publishes each file in a directory instead of just the directory, which should improve detection of Reports in Seqera Platform and fix some issues with Fusion symlinks (#4725). It does not yet solve #3372 , for that we also need to verify the checksum of published files when deciding whether a file needs to be re-published. I can implement that here or in a separate PR, up to you @pditommaso . |
|
Looks good, thanks Ben! In the case where publish method is symlinks and we are emitting/publishing a directory, what does the published output look like? Is it a symlinked directory full of symlinks, or a regular directory full of symlinks, or something else entirely? |
|
It will be a regular directory of symlinks |
pditommaso
left a comment
There was a problem hiding this comment.
Not sure we should do this. On some storage traversing a directory path is extremely expensive e.g. Lustre, NFS, etc.
|
I share your concern, although copying a directory also involves walking the directory tree at some level, so may it's not an issue. Either way, there are other reasons we should do this. In general, I think it is a bad practice to define a directory output. When I see a process with a directory output, I know nothing about the file structure within that directory. So I think we should either not support directory outputs or support them correctly. Currently we just support it incorrectly:
The performance implications you mention only happen if the user decides to use directory outputs.. I think we could simply document this point and encourage users to use flat files as much as possible. Consider also that we already walk the work directory to collect the output files: which has similar performance issues when there are nested directories |
|
Indeed, when we publish a directory we are already traversing it: This PR is just doing the traversal earlier |
|
@robsyme what do you think about what I said about publishing a symlink directory? It is certainly a change in behavior, but I'm wondering if it's more desirable to symlink all of the files rather than only symlink the directory. If not, I can modify the logic to only traverse the directory when the publish mode isn't symlink, but I feel like that would make the behavior inconsistent based on the publish mode. |
|
Thanks for moving this forward Ben, and apologies for my slow response times! My feeling is that we're stuck between a rock and a hard place, but inconsistencies between publishDir modes is the worse outcome. As for
I think it's inevitable that people will want to publish results that contain directories (many tools have complex/nested outputs). |
|
Indeed, even though a directory output is harder to reason about, it seems to be convenient for the first iteration of a Nextflow process for certain tools |
|
I think this is solved by #5005 |
|
The fusion symlinks are no longer an issue, but there are other reasons to merge this:
|
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
|
@bentsherman What's the status of this fix? Encountering is systematically with the https://github.com/nf-core/spatialvi pipeline. Alternatively, any known workaround? |
|
The PR works but I'm not sure if we still want to merge it or not. We might take a different approach. I'm not sure why your issue is happening because you don't need to create directories beforehand in S3. It could be a different bug in Nextflow. Feel free to submit an issue here if you aren't able to solve it. A workaround to publishing a directory is to have a tuple of paths for each individual file in the directory. But this gets very impractical if there are many files |
5a93547 to
27345a6
Compare
b4b321e to
069653d
Compare
d9fa5cd to
d752bc2
Compare
Close #3372
I just changed the publish dir to explicitly publish each file in a directory. I think it means that if the publish mode is symlink, each file in the directory will be symlinked instead of just the directory.
@robsyme do we just need to add a checksum comparison to S3 copies and uploads? I think I can just rip the code from this PR #3802