-
Notifications
You must be signed in to change notification settings - Fork 93
feat(processing): add unblob dedicated temporary directory handling #1222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
python/unblob/processing.py
Outdated
| """Set environment variables so all subprocesses and handlers use our temp dir""" | ||
| for var in ("TMP", "TMPDIR", "TEMP", "TEMPDIR"): | ||
| os.environ[var] = self.tmp_dir.as_posix() | ||
| atexit.register(self._cleanup_tmp_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is problematic for programs using unblob as library, as these variables remain set in the hosting process's environment. We should set and reset them in a tighter scope.
For example, Processor.process_task looks like a fine candidate, as it by default will spawn child processes, not poisoning the caller's environment at all (given they use process_num > 1).
Another possibility is doing this at process_file level, but it still affects the hosting process unfortunately, but at least we can restore the variables upon returning from the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also implemented tests in test_necessary_resources_can_be_created_in_sandbox
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see the original comment being resolved in the code - below I can find the env variables being set, but not restored, and here atexit.register still being used.
Have you pushed your changes?
I am also not sure how it would landlock to a /tmp/call-specific-tempdir if used as a library on the second call with another ExtractionConfig.
One solution I can think of is process_file forking in case of landlock, setting the environment variables in the fork, and cleaning up the temporary directory when the fork exits. Since the environment variables would be set in the child, they need no restoring.
(actually multiprocessing should be used somehow, as https://docs.python.org/3/library/os.html#os.fork has many problems and known not to work on macos)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@e3krisztian do I still need to do something here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implicit atexit-cleanup pattern still bothers me, and potentially anger provoking when the directory is unintentionally removed.
I would prefer a breaking change, except that is ruled out.
There is another solution: we could make the default '/tmp' (or the first value set for any of the TMP... variables), which should work backward compatibly. The only problem left is the cleanup, which would attempt to remove '/tmp' now.
For cleanup we could introduce another temp-dir for sub-processes:
- make a default for
config.tmp_dirbased on TMP... variables with fallback to/tmp - use another sub-directory (created with
mkdtemp(dir=config.tmp_dir)) ofconfig.tmp_diras temporary directory for running tasks, immediately cleaning up the sub-directory when exiting the task (we could add both themkdtempand the cleanup to thetmp_dircontext manager) - do not clean up
config.tmp_dirhere: it is pre-existing, should be cleaned up at caller site if set explicitly (e.g. inunblob.cli.cli)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vlaci raised a concern, that we should consider doing this only in a more limited context inside Command extractors, as only external commands use tempfiles.
d6d9358 to
c5a7634
Compare
Move temporary directory creation and cleanup from ExtractionConfig to CLI to make the lifecycle explicit and prevent unintended cleanup. The implicit atexit cleanup in ExtractionConfig could unexpectedly remove user-provided directories. This change moves that responsibility to the CLI layer where it's visible and under explicit control. Changes: - ExtractionConfig.tmp_dir defaults to system temp (no auto-cleanup) - Renamed tmp_dir() context manager to task_tmp_dir() - task_tmp_dir() creates and cleans up task-specific subdirectories - CLI explicitly creates unblob temp directory and manages cleanup - Tests updated for proper sandbox isolation Result: Library code has no implicit cleanup. CLI owns the lifecycle. Users can safely provide their own tmp_dir without cleanup surprises.
1fb0171 to
50ff71a
Compare
Draft implementation for #1216
It fails on
test_archive_successbecause we cannot instantiateExtractionConfigwith an arbitrary tmp directory right now.Some unit testing is also required for proper coverage.