DM-54879: Add --ignore-existing-metadata-for for when upstream outputs are selectively retained#561
DM-54879: Add --ignore-existing-metadata-for for when upstream outputs are selectively retained#561hsinfang wants to merge 4 commits into
Conversation
d4f69bd to
4bf23e3
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #561 +/- ##
==========================================
+ Coverage 88.77% 88.80% +0.02%
==========================================
Files 159 160 +1
Lines 22053 22217 +164
Branches 2623 2631 +8
==========================================
+ Hits 19578 19729 +151
- Misses 1839 1849 +10
- Partials 636 639 +3 ☔ View full report in Codecov by Sentry. |
4bf23e3 to
d6ef7aa
Compare
| return zstandard.ZstdCompressionDict(b"") | ||
| self.comms.log.info("Training compression dictionary.") | ||
| training_inputs: list[bytes] = [] | ||
| training_inputs: list[bytes | bytearray | memoryview[int]] = [] |
There was a problem hiding this comment.
I'm curious where this is coming from; AFAIK we don't use bytearray or memoryview[int] for any of these.
There was a problem hiding this comment.
I took it from mypy's suggestion directly, but this has now been fixed by ac0bcb1
| self.butler = butler.clone(collections=input_collections) | ||
| self.output_run = output_run | ||
| self.skip_existing_in = skip_existing_in | ||
| self.ignore_metadata_for: frozenset[str] = frozenset(ignore_metadata_for) |
There was a problem hiding this comment.
We probably want to raise here if ignore_metadata_for is not empty but skip_existing_in empty (since then ignore_metadata_for would have no effect).
7481f5b to
cf8760b
Compare
The skip_existing_in behavior of QuantumGraphBuilder was previously only covered through test_separable_pipeline_executor.py, where SeparablePipelineExecutor drives AllDimensionsQuantumGraphBuilder. No tests exercised the builder directly at the unit level.
cf8760b to
3e4cff4
Compare
Daytime AP runs against data produced by Prompt Processing, which does not retain all output types. With --skip-existing-in, tasks whose metadata exists are skipped even when their outputs are absent, confusing downstream tasks. For tasks listed in the new retry_missing_outputs_for, the completion signal is changed from task metadata to all non-metadata, non-log outputs existing in skip_existing_in. If any are absent the quantum runs and regenerates them. When all outputs are present the quantum is skipped, and discard_output_in_the_way removes any copies already in output_run, so the behavior is identical to the standard skip path in that case. When outputs are absent, the quantum is not skipped and they cannot be "in the way" in a fresh output_run. _skip_quantum_if_metadata_exists is renamed to _skip_quantum_if_done, because the old name baked in the implementation detail (checking metadata). After retry_missing_outputs_for the function can also skip based on regular outputs, so the name no longer described the behavior.
SeparablePipelineExecutor is not used by pipetask, but we might as well extend the same option and get tested there.
3e4cff4 to
52abd2a
Compare
Checklist
package-docs builddoc/changes