Update supported file list to match new Archivematica processing#37
Conversation
This commit updates the supported file list that these tests use to check conversions against. It removes the intermediate file types that were generated by the legacy process but are not generated by the new Archivematica pipeline. It also removes some files types that we don't officially support from the list, increases the time the tests wait for files to finish processing, and changes how the tests determine whether a file is done processing to reflect the behavior of the new pipeline.
There was a problem hiding this comment.
Pull request overview
This PR updates the functional upload/conversion test harness to align with the new Archivematica processing pipeline by updating the supported/expected output formats list, increasing polling timeout, and changing “processing complete” detection to include expected derivative formats.
Changes:
- Update
supported_file_types.csvto remove legacy intermediate/unsupported conversions and reflect new expected outputs. - Add expected-format loading and pass expected formats into upload polling to detect completion based on produced formats.
- Increase upload polling timeout from 60s to 300s.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| permanent_upload/validation.py | Refactors dataset loading and adds load_expected_formats() used to drive “expected derivative formats” checks. |
| permanent_upload/permanent.py | Updates post-upload polling to wait for both OK status and presence of expected formats. |
| permanent_upload/data/supported_file_types.csv | Updates expected conversion outputs to match the new Archivematica pipeline. |
| permanent_upload/main.py | Wires expected formats into upload polling and increases timeout for the new pipeline. |
Comments suppressed due to low confidence (1)
permanent_upload/main.py:54
- The f-string building the base URL uses double quotes inside the
{...}expression while the f-string itself is also delimited by double quotes. This pattern is a Python syntax error in f-strings (e.g., similar tof"{foo[\"bar\"]}"). Using single quotes inside the expression avoids the parse error.
timeout = 300
print(f"Current timeout is {timeout} seconds")
api = PermanentAPI(
f"https://{"app" if environment == "www" else "app." + environment}.permanent.org"
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def validate_supported_types(results, data_file="data/supported_file_types.csv"): | ||
| validation_dataset = _load_validation_dataset(data_file) | ||
| for result in results: | ||
| extension = result[0].split(".")[-1] | ||
| assert validation_dataset[extension] |
| extension = os.path.splitext(f)[1].lstrip(".") | ||
| expected_formats = formats_by_extension.get(extension, set()) |
| def load_expected_formats(data_file="data/supported_file_types.csv"): | ||
| return { | ||
| ext: set(row["conversions"].split(",")) | ||
| for ext, row in _load_validation_dataset(data_file).items() | ||
| } |
| actual_formats = { | ||
| vo["type"].split(".")[-1] for vo in (record.get("FileVOs") or []) | ||
| } | ||
| processing_complete = bool(expected_formats) and expected_formats.issubset( | ||
| actual_formats | ||
| ) |
This really is all about correctness; the changes to the polling are necessary to correctly detect when a file has finished processing. The way it worked previously would conclude that processing was complete too early, and then all the tests would fail because it wasn't actually done and the converted files wouldn't be present yet. |
This commit updates the supported file list that these tests use to check conversions against. It removes the intermediate file types that were generated by the legacy process but are not generated by the new Archivematica pipeline. It also removes some files types that we don't officially support from the list, increases the time the tests wait for files to finish processing, and changes how the tests determine whether a file is done processing to reflect the behavior of the new pipeline.