-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Description
We should improve/rewrite the script to check for corrupt or empty PDFs to prepare it for the migration to Airflow.
This includes rethinking what function parameters would make sense, some ideas are:
input_url:
Reads exisiting BOITE_O0XXX files in the shared CERNBox directory and checks if the given file numbers contains corrupt/empty PDFs on S3. Here is also an example URL with some files: https://cernbox.cern.ch/s/QslvWRIPsBcDAOK
List of Numbers:
A list of BOITE file numbers which should be checked on S3.
Range of Numbers:
Start and end of a number range to check on S3. For example 205..300 would check all folders in this range (205,206,207...300)
bucket_name
Name of the S3 bucket to be checked
base_prefix:
S3 Path to the PDF files which are meant to be checked. (E.g. raw/PDF/ or raw/CORRECTIONS_2/PDF_OCR/, etc.)
output_url:
CERNBox url to which the generated log file should be uploaded to. Here is an example folder whre files can be uploaded to: https://cernbox.cern.ch/s/OBzMIMo6fDb7gCc
Work involved
- Think about what parameters make sense
- Establish connection to view and upload files from and to CERNBox
- Implement useful parameters into the script