Skip to content

Rewrite Script to check for corrupt or empty PDFs #22

@PascalEgn

Description

@PascalEgn

Description

We should improve/rewrite the script to check for corrupt or empty PDFs to prepare it for the migration to Airflow.

This includes rethinking what function parameters would make sense, some ideas are:

input_url:

Reads exisiting BOITE_O0XXX files in the shared CERNBox directory and checks if the given file numbers contains corrupt/empty PDFs on S3. Here is also an example URL with some files: https://cernbox.cern.ch/s/QslvWRIPsBcDAOK

List of Numbers:

A list of BOITE file numbers which should be checked on S3.

Range of Numbers:

Start and end of a number range to check on S3. For example 205..300 would check all folders in this range (205,206,207...300)

bucket_name

Name of the S3 bucket to be checked

base_prefix:

S3 Path to the PDF files which are meant to be checked. (E.g. raw/PDF/ or raw/CORRECTIONS_2/PDF_OCR/, etc.)

output_url:

CERNBox url to which the generated log file should be uploaded to. Here is an example folder whre files can be uploaded to: https://cernbox.cern.ch/s/OBzMIMo6fDb7gCc

Work involved

  • Think about what parameters make sense
  • Establish connection to view and upload files from and to CERNBox
  • Implement useful parameters into the script

Acceptance criteria

Screenshots(Optional)

Metadata

Metadata

Assignees

No one assigned

    Labels

    File Import ProjectThis task is related to the file import project of digitization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions