Skip to content

support docx files ingestion - initial happy path#757

Open
Niharika0306 wants to merge 3 commits into
IBM:mainfrom
Niharika0306:word_support
Open

support docx files ingestion - initial happy path#757
Niharika0306 wants to merge 3 commits into
IBM:mainfrom
Niharika0306:word_support

Conversation

@Niharika0306
Copy link
Copy Markdown
Contributor

This PR includes initial changes to support digitisation and ingestion for docx files. The following things are to be fixed yet:

  1. page_count and provenance missing for docx files
  2. table captions missing
  3. toc and header level matching

Signed-off-by: Niharika Gurram <niharika.gurram1@ibm.com>
@dharaneeshvrd
Copy link
Copy Markdown
Member

@Niharika0306 Please fix the UTs

@dharaneeshvrd
Copy link
Copy Markdown
Member

never mind, fixing UT PR is still not merged!

raise ValueError(f"File has .docx extension but invalid DOCX format: {filename}")

# Keep old function for backward compatibility (deprecated)
def validate_pdf_file(filename: str, content) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed if you are replacing it with new validate_document_file

raise ValueError(f"File is empty: {filename}")

# Validate extension and format
allowed_extensions = {'.pdf', '.docx'}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extension validation should be done before validating the content

format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options),
# InputFormat.DOCX: WordFormatOption(pipeline_options=word_pipeline_options)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented?

# # Configure Word/DOCX pipeline options (use base PipelineOptions)
# word_pipeline_options = PipelineOptions()
# if artifacts_path:
# word_pipeline_options.artifacts_path = artifacts_path
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we comment this, then how it would work with offline artifacts?

# Only set artifacts_path if DOCLING_MODELS_PATH environment variable is set
# Get artifacts path if set
docling_models_path = os.environ.get('DOCLING_MODELS_PATH')
artifacts_path = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants