support docx files ingestion - initial happy path#757
Open
Niharika0306 wants to merge 3 commits into
Open
Conversation
Signed-off-by: Niharika Gurram <niharika.gurram1@ibm.com>
9bfb856 to
3dbd5d0
Compare
Member
|
@Niharika0306 Please fix the UTs |
Member
|
never mind, fixing UT PR is still not merged! |
| raise ValueError(f"File has .docx extension but invalid DOCX format: {filename}") | ||
|
|
||
| # Keep old function for backward compatibility (deprecated) | ||
| def validate_pdf_file(filename: str, content) -> None: |
Member
There was a problem hiding this comment.
this can be removed if you are replacing it with new validate_document_file
| raise ValueError(f"File is empty: {filename}") | ||
|
|
||
| # Validate extension and format | ||
| allowed_extensions = {'.pdf', '.docx'} |
Member
There was a problem hiding this comment.
extension validation should be done before validating the content
| format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)} | ||
| format_options={ | ||
| InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options), | ||
| # InputFormat.DOCX: WordFormatOption(pipeline_options=word_pipeline_options) |
| # # Configure Word/DOCX pipeline options (use base PipelineOptions) | ||
| # word_pipeline_options = PipelineOptions() | ||
| # if artifacts_path: | ||
| # word_pipeline_options.artifacts_path = artifacts_path |
Member
There was a problem hiding this comment.
If we comment this, then how it would work with offline artifacts?
| # Only set artifacts_path if DOCLING_MODELS_PATH environment variable is set | ||
| # Get artifacts path if set | ||
| docling_models_path = os.environ.get('DOCLING_MODELS_PATH') | ||
| artifacts_path = None |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR includes initial changes to support digitisation and ingestion for docx files. The following things are to be fixed yet: