Fix/change safe defaults #80

Vbhatt03 · 2025-12-25T08:40:50Z

Changed the default to docling and also increased the max_batch_files to 5

Copilot

Pull request overview

This PR updates the default backend configuration for document processing to use "docling" and increases batch processing capacity. The changes aim to improve throughput by processing more files per batch while simplifying backend selection.

Key Changes:

Changed default backend from auto-selection (safe/docling based on GPU needs) to always use "docling"
Increased max_batch_files from 1 to 5 for docling backend
Version bump from 0.1.2 to 0.1.3
Removed repro_rapidocr_onnx and deepseek-ocr directories

Reviewed changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/glossapi/corpus/phase_extract.py	Updated backend resolution logic to default to "docling" and increased batch size from 1 to 5 files
pyproject.toml	Version bump from 0.1.2 to 0.1.3
.gitignore	Added .venv and excluded deepseek-ocr/ and repro_rapidocr_onnx/ directories
repro_rapidocr_onnx/*	Deleted entire directory containing ONNX-related scripts and documentation
deepseek-ocr/*	Deleted entire directory containing DeepSeek OCR runner and AWS deployment files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-25T08:45:57Z

src/glossapi/corpus/phase_extract.py


        # Configure batch/backend policy based on resolved choice
        if backend_choice == "docling":
            # Keep docling runs conservative: process one document per batch for stability


The comment on line 109 is now inconsistent with the code. The comment states "Keep docling runs conservative: process one document per batch for stability" but the code now sets max_batch_files=5, not 1. Either update the comment to reflect the new batching strategy or reconsider if this comment is still accurate.

Copilot · 2025-12-25T08:45:58Z

src/glossapi/corpus/phase_extract.py

        needs_gpu = bool(force_ocr or formula_enrichment or code_enrichment)
        if choice == "auto":
-            choice = "docling" if needs_gpu else "safe"
+            choice = "docling" #removed safe option


The change makes "docling" the default for all "auto" cases, regardless of whether GPU features are needed. Previously, when choice was "auto", the code would select "docling" if needs_gpu was true, otherwise "safe". Now it always selects "docling". This may impact users without GPU capabilities who were relying on the automatic fallback to "safe" mode. Consider whether this change could break existing workflows or if there should be additional logic to check GPU availability before defaulting to "docling".

Suggested change

choice = "docling" #removed safe option

# When auto is requested, prefer 'docling' only if GPU‑dependent features are needed;

# otherwise default to the safer CPU‑oriented backend.

choice = "docling" if needs_gpu else "safe"

fffoivos and others added 2 commits November 29, 2025 20:37

Sync master with development (omit experimental folders); bump 0.1.3

46f6889

Changed from 'safe' to 'docling' and increased batching

2e828cb

Copilot AI review requested due to automatic review settings December 25, 2025 08:40

Copilot started reviewing on behalf of Vbhatt03 December 25, 2025 08:41 View session

Copilot AI reviewed Dec 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/change safe defaults #80

Fix/change safe defaults #80

Uh oh!

Vbhatt03 commented Dec 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 25, 2025

Uh oh!

Copilot AI Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix/change safe defaults #80

Are you sure you want to change the base?

Fix/change safe defaults #80

Uh oh!

Conversation

Vbhatt03 commented Dec 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants