Skip to content

Conversation

@Vbhatt03
Copy link

Changed the default to docling and also increased the max_batch_files to 5

Copilot AI review requested due to automatic review settings December 25, 2025 08:40
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the default backend configuration for document processing to use "docling" and increases batch processing capacity. The changes aim to improve throughput by processing more files per batch while simplifying backend selection.

Key Changes:

  • Changed default backend from auto-selection (safe/docling based on GPU needs) to always use "docling"
  • Increased max_batch_files from 1 to 5 for docling backend
  • Version bump from 0.1.2 to 0.1.3
  • Removed repro_rapidocr_onnx and deepseek-ocr directories

Reviewed changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/glossapi/corpus/phase_extract.py Updated backend resolution logic to default to "docling" and increased batch size from 1 to 5 files
pyproject.toml Version bump from 0.1.2 to 0.1.3
.gitignore Added .venv and excluded deepseek-ocr/ and repro_rapidocr_onnx/ directories
repro_rapidocr_onnx/* Deleted entire directory containing ONNX-related scripts and documentation
deepseek-ocr/* Deleted entire directory containing DeepSeek OCR runner and AWS deployment files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


# Configure batch/backend policy based on resolved choice
if backend_choice == "docling":
# Keep docling runs conservative: process one document per batch for stability
Copy link

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 109 is now inconsistent with the code. The comment states "Keep docling runs conservative: process one document per batch for stability" but the code now sets max_batch_files=5, not 1. Either update the comment to reflect the new batching strategy or reconsider if this comment is still accurate.

Copilot uses AI. Check for mistakes.
needs_gpu = bool(force_ocr or formula_enrichment or code_enrichment)
if choice == "auto":
choice = "docling" if needs_gpu else "safe"
choice = "docling" #removed safe option
Copy link

Copilot AI Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes "docling" the default for all "auto" cases, regardless of whether GPU features are needed. Previously, when choice was "auto", the code would select "docling" if needs_gpu was true, otherwise "safe". Now it always selects "docling". This may impact users without GPU capabilities who were relying on the automatic fallback to "safe" mode. Consider whether this change could break existing workflows or if there should be additional logic to check GPU availability before defaulting to "docling".

Suggested change
choice = "docling" #removed safe option
# When auto is requested, prefer 'docling' only if GPU‑dependent features are needed;
# otherwise default to the safer CPU‑oriented backend.
choice = "docling" if needs_gpu else "safe"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants