-
Notifications
You must be signed in to change notification settings - Fork 22
Fix/change safe defaults #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR updates the default backend configuration for document processing to use "docling" and increases batch processing capacity. The changes aim to improve throughput by processing more files per batch while simplifying backend selection.
Key Changes:
- Changed default backend from auto-selection (safe/docling based on GPU needs) to always use "docling"
- Increased max_batch_files from 1 to 5 for docling backend
- Version bump from 0.1.2 to 0.1.3
- Removed repro_rapidocr_onnx and deepseek-ocr directories
Reviewed changes
Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/glossapi/corpus/phase_extract.py | Updated backend resolution logic to default to "docling" and increased batch size from 1 to 5 files |
| pyproject.toml | Version bump from 0.1.2 to 0.1.3 |
| .gitignore | Added .venv and excluded deepseek-ocr/ and repro_rapidocr_onnx/ directories |
| repro_rapidocr_onnx/* | Deleted entire directory containing ONNX-related scripts and documentation |
| deepseek-ocr/* | Deleted entire directory containing DeepSeek OCR runner and AWS deployment files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # Configure batch/backend policy based on resolved choice | ||
| if backend_choice == "docling": | ||
| # Keep docling runs conservative: process one document per batch for stability |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 109 is now inconsistent with the code. The comment states "Keep docling runs conservative: process one document per batch for stability" but the code now sets max_batch_files=5, not 1. Either update the comment to reflect the new batching strategy or reconsider if this comment is still accurate.
| needs_gpu = bool(force_ocr or formula_enrichment or code_enrichment) | ||
| if choice == "auto": | ||
| choice = "docling" if needs_gpu else "safe" | ||
| choice = "docling" #removed safe option |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change makes "docling" the default for all "auto" cases, regardless of whether GPU features are needed. Previously, when choice was "auto", the code would select "docling" if needs_gpu was true, otherwise "safe". Now it always selects "docling". This may impact users without GPU capabilities who were relying on the automatic fallback to "safe" mode. Consider whether this change could break existing workflows or if there should be additional logic to check GPU availability before defaulting to "docling".
| choice = "docling" #removed safe option | |
| # When auto is requested, prefer 'docling' only if GPU‑dependent features are needed; | |
| # otherwise default to the safer CPU‑oriented backend. | |
| choice = "docling" if needs_gpu else "safe" |
Changed the default to docling and also increased the max_batch_files to 5