⚡️ Speed up method `OCRAgentPaddle.get_text_from_image` by 18% #56

codeflash-ai · 2025-12-20T04:17:39Z

📄 18% (0.18x) speedup for `OCRAgentPaddle.get_text_from_image` in `unstructured/partition/utils/ocr_models/paddle_ocr.py`

⏱️ Runtime : 210 microseconds → 178 microseconds (best of 10 runs)

📝 Explanation and details

The optimized code achieves a 17% speedup by replacing the original parse_data method with a new _parse_data_fast method that leverages Numba JIT compilation for the most computationally expensive operations.

Key optimizations:

Numba-accelerated min/max calculations: The bottleneck operation of finding bounding box coordinates (min/max of x and y arrays) is moved to a separate _get_minmax_numba function decorated with @njit(cache=True). This compiles to native machine code and eliminates Python interpreter overhead for these mathematical operations.
Vectorized coordinate processing: Instead of calling min() and max() on Python lists for each text region individually, the code now:
- Pre-allocates a numpy array flat_minmax to store all bounding box coordinates
- Converts coordinate lists to numpy arrays before passing to the JIT-compiled function
- Processes all coordinate calculations in batch
Reduced Python overhead: The original code performed list comprehensions and min/max operations in pure Python for each text region. The optimized version moves these operations into compiled code, significantly reducing interpreter overhead.

Why this works: OCR data parsing is typically compute-intensive with many repetitive mathematical operations on coordinate arrays. Numba's nopython mode eliminates Python's dynamic typing overhead and compiles these operations to efficient machine code, while the caching ensures the compilation cost is only paid once.

The optimization maintains identical behavior and output format, making it a safe performance improvement that preserves all existing functionality while accelerating the critical parsing loop that dominates execution time (88.3% of runtime in the original profiler results).

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 24 Passed
🌀 Generated Regression Tests	🔘 None Found
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

⚙️ Existing Unit Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`partition/pdf_image/test_ocr.py::test_get_ocr_from_image_google_vision`	159μs	130μs	22.2%✅
`partition/pdf_image/test_ocr.py::test_get_ocr_text_from_image_paddle`	49.8μs	46.9μs	6.13%✅

To edit these changes git checkout codeflash/optimize-OCRAgentPaddle.get_text_from_image-mjdsesqh and push.

The optimized code achieves a **17% speedup** by replacing the original `parse_data` method with a new `_parse_data_fast` method that leverages **Numba JIT compilation** for the most computationally expensive operations. **Key optimizations:** 1. **Numba-accelerated min/max calculations**: The bottleneck operation of finding bounding box coordinates (min/max of x and y arrays) is moved to a separate `_get_minmax_numba` function decorated with `@njit(cache=True)`. This compiles to native machine code and eliminates Python interpreter overhead for these mathematical operations. 2. **Vectorized coordinate processing**: Instead of calling `min()` and `max()` on Python lists for each text region individually, the code now: - Pre-allocates a numpy array `flat_minmax` to store all bounding box coordinates - Converts coordinate lists to numpy arrays before passing to the JIT-compiled function - Processes all coordinate calculations in batch 3. **Reduced Python overhead**: The original code performed list comprehensions and min/max operations in pure Python for each text region. The optimized version moves these operations into compiled code, significantly reducing interpreter overhead. **Why this works**: OCR data parsing is typically compute-intensive with many repetitive mathematical operations on coordinate arrays. Numba's nopython mode eliminates Python's dynamic typing overhead and compiles these operations to efficient machine code, while the caching ensures the compilation cost is only paid once. The optimization maintains identical behavior and output format, making it a safe performance improvement that preserves all existing functionality while accelerating the critical parsing loop that dominates execution time (88.3% of runtime in the original profiler results).

codeflash-ai bot requested a review from aseembits93 December 20, 2025 04:17

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Dec 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `OCRAgentPaddle.get_text_from_image` by 18% #56

⚡️ Speed up method `OCRAgentPaddle.get_text_from_image` by 18% #56

Uh oh!

codeflash-ai bot commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method OCRAgentPaddle.get_text_from_image by 18% #56

Are you sure you want to change the base?

⚡️ Speed up method OCRAgentPaddle.get_text_from_image by 18% #56

Uh oh!

Conversation

codeflash-ai bot commented Dec 20, 2025

📄 18% (0.18x) speedup for OCRAgentPaddle.get_text_from_image in unstructured/partition/utils/ocr_models/paddle_ocr.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `OCRAgentPaddle.get_text_from_image` by 18% #56

⚡️ Speed up method `OCRAgentPaddle.get_text_from_image` by 18% #56

📄 18% (0.18x) speedup for `OCRAgentPaddle.get_text_from_image` in `unstructured/partition/utils/ocr_models/paddle_ocr.py`