Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 20, 2025

📄 18% (0.18x) speedup for OCRAgentPaddle.get_text_from_image in unstructured/partition/utils/ocr_models/paddle_ocr.py

⏱️ Runtime : 210 microseconds 178 microseconds (best of 10 runs)

📝 Explanation and details

The optimized code achieves a 17% speedup by replacing the original parse_data method with a new _parse_data_fast method that leverages Numba JIT compilation for the most computationally expensive operations.

Key optimizations:

  1. Numba-accelerated min/max calculations: The bottleneck operation of finding bounding box coordinates (min/max of x and y arrays) is moved to a separate _get_minmax_numba function decorated with @njit(cache=True). This compiles to native machine code and eliminates Python interpreter overhead for these mathematical operations.

  2. Vectorized coordinate processing: Instead of calling min() and max() on Python lists for each text region individually, the code now:

    • Pre-allocates a numpy array flat_minmax to store all bounding box coordinates
    • Converts coordinate lists to numpy arrays before passing to the JIT-compiled function
    • Processes all coordinate calculations in batch
  3. Reduced Python overhead: The original code performed list comprehensions and min/max operations in pure Python for each text region. The optimized version moves these operations into compiled code, significantly reducing interpreter overhead.

Why this works: OCR data parsing is typically compute-intensive with many repetitive mathematical operations on coordinate arrays. Numba's nopython mode eliminates Python's dynamic typing overhead and compiles these operations to efficient machine code, while the caching ensures the compilation cost is only paid once.

The optimization maintains identical behavior and output format, making it a safe performance improvement that preserves all existing functionality while accelerating the critical parsing loop that dominates execution time (88.3% of runtime in the original profiler results).

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 24 Passed
🌀 Generated Regression Tests 🔘 None Found
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
partition/pdf_image/test_ocr.py::test_get_ocr_from_image_google_vision 159μs 130μs 22.2%✅
partition/pdf_image/test_ocr.py::test_get_ocr_text_from_image_paddle 49.8μs 46.9μs 6.13%✅

To edit these changes git checkout codeflash/optimize-OCRAgentPaddle.get_text_from_image-mjdsesqh and push.

Codeflash Static Badge

The optimized code achieves a **17% speedup** by replacing the original `parse_data` method with a new `_parse_data_fast` method that leverages **Numba JIT compilation** for the most computationally expensive operations.

**Key optimizations:**

1. **Numba-accelerated min/max calculations**: The bottleneck operation of finding bounding box coordinates (min/max of x and y arrays) is moved to a separate `_get_minmax_numba` function decorated with `@njit(cache=True)`. This compiles to native machine code and eliminates Python interpreter overhead for these mathematical operations.

2. **Vectorized coordinate processing**: Instead of calling `min()` and `max()` on Python lists for each text region individually, the code now:
   - Pre-allocates a numpy array `flat_minmax` to store all bounding box coordinates
   - Converts coordinate lists to numpy arrays before passing to the JIT-compiled function
   - Processes all coordinate calculations in batch

3. **Reduced Python overhead**: The original code performed list comprehensions and min/max operations in pure Python for each text region. The optimized version moves these operations into compiled code, significantly reducing interpreter overhead.

**Why this works**: OCR data parsing is typically compute-intensive with many repetitive mathematical operations on coordinate arrays. Numba's nopython mode eliminates Python's dynamic typing overhead and compiles these operations to efficient machine code, while the caching ensures the compilation cost is only paid once.

The optimization maintains identical behavior and output format, making it a safe performance improvement that preserves all existing functionality while accelerating the critical parsing loop that dominates execution time (88.3% of runtime in the original profiler results).
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 20, 2025 04:17
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant