⚡️ Speed up method OCRAgentPaddle.get_text_from_image by 18%
#56
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 18% (0.18x) speedup for
OCRAgentPaddle.get_text_from_imageinunstructured/partition/utils/ocr_models/paddle_ocr.py⏱️ Runtime :
210 microseconds→178 microseconds(best of10runs)📝 Explanation and details
The optimized code achieves a 17% speedup by replacing the original
parse_datamethod with a new_parse_data_fastmethod that leverages Numba JIT compilation for the most computationally expensive operations.Key optimizations:
Numba-accelerated min/max calculations: The bottleneck operation of finding bounding box coordinates (min/max of x and y arrays) is moved to a separate
_get_minmax_numbafunction decorated with@njit(cache=True). This compiles to native machine code and eliminates Python interpreter overhead for these mathematical operations.Vectorized coordinate processing: Instead of calling
min()andmax()on Python lists for each text region individually, the code now:flat_minmaxto store all bounding box coordinatesReduced Python overhead: The original code performed list comprehensions and min/max operations in pure Python for each text region. The optimized version moves these operations into compiled code, significantly reducing interpreter overhead.
Why this works: OCR data parsing is typically compute-intensive with many repetitive mathematical operations on coordinate arrays. Numba's nopython mode eliminates Python's dynamic typing overhead and compiles these operations to efficient machine code, while the caching ensures the compilation cost is only paid once.
The optimization maintains identical behavior and output format, making it a safe performance improvement that preserves all existing functionality while accelerating the critical parsing loop that dominates execution time (88.3% of runtime in the original profiler results).
✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
partition/pdf_image/test_ocr.py::test_get_ocr_from_image_google_visionpartition/pdf_image/test_ocr.py::test_get_ocr_text_from_image_paddleTo edit these changes
git checkout codeflash/optimize-OCRAgentPaddle.get_text_from_image-mjdsesqhand push.