⚡️ Speed up function aggregate_embedded_text_by_block by 70%
#50
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 70% (0.70x) speedup for
aggregate_embedded_text_by_blockinunstructured/partition/pdf_image/pdfminer_processing.py⏱️ Runtime :
3.98 milliseconds→2.34 milliseconds(best of30runs)📝 Explanation and details
The optimization introduces Numba JIT compilation to accelerate the most computationally intensive parts of the bounding box comparison algorithm, achieving a 70% speedup.
Key optimizations applied:
Numba JIT compilation: Added
@njit(cache=True, fastmath=True)decorators to create compiled versions of the core computational functions:_get_coords_from_bboxes_numba()for coordinate extraction_areas_of_boxes_and_intersection_area_numba()for area calculations_bboxes1_is_almost_subregion_of_bboxes2_numba()for the main comparison logicOptimized computation flow: The original code used NumPy broadcasting and vectorized operations, but the optimized version uses explicit loops within Numba-compiled functions, which can be faster for certain array sizes due to reduced memory overhead and better cache locality.
Precision handling: Switched to
np.float64for higher precision calculations while maintaining the same rounding behavior.Why this leads to speedup:
cache=Trueparameter ensures compiled functions are cached for subsequent callsPerformance characteristics from tests:
Impact on workloads:
Based on the function reference, this optimization significantly benefits PDF processing workflows where
aggregate_embedded_text_by_blockis called repeatedly inmerge_out_layout_with_ocr_layout()for each invalid text element. Since OCR processing typically involves many bounding box comparisons, this 70% speedup directly translates to faster document processing times, especially for documents with many text regions requiring OCR text aggregation.✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
partition/pdf_image/test_pdfminer_processing.py::test_aggregate_by_block🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-aggregate_embedded_text_by_block-mjdfd5ocand push.