⚡️ Speed up function `boxes_self_iou` by 79% #49

codeflash-ai · 2025-12-19T21:52:50Z

📄 79% (0.79x) speedup for `boxes_self_iou` in `unstructured/partition/pdf_image/pdfminer_processing.py`

⏱️ Runtime : 6.11 milliseconds → 3.41 milliseconds (best of 31 runs)

📝 Explanation and details

The optimized code achieves a 79% speedup by replacing NumPy's vectorized operations with Numba-compiled JIT functions for the core IoU computation.

Key Optimizations:

Numba JIT Compilation: The critical areas_of_boxes_and_intersection_area function is replaced with _areas_of_boxes_and_intersection_area_numba and _boxes_iou_numba, both decorated with @njit(cache=True, fastmath=True). This compiles the functions to native machine code, eliminating Python interpreter overhead.
Explicit Loop Implementation: Instead of NumPy's vectorized operations with array broadcasting and transpose operations, the optimized version uses explicit nested loops. While this seems counterintuitive, Numba makes these loops extremely fast while avoiding the memory allocation overhead of intermediate arrays.
Memory Efficiency: The explicit loops avoid creating large intermediate arrays that NumPy's vectorized operations would generate (like boxb_area.T and broadcast operations), reducing memory pressure and cache misses.
Type Consistency: The code ensures float64 compatibility for Numba functions, converting input arrays when necessary.

Performance Impact:

Small inputs (1-100 boxes): 89-391% faster due to reduced function call overhead
Medium inputs (100-200 boxes): 48-93% faster as Numba's compiled loops outperform NumPy's vectorized operations
Large inputs (500+ boxes): Still significant gains (52-59% faster) where memory efficiency becomes crucial

The optimization particularly benefits scenarios with frequent IoU calculations on moderate-sized bounding box sets, where the overhead of NumPy's array operations and memory allocations becomes significant compared to Numba's optimized machine code execution.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 13 Passed
🌀 Generated Regression Tests	✅ 36 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

⚙️ Existing Unit Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`partition/pdf_image/test_pdfminer_processing.py::test_boxes_self_iou`	120μs	32.7μs	269%✅

🌀 Generated Regression Tests and Runtime

from __future__ import annotations

import numpy as np

# imports
from unstructured.partition.pdf_image.pdfminer_processing import boxes_self_iou

DEFAULT_ROUND = 15


# Helper class for bbox
class BBox:
    def __init__(self, x1, y1, x2, y2):
        self.x1 = x1
        self.y1 = y1
        self.x2 = x2
        self.y2 = y2


# ------------------- UNIT TESTS -------------------

# 1. BASIC TEST CASES


def test_single_box_self_iou():
    # One box, IOU with itself should be True (diagonal)
    box = BBox(0, 0, 10, 10)
    codeflash_output = boxes_self_iou([box], threshold=0.5)
    result = codeflash_output  # 64.0μs -> 25.4μs (152% faster)


def test_two_identical_boxes():
    # Two identical boxes, IOU should be True for both
    box1 = BBox(1, 1, 5, 5)
    box2 = BBox(1, 1, 5, 5)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.5)
    result = codeflash_output  # 45.4μs -> 12.4μs (266% faster)


def test_two_non_overlapping_boxes():
    # Two boxes, no overlap, IOU should be False except diagonal
    box1 = BBox(0, 0, 2, 2)
    box2 = BBox(10, 10, 12, 12)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.1)
    result = codeflash_output  # 41.1μs -> 10.8μs (282% faster)


def test_partial_overlap_boxes():
    # Two boxes, partial overlap, IOU threshold 0.1 (should be True), 0.5 (should be False)
    box1 = BBox(0, 0, 4, 4)
    box2 = BBox(2, 2, 6, 6)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.1)
    result_low = codeflash_output  # 40.2μs -> 10.3μs (291% faster)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.5)
    result_high = codeflash_output  # 33.8μs -> 7.38μs (358% faster)


def test_boxes_as_numpy_array():
    # Provide boxes as numpy array
    arr = np.array([[0, 0, 3, 3], [2, 2, 5, 5]], dtype=np.float32)
    codeflash_output = boxes_self_iou(arr, threshold=0.1)
    result = codeflash_output  # 37.2μs -> 7.75μs (381% faster)


# 2. EDGE TEST CASES


def test_empty_input():
    # No boxes, should return shape (0, 0)
    codeflash_output = boxes_self_iou([], threshold=0.5)
    result = codeflash_output  # 35.3μs -> 8.92μs (296% faster)


def test_zero_area_boxes():
    # Box with zero area (x1==x2, y1==y2)
    box1 = BBox(1, 1, 1, 1)
    box2 = BBox(2, 2, 2, 2)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.0)
    result = codeflash_output  # 39.7μs -> 10.1μs (293% faster)


def test_negative_coordinates():
    # Boxes with negative coordinates
    box1 = BBox(-5, -5, 0, 0)
    box2 = BBox(-3, -3, 2, 2)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.1)
    result = codeflash_output  # 39.2μs -> 9.83μs (299% faster)


def test_threshold_extremes():
    # Threshold = 0 (all overlaps pass), threshold = 1 (only perfect overlap passes)
    box1 = BBox(0, 0, 4, 4)
    box2 = BBox(2, 2, 6, 6)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.0)
    result_zero = codeflash_output  # 39.0μs -> 9.71μs (301% faster)
    codeflash_output = boxes_self_iou([box1, box2], threshold=1.0)
    result_one = codeflash_output  # 33.5μs -> 7.12μs (370% faster)


def test_float_precision_rounding():
    # Test rounding effects
    box1 = BBox(0.0000001, 0.0000001, 1.0000001, 1.0000001)
    box2 = BBox(0, 0, 1, 1)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.99, round_to=7)
    result = codeflash_output  # 38.5μs -> 9.88μs (290% faster)


def test_large_coordinates():
    # Large values to test for overflow/precision
    box1 = BBox(1e6, 1e6, 1e6 + 100, 1e6 + 100)
    box2 = BBox(1e6 + 50, 1e6 + 50, 1e6 + 150, 1e6 + 150)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.2)
    result = codeflash_output  # 38.2μs -> 9.71μs (294% faster)


def test_invalid_box_order():
    # x2 < x1 or y2 < y1 (should result in zero area)
    box1 = BBox(5, 5, 1, 1)
    box2 = BBox(10, 10, 8, 8)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.0)
    result = codeflash_output  # 39.2μs -> 10.0μs (291% faster)


# 3. LARGE SCALE TEST CASES


def test_many_boxes_no_overlap():
    # 100 boxes, all far apart, off-diagonal should be False
    boxes = [BBox(i * 10, i * 10, i * 10 + 1, i * 10 + 1) for i in range(100)]
    codeflash_output = boxes_self_iou(boxes, threshold=0.1)
    result = codeflash_output  # 140μs -> 72.5μs (93.5% faster)
    # Diagonal should be True, off-diagonal should be False
    for i in range(100):
        for j in range(100):
            if i != j:
                pass


def test_many_boxes_full_overlap():
    # 100 identical boxes, all IOUs should be True
    boxes = [BBox(0, 0, 5, 5) for _ in range(100)]
    codeflash_output = boxes_self_iou(boxes, threshold=0.5)
    result = codeflash_output  # 133μs -> 70.2μs (89.7% faster)


def test_many_boxes_partial_overlap():
    # 100 boxes, each overlaps with next one
    boxes = [BBox(i, 0, i + 2, 2) for i in range(100)]
    codeflash_output = boxes_self_iou(boxes, threshold=0.1)
    result = codeflash_output  # 132μs -> 69.8μs (89.6% faster)
    for i in range(100):
        # Next box overlaps
        if i < 99:
            pass
        # Far boxes do not overlap
        if i < 98:
            pass


def test_large_input_performance():
    # 500 boxes, all identical, should be fast and all True
    boxes = [BBox(0, 0, 10, 10) for _ in range(500)]
    codeflash_output = boxes_self_iou(boxes, threshold=0.5)
    result = codeflash_output  # 1.37ms -> 899μs (52.3% faster)


def test_large_input_sparse_overlap():
    # 500 boxes, only every 10th overlaps with next
    boxes = []
    for i in range(500):
        if i % 10 == 0:
            boxes.append(BBox(i, i, i + 5, i + 5))
        else:
            boxes.append(BBox(i * 10, i * 10, i * 10 + 1, i * 10 + 1))
    codeflash_output = boxes_self_iou(boxes, threshold=0.1)
    result = codeflash_output  # 1.35ms -> 906μs (48.4% faster)
    for i in range(500):
        if i % 10 == 0 and i < 499 and (i + 1) % 10 == 1:
            pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from __future__ import annotations

import numpy as np

# imports
from unstructured.partition.pdf_image.pdfminer_processing import boxes_self_iou

DEFAULT_ROUND = 15


# Helper class for test cases
class BBox:
    def __init__(self, x1, y1, x2, y2):
        self.x1 = x1
        self.y1 = y1
        self.x2 = x2
        self.y2 = y2


# ---------------- BASIC TEST CASES ----------------


def test_single_box_self_iou():
    # Single box should have IoU 1 with itself
    box = BBox(0, 0, 10, 10)
    codeflash_output = boxes_self_iou([box], threshold=0.99)
    result = codeflash_output  # 40.6μs -> 9.79μs (315% faster)


def test_two_identical_boxes():
    # Two identical boxes should have IoU 1 with each other
    box1 = BBox(1, 1, 5, 5)
    box2 = BBox(1, 1, 5, 5)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.99)
    result = codeflash_output  # 39.6μs -> 9.88μs (301% faster)


def test_two_non_overlapping_boxes():
    # Two boxes that do not overlap should have IoU 0
    box1 = BBox(0, 0, 1, 1)
    box2 = BBox(10, 10, 12, 12)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.01)
    result = codeflash_output  # 38.8μs -> 9.58μs (304% faster)


def test_partial_overlap_boxes():
    # Boxes that partially overlap should have IoU < 1
    box1 = BBox(0, 0, 4, 4)
    box2 = BBox(2, 2, 6, 6)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.2)
    result = codeflash_output  # 38.6μs -> 9.54μs (305% faster)


def test_threshold_behavior():
    # Test thresholding: lower threshold makes more matches
    box1 = BBox(0, 0, 10, 10)
    box2 = BBox(5, 5, 15, 15)
    # Compute IoU: intersection area = (10-5+1)*(10-5+1)=6*6=36
    # Area1 = 11*11=121, Area2=11*11=121, union=121+121-36=206, IoU=36/206~0.174
    # threshold=0.1: should be True, threshold=0.2: should be False
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.1)
    result_low = codeflash_output  # 38.8μs -> 9.46μs (310% faster)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.2)
    result_high = codeflash_output  # 33.8μs -> 7.04μs (381% faster)


# ---------------- EDGE TEST CASES ----------------


def test_empty_input():
    # No boxes should return empty array
    codeflash_output = boxes_self_iou([], threshold=0.5)
    result = codeflash_output  # 34.6μs -> 8.62μs (301% faster)


def test_zero_area_box():
    # Box with zero area (x1==x2, y1==y2)
    box1 = BBox(1, 1, 1, 1)
    box2 = BBox(2, 2, 2, 2)
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.01)
    result = codeflash_output  # 38.9μs -> 9.58μs (306% faster)


def test_negative_coordinates():
    # Boxes with negative coordinates
    box1 = BBox(-5, -5, 0, 0)
    box2 = BBox(-3, -3, 2, 2)
    # They overlap at (-3,-3)-(0,0): area = (0-(-3)+1)*(0-(-3)+1)=4*4=16
    # Area1: (0-(-5)+1)^2=6*6=36, Area2: (2-(-3)+1)^2=6*6=36, union=36+36-16=56, IoU=16/56~0.286
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.25)
    result = codeflash_output  # 38.4μs -> 9.46μs (306% faster)


def test_float_precision():
    # Boxes with float coordinates
    box1 = BBox(0.1, 0.1, 1.9, 1.9)
    box2 = BBox(1.0, 1.0, 2.0, 2.0)
    # Overlap: (1.9-1.0+1)*(1.9-1.0+1)=1.9*1.9=3.61
    # Area1: (1.9-0.1+1)^2=2.8*2.8=7.84, Area2: (2.0-1.0+1)^2=2*2=4
    # Union=7.84+4-3.61=8.23, IoU=3.61/8.23~0.439
    codeflash_output = boxes_self_iou([box1, box2], threshold=0.4)
    result = codeflash_output  # 38.4μs -> 9.42μs (308% faster)


def test_mixed_types_input():
    # Accepts np.ndarray as input
    arr = np.array([[0, 0, 2, 2], [1, 1, 3, 3]], dtype=np.float32)
    codeflash_output = boxes_self_iou(arr, threshold=0.1)
    result = codeflash_output  # 36.4μs -> 7.42μs (391% faster)


def test_large_threshold():
    # Threshold > 1 should always be False except for self
    box1 = BBox(0, 0, 10, 10)
    box2 = BBox(5, 5, 15, 15)
    codeflash_output = boxes_self_iou([box1, box2], threshold=1.1)
    result = codeflash_output  # 38.5μs -> 9.46μs (307% faster)


# ---------------- LARGE SCALE TEST CASES ----------------


def test_many_boxes_sparse_overlap():
    # 100 boxes, only diagonal should be True
    boxes = [BBox(i * 10, i * 10, i * 10 + 5, i * 10 + 5) for i in range(100)]
    codeflash_output = boxes_self_iou(boxes, threshold=0.01)
    result = codeflash_output  # 136μs -> 70.9μs (92.1% faster)


def test_many_boxes_dense_overlap():
    # 100 boxes, all overlap completely
    boxes = [BBox(0, 0, 10, 10) for _ in range(100)]
    codeflash_output = boxes_self_iou(boxes, threshold=0.99)
    result = codeflash_output  # 131μs -> 69.7μs (89.1% faster)


def test_large_array_input():
    # Use np.ndarray input for 500 boxes
    arr = np.zeros((500, 4), dtype=np.float32)
    for i in range(500):
        arr[i] = [i, i, i + 1, i + 1]
    codeflash_output = boxes_self_iou(arr, threshold=0.01)
    result = codeflash_output  # 1.20ms -> 754μs (58.9% faster)


def test_performance_large_overlap():
    # 200 boxes, every box overlaps with every other (same coordinates)
    arr = np.array([[0, 0, 100, 100]] * 200, dtype=np.float32)
    codeflash_output = boxes_self_iou(arr, threshold=0.99)
    result = codeflash_output  # 248μs -> 131μs (89.2% faster)


def test_scalability_varied_boxes():
    # 100 boxes, half overlap, half don't
    boxes = [BBox(0, 0, 10, 10) for _ in range(50)] + [
        BBox(i * 20, i * 20, i * 20 + 5, i * 20 + 5) for i in range(50)
    ]
    codeflash_output = boxes_self_iou(boxes, threshold=0.1)
    result = codeflash_output  # 133μs -> 70.2μs (89.9% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-boxes_self_iou-mjdenx7e and push.

The optimized code achieves a **79% speedup** by replacing NumPy's vectorized operations with **Numba-compiled JIT functions** for the core IoU computation. **Key Optimizations:** 1. **Numba JIT Compilation**: The critical `areas_of_boxes_and_intersection_area` function is replaced with `_areas_of_boxes_and_intersection_area_numba` and `_boxes_iou_numba`, both decorated with `@njit(cache=True, fastmath=True)`. This compiles the functions to native machine code, eliminating Python interpreter overhead. 2. **Explicit Loop Implementation**: Instead of NumPy's vectorized operations with array broadcasting and transpose operations, the optimized version uses explicit nested loops. While this seems counterintuitive, Numba makes these loops extremely fast while avoiding the memory allocation overhead of intermediate arrays. 3. **Memory Efficiency**: The explicit loops avoid creating large intermediate arrays that NumPy's vectorized operations would generate (like `boxb_area.T` and broadcast operations), reducing memory pressure and cache misses. 4. **Type Consistency**: The code ensures float64 compatibility for Numba functions, converting input arrays when necessary. **Performance Impact:** - **Small inputs** (1-100 boxes): 89-391% faster due to reduced function call overhead - **Medium inputs** (100-200 boxes): 48-93% faster as Numba's compiled loops outperform NumPy's vectorized operations - **Large inputs** (500+ boxes): Still significant gains (52-59% faster) where memory efficiency becomes crucial The optimization particularly benefits scenarios with frequent IoU calculations on moderate-sized bounding box sets, where the overhead of NumPy's array operations and memory allocations becomes significant compared to Numba's optimized machine code execution.

codeflash-ai bot requested a review from aseembits93 December 19, 2025 21:52

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `boxes_self_iou` by 79% #49

⚡️ Speed up function `boxes_self_iou` by 79% #49

Uh oh!

codeflash-ai bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function boxes_self_iou by 79% #49

Are you sure you want to change the base?

⚡️ Speed up function boxes_self_iou by 79% #49

Uh oh!

Conversation

codeflash-ai bot commented Dec 19, 2025

📄 79% (0.79x) speedup for boxes_self_iou in unstructured/partition/pdf_image/pdfminer_processing.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `boxes_self_iou` by 79% #49

⚡️ Speed up function `boxes_self_iou` by 79% #49

📄 79% (0.79x) speedup for `boxes_self_iou` in `unstructured/partition/pdf_image/pdfminer_processing.py`