Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 20, 2025

📄 353% (3.53x) speedup for _get_bbox_to_page_ratio in unstructured/partition/pdf_image/analysis/bbox_visualisation.py

⏱️ Runtime : 930 microseconds 205 microseconds (best of 250 runs)

📝 Explanation and details

The optimization applies Numba's Just-In-Time (JIT) compilation using the @njit(cache=True) decorator to dramatically speed up this mathematical computation function.

Key changes:

  • Added from numba import njit import
  • Applied @njit(cache=True) decorator to the function
  • No changes to the algorithm logic itself

Why this leads to a speedup:
Numba compiles Python bytecode to optimized machine code at runtime, eliminating Python's interpreter overhead for numerical computations. The function performs several floating-point operations (math.sqrt, exponentiation, arithmetic) that benefit significantly from native machine code execution. The cache=True parameter ensures the compiled version is cached for subsequent calls, avoiding recompilation overhead.

Performance characteristics:

  • 352% speedup (930μs → 205μs) demonstrates Numba's effectiveness on math-heavy functions
  • The line profiler shows no timing data for the optimized version because Numba-compiled code runs outside Python's profiling mechanisms
  • All test cases show consistent 180-370% speedups, with larger improvements on simple cases and slightly smaller gains on edge cases like exception handling

Impact on workloads:
Based on function_references, this function is called from _get_optimal_value_for_bbox(), which suggests it's used in document analysis pipelines where bounding box calculations are performed repeatedly. The substantial speedup will be particularly beneficial when processing documents with many bounding boxes, as demonstrated by the large-scale test cases showing 300%+ improvements when processing thousands of bboxes.

Optimization effectiveness:
Most effective for computational workloads with repeated calls to this function, especially when processing large documents or batch operations where the function is called hundreds or thousands of times.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 1067 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import math

# imports
import pytest

from unstructured.partition.pdf_image.analysis.bbox_visualisation import _get_bbox_to_page_ratio

# unit tests

# --- BASIC TEST CASES ---


def test_bbox_same_as_page():
    # BBox is exactly the size of the page, so ratio should be 1.0
    bbox = (0, 0, 100, 200)
    page_size = (100, 200)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    ratio = codeflash_output  # 1.29μs -> 416ns (210% faster)


def test_bbox_half_width_height():
    # BBox is half width and half height of page, so diagonal is half
    bbox = (0, 0, 50, 100)
    page_size = (100, 200)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    ratio = codeflash_output  # 1.17μs -> 416ns (180% faster)
    # Diagonal of bbox: sqrt(50^2 + 100^2) = sqrt(2500+10000)=sqrt(12500)
    # Diagonal of page: sqrt(100^2 + 200^2) = sqrt(10000+40000)=sqrt(50000)
    expected = math.sqrt(12500) / math.sqrt(50000)


def test_bbox_square_on_rect_page():
    # BBox is a square on a rectangular page
    bbox = (10, 20, 60, 70)  # width=50, height=50
    page_size = (100, 200)
    expected = math.sqrt(50**2 + 50**2) / math.sqrt(100**2 + 200**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    ratio = codeflash_output  # 1.12μs -> 375ns (200% faster)


def test_bbox_line_horizontal():
    # BBox is a horizontal line (height=0)
    bbox = (10, 20, 60, 20)  # width=50, height=0
    page_size = (100, 200)
    expected = math.sqrt(50**2 + 0**2) / math.sqrt(100**2 + 200**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    ratio = codeflash_output  # 1.17μs -> 375ns (211% faster)


def test_bbox_line_vertical():
    # BBox is a vertical line (width=0)
    bbox = (10, 20, 10, 70)  # width=0, height=50
    page_size = (100, 200)
    expected = math.sqrt(0**2 + 50**2) / math.sqrt(100**2 + 200**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    ratio = codeflash_output  # 1.12μs -> 333ns (238% faster)


# --- EDGE TEST CASES ---


def test_bbox_zero_area():
    # BBox with zero area (all points the same)
    bbox = (10, 20, 10, 20)
    page_size = (100, 200)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    ratio = codeflash_output  # 1.17μs -> 375ns (211% faster)


def test_bbox_negative_coordinates():
    # BBox with negative coordinates, but positive width/height
    bbox = (-10, -20, 10, 20)
    page_size = (100, 200)
    bbox_width = 10 - (-10)  # 20
    bbox_height = 20 - (-20)  # 40
    expected = math.sqrt(bbox_width**2 + bbox_height**2) / math.sqrt(100**2 + 200**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    ratio = codeflash_output  # 1.08μs -> 375ns (189% faster)


def test_bbox_coords_reversed():
    # BBox with x2 < x1 or y2 < y1 (should still work, diagonal is abs)
    bbox = (50, 60, 10, 20)
    page_size = (100, 200)
    bbox_width = 10 - 50  # -40
    bbox_height = 20 - 60  # -40
    expected = math.sqrt(bbox_width**2 + bbox_height**2) / math.sqrt(100**2 + 200**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    ratio = codeflash_output  # 1.17μs -> 333ns (250% faster)


def test_page_size_zero():
    # Page with zero width and height (should raise ZeroDivisionError)
    bbox = (0, 0, 10, 10)
    page_size = (0, 0)
    with pytest.raises(ZeroDivisionError):
        _get_bbox_to_page_ratio(bbox, page_size)  # 1.54μs -> 1.33μs (15.6% faster)


def test_bbox_large_coordinates():
    # Very large bbox and page coordinates
    bbox = (0, 0, 1_000_000, 2_000_000)
    page_size = (1_000_000, 2_000_000)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    ratio = codeflash_output  # 1.54μs -> 541ns (185% faster)


def test_bbox_outside_page():
    # BBox coordinates outside the page (should still compute ratio)
    bbox = (200, 300, 400, 500)
    page_size = (100, 200)
    bbox_width = 400 - 200  # 200
    bbox_height = 500 - 300  # 200
    expected = math.sqrt(200**2 + 200**2) / math.sqrt(100**2 + 200**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    ratio = codeflash_output  # 1.17μs -> 416ns (180% faster)


# --- LARGE SCALE TEST CASES ---


def test_many_bboxes_on_large_page():
    # Test with many bboxes on a large page
    page_size = (1000, 1000)
    page_diag = math.sqrt(1000**2 + 1000**2)
    for i in range(1, 1001, 100):  # 1, 101, ..., 901
        bbox = (0, 0, i, i)
        expected = math.sqrt(i**2 + i**2) / page_diag
        codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
        ratio = codeflash_output  # 9.25μs -> 2.29μs (304% faster)


def test_varied_bboxes_large_scale():
    # Test with 1000 bboxes of increasing size
    page_size = (1000, 2000)
    page_diag = math.sqrt(1000**2 + 2000**2)
    for i in range(1, 1001):
        bbox = (0, 0, i, 2 * i)
        bbox_diag = math.sqrt(i**2 + (2 * i) ** 2)
        expected = bbox_diag / page_diag
        codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
        ratio = codeflash_output  # 860μs -> 183μs (369% faster)


def test_large_random_bboxes():
    # Test with bboxes with random coordinates, but deterministic
    page_size = (500, 500)
    page_diag = math.sqrt(500**2 + 500**2)
    for i in range(0, 1000, 100):
        x1, y1 = i % 250, (i * 2) % 250
        x2, y2 = (x1 + 100) % 500, (y1 + 150) % 500
        bbox_width = x2 - x1
        bbox_height = y2 - y1
        bbox_diag = math.sqrt(bbox_width**2 + bbox_height**2)
        expected = bbox_diag / page_diag
        codeflash_output = _get_bbox_to_page_ratio((x1, y1, x2, y2), page_size)
        ratio = codeflash_output  # 8.96μs -> 2.21μs (306% faster)


def test_large_bbox_small_page():
    # BBox much larger than page (ratio > 1)
    bbox = (0, 0, 1000, 1000)
    page_size = (10, 10)
    expected = math.sqrt(1000**2 + 1000**2) / math.sqrt(10**2 + 10**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    ratio = codeflash_output  # 1.21μs -> 333ns (263% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import math

# imports
import pytest

from unstructured.partition.pdf_image.analysis.bbox_visualisation import _get_bbox_to_page_ratio

# unit tests

# -------------------- Basic Test Cases --------------------


def test_bbox_same_as_page():
    # BBox covers the whole page: ratio should be 1.0
    bbox = (0, 0, 100, 200)
    page_size = (100, 200)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.17μs -> 375ns (211% faster)


def test_bbox_half_size():
    # BBox is exactly half the width and height of the page
    bbox = (0, 0, 50, 100)
    page_size = (100, 200)
    # Diagonal of bbox: sqrt(50^2 + 100^2) = sqrt(2500 + 10000) = sqrt(12500)
    # Diagonal of page: sqrt(100^2 + 200^2) = sqrt(10000 + 40000) = sqrt(50000)
    expected = math.sqrt(12500) / math.sqrt(50000)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.12μs -> 375ns (200% faster)


def test_bbox_square_on_rect_page():
    # BBox is square on a rectangular page
    bbox = (10, 10, 60, 60)  # 50x50
    page_size = (100, 200)
    expected = math.sqrt(50**2 + 50**2) / math.sqrt(100**2 + 200**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.12μs -> 375ns (200% faster)


def test_bbox_rect_on_square_page():
    # BBox is rectangle on a square page
    bbox = (0, 0, 30, 60)
    page_size = (100, 100)
    expected = math.sqrt(30**2 + 60**2) / math.sqrt(100**2 + 100**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.08μs -> 375ns (189% faster)


def test_bbox_offset_from_origin():
    # BBox is not at origin but same size as page
    bbox = (5, 10, 105, 210)
    page_size = (100, 200)
    expected = math.sqrt(100**2 + 200**2) / math.sqrt(100**2 + 200**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.12μs -> 334ns (237% faster)


# -------------------- Edge Test Cases --------------------


def test_bbox_zero_area():
    # BBox has zero area (x1==x2, y1==y2)
    bbox = (10, 10, 10, 10)
    page_size = (100, 100)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.21μs -> 375ns (222% faster)


def test_bbox_line_horizontal():
    # BBox is a horizontal line (y1==y2)
    bbox = (10, 20, 60, 20)
    page_size = (100, 100)
    expected = math.sqrt((60 - 10) ** 2 + 0**2) / math.sqrt(100**2 + 100**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.17μs -> 375ns (211% faster)


def test_bbox_line_vertical():
    # BBox is a vertical line (x1==x2)
    bbox = (30, 40, 30, 90)
    page_size = (100, 100)
    expected = math.sqrt(0**2 + (90 - 40) ** 2) / math.sqrt(100**2 + 100**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.12μs -> 333ns (238% faster)


def test_bbox_negative_coordinates():
    # BBox has negative coordinates
    bbox = (-10, -10, 10, 10)
    page_size = (20, 20)
    expected = math.sqrt(20**2 + 20**2) / math.sqrt(20**2 + 20**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.12μs -> 375ns (200% faster)


def test_bbox_larger_than_page():
    # BBox is larger than the page
    bbox = (0, 0, 200, 200)
    page_size = (100, 100)
    expected = math.sqrt(200**2 + 200**2) / math.sqrt(100**2 + 100**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.12μs -> 333ns (238% faster)


def test_page_zero_size():
    # Page has zero width and height (should raise ZeroDivisionError)
    bbox = (0, 0, 10, 10)
    page_size = (0, 0)
    with pytest.raises(ZeroDivisionError):
        _get_bbox_to_page_ratio(bbox, page_size)  # 1.46μs -> 1.42μs (2.96% faster)


def test_bbox_coordinates_swapped():
    # x2 < x1 and y2 < y1 (negative width/height)
    bbox = (10, 10, 0, 0)
    page_size = (10, 10)
    # Diagonal is sqrt((-10)^2 + (-10)^2) = sqrt(200)
    expected = math.sqrt(100 + 100) / math.sqrt(100 + 100)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.25μs -> 542ns (131% faster)


def test_bbox_floats():
    # BBox and page_size with float values
    bbox = (0.0, 0.0, 3.0, 4.0)
    page_size = (6.0, 8.0)
    # bbox diagonal: 5, page diagonal: 10
    expected = 0.5
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 875ns -> 666ns (31.4% faster)


# -------------------- Large Scale Test Cases --------------------


def test_large_bbox_and_page():
    # Large bbox and page values
    bbox = (0, 0, 1000000, 1000000)
    page_size = (1000000, 1000000)
    expected = 1.0
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.54μs -> 458ns (236% faster)


def test_many_random_bboxes_on_large_page():
    # Test many bboxes on a large page for performance and correctness
    page_size = (999, 999)
    for i in range(1, 1000, 100):  # 10 cases, avoid >1000 iterations
        bbox = (0, 0, i, i)
        expected = math.sqrt(i**2 + i**2) / math.sqrt(999**2 + 999**2)
        codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
        result = codeflash_output  # 9.21μs -> 2.29μs (302% faster)


def test_large_page_small_bbox():
    # Very small bbox on a very large page
    bbox = (0, 0, 1, 1)
    page_size = (10000, 10000)
    expected = math.sqrt(1**2 + 1**2) / math.sqrt(10000**2 + 10000**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.17μs -> 333ns (250% faster)


def test_large_number_of_varied_bboxes():
    # Test up to 1000 different bboxes for robustness
    page_size = (500, 500)
    for i in range(1, 1000, 111):  # 10 cases
        bbox = (i, i, 500 - i, 500 - i)
        bbox_width = 500 - 2 * i
        bbox_height = 500 - 2 * i
        expected = math.sqrt(bbox_width**2 + bbox_height**2) / math.sqrt(500**2 + 500**2)
        codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
        result = codeflash_output  # 8.29μs -> 2.08μs (298% faster)


def test_large_bbox_negative_coordinates():
    # Large bbox with negative coordinates, page also large
    bbox = (-1000, -1000, 1000, 1000)
    page_size = (2000, 2000)
    expected = math.sqrt(2000**2 + 2000**2) / math.sqrt(2000**2 + 2000**2)
    codeflash_output = _get_bbox_to_page_ratio(bbox, page_size)
    result = codeflash_output  # 1.17μs -> 333ns (250% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_get_bbox_to_page_ratio-mjdkzmao and push.

Codeflash Static Badge

The optimization applies **Numba's Just-In-Time (JIT) compilation** using the `@njit(cache=True)` decorator to dramatically speed up this mathematical computation function.

**Key changes:**
- Added `from numba import njit` import
- Applied `@njit(cache=True)` decorator to the function
- No changes to the algorithm logic itself

**Why this leads to a speedup:**
Numba compiles Python bytecode to optimized machine code at runtime, eliminating Python's interpreter overhead for numerical computations. The function performs several floating-point operations (`math.sqrt`, exponentiation, arithmetic) that benefit significantly from native machine code execution. The `cache=True` parameter ensures the compiled version is cached for subsequent calls, avoiding recompilation overhead.

**Performance characteristics:**
- **352% speedup** (930μs → 205μs) demonstrates Numba's effectiveness on math-heavy functions
- The line profiler shows no timing data for the optimized version because Numba-compiled code runs outside Python's profiling mechanisms
- All test cases show consistent **180-370% speedups**, with larger improvements on simple cases and slightly smaller gains on edge cases like exception handling

**Impact on workloads:**
Based on `function_references`, this function is called from `_get_optimal_value_for_bbox()`, which suggests it's used in document analysis pipelines where bounding box calculations are performed repeatedly. The substantial speedup will be particularly beneficial when processing documents with many bounding boxes, as demonstrated by the large-scale test cases showing **300%+ improvements** when processing thousands of bboxes.

**Optimization effectiveness:**
Most effective for computational workloads with repeated calls to this function, especially when processing large documents or batch operations where the function is called hundreds or thousands of times.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 20, 2025 00:49
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant