⚡️ Speed up function `_mark_non_table_inferred_for_removal_if_has_subregion_relationship` by 114% #47

codeflash-ai · 2025-12-19T21:24:32Z

📄 114% (1.14x) speedup for `_mark_non_table_inferred_for_removal_if_has_subregion_relationship` in `unstructured/partition/pdf_image/pdfminer_processing.py`

⏱️ Runtime : 16.7 milliseconds → 7.78 milliseconds (best of 86 runs)

📝 Explanation and details

The optimization replaces the original two separate function calls to bboxes1_is_almost_subregion_of_bboxes2() and the underlying areas_of_boxes_and_intersection_area() with a single, fused Numba-compiled function _areas_and_subregion_mask().

Key Changes:

Numba JIT compilation: The @njit(cache=True, fastmath=True) decorator compiles the intersection area computation and subregion logic to optimized machine code, eliminating Python interpretation overhead
Fused operations: Instead of separately computing intersection areas, box areas, and then applying the subregion threshold check, everything is done in one pass within the compiled loop
Eliminated intermediate arrays: The original code created large intermediate matrices for inter_area, boxa_area, and boxb_area that consumed memory and required additional vectorized operations

Why It's Faster:
The original implementation had two expensive calls (82% and 37% of runtime respectively in the line profiler) that involved:

Converting coordinates via get_coords_from_bboxes()
Computing intersection areas using vectorized NumPy operations with broadcasting
Creating large intermediate arrays and applying mathematical operations across them

The Numba version eliminates the NumPy vectorization overhead by using explicit nested loops that compile to efficient machine code, avoiding temporary array allocations and reducing memory bandwidth requirements.

Impact on Workloads:
Based on the function reference, this optimization directly benefits PDF layout merging operations in array_merge_inferred_layout_with_extracted_layout(), which is a core function for document processing. The 114% speedup is particularly valuable for:

Large documents with many layout elements (test cases show 92-95% speedups on 500+ element scenarios)
Batch document processing where this function is called repeatedly
Real-time document analysis workflows where latency matters

The optimization shows consistent 200-800% improvements across all test scenarios, with the greatest benefits on larger datasets where the O(N×M) complexity of comparing all inferred vs extracted elements becomes most expensive.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 37 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

from __future__ import annotations

import numpy as np

# imports
from unstructured.partition.pdf_image.pdfminer_processing import (
    _mark_non_table_inferred_for_removal_if_has_subregion_relationship,
)

DEFAULT_ROUND = 15


def get_coords_from_bboxes(bboxes, round_to: int = DEFAULT_ROUND) -> np.ndarray:
    """convert a list of boxes's coords into np array"""
    if isinstance(bboxes, np.ndarray):
        return bboxes.round(round_to)
    coords = np.zeros((len(bboxes), 4), dtype=np.float32)
    for i, bbox in enumerate(bboxes):
        coords[i, :] = [bbox.x1, bbox.y1, bbox.x2, bbox.y2]
    return coords.round(round_to)


# --- Minimal LayoutElement class for testing ---
class DummyBox:
    """A simple bounding box class for test purposes."""

    def __init__(self, x1, y1, x2, y2):
        self.x1 = float(x1)
        self.y1 = float(y1)
        self.x2 = float(x2)
        self.y2 = float(y2)


class LayoutElements:
    """A minimal stand-in for unstructured_inference.inference.layoutelement.LayoutElements."""

    def __init__(self, boxes):
        # boxes: list of DummyBox or np.ndarray of shape (N, 4)
        if isinstance(boxes, np.ndarray):
            self.element_coords = boxes
        else:
            # Assume boxes is a list of DummyBox
            self.element_coords = get_coords_from_bboxes(boxes)


# --- Unit Tests ---

# 1. BASIC TEST CASES


def test_no_overlap_all_remain():
    # No inferred is subregion of extracted and vice versa
    inferred = LayoutElements([DummyBox(0, 0, 1, 1), DummyBox(5, 5, 6, 6)])
    extracted = LayoutElements([DummyBox(10, 10, 11, 11)])
    inferred_to_keep = np.array([True, True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 70.1μs -> 8.58μs (717% faster)


def test_inferred_is_subregion_of_extracted():
    # inferred[0] is inside extracted[0]
    inferred = LayoutElements([DummyBox(1, 1, 2, 2), DummyBox(10, 10, 12, 12)])
    extracted = LayoutElements([DummyBox(0, 0, 3, 3)])
    inferred_to_keep = np.array([True, True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 67.8μs -> 8.17μs (730% faster)


def test_extracted_is_subregion_of_inferred():
    # extracted[0] is inside inferred[1]
    inferred = LayoutElements([DummyBox(0, 0, 1, 1), DummyBox(0, 0, 10, 10)])
    extracted = LayoutElements([DummyBox(1, 1, 2, 2)])
    inferred_to_keep = np.array([True, True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 67.2μs -> 7.92μs (748% faster)


def test_both_subregion_relationships():
    # inferred[0] is subregion of extracted[0], extracted[1] is subregion of inferred[1]
    inferred = LayoutElements([DummyBox(1, 1, 2, 2), DummyBox(10, 10, 20, 20)])
    extracted = LayoutElements([DummyBox(0, 0, 3, 3), DummyBox(12, 12, 15, 15)])
    inferred_to_keep = np.array([True, True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 69.7μs -> 8.92μs (682% faster)


def test_threshold_effect():
    # inferred[0] overlaps but not enough to trigger threshold
    inferred = LayoutElements([DummyBox(0, 0, 2, 2)])
    extracted = LayoutElements([DummyBox(1, 1, 3, 3)])
    inferred_to_keep = np.array([True])
    # With high threshold, should not remove
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.9
    )
    result = codeflash_output  # 64.8μs -> 7.75μs (737% faster)
    # With low threshold, should remove
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.1
    )
    result2 = codeflash_output  # 59.6μs -> 6.54μs (811% faster)


# 2. EDGE TEST CASES


def test_empty_inferred():
    # No inferred boxes
    inferred = LayoutElements([])
    extracted = LayoutElements([DummyBox(0, 0, 1, 1)])
    inferred_to_keep = np.array([], dtype=bool)
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 64.8μs -> 8.00μs (710% faster)


def test_empty_extracted():
    # No extracted boxes
    inferred = LayoutElements([DummyBox(0, 0, 1, 1)])
    extracted = LayoutElements([])
    inferred_to_keep = np.array([True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 62.3μs -> 7.46μs (735% faster)


def test_all_inferred_removed():
    # All inferred boxes are subregions of extracted
    inferred = LayoutElements([DummyBox(1, 1, 2, 2), DummyBox(3, 3, 4, 4)])
    extracted = LayoutElements([DummyBox(0, 0, 5, 5)])
    inferred_to_keep = np.array([True, True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 68.2μs -> 7.96μs (758% faster)


def test_all_inferred_already_false():
    inferred = LayoutElements([DummyBox(0, 0, 1, 1), DummyBox(2, 2, 3, 3)])
    extracted = LayoutElements([DummyBox(0, 0, 10, 10)])
    inferred_to_keep = np.array([False, False])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 67.5μs -> 7.75μs (772% faster)


def test_touching_but_not_overlapping():
    # Boxes touch at edge but do not overlap
    inferred = LayoutElements([DummyBox(0, 0, 1, 1)])
    extracted = LayoutElements([DummyBox(1, 1, 2, 2)])
    inferred_to_keep = np.array([True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 64.4μs -> 7.54μs (754% faster)


def test_zero_area_boxes():
    # Boxes with zero area
    inferred = LayoutElements([DummyBox(1, 1, 1, 1)])  # zero area
    extracted = LayoutElements([DummyBox(0, 0, 2, 2)])
    inferred_to_keep = np.array([True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 64.5μs -> 7.75μs (732% faster)


def test_floating_point_precision():
    # Boxes with nearly equal coordinates, test rounding
    inferred = LayoutElements([DummyBox(0.000000000000001, 0, 1, 1)])
    extracted = LayoutElements([DummyBox(0, 0, 1, 1)])
    inferred_to_keep = np.array([True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 63.7μs -> 7.71μs (726% faster)


# 3. LARGE SCALE TEST CASES


def test_many_boxes_no_overlap():
    # 100 inferred, 100 extracted, no overlap at all
    inferred_boxes = [DummyBox(i * 10, i * 10, i * 10 + 5, i * 10 + 5) for i in range(100)]
    extracted_boxes = [
        DummyBox(1000 + i * 10, 1000 + i * 10, 1000 + i * 10 + 5, 1000 + i * 10 + 5)
        for i in range(100)
    ]
    inferred = LayoutElements(inferred_boxes)
    extracted = LayoutElements(extracted_boxes)
    inferred_to_keep = np.ones(100, dtype=bool)
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 207μs -> 65.8μs (215% faster)


def test_many_boxes_all_subregions():
    # 50 inferred, each is subregion of corresponding extracted
    inferred_boxes = [DummyBox(i, i, i + 2, i + 2) for i in range(50)]
    extracted_boxes = [DummyBox(i - 1, i - 1, i + 3, i + 3) for i in range(50)]
    inferred = LayoutElements(inferred_boxes)
    extracted = LayoutElements(extracted_boxes)
    inferred_to_keep = np.ones(50, dtype=bool)
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 102μs -> 26.0μs (292% faster)


def test_large_mixed_relationships():
    # 100 inferred, 100 extracted, half overlap as subregions, half do not
    inferred_boxes = [DummyBox(i, i, i + 2, i + 2) for i in range(50)] + [
        DummyBox(1000 + i, 1000 + i, 1000 + i + 2, 1000 + i + 2) for i in range(50)
    ]
    extracted_boxes = [DummyBox(i - 1, i - 1, i + 3, i + 3) for i in range(50)] + [
        DummyBox(2000 + i, 2000 + i, 2000 + i + 2, 2000 + i + 2) for i in range(50)
    ]
    inferred = LayoutElements(inferred_boxes)
    extracted = LayoutElements(extracted_boxes)
    inferred_to_keep = np.ones(100, dtype=bool)
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 201μs -> 66.4μs (203% faster)
    # First 50 should be removed, last 50 remain
    expected = np.array([False] * 50 + [True] * 50)


def test_performance_large_sparse():
    # 500 inferred, 500 extracted, only a few overlap
    inferred_boxes = [DummyBox(i * 2, i * 2, i * 2 + 1, i * 2 + 1) for i in range(500)]
    extracted_boxes = [DummyBox(i * 4, i * 4, i * 4 + 1, i * 4 + 1) for i in range(500)]
    inferred = LayoutElements(inferred_boxes)
    extracted = LayoutElements(extracted_boxes)
    inferred_to_keep = np.ones(500, dtype=bool)
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 2.51ms -> 1.30ms (92.1% faster)
    # Only those where i*2 == j*4 (i even, j = i//2) overlap
    # For i in 0,2,4,...,998, inferred[i] overlaps extracted[i//2]
    expected = np.ones(500, dtype=bool)
    for i in range(0, 500, 2):
        expected[i] = False


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from __future__ import annotations

import numpy as np

# imports
from unstructured.partition.pdf_image.pdfminer_processing import (
    _mark_non_table_inferred_for_removal_if_has_subregion_relationship,
)


# Minimal LayoutElement and LayoutElements for testing
class LayoutElement:
    def __init__(self, x1, y1, x2, y2):
        self.x1 = x1
        self.y1 = y1
        self.x2 = x2
        self.y2 = y2


class LayoutElements(list):
    @property
    def element_coords(self):
        # Returns np.ndarray of shape (N, 4)
        return np.array([[el.x1, el.y1, el.x2, el.y2] for el in self], dtype=np.float32)


# ========== Unit tests ==========

# ---- Basic Test Cases ----


def test_no_overlap_all_inferred_kept():
    # No subregion relationship, all inferred should be kept
    extracted = LayoutElements([LayoutElement(0, 0, 10, 10)])
    inferred = LayoutElements([LayoutElement(20, 20, 30, 30), LayoutElement(40, 40, 50, 50)])
    inferred_to_keep = np.array([True, True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 76.8μs -> 11.9μs (546% faster)


def test_inferred_is_subregion_of_extracted():
    # Inferred is subregion of extracted, should be removed
    extracted = LayoutElements([LayoutElement(0, 0, 10, 10)])
    inferred = LayoutElements([LayoutElement(1, 1, 5, 5), LayoutElement(20, 20, 30, 30)])
    inferred_to_keep = np.array([True, True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 72.9μs -> 11.2μs (553% faster)


def test_extracted_is_subregion_of_inferred():
    # Extracted is subregion of inferred, inferred should be removed
    extracted = LayoutElements([LayoutElement(2, 2, 4, 4)])
    inferred = LayoutElements([LayoutElement(0, 0, 10, 10), LayoutElement(20, 20, 30, 30)])
    inferred_to_keep = np.array([True, True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 72.4μs -> 10.9μs (566% faster)


def test_multiple_inferred_and_extracted_mixed():
    # Mixed relationships
    extracted = LayoutElements([LayoutElement(0, 0, 10, 10), LayoutElement(30, 30, 40, 40)])
    inferred = LayoutElements(
        [
            LayoutElement(1, 1, 5, 5),  # subregion of first extracted
            LayoutElement(30, 30, 40, 40),  # identical to second extracted
            LayoutElement(15, 15, 18, 18),  # no relation
        ]
    )
    inferred_to_keep = np.array([True, True, True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 76.2μs -> 12.7μs (502% faster)


def test_threshold_effect():
    # Subregion only if threshold is met
    extracted = LayoutElements([LayoutElement(0, 0, 10, 10)])
    inferred = LayoutElements([LayoutElement(0, 0, 5, 5)])  # covers 1/4 of extracted
    inferred_to_keep = np.array([True])
    # threshold 0.5: should not be subregion
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.6
    )
    result = codeflash_output  # 69.1μs -> 10.7μs (545% faster)
    # threshold 0.2: should be subregion
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.2
    )
    result2 = codeflash_output  # 62.8μs -> 8.08μs (677% faster)


# ---- Edge Test Cases ----


def test_all_inferred_removed():
    # All inferred are subregions or contain extracted, all should be removed
    extracted = LayoutElements([LayoutElement(0, 0, 10, 10), LayoutElement(20, 20, 30, 30)])
    inferred = LayoutElements(
        [
            LayoutElement(1, 1, 5, 5),  # subregion of first
            LayoutElement(0, 0, 10, 10),  # identical to first
            LayoutElement(21, 21, 29, 29),  # subregion of second
        ]
    )
    inferred_to_keep = np.array([True, True, True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 111μs -> 39.5μs (182% faster)


def test_all_inferred_already_false():
    # inferred_to_keep is already all False, should stay False
    extracted = LayoutElements([LayoutElement(0, 0, 10, 10)])
    inferred = LayoutElements([LayoutElement(1, 1, 5, 5), LayoutElement(20, 20, 30, 30)])
    inferred_to_keep = np.array([False, False])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 78.2μs -> 13.8μs (465% faster)


def test_identical_boxes():
    # Inferred and extracted are identical, should be removed
    extracted = LayoutElements([LayoutElement(0, 0, 10, 10)])
    inferred = LayoutElements([LayoutElement(0, 0, 10, 10)])
    inferred_to_keep = np.array([True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 71.3μs -> 12.0μs (492% faster)


def test_boxes_touching_but_not_overlapping():
    # Boxes touch at edge but do not overlap
    extracted = LayoutElements([LayoutElement(0, 0, 10, 10)])
    inferred = LayoutElements([LayoutElement(10, 10, 20, 20)])
    inferred_to_keep = np.array([True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 68.8μs -> 11.0μs (525% faster)


def test_float_precision_boxes():
    # Boxes with float coordinates, test rounding and precision
    extracted = LayoutElements([LayoutElement(0.0000001, 0.0000001, 10.0000001, 10.0000001)])
    inferred = LayoutElements([LayoutElement(1.0000001, 1.0000001, 5.0000001, 5.0000001)])
    inferred_to_keep = np.array([True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 68.6μs -> 10.8μs (535% faster)


def test_negative_coordinates():
    # Boxes with negative coordinates
    extracted = LayoutElements([LayoutElement(-10, -10, 0, 0)])
    inferred = LayoutElements([LayoutElement(-9, -9, -1, -1), LayoutElement(1, 1, 5, 5)])
    inferred_to_keep = np.array([True, True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 73.5μs -> 11.3μs (549% faster)


# ---- Large Scale Test Cases ----


def test_large_number_of_inferred_and_extracted():
    # Many inferred and extracted, only some overlap
    np.random.seed(42)
    N = 500
    # Extracted: 500 boxes in grid
    extracted = LayoutElements(
        [LayoutElement(i * 2, i * 2, i * 2 + 5, i * 2 + 5) for i in range(N)]
    )
    # Inferred: 500 boxes, every 10th is subregion of an extracted, rest are disjoint
    inferred = LayoutElements(
        [
            (
                LayoutElement(i * 2 + 1, i * 2 + 1, i * 2 + 3, i * 2 + 3)
                if i % 10 == 0
                else LayoutElement(1000 + i, 1000 + i, 1005 + i, 1005 + i)
            )
            for i in range(N)
        ]
    )
    inferred_to_keep = np.ones(N, dtype=bool)
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 3.02ms -> 1.56ms (94.0% faster)
    # Every 10th inferred should be removed
    expected = np.ones(N, dtype=bool)
    expected[::10] = False


def test_large_all_disjoint():
    # Large N, all boxes disjoint, all inferred kept
    N = 800
    extracted = LayoutElements(
        [LayoutElement(i * 10, i * 10, i * 10 + 5, i * 10 + 5) for i in range(N)]
    )
    inferred = LayoutElements(
        [
            LayoutElement(10000 + i * 10, 10000 + i * 10, 10000 + i * 10 + 5, 10000 + i * 10 + 5)
            for i in range(N)
        ]
    )
    inferred_to_keep = np.ones(N, dtype=bool)
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 6.99ms -> 3.69ms (89.4% faster)


def test_large_all_subregions():
    # Large N, all inferred are subregions of extracted, all removed
    N = 300
    extracted = LayoutElements(
        [LayoutElement(i * 10, i * 10, i * 10 + 10, i * 10 + 10) for i in range(N)]
    )
    inferred = LayoutElements(
        [LayoutElement(i * 10 + 1, i * 10 + 1, i * 10 + 5, i * 10 + 5) for i in range(N)]
    )
    inferred_to_keep = np.ones(N, dtype=bool)
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 1.30ms -> 638μs (104% faster)


def test_large_inferred_to_keep_initial_mask():
    # Large N, some inferred_to_keep already False, only True ones can be removed
    N = 100
    extracted = LayoutElements(
        [LayoutElement(i * 10, i * 10, i * 10 + 10, i * 10 + 10) for i in range(N)]
    )
    inferred = LayoutElements(
        [LayoutElement(i * 10 + 1, i * 10 + 1, i * 10 + 5, i * 10 + 5) for i in range(N)]
    )
    inferred_to_keep = np.array([True if i % 2 == 0 else False for i in range(N)])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 307μs -> 120μs (155% faster)
    expected = np.array([False if i % 2 == 0 else False for i in range(N)])


# ---- Miscellaneous Robustness ----


def test_inferred_and_extracted_same_object():
    # inferred_layout and extracted_layout are the same object
    boxes = [LayoutElement(0, 0, 10, 10), LayoutElement(20, 20, 30, 30)]
    layout = LayoutElements(boxes)
    inferred_to_keep = np.array([True, True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        layout, layout, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 75.5μs -> 12.5μs (506% faster)


def test_inferred_and_extracted_overlap_but_not_subregion():
    # Overlap but not subregion (intersection area too small)
    extracted = LayoutElements([LayoutElement(0, 0, 10, 10)])
    inferred = LayoutElements([LayoutElement(9, 9, 20, 20)])
    inferred_to_keep = np.array([True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.9
    )
    result = codeflash_output  # 71.3μs -> 10.8μs (558% faster)


def test_inferred_larger_than_extracted():
    # Inferred is larger than extracted, not a subregion, should not be removed
    extracted = LayoutElements([LayoutElement(5, 5, 10, 10)])
    inferred = LayoutElements([LayoutElement(0, 0, 20, 20)])
    inferred_to_keep = np.array([True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 68.9μs -> 10.8μs (539% faster)


def test_extracted_larger_than_inferred():
    # Extracted is larger than inferred, inferred is subregion, should be removed
    extracted = LayoutElements([LayoutElement(0, 0, 20, 20)])
    inferred = LayoutElements([LayoutElement(5, 5, 10, 10)])
    inferred_to_keep = np.array([True])
    codeflash_output = _mark_non_table_inferred_for_removal_if_has_subregion_relationship(
        extracted, inferred, inferred_to_keep.copy(), 0.5
    )
    result = codeflash_output  # 68.8μs -> 10.5μs (556% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_mark_non_table_inferred_for_removal_if_has_subregion_relationship-mjddnijp and push.

…onship The optimization replaces the original two separate function calls to `bboxes1_is_almost_subregion_of_bboxes2()` and the underlying `areas_of_boxes_and_intersection_area()` with a single, fused Numba-compiled function `_areas_and_subregion_mask()`. **Key Changes:** - **Numba JIT compilation**: The `@njit(cache=True, fastmath=True)` decorator compiles the intersection area computation and subregion logic to optimized machine code, eliminating Python interpretation overhead - **Fused operations**: Instead of separately computing intersection areas, box areas, and then applying the subregion threshold check, everything is done in one pass within the compiled loop - **Eliminated intermediate arrays**: The original code created large intermediate matrices for `inter_area`, `boxa_area`, and `boxb_area` that consumed memory and required additional vectorized operations **Why It's Faster:** The original implementation had two expensive calls (82% and 37% of runtime respectively in the line profiler) that involved: 1. Converting coordinates via `get_coords_from_bboxes()` 2. Computing intersection areas using vectorized NumPy operations with broadcasting 3. Creating large intermediate arrays and applying mathematical operations across them The Numba version eliminates the NumPy vectorization overhead by using explicit nested loops that compile to efficient machine code, avoiding temporary array allocations and reducing memory bandwidth requirements. **Impact on Workloads:** Based on the function reference, this optimization directly benefits PDF layout merging operations in `array_merge_inferred_layout_with_extracted_layout()`, which is a core function for document processing. The 114% speedup is particularly valuable for: - Large documents with many layout elements (test cases show 92-95% speedups on 500+ element scenarios) - Batch document processing where this function is called repeatedly - Real-time document analysis workflows where latency matters The optimization shows consistent 200-800% improvements across all test scenarios, with the greatest benefits on larger datasets where the O(N×M) complexity of comparing all inferred vs extracted elements becomes most expensive.

codeflash-ai bot requested a review from aseembits93 December 19, 2025 21:24

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `_mark_non_table_inferred_for_removal_if_has_subregion_relationship` by 114% #47

⚡️ Speed up function `_mark_non_table_inferred_for_removal_if_has_subregion_relationship` by 114% #47

Uh oh!

codeflash-ai bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function _mark_non_table_inferred_for_removal_if_has_subregion_relationship by 114% #47

Are you sure you want to change the base?

⚡️ Speed up function _mark_non_table_inferred_for_removal_if_has_subregion_relationship by 114% #47

Uh oh!

Conversation

codeflash-ai bot commented Dec 19, 2025

📄 114% (1.14x) speedup for _mark_non_table_inferred_for_removal_if_has_subregion_relationship in unstructured/partition/pdf_image/pdfminer_processing.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `_mark_non_table_inferred_for_removal_if_has_subregion_relationship` by 114% #47

⚡️ Speed up function `_mark_non_table_inferred_for_removal_if_has_subregion_relationship` by 114% #47

📄 114% (1.14x) speedup for `_mark_non_table_inferred_for_removal_if_has_subregion_relationship` in `unstructured/partition/pdf_image/pdfminer_processing.py`