⚡️ Speed up method `_DocxPartitioner._header_footer_text` by 11% #60

codeflash-ai · 2025-12-20T05:46:58Z

📄 11% (0.11x) speedup for `_DocxPartitioner._header_footer_text` in `unstructured/partition/docx.py`

⏱️ Runtime : 251 microseconds → 226 microseconds (best of 250 runs)

📝 Explanation and details

The optimization replaces a generator-based approach with a direct list accumulation, resulting in an 11% performance improvement.

Key Changes:

Removed nested generator function: The original code used a nested iter_hdrftr_texts() generator function that yielded text items, then filtered and joined them in a generator expression.
Direct list accumulation: The optimized version builds a list directly by iterating through blocks and only appending non-empty text items.
Eliminated generator overhead: By avoiding the generator pattern and the filtering step in the final join operation, the code reduces Python's generator creation and iteration overhead.

Why This is Faster:
The original code had multiple layers of abstraction: a generator function that yielded items, then a generator expression that filtered empty strings, and finally a join operation. Each generator creates overhead for Python's iterator protocol. The optimized version eliminates this by:

Checking if text is non-empty before adding to the list (avoiding empty string filtering later)
Using a simple list append instead of yield/next mechanics
Performing a single join operation on a pre-filtered list

Performance Characteristics:

Small datasets (single paragraphs/tables): 75-100% faster due to reduced function call overhead
Medium datasets (multiple paragraphs/tables): 60-85% faster from eliminated generator mechanics
Large datasets (500+ items): 2-4% faster as the benefits are diluted by the dominant cost of string operations

This optimization is particularly effective for document parsing workloads where headers/footers are processed frequently, as it reduces the per-call overhead without changing the algorithmic complexity.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 119 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

from typing import List

# imports
from unstructured.partition.docx import _DocxPartitioner


# Mocks for docx objects
class MockParagraph:
    def __init__(self, text):
        self.text = text


class MockCell:
    def __init__(self, text):
        self.text = text


class MockTable:
    def __init__(self, cells: List[MockCell]):
        self._cells = cells


class MockHeaderFooter:
    """
    Mocks _Header or _Footer object.
    Accepts a list of block items (paragraphs or tables).
    """

    def __init__(self, blocks):
        self._blocks = blocks

    def iter_inner_content(self):
        return iter(self._blocks)


# Minimal DocxPartitionerOptions for constructor
class DocxPartitionerOptions:
    pass


# unit tests


# Helper to create partitioner
def partitioner():
    return _DocxPartitioner(DocxPartitionerOptions())


# -------------------
# Basic Test Cases
# -------------------


def test_empty_header_footer_returns_empty_string():
    # No blocks at all
    hdrftr = MockHeaderFooter([])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 916ns -> 416ns (120% faster)


def test_header_footer_with_only_whitespace_paragraphs():
    # Only whitespace paragraphs
    hdrftr = MockHeaderFooter([MockParagraph("   "), MockParagraph("\t\n")])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 1.17μs -> 625ns (86.7% faster)


def test_single_paragraph():
    hdrftr = MockHeaderFooter([MockParagraph("Hello World")])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 1.00μs -> 500ns (100% faster)


def test_multiple_paragraphs():
    hdrftr = MockHeaderFooter(
        [MockParagraph("First"), MockParagraph("Second"), MockParagraph("Third")]
    )
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 1.17μs -> 666ns (75.1% faster)


def test_paragraphs_with_leading_trailing_whitespace():
    hdrftr = MockHeaderFooter(
        [MockParagraph("  foo  "), MockParagraph("\tbar\n"), MockParagraph("baz")]
    )
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 1.12μs -> 625ns (80.0% faster)


def test_single_table_with_cells():
    table = MockTable([MockCell("cell1"), MockCell("cell2"), MockCell("cell3")])
    hdrftr = MockHeaderFooter([table])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 959ns -> 500ns (91.8% faster)


def test_paragraph_and_table_mixed():
    table = MockTable([MockCell("A"), MockCell("B")])
    hdrftr = MockHeaderFooter([MockParagraph("Intro"), table, MockParagraph("Outro")])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 1.08μs -> 625ns (73.3% faster)


def test_table_with_cells_having_whitespace():
    table = MockTable([MockCell("  x "), MockCell("  "), MockCell("y")])
    hdrftr = MockHeaderFooter([table])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 959ns -> 500ns (91.8% faster)


def test_multiple_tables():
    table1 = MockTable([MockCell("foo"), MockCell("bar")])
    table2 = MockTable([MockCell("baz"), MockCell("qux")])
    hdrftr = MockHeaderFooter([table1, table2])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 1.04μs -> 583ns (78.7% faster)


def test_paragraph_with_newlines_inside():
    # Paragraph text may contain newlines
    hdrftr = MockHeaderFooter([MockParagraph("Line1\nLine2\nLine3")])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 1.00μs -> 500ns (100% faster)


# -------------------
# Edge Test Cases
# -------------------


def test_paragraphs_and_tables_with_empty_texts():
    table = MockTable([MockCell(""), MockCell(" "), MockCell("cell")])
    hdrftr = MockHeaderFooter(
        [MockParagraph(""), MockParagraph("   "), table, MockParagraph("next")]
    )
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 1.17μs -> 708ns (64.8% faster)


def test_table_with_all_empty_cells():
    table = MockTable([MockCell(" "), MockCell(""), MockCell("\t")])
    hdrftr = MockHeaderFooter([table])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 958ns -> 500ns (91.6% faster)


def test_table_with_some_empty_cells():
    table = MockTable([MockCell("A"), MockCell(" "), MockCell("B")])
    hdrftr = MockHeaderFooter([table])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 958ns -> 500ns (91.6% faster)


def test_paragraph_and_table_with_mixed_empty_and_nonempty():
    table = MockTable([MockCell(""), MockCell("X")])
    hdrftr = MockHeaderFooter([MockParagraph(""), table, MockParagraph("Y")])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 1.12μs -> 666ns (68.9% faster)


def test_table_with_cells_containing_newlines():
    table = MockTable([MockCell("foo\nbar"), MockCell("baz")])
    hdrftr = MockHeaderFooter([table])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 916ns -> 459ns (99.6% faster)


def test_paragraph_with_only_newlines():
    hdrftr = MockHeaderFooter([MockParagraph("\n\n")])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 958ns -> 500ns (91.6% faster)


def test_table_with_cells_having_only_whitespace_and_newlines():
    table = MockTable([MockCell(" "), MockCell("\n"), MockCell("\t")])
    hdrftr = MockHeaderFooter([table])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 916ns -> 459ns (99.6% faster)


def test_header_footer_with_mixed_types_and_order():
    table1 = MockTable([MockCell("foo")])
    table2 = MockTable([MockCell("bar"), MockCell("baz")])
    hdrftr = MockHeaderFooter(
        [MockParagraph("start"), table1, MockParagraph("middle"), table2, MockParagraph("end")]
    )
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 1.21μs -> 750ns (61.1% faster)


def test_table_with_leading_trailing_whitespace_in_cells():
    table = MockTable([MockCell("  a "), MockCell("b  "), MockCell("   c   ")])
    hdrftr = MockHeaderFooter([table])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 875ns -> 500ns (75.0% faster)


def test_table_with_cells_containing_tabs_and_newlines():
    table = MockTable([MockCell("foo\tbar"), MockCell("baz\nqux")])
    hdrftr = MockHeaderFooter([table])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 916ns -> 458ns (100% faster)


# -------------------
# Large Scale Test Cases
# -------------------


def test_large_number_of_paragraphs():
    # 500 paragraphs with unique text
    paragraphs = [MockParagraph(f"Para {i}") for i in range(500)]
    hdrftr = MockHeaderFooter(paragraphs)
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 26.5μs -> 25.4μs (4.10% faster)
    expected = "\n".join(f"Para {i}" for i in range(500))


def test_large_table_with_many_cells():
    # Table with 500 cells
    table = MockTable([MockCell(f"Cell{i}") for i in range(500)])
    hdrftr = MockHeaderFooter([table])
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 1.00μs -> 541ns (84.8% faster)
    expected = " ".join(f"Cell{i}" for i in range(500))


def test_large_mixed_content():
    # 250 paragraphs, 250 tables (each with 2 cells)
    paragraphs = [MockParagraph(f"P{i}") for i in range(250)]
    tables = [MockTable([MockCell(f"T{i}A"), MockCell(f"T{i}B")]) for i in range(250)]
    # Interleave paragraphs and tables
    blocks = []
    for i in range(250):
        blocks.append(paragraphs[i])
        blocks.append(tables[i])
    hdrftr = MockHeaderFooter(blocks)
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 26.3μs -> 25.4μs (3.60% faster)
    # Expected: P0\nT0A T0B\nP1\nT1A T1B\n...
    expected = "\n".join(f"P{i}\nT{i}A T{i}B" for i in range(250))
    # But the actual output is interleaved, so split and join accordingly
    expected_lines = []
    for i in range(250):
        expected_lines.append(f"P{i}")
        expected_lines.append(f"T{i}A T{i}B")
    expected = "\n".join(expected_lines)


def test_large_table_with_some_empty_cells():
    # Table with 1000 cells, every 10th cell is empty
    cells = [MockCell("" if i % 10 == 0 else f"Cell{i}") for i in range(1000)]
    table = MockTable(cells)
    hdrftr = MockHeaderFooter([table])
    # Only non-empty cells should be included
    expected = " ".join(f"Cell{i}" for i in range(1000) if i % 10 != 0)
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 1.00μs -> 541ns (84.8% faster)


def test_large_header_footer_with_only_whitespace():
    # 1000 paragraphs, all whitespace
    paragraphs = [MockParagraph("   ") for _ in range(1000)]
    hdrftr = MockHeaderFooter(paragraphs)
    codeflash_output = partitioner()._header_footer_text(hdrftr)
    result = codeflash_output  # 51.5μs -> 50.2μs (2.66% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from typing import List

# imports
from unstructured.partition.docx import _DocxPartitioner


# Mocks for docx objects
class MockParagraph:
    def __init__(self, text):
        self.text = text


class MockCell:
    def __init__(self, text):
        self.text = text


class MockRow:
    def __init__(self, cells: List[MockCell]):
        self.cells = cells


class MockTable:
    def __init__(self, rows: List[MockRow]):
        self.rows = rows


class MockHeaderFooter:
    def __init__(self, blocks):
        self._blocks = blocks

    def iter_inner_content(self):
        return iter(self._blocks)


# Minimal options stub for _DocxPartitioner
class DocxPartitionerOptions:
    pass


# ========== UNIT TESTS ==========

# Basic Test Cases


def test_single_paragraph():
    """Single paragraph, normal text."""
    hdrftr = MockHeaderFooter([MockParagraph("Hello world!")])
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 958ns -> 500ns (91.6% faster)


def test_multiple_paragraphs():
    """Multiple paragraphs, normal text."""
    hdrftr = MockHeaderFooter(
        [MockParagraph("First line."), MockParagraph("Second line."), MockParagraph("Third line.")]
    )
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 1.12μs -> 666ns (68.9% faster)


def test_paragraphs_with_whitespace():
    """Paragraphs with leading/trailing whitespace are stripped."""
    hdrftr = MockHeaderFooter(
        [
            MockParagraph("   Leading space"),
            MockParagraph("Trailing space   "),
            MockParagraph("   Both sides   "),
            MockParagraph("NoSpace"),
        ]
    )
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 1.17μs -> 708ns (64.7% faster)


def test_empty_paragraphs_are_omitted():
    """Paragraphs with only whitespace or empty are omitted."""
    hdrftr = MockHeaderFooter(
        [
            MockParagraph(""),
            MockParagraph("   "),
            MockParagraph("Text"),
            MockParagraph("  \n  "),
            MockParagraph("Another"),
        ]
    )
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 1.21μs -> 750ns (61.1% faster)


def test_table_single_row_single_cell():
    """Table with one row and one cell."""
    table = MockTable([MockRow([MockCell("TableCell")])])
    hdrftr = MockHeaderFooter([table])
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 875ns -> 500ns (75.0% faster)


def test_table_multiple_rows_and_cells():
    """Table with multiple rows and cells."""
    table = MockTable(
        [MockRow([MockCell("A1"), MockCell("A2")]), MockRow([MockCell("B1"), MockCell("B2")])]
    )
    hdrftr = MockHeaderFooter([table])
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 917ns -> 500ns (83.4% faster)


def test_mixed_paragraphs_and_table():
    """Header/footer with paragraphs and a table."""
    table = MockTable([MockRow([MockCell("Cell1"), MockCell("Cell2")])])
    hdrftr = MockHeaderFooter([MockParagraph("Para1"), table, MockParagraph("Para2")])
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 1.08μs -> 666ns (62.6% faster)


def test_table_with_empty_cells():
    """Table with empty cells and whitespace cells."""
    table = MockTable(
        [
            MockRow([MockCell(""), MockCell("   "), MockCell("Cell")]),
            MockRow([MockCell("Another"), MockCell("")]),
        ]
    )
    hdrftr = MockHeaderFooter([table])
    part = _DocxPartitioner(DocxPartitionerOptions())
    # Only non-empty cells are included, but empty cells still yield spaces
    codeflash_output = part._header_footer_text(hdrftr)  # 958ns -> 500ns (91.6% faster)


def test_header_footer_with_only_empty_content():
    """Header/footer with only empty paragraphs/tables."""
    table = MockTable([MockRow([MockCell(" "), MockCell("")])])
    hdrftr = MockHeaderFooter([MockParagraph(""), MockParagraph("   "), table])
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 1.08μs -> 625ns (73.3% faster)


def test_paragraph_with_linebreaks():
    """Paragraph containing line breaks should preserve them."""
    hdrftr = MockHeaderFooter([MockParagraph("Line1\nLine2\nLine3"), MockParagraph("Another")])
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 1.00μs -> 583ns (71.5% faster)


# Edge Test Cases


def test_table_with_all_empty_cells():
    """Table where all cells are empty or whitespace."""
    table = MockTable(
        [
            MockRow([MockCell(" "), MockCell("  "), MockCell("")]),
            MockRow([MockCell(""), MockCell("   ")]),
        ]
    )
    hdrftr = MockHeaderFooter([table])
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 917ns -> 500ns (83.4% faster)


def test_header_footer_with_no_blocks():
    """Header/footer with no blocks at all."""
    hdrftr = MockHeaderFooter([])
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 833ns -> 416ns (100% faster)


def test_table_with_mixed_empty_and_nonempty_cells():
    """Table with a mix of empty and non-empty cells."""
    table = MockTable(
        [
            MockRow([MockCell("A"), MockCell(" "), MockCell("B")]),
            MockRow([MockCell(""), MockCell("C")]),
        ]
    )
    hdrftr = MockHeaderFooter([table])
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 958ns -> 458ns (109% faster)


def test_paragraphs_with_only_newlines():
    """Paragraphs that are just newlines should be omitted."""
    hdrftr = MockHeaderFooter(
        [
            MockParagraph("\n"),
            MockParagraph("\n\n"),
            MockParagraph("Text"),
            MockParagraph("  \n  "),
            MockParagraph("Another"),
        ]
    )
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 1.21μs -> 750ns (61.1% faster)


def test_table_with_cells_with_newlines():
    """Table cells containing newlines are included with spaces."""
    table = MockTable(
        [
            MockRow([MockCell("Cell1\nCell2"), MockCell("Cell3")]),
            MockRow([MockCell("Cell4\nCell5")]),
        ]
    )
    hdrftr = MockHeaderFooter([table])
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 958ns -> 500ns (91.6% faster)


def test_table_with_leading_trailing_whitespace_cells():
    """Table cells with leading/trailing whitespace are stripped."""
    table = MockTable(
        [MockRow([MockCell("   CellA  "), MockCell("CellB   "), MockCell("   CellC")])]
    )
    hdrftr = MockHeaderFooter([table])
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 916ns -> 500ns (83.2% faster)


def test_table_and_paragraphs_with_only_whitespace():
    """Header/footer with only whitespace paragraphs and tables."""
    table = MockTable([MockRow([MockCell(" "), MockCell("   ")])])
    hdrftr = MockHeaderFooter([MockParagraph("   "), table, MockParagraph(""), MockParagraph("  ")])
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 1.12μs -> 708ns (58.9% faster)


# Large Scale Test Cases


def test_large_number_of_paragraphs():
    """Header/footer with a large number of paragraphs."""
    paragraphs = [MockParagraph(f"Para {i}") for i in range(1000)]
    hdrftr = MockHeaderFooter(paragraphs)
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)
    result = codeflash_output  # 51.8μs -> 50.5μs (2.64% faster)
    # Should be all paragraphs joined by newline
    expected = "\n".join([f"Para {i}" for i in range(1000)])


def test_large_table():
    """Header/footer with a large table (100 rows, 10 cells per row)."""
    table = MockTable(
        [MockRow([MockCell(f"R{row}C{col}") for col in range(10)]) for row in range(100)]
    )
    hdrftr = MockHeaderFooter([table])
    part = _DocxPartitioner(DocxPartitionerOptions())
    # All cell texts joined by spaces
    expected = " ".join([f"R{row}C{col}" for row in range(100) for col in range(10)])
    codeflash_output = part._header_footer_text(hdrftr)  # 1.04μs -> 584ns (78.3% faster)


def test_large_mixed_content():
    """Header/footer with a large mix of paragraphs and tables."""
    paragraphs = [MockParagraph(f"Para{i}") for i in range(500)]
    table = MockTable([MockRow([MockCell(f"T{i}C{j}") for j in range(5)]) for i in range(100)])
    blocks = []
    # Interleave paragraphs and tables
    for i in range(500):
        blocks.append(paragraphs[i])
        if i % 50 == 0:
            blocks.append(table)
    hdrftr = MockHeaderFooter(blocks)
    part = _DocxPartitioner(DocxPartitionerOptions())
    # Build expected output
    expected_parts = []
    for i in range(500):
        expected_parts.append(f"Para{i}")
        if i % 50 == 0:
            expected_parts.append(" ".join([f"T{k}C{j}" for k in range(100) for j in range(5)]))
    expected = "\n".join(expected_parts)
    codeflash_output = part._header_footer_text(hdrftr)  # 26.8μs -> 26.0μs (3.05% faster)


def test_large_table_with_some_empty_cells():
    """Large table with some empty cells scattered throughout."""
    table = MockTable(
        [
            MockRow(
                [
                    MockCell(f"R{row}C{col}") if (row + col) % 10 != 0 else MockCell(" ")
                    for col in range(10)
                ]
            )
            for row in range(100)
        ]
    )
    hdrftr = MockHeaderFooter([table])
    part = _DocxPartitioner(DocxPartitionerOptions())
    # Only non-empty cells included
    expected = " ".join(
        [f"R{row}C{col}" for row in range(100) for col in range(10) if (row + col) % 10 != 0]
    )
    codeflash_output = part._header_footer_text(hdrftr)  # 1.04μs -> 583ns (78.6% faster)


def test_large_header_footer_with_only_empty_content():
    """Large header/footer with only empty paragraphs and tables."""
    paragraphs = [MockParagraph("   ") for _ in range(500)]
    table = MockTable([MockRow([MockCell(" ") for _ in range(10)]) for _ in range(50)])
    blocks = paragraphs + [table]
    hdrftr = MockHeaderFooter(blocks)
    part = _DocxPartitioner(DocxPartitionerOptions())
    codeflash_output = part._header_footer_text(hdrftr)  # 26.2μs -> 25.3μs (3.45% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_DocxPartitioner._header_footer_text-mjdvlnc9 and push.

The optimization replaces a generator-based approach with a direct list accumulation, resulting in an 11% performance improvement. **Key Changes:** - **Removed nested generator function**: The original code used a nested `iter_hdrftr_texts()` generator function that yielded text items, then filtered and joined them in a generator expression. - **Direct list accumulation**: The optimized version builds a list directly by iterating through blocks and only appending non-empty text items. - **Eliminated generator overhead**: By avoiding the generator pattern and the filtering step in the final join operation, the code reduces Python's generator creation and iteration overhead. **Why This is Faster:** The original code had multiple layers of abstraction: a generator function that yielded items, then a generator expression that filtered empty strings, and finally a join operation. Each generator creates overhead for Python's iterator protocol. The optimized version eliminates this by: 1. Checking if text is non-empty before adding to the list (avoiding empty string filtering later) 2. Using a simple list append instead of yield/next mechanics 3. Performing a single join operation on a pre-filtered list **Performance Characteristics:** - **Small datasets** (single paragraphs/tables): 75-100% faster due to reduced function call overhead - **Medium datasets** (multiple paragraphs/tables): 60-85% faster from eliminated generator mechanics - **Large datasets** (500+ items): 2-4% faster as the benefits are diluted by the dominant cost of string operations This optimization is particularly effective for document parsing workloads where headers/footers are processed frequently, as it reduces the per-call overhead without changing the algorithmic complexity.

codeflash-ai bot requested a review from aseembits93 December 20, 2025 05:47

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Dec 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `_DocxPartitioner._header_footer_text` by 11% #60

⚡️ Speed up method `_DocxPartitioner._header_footer_text` by 11% #60

Uh oh!

codeflash-ai bot commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method _DocxPartitioner._header_footer_text by 11% #60

Are you sure you want to change the base?

⚡️ Speed up method _DocxPartitioner._header_footer_text by 11% #60

Uh oh!

Conversation

codeflash-ai bot commented Dec 20, 2025

📄 11% (0.11x) speedup for _DocxPartitioner._header_footer_text in unstructured/partition/docx.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `_DocxPartitioner._header_footer_text` by 11% #60

⚡️ Speed up method `_DocxPartitioner._header_footer_text` by 11% #60

📄 11% (0.11x) speedup for `_DocxPartitioner._header_footer_text` in `unstructured/partition/docx.py`