Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 20, 2025

📄 593% (5.93x) speedup for _DocxPartitioner._style_based_element_type in unstructured/partition/docx.py

⏱️ Runtime : 5.53 milliseconds 798 microseconds (best of 116 runs)

📝 Explanation and details

The optimization achieves a 593% speedup by moving the STYLE_TO_ELEMENT_MAPPING dictionary from inside the method to module level as a global constant.

What changed:

  • Moved the 29-entry dictionary definition from inside _style_based_element_type() to the module level as STYLE_TO_ELEMENT_MAPPING
  • The method now simply references the pre-built dictionary instead of reconstructing it on every call

Why this is dramatically faster:
The original code was reconstructing a 29-entry dictionary on every single method invocation. The line profiler shows this dictionary creation consumed 58.7% of total execution time (33.7ms out of 57.6ms total). Each dictionary entry required individual object creation and insertion operations, creating significant overhead when called repeatedly.

By moving the dictionary to module level, it's constructed only once when the module is imported, eliminating this repeated work entirely. The optimized version shows the dictionary lookup now takes only 53.3% of the much smaller total time.

Performance characteristics:

  • All test cases show 300-600% speedups, indicating consistent benefits across different style types
  • Large-scale tests with 800-1000 paragraphs show particularly strong gains (518-645% speedups), demonstrating the optimization scales well with volume
  • Edge cases (None styles, unknown styles) benefit equally, showing the optimization doesn't create performance regressions

This optimization is especially valuable for document processing workloads where _style_based_element_type() is called repeatedly for each paragraph in potentially large documents, making the cumulative time savings substantial.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 5947 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from types import SimpleNamespace

# function to test
# (copied from above, with necessary imports and dummy DocxPartitionerOptions)
# imports
from unstructured.partition.docx import _DocxPartitioner


# Dummy DocxPartitionerOptions for __init__ signature
class DocxPartitionerOptions:
    pass


# Dummy element types for testing
class Text:
    pass


class Title:
    pass


class ListItem:
    pass


# Helper to create a mock Paragraph object with a given style name
def make_paragraph(style_name=None):
    if style_name is None:
        # Simulate paragraph.style is None
        return SimpleNamespace(style=None)
    else:
        # Simulate paragraph.style.name
        style = SimpleNamespace(name=style_name)
        return SimpleNamespace(style=style)


# Basic Test Cases


def test_heading_styles():
    """Test that heading styles map to Title."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    for i in range(1, 10):
        para = make_paragraph(f"Heading {i}")
        codeflash_output = partitioner._style_based_element_type(
            para
        )  # 9.46μs -> 1.79μs (428% faster)


def test_title_style():
    """Test that 'Title' style maps to Title."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    para = make_paragraph("Title")
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 291ns (344% faster)


def test_subtitle_style():
    """Test that 'Subtitle' style maps to Title."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    para = make_paragraph("Subtitle")
    codeflash_output = partitioner._style_based_element_type(para)  # 1.25μs -> 292ns (328% faster)


def test_tocheading_style():
    """Test that 'TOCHeading' style maps to Title."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    para = make_paragraph("TOCHeading")
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 291ns (344% faster)


def test_list_styles():
    """Test that various list styles map to ListItem."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    list_styles = [
        "List",
        "List 2",
        "List 3",
        "List Bullet",
        "List Bullet 2",
        "List Bullet 3",
        "List Continue",
        "List Continue 2",
        "List Continue 3",
        "List Number",
        "List Number 2",
        "List Number 3",
        "List Paragraph",
    ]
    for style in list_styles:
        para = make_paragraph(style)
        codeflash_output = partitioner._style_based_element_type(
            para
        )  # 12.7μs -> 2.04μs (521% faster)


def test_text_styles():
    """Test that various text styles map to Text."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    text_styles = ["Caption", "Intense Quote", "Macro Text", "No Spacing", "Quote"]
    for style in text_styles:
        para = make_paragraph(style)
        codeflash_output = partitioner._style_based_element_type(
            para
        )  # 5.12μs -> 873ns (487% faster)


def test_unknown_style_returns_none():
    """Test that unknown style names return None."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    para = make_paragraph("MyCustomStyle")
    codeflash_output = partitioner._style_based_element_type(para)  # 1.33μs -> 291ns (358% faster)


def test_normal_style_returns_none():
    """Test that 'Normal' style returns None."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    para = make_paragraph("Normal")
    codeflash_output = partitioner._style_based_element_type(para)  # 1.33μs -> 291ns (358% faster)


# Edge Test Cases


def test_paragraph_style_is_none():
    """Test that paragraph.style is None returns None (treated as 'Normal')."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    para = make_paragraph(None)
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 250ns (416% faster)


def test_paragraph_style_name_is_none():
    """Test that paragraph.style.name is None returns None (treated as 'Normal')."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    style = SimpleNamespace(name=None)
    para = SimpleNamespace(style=style)
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 291ns (344% faster)


def test_style_name_case_sensitivity():
    """Test that style names are case sensitive."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    para = make_paragraph("title")  # lower-case, should not match
    codeflash_output = partitioner._style_based_element_type(para)  # 2.38μs -> 667ns (256% faster)


def test_style_name_with_whitespace():
    """Test that style names with extra whitespace do not match."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    para = make_paragraph(" Title ")
    codeflash_output = partitioner._style_based_element_type(para)  # 1.67μs -> 333ns (401% faster)


def test_style_name_is_empty_string():
    """Test that empty string style name returns None."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    para = make_paragraph("")
    codeflash_output = partitioner._style_based_element_type(para)  # 1.62μs -> 291ns (458% faster)


def test_style_name_is_integer():
    """Test that non-string style name returns None."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    style = SimpleNamespace(name=123)
    para = SimpleNamespace(style=style)
    codeflash_output = partitioner._style_based_element_type(para)  # 1.50μs -> 333ns (350% faster)


def test_style_name_is_none_and_style_is_object():
    """Test that style.name is None and style is not None returns None."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    style = SimpleNamespace(name=None)
    para = SimpleNamespace(style=style)
    codeflash_output = partitioner._style_based_element_type(para)  # 1.38μs -> 292ns (371% faster)


# Large Scale Test Cases


def test_large_number_of_paragraphs_known_styles():
    """Test performance and correctness with many paragraphs of known styles."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    styles = [
        "Heading 1",
        "Heading 2",
        "Heading 3",
        "List",
        "List Bullet",
        "Title",
        "Caption",
        "Quote",
    ]
    expected_types = {
        "Heading 1": Title,
        "Heading 2": Title,
        "Heading 3": Title,
        "List": ListItem,
        "List Bullet": ListItem,
        "Title": Title,
        "Caption": Text,
        "Quote": Text,
    }
    paragraphs = [make_paragraph(style) for style in styles * 100]  # 800 paragraphs
    for para in paragraphs:
        style_name = para.style.name
        expected = expected_types[style_name]
        codeflash_output = partitioner._style_based_element_type(
            para
        )  # 737μs -> 100μs (637% faster)


def test_large_number_of_paragraphs_unknown_styles():
    """Test performance and correctness with many paragraphs of unknown styles."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    paragraphs = [make_paragraph(f"CustomStyle{i}") for i in range(1000)]
    for para in paragraphs:
        codeflash_output = partitioner._style_based_element_type(
            para
        )  # 955μs -> 154μs (518% faster)


def test_large_number_of_paragraphs_mixed_styles():
    """Test performance and correctness with a mix of known and unknown styles."""
    partitioner = _DocxPartitioner(DocxPartitionerOptions())
    known_styles = ["Heading 1", "List", "Title", "Caption"]
    unknown_styles = [f"Unknown{i}" for i in range(500)]
    paragraphs = []
    # Alternate known and unknown styles
    for i in range(500):
        paragraphs.append(make_paragraph(known_styles[i % len(known_styles)]))
        paragraphs.append(make_paragraph(unknown_styles[i]))
    # 1000 paragraphs total
    for i, para in enumerate(paragraphs):
        if i % 2 == 0:
            # known style
            style = known_styles[(i // 2) % len(known_styles)]
            expected = {"Heading 1": Title, "List": ListItem, "Title": Title, "Caption": Text}[
                style
            ]
            codeflash_output = partitioner._style_based_element_type(para)
        else:
            # unknown style
            codeflash_output = partitioner._style_based_element_type(para)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
# imports
import pytest

from unstructured.partition.docx import _DocxPartitioner


# Dummy classes for element types (simulate unstructured.documents.elements)
class Text:
    pass


class Title:
    pass


class ListItem:
    pass


# Dummy Paragraph and Style classes to simulate python-docx
class DummyStyle:
    def __init__(self, name):
        self.name = name


class DummyParagraph:
    def __init__(self, style):
        self.style = style


# Dummy DocxPartitionerOptions (not used in the function, but required for __init__)
class DocxPartitionerOptions:
    pass


# unit tests


@pytest.fixture
def partitioner():
    # Returns an instance of _DocxPartitioner for use in tests
    return _DocxPartitioner(DocxPartitionerOptions())


# ---------------------------
# 1. Basic Test Cases
# ---------------------------


def test_heading_styles_return_title(partitioner):
    # Test all heading styles map to Title
    for i in range(1, 10):
        para = DummyParagraph(DummyStyle(f"Heading {i}"))
        codeflash_output = partitioner._style_based_element_type(
            para
        )  # 9.25μs -> 1.50μs (517% faster)


def test_caption_returns_text(partitioner):
    para = DummyParagraph(DummyStyle("Caption"))
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 250ns (417% faster)


def test_quote_returns_text(partitioner):
    para = DummyParagraph(DummyStyle("Quote"))
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 250ns (416% faster)


def test_subtitle_returns_title(partitioner):
    para = DummyParagraph(DummyStyle("Subtitle"))
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 291ns (344% faster)


def test_title_returns_title(partitioner):
    para = DummyParagraph(DummyStyle("Title"))
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 291ns (344% faster)


def test_tocheading_returns_title(partitioner):
    para = DummyParagraph(DummyStyle("TOCHeading"))
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 250ns (416% faster)


def test_list_styles_return_listitem(partitioner):
    list_styles = [
        "List",
        "List 2",
        "List 3",
        "List Bullet",
        "List Bullet 2",
        "List Bullet 3",
        "List Continue",
        "List Continue 2",
        "List Continue 3",
        "List Number",
        "List Number 2",
        "List Number 3",
        "List Paragraph",
    ]
    for style in list_styles:
        para = DummyParagraph(DummyStyle(style))
        codeflash_output = partitioner._style_based_element_type(
            para
        )  # 12.5μs -> 2.04μs (513% faster)


def test_macro_text_returns_text(partitioner):
    para = DummyParagraph(DummyStyle("Macro Text"))
    codeflash_output = partitioner._style_based_element_type(para)  # 1.38μs -> 292ns (371% faster)


def test_no_spacing_returns_text(partitioner):
    para = DummyParagraph(DummyStyle("No Spacing"))
    codeflash_output = partitioner._style_based_element_type(para)  # 1.33μs -> 292ns (357% faster)


# ---------------------------
# 2. Edge Test Cases
# ---------------------------


def test_normal_style_returns_none(partitioner):
    # "Normal" style is not mapped, should return None
    para = DummyParagraph(DummyStyle("Normal"))
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 250ns (417% faster)


def test_none_style_returns_none(partitioner):
    # Paragraph.style is None, should be treated as "Normal"
    para = DummyParagraph(None)
    codeflash_output = partitioner._style_based_element_type(para)  # 1.25μs -> 250ns (400% faster)


def test_style_object_with_none_name_returns_none(partitioner):
    # Paragraph.style.name is None, should be treated as "Normal"
    style = DummyStyle(None)
    para = DummyParagraph(style)
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 250ns (416% faster)


def test_unknown_style_returns_none(partitioner):
    # Unknown style should return None
    para = DummyParagraph(DummyStyle("MyCustomStyle"))
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 250ns (417% faster)


def test_style_name_empty_string_returns_none(partitioner):
    # Empty string style name should return None
    para = DummyParagraph(DummyStyle(""))
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 250ns (417% faster)


def test_style_name_whitespace_returns_none(partitioner):
    # Whitespace style name should return None
    para = DummyParagraph(DummyStyle("   "))
    codeflash_output = partitioner._style_based_element_type(para)  # 1.29μs -> 291ns (344% faster)


def test_style_name_case_sensitivity(partitioner):
    # Should be case-sensitive: "heading 1" should not match "Heading 1"
    para = DummyParagraph(DummyStyle("heading 1"))
    codeflash_output = partitioner._style_based_element_type(para)  # 2.00μs -> 500ns (300% faster)


def test_style_name_with_extra_spaces(partitioner):
    # Should not match if extra spaces are present
    para = DummyParagraph(DummyStyle(" Heading 1 "))
    codeflash_output = partitioner._style_based_element_type(para)  # 1.54μs -> 375ns (311% faster)


# ---------------------------
# 3. Large Scale Test Cases
# ---------------------------


def test_large_batch_of_paragraphs_mixed_styles(partitioner):
    # Create a large number of paragraphs with a mix of known and unknown styles
    known_styles = [
        "Heading 1",
        "Caption",
        "List Bullet",
        "Quote",
        "Subtitle",
        "Title",
        "TOCHeading",
    ]
    unknown_styles = ["CustomStyleA", "Unknown", "heading 1", "", "   ", None]
    paragraphs = []
    # Alternate between known and unknown styles
    for i in range(500):
        style_name = known_styles[i % len(known_styles)]
        paragraphs.append(DummyParagraph(DummyStyle(style_name)))
        style_name = unknown_styles[i % len(unknown_styles)]
        paragraphs.append(DummyParagraph(DummyStyle(style_name)))
    # Check that known styles return correct types, unknown return None
    for i, para in enumerate(paragraphs):
        style_name = para.style.name if para.style else None
        if style_name in known_styles:
            expected_type = {
                "Heading 1": Title,
                "Caption": Text,
                "List Bullet": ListItem,
                "Quote": Text,
                "Subtitle": Title,
                "Title": Title,
                "TOCHeading": Title,
            }[style_name]
            codeflash_output = partitioner._style_based_element_type(para)
        else:
            codeflash_output = partitioner._style_based_element_type(para)


def test_all_mapping_styles_are_covered(partitioner):
    # Ensure every style in the mapping returns the correct type
    mapping = {
        "Caption": Text,
        "Heading 1": Title,
        "Heading 2": Title,
        "Heading 3": Title,
        "Heading 4": Title,
        "Heading 5": Title,
        "Heading 6": Title,
        "Heading 7": Title,
        "Heading 8": Title,
        "Heading 9": Title,
        "Intense Quote": Text,
        "List": ListItem,
        "List 2": ListItem,
        "List 3": ListItem,
        "List Bullet": ListItem,
        "List Bullet 2": ListItem,
        "List Bullet 3": ListItem,
        "List Continue": ListItem,
        "List Continue 2": ListItem,
        "List Continue 3": ListItem,
        "List Number": ListItem,
        "List Number 2": ListItem,
        "List Number 3": ListItem,
        "List Paragraph": ListItem,
        "Macro Text": Text,
        "No Spacing": Text,
        "Quote": Text,
        "Subtitle": Title,
        "TOCHeading": Title,
        "Title": Title,
    }
    for style_name, expected_type in mapping.items():
        para = DummyParagraph(DummyStyle(style_name))
        codeflash_output = partitioner._style_based_element_type(
            para
        )  # 28.2μs -> 4.28μs (559% faster)


def test_large_number_of_unknown_styles_returns_none(partitioner):
    # Test with 1000 paragraphs with unknown styles
    for i in range(1000):
        para = DummyParagraph(DummyStyle(f"UnknownStyle{i}"))
        codeflash_output = partitioner._style_based_element_type(
            para
        )  # 939μs -> 140μs (568% faster)


def test_large_number_of_normal_and_none_styles_returns_none(partitioner):
    # Test with 500 paragraphs with "Normal" and 500 with None style
    for i in range(500):
        para_normal = DummyParagraph(DummyStyle("Normal"))
        para_none = DummyParagraph(None)
        codeflash_output = partitioner._style_based_element_type(
            para_normal
        )  # 459μs -> 61.7μs (645% faster)
        codeflash_output = partitioner._style_based_element_type(para_none)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_DocxPartitioner._style_based_element_type-mjdwusew and push.

Codeflash Static Badge

The optimization achieves a **593% speedup** by moving the `STYLE_TO_ELEMENT_MAPPING` dictionary from inside the method to module level as a global constant.

**What changed:**
- Moved the 29-entry dictionary definition from inside `_style_based_element_type()` to the module level as `STYLE_TO_ELEMENT_MAPPING`
- The method now simply references the pre-built dictionary instead of reconstructing it on every call

**Why this is dramatically faster:**
The original code was reconstructing a 29-entry dictionary on every single method invocation. The line profiler shows this dictionary creation consumed **58.7% of total execution time** (33.7ms out of 57.6ms total). Each dictionary entry required individual object creation and insertion operations, creating significant overhead when called repeatedly.

By moving the dictionary to module level, it's constructed only once when the module is imported, eliminating this repeated work entirely. The optimized version shows the dictionary lookup now takes only 53.3% of the much smaller total time.

**Performance characteristics:**
- **All test cases** show 300-600% speedups, indicating consistent benefits across different style types
- **Large-scale tests** with 800-1000 paragraphs show particularly strong gains (518-645% speedups), demonstrating the optimization scales well with volume
- **Edge cases** (None styles, unknown styles) benefit equally, showing the optimization doesn't create performance regressions

This optimization is especially valuable for document processing workloads where `_style_based_element_type()` is called repeatedly for each paragraph in potentially large documents, making the cumulative time savings substantial.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 20, 2025 06:22
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant