Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 20, 2025

📄 8% (0.08x) speedup for stage_for_datasaur in unstructured/staging/datasaur.py

⏱️ Runtime : 1.69 milliseconds 1.56 milliseconds (best of 250 runs)

📝 Explanation and details

The optimization replaces the explicit loop-based result construction with a list comprehension. This change eliminates the intermediate result list initialization and the repeated append() operations.

Key changes:

  • Removed result: List[Dict[str, Any]] = [] initialization
  • Replaced the for i, item in enumerate(elements): loop with a single list comprehension: return [{"text": item.text, "entities": _entities[i]} for i, item in enumerate(elements)]
  • Eliminated multiple result.append(data) calls

Why this is faster:
List comprehensions in Python are implemented in C and execute significantly faster than equivalent explicit loops with append operations. The optimization eliminates the overhead of:

  • Creating an empty list and growing it incrementally
  • Multiple function calls to append()
  • Temporary variable assignment (data)

Performance characteristics:
The profiler shows this optimization is most effective for larger datasets - the annotated tests demonstrate 18-20% speedup for 1000+ elements, while smaller datasets see modest gains or slight overhead due to the comprehension setup cost. The optimization delivers consistent 6-10% improvements for medium-scale workloads (500+ elements with entities).

Impact on workloads:
This optimization will benefit any application processing substantial amounts of text data for Datasaur formatting, particularly document processing pipelines or batch entity annotation workflows where hundreds or thousands of text elements are processed together.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 6 Passed
🌀 Generated Regression Tests 37 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 3 Passed
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
staging/test_datasaur.py::test_datasaur_raises_with_bad_type 2.67μs 2.50μs 6.64%✅
staging/test_datasaur.py::test_datasaur_raises_with_missing_entity_text 1.04μs 1.04μs -0.096%⚠️
staging/test_datasaur.py::test_datasaur_raises_with_missing_key 2.08μs 1.96μs 6.33%✅
staging/test_datasaur.py::test_datasaur_raises_with_wrong_length 1.08μs 1.04μs 4.03%✅
staging/test_datasaur.py::test_stage_for_datasaur 1.29μs 1.33μs -3.08%⚠️
staging/test_datasaur.py::test_stage_for_datasaur_with_entities 2.50μs 2.46μs 1.67%✅
🌀 Generated Regression Tests and Runtime
# imports
import pytest

from unstructured.staging.datasaur import stage_for_datasaur


# Mock class for Text, as per unstructured.documents.elements.Text
class Text:
    def __init__(self, text: str):
        self.text = text


# unit tests

# ---------------------------
# Basic Test Cases
# ---------------------------


def test_single_element_no_entities():
    # Single Text element, no entities
    elements = [Text("hello world")]
    codeflash_output = stage_for_datasaur(elements)
    result = codeflash_output  # 1.12μs -> 1.25μs (10.0% slower)


def test_multiple_elements_no_entities():
    # Multiple Text elements, no entities
    elements = [Text("a"), Text("b"), Text("c")]
    codeflash_output = stage_for_datasaur(elements)
    result = codeflash_output  # 1.38μs -> 1.38μs (0.000% faster)
    for i, letter in enumerate(["a", "b", "c"]):
        pass


def test_single_element_with_single_entity():
    # Single element, one entity
    elements = [Text("hello world")]
    entities = [[{"text": "hello", "type": "GREETING", "start_idx": 0, "end_idx": 5}]]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 2.04μs -> 2.04μs (0.000% faster)


def test_multiple_elements_with_entities():
    # Multiple elements, each with entities
    elements = [Text("foo bar"), Text("baz qux")]
    entities = [
        [{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 3}],
        [{"text": "qux", "type": "NOUN", "start_idx": 4, "end_idx": 7}],
    ]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 2.50μs -> 2.58μs (3.21% slower)


def test_elements_with_mixed_entities():
    # Some elements have entities, some do not
    elements = [Text("foo bar"), Text("baz qux")]
    entities = [[], [{"text": "baz", "type": "NOUN", "start_idx": 0, "end_idx": 3}]]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 2.08μs -> 2.08μs (0.000% faster)


# ---------------------------
# Edge Test Cases
# ---------------------------


def test_empty_elements_list():
    # Empty input list
    elements = []
    codeflash_output = stage_for_datasaur(elements)
    result = codeflash_output  # 791ns -> 875ns (9.60% slower)


def test_entities_length_mismatch():
    # entities list length does not match elements length
    elements = [Text("foo"), Text("bar")]
    entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 3}]]
    with pytest.raises(ValueError) as excinfo:
        stage_for_datasaur(elements, entities)  # 916ns -> 875ns (4.69% faster)


def test_entity_missing_key():
    # Entity is missing a required key
    elements = [Text("foo")]
    entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0}]]  # missing 'end_idx'
    with pytest.raises(ValueError) as excinfo:
        stage_for_datasaur(elements, entities)  # 1.83μs -> 1.75μs (4.74% faster)


def test_entity_wrong_type():
    # Entity has wrong type for a key
    elements = [Text("foo")]
    entities = [
        [{"text": "foo", "type": "NOUN", "start_idx": "0", "end_idx": 3}]
    ]  # 'start_idx' should be int
    with pytest.raises(ValueError) as excinfo:
        stage_for_datasaur(elements, entities)  # 2.42μs -> 2.33μs (3.60% faster)


def test_entity_extra_keys():
    # Entity has extra keys (should not error)
    elements = [Text("foo")]
    entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 3, "confidence": 0.99}]]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 2.00μs -> 2.04μs (2.01% slower)


def test_entities_is_none():
    # entities explicitly passed as None
    elements = [Text("foo")]
    codeflash_output = stage_for_datasaur(elements, None)
    result = codeflash_output  # 1.04μs -> 1.08μs (3.79% slower)


def test_entity_empty_list():
    # entities is a list of empty lists (should be valid)
    elements = [Text("foo"), Text("bar")]
    entities = [[], []]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 1.42μs -> 1.50μs (5.60% slower)


def test_entity_text_not_matching_element():
    # Entity text does not match element text (should not error)
    elements = [Text("foobar")]
    entities = [[{"text": "baz", "type": "NOUN", "start_idx": 0, "end_idx": 3}]]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 2.00μs -> 2.00μs (0.000% faster)


def test_entity_indices_out_of_bounds():
    # Entity indices out of text bounds (should not error)
    elements = [Text("foo")]
    entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 10}]]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 1.96μs -> 2.00μs (2.10% slower)


# ---------------------------
# Large Scale Test Cases
# ---------------------------


def test_large_number_of_elements():
    # Test with 1000 elements, no entities
    n = 1000
    elements = [Text(str(i)) for i in range(n)]
    codeflash_output = stage_for_datasaur(elements)
    result = codeflash_output  # 102μs -> 87.0μs (18.1% faster)
    for i in range(n):
        pass


def test_large_number_of_elements_with_entities():
    # Test with 500 elements, each with one entity
    n = 500
    elements = [Text(f"text_{i}") for i in range(n)]
    entities = [
        [{"text": f"text_{i}", "type": "TYPE", "start_idx": 0, "end_idx": len(f"text_{i}")}]
        for i in range(n)
    ]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 244μs -> 227μs (7.83% faster)
    for i in range(n):
        pass


def test_large_number_of_entities_per_element():
    # Test with 10 elements, each with 100 entities
    elements = [Text(f"text_{i}") for i in range(10)]
    entities = [
        [{"text": f"t_{j}", "type": "TYPE", "start_idx": j, "end_idx": j + 1} for j in range(100)]
        for _ in range(10)
    ]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 356μs -> 337μs (5.73% faster)
    for i in range(10):
        for j in range(100):
            pass


# ---------------------------
# Mutation Testing Guards
# ---------------------------


def test_mutation_guard_wrong_text_key():
    # Changing the output key 'text' should fail
    elements = [Text("foo")]
    codeflash_output = stage_for_datasaur(elements)
    result = codeflash_output  # 1.00μs -> 1.04μs (4.03% slower)


def test_mutation_guard_wrong_entities_key():
    # Changing the output key 'entities' should fail
    elements = [Text("foo")]
    codeflash_output = stage_for_datasaur(elements)
    result = codeflash_output  # 958ns -> 1.00μs (4.20% slower)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
# imports
import pytest

from unstructured.staging.datasaur import stage_for_datasaur


# Dummy Text class for testing, since unstructured.documents.elements.Text is not available
class Text:
    def __init__(self, text: str):
        self.text = text


# unit tests

# --------------------- Basic Test Cases ---------------------


def test_single_element_no_entities():
    # One element, no entities
    elements = [Text("hello world")]
    codeflash_output = stage_for_datasaur(elements)
    result = codeflash_output  # 1.17μs -> 1.21μs (3.47% slower)


def test_multiple_elements_no_entities():
    # Multiple elements, no entities
    elements = [Text("foo"), Text("bar"), Text("baz")]
    codeflash_output = stage_for_datasaur(elements)
    result = codeflash_output  # 1.29μs -> 1.33μs (3.15% slower)


def test_single_element_with_valid_entities():
    # One element, one valid entity
    elements = [Text("hello world")]
    entities = [[{"text": "hello", "type": "GREETING", "start_idx": 0, "end_idx": 5}]]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 2.04μs -> 2.00μs (2.05% faster)


def test_multiple_elements_with_entities():
    # Multiple elements, each with their own entities
    elements = [Text("foo bar"), Text("baz qux")]
    entities = [
        [{"text": "foo", "type": "WORD", "start_idx": 0, "end_idx": 3}],
        [{"text": "qux", "type": "WORD", "start_idx": 4, "end_idx": 7}],
    ]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 2.62μs -> 2.50μs (5.00% faster)


def test_multiple_elements_some_empty_entities():
    # Multiple elements, some with no entities
    elements = [Text("foo bar"), Text("baz qux")]
    entities = [
        [],
        [{"text": "baz", "type": "WORD", "start_idx": 0, "end_idx": 3}],
    ]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 2.08μs -> 2.08μs (0.048% slower)


# --------------------- Edge Test Cases ---------------------


def test_empty_elements_list():
    # No elements
    elements = []
    codeflash_output = stage_for_datasaur(elements)
    result = codeflash_output  # 750ns -> 875ns (14.3% slower)


def test_empty_elements_with_empty_entities():
    # No elements, entities is empty list
    elements = []
    entities = []
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 833ns -> 1.00μs (16.7% slower)


def test_entities_length_mismatch():
    # entities list length does not match elements list length
    elements = [Text("foo"), Text("bar")]
    entities = [[]]  # Should be length 2
    with pytest.raises(ValueError) as excinfo:
        stage_for_datasaur(elements, entities)  # 916ns -> 875ns (4.69% faster)


def test_entity_missing_key():
    # Entity dict missing a required key
    elements = [Text("foo")]
    entities = [[{"text": "foo", "type": "WORD", "start_idx": 0}]]  # Missing 'end_idx'
    with pytest.raises(ValueError) as excinfo:
        stage_for_datasaur(elements, entities)  # 1.92μs -> 1.75μs (9.49% faster)


def test_entity_wrong_type():
    # Entity dict with wrong type for a key
    elements = [Text("foo")]
    entities = [
        [{"text": "foo", "type": "WORD", "start_idx": "zero", "end_idx": 3}]
    ]  # start_idx should be int
    with pytest.raises(ValueError) as excinfo:
        stage_for_datasaur(elements, entities)  # 2.46μs -> 2.33μs (5.36% faster)


def test_entity_extra_keys():
    # Entity dict with extra keys (should be ignored)
    elements = [Text("foo")]
    entities = [[{"text": "foo", "type": "WORD", "start_idx": 0, "end_idx": 3, "extra": "ignored"}]]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 1.96μs -> 2.00μs (2.05% slower)


def test_entity_with_empty_string():
    # Entity with empty string values (should be allowed)
    elements = [Text("")]
    entities = [[{"text": "", "type": "", "start_idx": 0, "end_idx": 0}]]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 1.96μs -> 1.96μs (0.000% faster)


def test_entity_with_negative_indices():
    # Entity with negative indices (should be allowed, not validated)
    elements = [Text("foo")]
    entities = [[{"text": "foo", "type": "WORD", "start_idx": -1, "end_idx": -1}]]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 1.83μs -> 1.88μs (2.24% slower)


# --------------------- Large Scale Test Cases ---------------------


def test_large_number_of_elements_no_entities():
    # Large number of elements, no entities
    n = 1000
    elements = [Text(f"text_{i}") for i in range(n)]
    codeflash_output = stage_for_datasaur(elements)
    result = codeflash_output  # 103μs -> 86.7μs (19.7% faster)
    for i in range(n):
        pass


def test_large_number_of_elements_with_entities():
    # Large number of elements, each with one entity
    n = 1000
    elements = [Text(f"text_{i}") for i in range(n)]
    entities = [
        [{"text": f"text_{i}", "type": "TYPE", "start_idx": 0, "end_idx": len(f"text_{i}")}]
        for i in range(n)
    ]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 502μs -> 470μs (6.85% faster)
    for i in range(n):
        pass


def test_large_number_of_elements_some_with_entities():
    # Large number of elements, only even indices have entities
    n = 1000
    elements = [Text(f"text_{i}") for i in range(n)]
    entities = [
        (
            [{"text": f"text_{i}", "type": "TYPE", "start_idx": 0, "end_idx": len(f"text_{i}")}]
            if i % 2 == 0
            else []
        )
        for i in range(n)
    ]
    codeflash_output = stage_for_datasaur(elements, entities)
    result = codeflash_output  # 309μs -> 282μs (9.66% faster)
    for i in range(n):
        if i % 2 == 0:
            pass
        else:
            pass


# --------------------- Determinism Test ---------------------


def test_determinism():
    # Running the function twice with the same input should yield the same result
    elements = [Text("foo"), Text("bar")]
    entities = [
        [{"text": "foo", "type": "WORD", "start_idx": 0, "end_idx": 3}],
        [{"text": "bar", "type": "WORD", "start_idx": 0, "end_idx": 3}],
    ]
    codeflash_output = stage_for_datasaur(elements, entities)
    result1 = codeflash_output  # 2.75μs -> 2.67μs (3.15% faster)
    codeflash_output = stage_for_datasaur(elements, entities)
    result2 = codeflash_output  # 1.58μs -> 1.54μs (2.66% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest

from unstructured.documents.elements import Text
from unstructured.staging.datasaur import stage_for_datasaur


def test_stage_for_datasaur():
    stage_for_datasaur(
        [
            Text(
                "",
                element_id=None,
                coordinates=None,
                coordinate_system=None,
                metadata=None,
                detection_origin="",
                embeddings=[],
            )
        ],
        entities=[[]],
    )


def test_stage_for_datasaur_2():
    with pytest.raises(
        ValueError,
        match="If\\ entities\\ is\\ specified,\\ it\\ must\\ be\\ the\\ same\\ length\\ as\\ elements\\.",
    ):
        stage_for_datasaur([], entities=[[]])


def test_stage_for_datasaur_3():
    with pytest.raises(
        ValueError,
        match="Key\\ 'text'\\ was\\ expected\\ but\\ not\\ present\\ in\\ the\\ Datasaur\\ entity\\.",
    ):
        stage_for_datasaur(
            [
                Text(
                    "",
                    element_id=None,
                    coordinates=None,
                    coordinate_system=None,
                    metadata=None,
                    detection_origin="",
                    embeddings=[0.0],
                )
            ],
            entities=[[{}, {}]],
        )
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_e8goshnj/tmp5mzednpf/test_concolic_coverage.py::test_stage_for_datasaur 1.29μs 1.46μs -11.4%⚠️
codeflash_concolic_e8goshnj/tmp5mzednpf/test_concolic_coverage.py::test_stage_for_datasaur_2 916ns 959ns -4.48%⚠️
codeflash_concolic_e8goshnj/tmp5mzednpf/test_concolic_coverage.py::test_stage_for_datasaur_3 1.71μs 1.67μs 2.52%✅

To edit these changes git checkout codeflash/optimize-stage_for_datasaur-mjdt0e1s and push.

Codeflash Static Badge

The optimization replaces the explicit loop-based result construction with a **list comprehension**. This change eliminates the intermediate `result` list initialization and the repeated `append()` operations.

**Key changes:**
- Removed `result: List[Dict[str, Any]] = []` initialization
- Replaced the `for i, item in enumerate(elements):` loop with a single list comprehension: `return [{"text": item.text, "entities": _entities[i]} for i, item in enumerate(elements)]`
- Eliminated multiple `result.append(data)` calls

**Why this is faster:**
List comprehensions in Python are implemented in C and execute significantly faster than equivalent explicit loops with append operations. The optimization eliminates the overhead of:
- Creating an empty list and growing it incrementally 
- Multiple function calls to `append()`
- Temporary variable assignment (`data`)

**Performance characteristics:**
The profiler shows this optimization is most effective for larger datasets - the annotated tests demonstrate **18-20% speedup** for 1000+ elements, while smaller datasets see modest gains or slight overhead due to the comprehension setup cost. The optimization delivers consistent **6-10% improvements** for medium-scale workloads (500+ elements with entities).

**Impact on workloads:**
This optimization will benefit any application processing substantial amounts of text data for Datasaur formatting, particularly document processing pipelines or batch entity annotation workflows where hundreds or thousands of text elements are processed together.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 20, 2025 04:34
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant