Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 20, 2025

📄 650% (6.50x) speedup for get_default_pandas_dtypes in unstructured/staging/base.py

⏱️ Runtime : 17.4 milliseconds 2.33 milliseconds (best of 117 runs)

📝 Explanation and details

The optimized code implements function-level caching to avoid recreating the pandas dtype dictionary on every call. The key optimization is using a function attribute (get_default_pandas_dtypes._cache) to store the computed dictionary after the first invocation.

Key changes:

  • Added a cache check using hasattr() to see if the cache exists
  • Store the complete dtype dictionary in _cache on first call
  • Return _cache.copy() on subsequent calls to prevent mutation of the cached data

Why this optimization works:

  • Eliminates repeated object creation: The original code creates ~40 pd.StringDtype() objects plus other dtype instances on every call. These object instantiations are expensive in Python.
  • Reduces memory allocation overhead: Creating the dictionary and all its values repeatedly causes significant garbage collection pressure.
  • Leverages shallow copying: dict.copy() is much faster than recreating all the dtype objects from scratch.

Performance impact based on function usage:
The convert_to_dataframe function reference shows this function is called in a data processing pipeline where set_dtypes=True triggers get_default_pandas_dtypes(). Given the test results showing 350-690% speedups across various scenarios, this optimization is particularly valuable when:

  • Processing multiple dataframes in batch operations
  • Called repeatedly in loops or data processing pipelines
  • Used in performance-critical staging operations

Test case analysis:
The optimization performs consistently well across all test scenarios:

  • Simple calls: 211-398% faster
  • Multiple calls: 692% faster (showing cache effectiveness)
  • Large-scale operations: 365-397% faster

This caching approach maintains correctness by returning copies, preventing callers from accidentally mutating the shared cache while delivering substantial performance gains for repeated invocations.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 37 Passed
🌀 Generated Regression Tests 537 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
staging/test_base.py::test_default_pandas_dtypes 44.2μs 10.7μs 314%✅
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

import pandas as pd

# imports
from unstructured.staging.base import get_default_pandas_dtypes

# unit tests

# --- Basic Test Cases ---


def test_return_type_is_dict():
    # The function should return a dictionary
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 53.2μs -> 17.1μs (211% faster)


def test_expected_keys_present():
    # The dictionary should contain all expected keys
    expected_keys = [
        "text",
        "type",
        "element_id",
        "filename",
        "filetype",
        "file_directory",
        "last_modified",
        "attached_to_filename",
        "parent_id",
        "category_depth",
        "image_path",
        "languages",
        "page_number",
        "page_name",
        "url",
        "link_urls",
        "link_texts",
        "links",
        "sent_from",
        "sent_to",
        "subject",
        "section",
        "header_footer_type",
        "emphasized_text_contents",
        "emphasized_text_tags",
        "text_as_html",
        "max_characters",
        "is_continuation",
        "detection_class_prob",
        "sender",
        "coordinates_points",
        "coordinates_system",
        "coordinates_layout_width",
        "coordinates_layout_height",
        "data_source_url",
        "data_source_version",
        "data_source_record_locator",
        "data_source_date_created",
        "data_source_date_modified",
        "data_source_date_processed",
        "data_source_permissions_data",
        "embeddings",
    ]
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 38.4μs -> 8.54μs (350% faster)
    for key in expected_keys:
        pass


def test_no_extra_keys():
    # The dictionary should not contain any unexpected keys
    expected_keys = set(
        [
            "text",
            "type",
            "element_id",
            "filename",
            "filetype",
            "file_directory",
            "last_modified",
            "attached_to_filename",
            "parent_id",
            "category_depth",
            "image_path",
            "languages",
            "page_number",
            "page_name",
            "url",
            "link_urls",
            "link_texts",
            "links",
            "sent_from",
            "sent_to",
            "subject",
            "section",
            "header_footer_type",
            "emphasized_text_contents",
            "emphasized_text_tags",
            "text_as_html",
            "max_characters",
            "is_continuation",
            "detection_class_prob",
            "sender",
            "coordinates_points",
            "coordinates_system",
            "coordinates_layout_width",
            "coordinates_layout_height",
            "data_source_url",
            "data_source_version",
            "data_source_record_locator",
            "data_source_date_created",
            "data_source_date_modified",
            "data_source_date_processed",
            "data_source_permissions_data",
            "embeddings",
        ]
    )
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.8μs -> 8.17μs (363% faster)
    result_keys = set(result.keys())


def test_string_dtype_values():
    # All keys ending with _id, _name, _filename, _type, _path, _url, _version, _directory, _created, _modified, _processed should have pd.StringDtype()
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.8μs -> 7.92μs (377% faster)
    string_keys = [
        "text",
        "type",
        "element_id",
        "filename",
        "filetype",
        "file_directory",
        "last_modified",
        "attached_to_filename",
        "parent_id",
        "image_path",
        "page_name",
        "url",
        "link_urls",
        "subject",
        "section",
        "header_footer_type",
        "text_as_html",
        "sender",
        "coordinates_system",
        "data_source_url",
        "data_source_version",
        "data_source_date_created",
        "data_source_date_modified",
        "data_source_date_processed",
    ]
    for key in string_keys:
        pass


def test_int64_dtype_values():
    # Keys that should have "Int64"
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.4μs -> 7.79μs (380% faster)
    int_keys = ["category_depth", "page_number", "max_characters"]
    for key in int_keys:
        pass


def test_boolean_dtype_value():
    # Key that should have "boolean"
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.8μs -> 7.67μs (394% faster)


def test_float_dtype_values():
    # Keys that should have float
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.8μs -> 7.79μs (386% faster)
    float_keys = ["detection_class_prob", "coordinates_layout_width", "coordinates_layout_height"]
    for key in float_keys:
        pass


def test_object_dtype_values():
    # Keys that should have object
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.8μs -> 7.58μs (398% faster)
    object_keys = [
        "languages",
        "link_texts",
        "links",
        "sent_from",
        "sent_to",
        "emphasized_text_contents",
        "emphasized_text_tags",
        "coordinates_points",
        "data_source_record_locator",
        "data_source_permissions_data",
        "embeddings",
    ]
    for key in object_keys:
        pass


# --- Edge Test Cases ---


def test_dict_is_not_empty():
    # The returned dictionary should not be empty
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.3μs -> 7.75μs (382% faster)


def test_key_names_are_strings():
    # All keys in the dictionary should be strings
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.7μs -> 7.71μs (389% faster)
    for key in result.keys():
        pass


def test_values_are_valid_types():
    # All values should be valid pandas dtypes or Python types
    valid_types = {pd.StringDtype(), "Int64", "boolean", float, object}
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.2μs -> 7.71μs (383% faster)
    for key, value in result.items():
        pass


def test_no_mutation_on_call():
    # Calling the function multiple times should return equal but not the same object (no mutation)
    codeflash_output = get_default_pandas_dtypes()
    first = codeflash_output  # 38.1μs -> 7.75μs (392% faster)
    codeflash_output = get_default_pandas_dtypes()
    second = codeflash_output  # 34.0μs -> 4.88μs (597% faster)


def test_keys_are_unique():
    # There should be no duplicate keys
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.8μs -> 7.75μs (388% faster)


def test_can_be_used_as_dataframe_dtypes():
    # The returned dictionary should be usable as the dtype argument for pd.DataFrame
    import pandas as pd

    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.9μs -> 7.71μs (391% faster)
    # Create a DataFrame with the keys as columns
    df = pd.DataFrame([dict.fromkeys(result.keys())])
    # Should not raise error when setting dtypes (except for object, which is always allowed)
    try:
        df = df.astype(result)
    except Exception as e:
        raise AssertionError(f"Failed to use dictionary as dtypes: {e}")


def test_modifying_returned_dict_does_not_affect_future_calls():
    # Modifying the returned dict should not affect future calls
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 42.0μs -> 10.3μs (308% faster)
    result["text"] = "Int64"
    codeflash_output = get_default_pandas_dtypes()
    new_result = codeflash_output  # 33.2μs -> 5.04μs (559% faster)


# --- Large Scale Test Cases ---


def test_large_dataframe_with_dtypes():
    # Test creating a large DataFrame using the returned dtypes dictionary
    import pandas as pd

    codeflash_output = get_default_pandas_dtypes()
    dtypes = codeflash_output  # 38.6μs -> 8.29μs (365% faster)
    # Create 1000 rows with all columns set to None
    data = [dict.fromkeys(dtypes.keys()) for _ in range(1000)]
    df = pd.DataFrame(data)
    # Should not raise error when setting dtypes
    try:
        df = df.astype(dtypes)
    except Exception as e:
        raise AssertionError(f"Failed to use dtypes with large DataFrame: {e}")


def test_performance_on_multiple_calls():
    # Test that multiple calls do not degrade performance or memory
    import time

    start = time.time()
    for _ in range(500):
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 15.9ms -> 2.01ms (692% faster)
    elapsed = time.time() - start


def test_large_dict_keys_and_values():
    # Test that all keys and values can be iterated efficiently
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 38.1μs -> 7.92μs (382% faster)
    keys = list(result.keys())
    values = list(result.values())


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pandas as pd

# imports
from unstructured.staging.base import get_default_pandas_dtypes

# unit tests

# --- Basic Test Cases ---


def test_return_type_is_dict():
    # The function should return a dictionary
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 38.1μs -> 7.96μs (379% faster)


def test_dict_keys_are_expected_columns():
    # The dict keys should match the expected column names
    expected_keys = {
        "text",
        "type",
        "element_id",
        "filename",
        "filetype",
        "file_directory",
        "last_modified",
        "attached_to_filename",
        "parent_id",
        "category_depth",
        "image_path",
        "languages",
        "page_number",
        "page_name",
        "url",
        "link_urls",
        "link_texts",
        "links",
        "sent_from",
        "sent_to",
        "subject",
        "section",
        "header_footer_type",
        "emphasized_text_contents",
        "emphasized_text_tags",
        "text_as_html",
        "max_characters",
        "is_continuation",
        "detection_class_prob",
        "sender",
        "coordinates_points",
        "coordinates_system",
        "coordinates_layout_width",
        "coordinates_layout_height",
        "data_source_url",
        "data_source_version",
        "data_source_record_locator",
        "data_source_date_created",
        "data_source_date_modified",
        "data_source_date_processed",
        "data_source_permissions_data",
        "embeddings",
    }
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.9μs -> 7.88μs (381% faster)


def test_string_dtype_columns():
    # Columns expected to have pd.StringDtype()
    string_columns = [
        "text",
        "type",
        "element_id",
        "filename",
        "filetype",
        "file_directory",
        "last_modified",
        "attached_to_filename",
        "parent_id",
        "image_path",
        "page_name",
        "url",
        "link_urls",
        "subject",
        "section",
        "header_footer_type",
        "text_as_html",
        "sender",
        "coordinates_system",
        "data_source_url",
        "data_source_version",
        "data_source_date_created",
        "data_source_date_modified",
        "data_source_date_processed",
    ]
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 38.1μs -> 7.71μs (394% faster)
    for col in string_columns:
        dtype = result[col]


def test_int64_dtype_columns():
    # Columns expected to have "Int64" dtype
    int_columns = ["category_depth", "page_number", "max_characters"]
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 38.0μs -> 7.75μs (390% faster)
    for col in int_columns:
        dtype = result[col]


def test_boolean_dtype_column():
    # "is_continuation" should be "boolean"
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.2μs -> 7.67μs (386% faster)


def test_float_dtype_columns():
    # Columns expected to have float dtype
    float_columns = [
        "detection_class_prob",
        "coordinates_layout_width",
        "coordinates_layout_height",
    ]
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.9μs -> 7.71μs (391% faster)
    for col in float_columns:
        dtype = result[col]


def test_object_dtype_columns():
    # Columns expected to have object dtype
    object_columns = [
        "languages",
        "link_texts",
        "links",
        "sent_from",
        "sent_to",
        "emphasized_text_contents",
        "emphasized_text_tags",
        "coordinates_points",
        "data_source_record_locator",
        "data_source_permissions_data",
        "embeddings",
    ]
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.7μs -> 7.83μs (381% faster)
    for col in object_columns:
        dtype = result[col]


# --- Edge Test Cases ---


def test_no_extra_keys():
    # There should be no extra keys in the result
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.6μs -> 7.83μs (380% faster)
    allowed_keys = {
        "text",
        "type",
        "element_id",
        "filename",
        "filetype",
        "file_directory",
        "last_modified",
        "attached_to_filename",
        "parent_id",
        "category_depth",
        "image_path",
        "languages",
        "page_number",
        "page_name",
        "url",
        "link_urls",
        "link_texts",
        "links",
        "sent_from",
        "sent_to",
        "subject",
        "section",
        "header_footer_type",
        "emphasized_text_contents",
        "emphasized_text_tags",
        "text_as_html",
        "max_characters",
        "is_continuation",
        "detection_class_prob",
        "sender",
        "coordinates_points",
        "coordinates_system",
        "coordinates_layout_width",
        "coordinates_layout_height",
        "data_source_url",
        "data_source_version",
        "data_source_record_locator",
        "data_source_date_created",
        "data_source_date_modified",
        "data_source_date_processed",
        "data_source_permissions_data",
        "embeddings",
    }
    for key in result.keys():
        pass


def test_dtype_types_are_valid():
    # All dtypes should be valid pandas dtypes or python types
    valid_types = (str, type, object, pd.StringDtype)
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.2μs -> 7.58μs (390% faster)
    for key, dtype in result.items():
        pass


def test_column_names_are_strings():
    # All column names should be strings
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.3μs -> 7.67μs (387% faster)
    for key in result.keys():
        pass


def test_column_dtypes_are_not_none():
    # No dtype should be None
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.4μs -> 7.62μs (390% faster)
    for key, dtype in result.items():
        pass


def test_mutation_missing_key():
    # If a key is missing, the test should fail
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.5μs -> 7.75μs (384% faster)
    # Remove a key and check that the test fails
    keys = list(result.keys())
    keys.remove("text")
    mutated_result = {k: result[k] for k in keys}
    # Simulate the test
    expected_keys = set(result.keys())
    mutated_keys = set(mutated_result.keys())


def test_mutation_wrong_dtype():
    # If a dtype is wrong, the test should fail
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.4μs -> 7.58μs (393% faster)
    mutated_result = result.copy()
    mutated_result["text"] = "Int64"  # Should be pd.StringDtype()


# --- Large Scale Test Cases ---


def test_large_scale_column_count():
    # The function should scale to a large number of columns (simulate by extending the dict)
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.5μs -> 7.75μs (384% faster)
    # Add 900 dummy columns
    for i in range(900):
        result[f"dummy_col_{i}"] = float


def test_large_scale_dtype_consistency():
    # All dummy columns should have the correct dtype
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.7μs -> 7.71μs (389% faster)
    for i in range(900):
        result[f"dummy_col_{i}"] = float
    for i in range(900):
        pass


def test_large_scale_performance():
    # The function should not take excessive time for many columns
    import time

    start = time.time()
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.7μs -> 7.71μs (389% faster)
    for i in range(900):
        result[f"dummy_col_{i}"] = "Int64"
    elapsed = time.time() - start


def test_large_scale_no_duplicate_keys():
    # There should be no duplicate keys in the dict, even after adding many
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 38.0μs -> 7.62μs (398% faster)
    for i in range(900):
        result[f"dummy_col_{i}"] = object
    keys = list(result.keys())


def test_large_scale_all_dtypes_not_none():
    # All dtypes should not be None, even after large scale addition
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 37.9μs -> 7.62μs (397% faster)
    for i in range(900):
        result[f"dummy_col_{i}"] = "Int64"
    for dtype in result.values():
        pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unstructured.staging.base import get_default_pandas_dtypes


def test_get_default_pandas_dtypes():
    get_default_pandas_dtypes()
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_e8goshnj/tmp3uobdmct/test_concolic_coverage.py::test_get_default_pandas_dtypes 37.6μs 7.12μs 427%✅

To edit these changes git checkout codeflash/optimize-get_default_pandas_dtypes-mje5yhes and push.

Codeflash Static Badge

The optimized code implements **function-level caching** to avoid recreating the pandas dtype dictionary on every call. The key optimization is using a function attribute (`get_default_pandas_dtypes._cache`) to store the computed dictionary after the first invocation.

**Key changes:**
- Added a cache check using `hasattr()` to see if the cache exists
- Store the complete dtype dictionary in `_cache` on first call
- Return `_cache.copy()` on subsequent calls to prevent mutation of the cached data

**Why this optimization works:**
- **Eliminates repeated object creation**: The original code creates ~40 `pd.StringDtype()` objects plus other dtype instances on every call. These object instantiations are expensive in Python.
- **Reduces memory allocation overhead**: Creating the dictionary and all its values repeatedly causes significant garbage collection pressure.
- **Leverages shallow copying**: `dict.copy()` is much faster than recreating all the dtype objects from scratch.

**Performance impact based on function usage:**
The `convert_to_dataframe` function reference shows this function is called in a data processing pipeline where `set_dtypes=True` triggers `get_default_pandas_dtypes()`. Given the test results showing 350-690% speedups across various scenarios, this optimization is particularly valuable when:
- Processing multiple dataframes in batch operations
- Called repeatedly in loops or data processing pipelines
- Used in performance-critical staging operations

**Test case analysis:**
The optimization performs consistently well across all test scenarios:
- Simple calls: 211-398% faster
- Multiple calls: 692% faster (showing cache effectiveness)
- Large-scale operations: 365-397% faster

This caching approach maintains correctness by returning copies, preventing callers from accidentally mutating the shared cache while delivering substantial performance gains for repeated invocations.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 20, 2025 10:36
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant