Skip to content

Improve performance of tabmat.from_pandas for sparse columns #378

@fbrueggemann-axa

Description

@fbrueggemann-axa

Instead of

if (coldata != 0).mean() <= sparse_threshold:

we should consider

if isinstance(coldata.dtype, pd.SparseDtype) and coldata.sparse.fill_value == 0:
    sparse_density = coldata.sparse.density
else:
    sparse_density = (coldata != 0).mean()

if sparse_density <= sparse_threshold:
    ...

pandas.Series.sparse.density operates in constant time (e.g., 3e-5s on my machine) while pandas.Series.ne and pandas.Series.mean are linear in the amount of non-zero entries for sparse columns (e.g., 3e-3s on my machine for one million rows with density 0.1).

The proposed changes reduce the time to convert a pandas.DataFrame with 130 sparse columns (and a few categoricals) and one million rows from 0.5s to 0.15s.

Edit: It's possible for pandas.Series.sparse.sp_values to contain zeros (e.g., when multiplying two sparse arrays, their product's sp_index is the union rather than the intersection of the factors' sp_index). In particular, coldata.sparse.density and (coldata != 0).mean() are not equivalent in those cases. Because scipy.sparse.coo_matrix uses pandas.Series.sparse.sp_values though, coldata.sparse.density appears to be the more sensible solution. Consider the below example with coldata.sparse.density = 1 and (coldata != 0).mean() = 0 which results in a "sparse" matrix with 2000 zero entries with the current tabmat version.

import pandas as pd
import tabmat as tm

df = (
    pd.DataFrame(
        {
            "col_a": [1, 0] * 1000,
            "col_b": [0, 1] * 1000,
        }
    )
    .astype(pd.SparseDtype(fill_value=0))
    .assign(col_c=lambda x: x["col_a"] * x["col_b"])
)
tm.from_pandas(df.filter(["col_c"]))._array.shape

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions