-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Instead of
tabmat/src/tabmat/constructor.py
Line 158 in c0c8626
| if (coldata != 0).mean() <= sparse_threshold: |
we should consider
if isinstance(coldata.dtype, pd.SparseDtype) and coldata.sparse.fill_value == 0:
sparse_density = coldata.sparse.density
else:
sparse_density = (coldata != 0).mean()
if sparse_density <= sparse_threshold:
...pandas.Series.sparse.density operates in constant time (e.g., 3e-5s on my machine) while pandas.Series.ne and pandas.Series.mean are linear in the amount of non-zero entries for sparse columns (e.g., 3e-3s on my machine for one million rows with density 0.1).
The proposed changes reduce the time to convert a pandas.DataFrame with 130 sparse columns (and a few categoricals) and one million rows from 0.5s to 0.15s.
Edit: It's possible for pandas.Series.sparse.sp_values to contain zeros (e.g., when multiplying two sparse arrays, their product's sp_index is the union rather than the intersection of the factors' sp_index). In particular, coldata.sparse.density and (coldata != 0).mean() are not equivalent in those cases. Because scipy.sparse.coo_matrix uses pandas.Series.sparse.sp_values though, coldata.sparse.density appears to be the more sensible solution. Consider the below example with coldata.sparse.density = 1 and (coldata != 0).mean() = 0 which results in a "sparse" matrix with 2000 zero entries with the current tabmat version.
import pandas as pd
import tabmat as tm
df = (
pd.DataFrame(
{
"col_a": [1, 0] * 1000,
"col_b": [0, 1] * 1000,
}
)
.astype(pd.SparseDtype(fill_value=0))
.assign(col_c=lambda x: x["col_a"] * x["col_b"])
)
tm.from_pandas(df.filter(["col_c"]))._array.shape