Improve performance of `tabmat.from_pandas` for sparse columns

Instead of

https://github.com/Quantco/tabmat/blob/c0c8626b0b35737e34742c422f6adc41d0b8001f/src/tabmat/constructor.py#L158

we should consider

```python
if isinstance(coldata.dtype, pd.SparseDtype) and coldata.sparse.fill_value == 0:
    sparse_density = coldata.sparse.density
else:
    sparse_density = (coldata != 0).mean()

if sparse_density <= sparse_threshold:
    ...
```

[pandas.Series.sparse.density](https://pandas.pydata.org/docs/reference/api/pandas.Series.sparse.density.html) operates in constant time (e.g., 3e-5s on my machine) while [pandas.Series.ne](https://pandas.pydata.org/docs/reference/api/pandas.Series.ne.html) and [pandas.Series.mean](https://pandas.pydata.org/docs/reference/api/pandas.Series.mean.html) are linear in the amount of non-zero entries for sparse columns (e.g., 3e-3s on my machine for one million rows with density 0.1).

The proposed changes reduce the time to convert a `pandas.DataFrame` with 130 sparse columns (and a few categoricals) and one million rows from 0.5s to 0.15s.


Edit: It's possible for `pandas.Series.sparse.sp_values` to contain zeros (e.g., when multiplying two sparse arrays, their product's `sp_index` is the union rather than the intersection of the factors' `sp_index`). In particular, `coldata.sparse.density` and `(coldata != 0).mean()` are not equivalent in those cases. Because `scipy.sparse.coo_matrix` uses `pandas.Series.sparse.sp_values` though, `coldata.sparse.density` appears to be the more sensible solution. Consider the below example with `coldata.sparse.density = 1` and `(coldata != 0).mean() = 0` which results in a "sparse" matrix with 2000 zero entries with the current `tabmat` version.


```python
import pandas as pd
import tabmat as tm

df = (
    pd.DataFrame(
        {
            "col_a": [1, 0] * 1000,
            "col_b": [0, 1] * 1000,
        }
    )
    .astype(pd.SparseDtype(fill_value=0))
    .assign(col_c=lambda x: x["col_a"] * x["col_b"])
)
tm.from_pandas(df.filter(["col_c"]))._array.shape
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance of `tabmat.from_pandas` for sparse columns #378

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve performance of tabmat.from_pandas for sparse columns #378

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Improve performance of `tabmat.from_pandas` for sparse columns #378