Dask performance

Finally getting around to looking at this :)

I think there were two reasons dask was relatively slow:

1. Just using a single partition. Dask works by splitting the input into multiple inputs and running those in parallel. For CSV input, it apparently uses one partition per file by default. The `blocksize` keyword can be used to split large CSVs into multiple partitions. I used `blocksize=20000000` which gave ~10-20 partitions.
2. The [distributed](https://distributed.dask.org/) scheduler can sometimes be faster. I mostly just used it for the better dashboard / profiling.

```python
In [1]: from distributed import Client

In [2]: import dask.dataframe as dd

In [3]: client = Client()

In [4]: %time _ = dd.read_csv("resources/large_dataset.csv", blocksize=20000000).groupby("state").count().compute()
CPU times: user 30.2 ms, sys: 9.5 ms, total: 39.7 ms
Wall time: 613 ms

In [5]: %time _ = dd.read_csv("resources/large_dataset.csv", blocksize=20000000).groupby("state").count().compute()
CPU times: user 26.9 ms, sys: 6.14 ms, total: 33 ms
Wall time: 369 ms

In [6]: import pandas as pd

In [7]: %time _ = pd.read_csv("resources/large_dataset.csv").groupby("state").count()
CPU times: user 1.11 s, sys: 109 ms, total: 1.22 s
Wall time: 1.22 s

In [8]: %time _ = pd.read_csv("resources/large_dataset.csv").groupby("state").count()
CPU times: user 1.09 s, sys: 143 ms, total: 1.23 s
Wall time: 1.22 s
```

I think without my changes it took ~2-4 seconds instead of ~400ms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask performance #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Dask performance #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions