add siuba python code (will need to re-run nb) by machow · Pull Request #2 · BodonFerenc/PythonIsThisReallySimple

machow · 2020-02-28T00:50:05Z

Hey! I really enjoyed your article and think it hit on some very important points about how difficult complex groupbys can be in pandas. I maintain a library called siuba, which is a python port of dplyr.

I think it might be a contender for fastest (up to 10 million rows)?

It sits on top of pandas, and I think resolves the issues you lay out in your article. Namely it...

avoids the renaming
multiple columns allowed in agg ops
runs the agg ops much faster!

I've added a notebook for siuba in this PR (alongside your original agg code, to show it produces equivalent results!).

Note you will likely need to re-run the notebook to calibrate it to your profiling results

daroczig · 2020-02-28T01:03:14Z

Wow, this is awesome -- thanks a lot for your replies on Twitter and contributing the PR!
I hope @BodonFerenc will also find this valuable and merge soon :)

machow · 2020-02-28T01:05:25Z

No problem! Being able to see these grouped agg comparisons across different code bases is really helpful. Really appreciate the wide net being cast here!

BodonFerenc · 2020-03-04T08:01:37Z

@machow What would be the precise siuba syntax for the query in the article?

machow · 2020-03-04T17:09:13Z

Should be this (with t replaced with the data)...

from siuba import _, group_by
from siuba.experimental.pd_groups import fast_summarize

(
     t >>
     group_by(_.bucket) >>
     fast_summarize(
         NR = _.bucket.count(),
         TOTAL_QTY = _.qty.sum(),
         AVG_QTY = _.qty.mean(),
         TOTAL_RISK = _.risk.sum(),
         AVG_RISK = _.risk.mean(),
         W_AVG_QTY = (_.qty * _.weight).sum() / _.weight.sum(),
         W_AVG_RISK = (_.risk * _.weight).sum() / _.weight.sum()
     )
)

You could also skip the pipe and do...

g_bucket = t.groupby("bucket")
fast_summarize(g_bucket,
         NR = _.bucket.count(),
         TOTAL_QTY = _.qty.sum(),
         AVG_QTY = _.qty.mean(),
         TOTAL_RISK = _.risk.sum(),
         AVG_RISK = _.risk.mean(),
         W_AVG_QTY = (_.qty * _.weight).sum() / _.weight.sum(),
         W_AVG_RISK = (_.risk * _.weight).sum() / _.weight.sum()
)

add siuba python code (will need to re-run nb)

20a849d

remove unused cell from siuba

73f1540

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add siuba python code (will need to re-run nb)#2

add siuba python code (will need to re-run nb)#2
machow wants to merge 2 commits intoBodonFerenc:masterfrom
machow:feat-siuba-library

machow commented Feb 28, 2020 •

edited

Loading

Uh oh!

daroczig commented Feb 28, 2020

Uh oh!

machow commented Feb 28, 2020

Uh oh!

BodonFerenc commented Mar 4, 2020

Uh oh!

machow commented Mar 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

machow commented Feb 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daroczig commented Feb 28, 2020

Uh oh!

machow commented Feb 28, 2020

Uh oh!

BodonFerenc commented Mar 4, 2020

Uh oh!

machow commented Mar 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

machow commented Feb 28, 2020 •

edited

Loading