Skip to content

add siuba python code (will need to re-run nb)#2

Open
machow wants to merge 2 commits intoBodonFerenc:masterfrom
machow:feat-siuba-library
Open

add siuba python code (will need to re-run nb)#2
machow wants to merge 2 commits intoBodonFerenc:masterfrom
machow:feat-siuba-library

Conversation

@machow
Copy link

@machow machow commented Feb 28, 2020

Hey! I really enjoyed your article and think it hit on some very important points about how difficult complex groupbys can be in pandas. I maintain a library called siuba, which is a python port of dplyr.

I think it might be a contender for fastest (up to 10 million rows)?

It sits on top of pandas, and I think resolves the issues you lay out in your article. Namely it...

  • avoids the renaming
  • multiple columns allowed in agg ops
  • runs the agg ops much faster!

I've added a notebook for siuba in this PR (alongside your original agg code, to show it produces equivalent results!).

Note you will likely need to re-run the notebook to calibrate it to your profiling results

@daroczig
Copy link

Wow, this is awesome -- thanks a lot for your replies on Twitter and contributing the PR!
I hope @BodonFerenc will also find this valuable and merge soon :)

@machow
Copy link
Author

machow commented Feb 28, 2020

No problem! Being able to see these grouped agg comparisons across different code bases is really helpful. Really appreciate the wide net being cast here!

@BodonFerenc
Copy link
Owner

@machow What would be the precise siuba syntax for the query in the article?

@machow
Copy link
Author

machow commented Mar 4, 2020

Should be this (with t replaced with the data)...

from siuba import _, group_by
from siuba.experimental.pd_groups import fast_summarize

(
     t >>
     group_by(_.bucket) >>
     fast_summarize(
         NR = _.bucket.count(),
         TOTAL_QTY = _.qty.sum(),
         AVG_QTY = _.qty.mean(),
         TOTAL_RISK = _.risk.sum(),
         AVG_RISK = _.risk.mean(),
         W_AVG_QTY = (_.qty * _.weight).sum() / _.weight.sum(),
         W_AVG_RISK = (_.risk * _.weight).sum() / _.weight.sum()
     )
)

You could also skip the pipe and do...

g_bucket = t.groupby("bucket")
fast_summarize(g_bucket,
         NR = _.bucket.count(),
         TOTAL_QTY = _.qty.sum(),
         AVG_QTY = _.qty.mean(),
         TOTAL_RISK = _.risk.sum(),
         AVG_RISK = _.risk.mean(),
         W_AVG_QTY = (_.qty * _.weight).sum() / _.weight.sum(),
         W_AVG_RISK = (_.risk * _.weight).sum() / _.weight.sum()
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants