add siuba python code (will need to re-run nb)#2
Open
machow wants to merge 2 commits intoBodonFerenc:masterfrom
Open
add siuba python code (will need to re-run nb)#2machow wants to merge 2 commits intoBodonFerenc:masterfrom
machow wants to merge 2 commits intoBodonFerenc:masterfrom
Conversation
|
Wow, this is awesome -- thanks a lot for your replies on Twitter and contributing the PR! |
Author
|
No problem! Being able to see these grouped agg comparisons across different code bases is really helpful. Really appreciate the wide net being cast here! |
Owner
|
@machow What would be the precise siuba syntax for the query in the article? |
Author
|
Should be this (with from siuba import _, group_by
from siuba.experimental.pd_groups import fast_summarize
(
t >>
group_by(_.bucket) >>
fast_summarize(
NR = _.bucket.count(),
TOTAL_QTY = _.qty.sum(),
AVG_QTY = _.qty.mean(),
TOTAL_RISK = _.risk.sum(),
AVG_RISK = _.risk.mean(),
W_AVG_QTY = (_.qty * _.weight).sum() / _.weight.sum(),
W_AVG_RISK = (_.risk * _.weight).sum() / _.weight.sum()
)
)You could also skip the pipe and do... g_bucket = t.groupby("bucket")
fast_summarize(g_bucket,
NR = _.bucket.count(),
TOTAL_QTY = _.qty.sum(),
AVG_QTY = _.qty.mean(),
TOTAL_RISK = _.risk.sum(),
AVG_RISK = _.risk.mean(),
W_AVG_QTY = (_.qty * _.weight).sum() / _.weight.sum(),
W_AVG_RISK = (_.risk * _.weight).sum() / _.weight.sum()
) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey! I really enjoyed your article and think it hit on some very important points about how difficult complex groupbys can be in pandas. I maintain a library called siuba, which is a python port of dplyr.
I think it might be a contender for fastest (up to 10 million rows)?
It sits on top of pandas, and I think resolves the issues you lay out in your article. Namely it...
I've added a notebook for siuba in this PR (alongside your original agg code, to show it produces equivalent results!).
Note you will likely need to re-run the notebook to calibrate it to your profiling results