-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Remove codepath asymmetry in dataframe count() #9136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
on a mixed type frame this will cause a big drop in perf as the .values will convert everything to object (and actually may not work correctly) - so I believe some tests should fail and need a more comprehensive perf metric the notnull(frame) will do a block by block comparison |
@jreback Test suite passes for me and here's a full vbench run:
|
the isnull determination is prob ok, but the vbenches are all very misleading. try this. In fact should add a mixed-type vbench.
|
@qwhelan so I think you can use this perf fix, but only if |
@jreback Thanks for the pointer. I'll add a mixed-type case to the vb_suite and investigate this further in the next few days (getting on a plane in a few hours). |
@qwhelan thanks! certainly can be perf improved. |
Still have some cleanup to do, but here's a new vbench with the mixed-dtype case:
|
@jreback I think this patch is ready - let me know if there's anything I should address. The axis information doesn't seem to be relevant as to which branch to take here, as |
@qwhelan ok this looks gr8! ping when green! and feel free to look for more of these types of things. Anytime And of course not everything is profiled (though hopefully tested) |
@jreback Travis is giving this a green - looks good to go. I'm seeing some good candidates for investigation, so I'll try and find some time to dig in over the next few weeks. |
Remove codepath asymmetry in dataframe count()
@qwhelan thank you sir! |
@jreback I noticed a codepath asymmetry in core.frame.count that leads to a substantial difference in dropna() performance depending on the axis. Using the path
df.dropna(axis=0)
takes yields a 2.5-5x improvement.