-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: numexpr on boolean frames #2925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
hell yeah. |
@y-p not sure how to test this setup.py changes...anything look weird? |
@y-p also...can add bottleneck to travis build for full-dep? (fyi bottleneck optional dep has been there a long time....not sure why we don't have it listed....) and maybe numexpr/bottlneck for all runs? (though numexpr currently only is python 2.X I think), 3.X being worked on |
IMO dependencies are either optional or they're not, so I'd put this under I think wes considers performance to be a core feature of pandas, so unless it I have minimal knowlege of setuptools, but as I understand it you're using go ahead and add this to full-dep, please update ci/print_version as well. I'm happy to give my opinion, but policy is really up to wes and I can't speak for him. |
ok...thanks....issue is that numexpr currently is not py3 ready (I think 2.02 will do that).....so what do we do about that? |
I suspect numexpr is built from source when installed with pip, and FULL_DEPS Not sure if you mean for travis or setup.py about 2.x/3.x deps, but the answer is yes to both, |
do you know any way to defer evaluation of something like:
I can think of using a lambda but these both change the 'interface' ? |
Python is a strict language, so no way I'm aware of that doesn't df[(df.lazy('A') > 0) & (df.lazy('B') > 0)'] |
k...that's what I thought...... of course I think you made the suggestion before to accept a lambda expression as an indexer...which |
you're right, lambda is basically the same thing. Scheme promises are actually lambdas. |
I think an alternative (less verbose) interface using strings could be a nice addition but then you may end up writing a mini query language. I've often wanted something like this for dropping and selecting variables. Perhaps select could be extended to do this?
or something. |
this actually drops very easily into numexpr......I was trying NOT to force the user to do that... reason of course is that if I give the entire expression to numexpr (rather than the terms individually) it is MUCH faster |
Yeah patsy uses tokenize to do parsing of arbitrary expressions. Working with strings vs. using lambda would likely be much more work in protecting the user from himself, but it's less typing for users. I'm not saying it should be part of this PR or anything, just a thought I've had in the back of my mind. My motivation is that it's a bit gentler for users who are not necessarily advanced pythonistas. |
This is like a dependency that is optional, but should be installed if possible (from the thread below) http://stackoverflow.com/questions/10572603/specifying-optional-dependencies-in-pypi-python-setup-py |
I was reading here
|
reading again...I think you are right! ok....so how to get the installers to install |
@wesm thoughts on the setup issue (to require Numexpr or make it highly recommended )? |
Just a +1 for numexpr all over the place and a +1 for df["col1 boolop col2 boolop col3**2"] syntax sugar. Prefixing columns with the df name, usually longer than 'df' is a lot of typing, plus the whole-expression optimization argument. |
How much of the numexpr performance win is coming from parallelization? |
I'm +1 on numexpr but only as an optional dependency |
this should be completely optional, it will detect and just use regular stuff I modified setup to install it if it can, not sure if that part works though |
I added to vbench a single threaded (vs normal it uses so a fair amount of the speedup IS due to parallelization
|
BLD: add highly recommended depdencies section and setup.py changes to optionally build these ENH: add binary_ops vbench add binary_ops accelerations via ne created expressions.py core module ENH: added support for remainder of comp and compare methods added vbench for add/mult BUG: catching error in boolean comparisons if type is int and trying to put NA fixed failing tests from boolean comparisons removed radd operations from numexpr numexpr to now ignore a failing type operation and fail over TST: failing test because float16 not upcasted on 32-bit....weird TST: added test_expressions test suite added ability to turn off usage of numexpr (mainly for testing) restricted ops to allowed dtypes and min shape of input
TST: updating test_expressions to test with 1 and num_cores ENH: added module level function in core/expressions to change use of numexpr BUG: using set_use_numexpr now changes the evaluation functionsx
@wesm I updated vbench to do all the cases (no numexpr, numexpr single-threaded, numexpr multi-threads). Seems to do the right thing when multi-threads are enabled. It is completely optionaly (like bottleneck), so I think a worthwhile addition. |
Agreed. thanks |
Do you know any easy way to 'encourage' users to install numexpr/bottleneck? (e.g. make it a dependency that the package managers should try to install, but not fail if it doesn't work) |
As long as it's stated in the README I think people who care about speed will install them. |
simple usage of using numexpr on boolean type frame comparison
frame_multi_and is the case we are talking about below, a boolean express anded with another,
can only optimize each sub-expression, and not the 3 terms together
original issue is #724
here some runs (on travis ci), for these 4 routines
So numexpr helps in boolean case with and w/o parallelization
add/multiply the parallelization helps (and just numexpr doesn't do much)