Skip to content

ENH: numexpr on boolean frames #2925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 9, 2013
Merged

ENH: numexpr on boolean frames #2925

merged 3 commits into from
Mar 9, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Feb 25, 2013

simple usage of using numexpr on boolean type frame comparison

name                                                                    
indexing_dataframe_boolean                 13.3200   125.3521     0.1063
frame_mult                                 21.7157    36.6353     0.5928
frame_add                                  22.0432    36.5021     0.6039
frame_multi_and                           385.4241   407.4359     0.9460
  • optional use of numexpr (though out to suggest it highly)
  • example usage, many more cases like this

frame_multi_and is the case we are talking about below, a boolean express anded with another,
can only optimize each sub-expression, and not the 3 terms together

original issue is #724

here some runs (on travis ci), for these 4 routines

  • with no numexpr (no_ne)
  • with numexpr single threaded (_st)
  • with numexpr (no suffix)

So numexpr helps in boolean case with and w/o parallelization
add/multiply the parallelization helps (and just numexpr doesn't do much)

frame_mult                              :    27.0671 [ms]
frame_mult_st                           :    85.1929 [ms]
frame_mult_no_ne                        :    86.2602 [ms]
frame_add                               :    26.3325 [ms]
frame_add_st                            :    74.8805 [ms]
frame_add_no_ne                         :    73.9357 [ms]
indexing_dataframe_boolean              :    15.0413 [ms]
indexing_dataframe_boolean_st           :    25.8873 [ms]
indexing_dataframe_boolean_no_ne        :   288.4722 [ms]
frame_multi_and                         :   637.8641 [ms]
frame_multi_and_st                      :   675.7760 [ms]
frame_multi_and_no_ne                   :   589.3791 [ms]

@ghost
Copy link

ghost commented Feb 25, 2013

hell yeah.

@jreback
Copy link
Contributor Author

jreback commented Feb 25, 2013

@y-p not sure how to test this setup.py changes...anything look weird?
and is the notion of 'highly recommended dependencies' ok?

@jreback
Copy link
Contributor Author

jreback commented Feb 25, 2013

@y-p also...can add bottleneck to travis build for full-dep? (fyi bottleneck optional dep has been there a long time....not sure why we don't have it listed....)

and maybe numexpr/bottlneck for all runs? (though numexpr currently only is python 2.X I think), 3.X being worked on
bottleneck good for 3.X

@ghost
Copy link

ghost commented Feb 25, 2013

IMO dependencies are either optional or they're not, so I'd put this under
optional,and maybe issue a "you really should" warning at install if the deps are unavailable.

I think wes considers performance to be a core feature of pandas, so unless it
limits the supported platforms, something providing 10x performance should become the default.
So, If this becomes integrated into more parts of pandas (and that would be awesome)
then it should probably become a required dep (does numexpr support all platforms?),

I have minimal knowlege of setuptools, but as I understand it you're using
extra_requires incorrectly, and even if you were it's not what you want since those deps
won't get installed unless some pretty obscure things are done.

go ahead and add this to full-dep, please update ci/print_version as well.
As long as it's not a hard dependency, it shouldn't be part of the other builds.

I'm happy to give my opinion, but policy is really up to wes and I can't speak for him.
Mostly, I just merge the docstring fix PRs ;)

@jreback
Copy link
Contributor Author

jreback commented Feb 25, 2013

ok...thanks....issue is that numexpr currently is not py3 ready (I think 2.02 will do that).....so what do we do about that?
can u have a dependecy only for 2.X?

@ghost
Copy link

ghost commented Feb 25, 2013

I suspect numexpr is built from source when installed with pip, and FULL_DEPS
is already pretty close to the travis time-limit, try to install with apt-get, similar to how scipy
is installed.

Not sure if you mean for travis or setup.py about 2.x/3.x deps, but the answer is yes to both,
and both already do that for certain packages.

@jreback
Copy link
Contributor Author

jreback commented Feb 25, 2013

@y-p

do you know any way to defer evaluation of something like:

df[(df.A > 0) & (df.B > 0)]

I can think of using a lambda
df[lambda x: (df.A > 0) & (df.B > 0)]
or string
df['(df.A > 0) & (df.B > 0)']

but these both change the 'interface'

?

@ghost
Copy link

ghost commented Feb 25, 2013

Python is a strict language, so no way I'm aware of that doesn't
change the interface. You could implement a lazy/promise style
for attribute access, like:

df[(df.lazy('A') > 0) & (df.lazy('B') > 0)']
and then do special handling in__getitem__, but that would have
to grow into something pretty large, since it doesn't make sense
only as a special case.

@jreback
Copy link
Contributor Author

jreback commented Feb 25, 2013

k...that's what I thought......

of course I think you made the suggestion before to accept a lambda expression as an indexer...which
essentially solves the problem anyhow

@ghost
Copy link

ghost commented Feb 25, 2013

you're right, lambda is basically the same thing. Scheme promises are actually lambdas.

@jseabold
Copy link
Contributor

I think an alternative (less verbose) interface using strings could be a nice addition but then you may end up writing a mini query language. I've often wanted something like this for dropping and selecting variables. Perhaps select could be extended to do this?

df.select("A > 0 & B > 0", axis=1)

or something.

@jreback
Copy link
Contributor Author

jreback commented Feb 25, 2013

this actually drops very easily into numexpr......I was trying NOT to force the user to do that...
I think this is what patsy does, right?

reason of course is that if I give the entire expression to numexpr (rather than the terms individually) it is MUCH faster
fyi...I actually have to 'parse' it anyhow to make sure things are aligned and such....so a lambda is a bit easier....

@jseabold
Copy link
Contributor

Yeah patsy uses tokenize to do parsing of arbitrary expressions. Working with strings vs. using lambda would likely be much more work in protecting the user from himself, but it's less typing for users. I'm not saying it should be part of this PR or anything, just a thought I've had in the back of my mind. My motivation is that it's a bit gentler for users who are not necessarily advanced pythonistas.

@jreback
Copy link
Contributor Author

jreback commented Feb 26, 2013

This is like a dependency that is optional, but should be installed if possible (from the thread below)
the extras_require is 'supposed' to install numexpr if it can, but not fail the whole install if it cannot

http://stackoverflow.com/questions/10572603/specifying-optional-dependencies-in-pypi-python-setup-py

@ghost
Copy link

ghost commented Feb 26, 2013

I was reading here

These requirements will not be automatically installed unless another 
package depends on them (directly or indirectly) by including the desired “extras”
in square brackets after the associated project name. (Or if the extras were listed 
in a requirement spec on the EasyInstall command line.)

@jreback
Copy link
Contributor Author

jreback commented Feb 26, 2013

reading again...I think you are right!

ok....so how to get the installers to install numexpr and bottleneck, but w/o breaking the install process?

@jreback
Copy link
Contributor Author

jreback commented Feb 27, 2013

@wesm thoughts on the setup issue (to require Numexpr or make it highly recommended )?
(same for bottleneck too I think)

@alvorithm
Copy link

Just a +1 for numexpr all over the place and a +1 for df["col1 boolop col2 boolop col3**2"] syntax sugar. Prefixing columns with the df name, usually longer than 'df' is a lot of typing, plus the whole-expression optimization argument.

@wesm
Copy link
Member

wesm commented Mar 8, 2013

How much of the numexpr performance win is coming from parallelization?

@wesm
Copy link
Member

wesm commented Mar 8, 2013

I'm +1 on numexpr but only as an optional dependency

@jreback
Copy link
Contributor Author

jreback commented Mar 8, 2013

this should be completely optional, it will detect and just use regular stuff

I modified setup to install it if it can, not sure if that part works though

@jreback
Copy link
Contributor Author

jreback commented Mar 8, 2013

I added to vbench a single threaded (vs normal it uses
number of cores)

so a fair amount of the speedup IS due to parallelization
(these are all WITH numexpr), coming from a travis ci,
not comparable to above numbers

indexing_dataframe_boolean_st           :    41.7481 [ms]
indexing_dataframe_boolean              :    16.3809 [ms]
frame_add_st                            :    85.0020 [ms]
frame_add                               :    25.4194 [ms]
frame_mult_st                           :    80.9674 [ms]
frame_mult                              :    28.3772 [ms]

@jreback jreback closed this Mar 8, 2013
@jreback jreback reopened this Mar 8, 2013
jreback added 2 commits March 8, 2013 21:20
BLD: add highly recommended depdencies section and setup.py changes to optionally build these

ENH: add binary_ops vbench
     add binary_ops accelerations via ne
     created expressions.py core module

ENH: added support for remainder of comp and compare methods
     added vbench for add/mult

BUG: catching error in boolean comparisons if type is int and trying to put NA
     fixed failing tests from boolean comparisons
     removed radd operations from numexpr
     numexpr to now ignore a failing type operation and fail over

TST: failing test because float16 not upcasted on 32-bit....weird

TST: added test_expressions test suite
     added ability to turn off usage of numexpr (mainly for testing)
     restricted ops to allowed dtypes and min shape of input
TST: updating test_expressions to test with 1 and num_cores

ENH: added module level function in core/expressions to change use of numexpr

BUG: using set_use_numexpr now changes the evaluation functionsx
@jreback
Copy link
Contributor Author

jreback commented Mar 9, 2013

@wesm I updated vbench to do all the cases (no numexpr, numexpr single-threaded, numexpr multi-threads). Seems to do the right thing when multi-threads are enabled. It is completely optionaly (like bottleneck), so I think a worthwhile addition.

@wesm
Copy link
Member

wesm commented Mar 9, 2013

Agreed. thanks

@jreback
Copy link
Contributor Author

jreback commented Mar 9, 2013

Do you know any easy way to 'encourage' users to install numexpr/bottleneck? (e.g. make it a dependency that the package managers should try to install, but not fail if it doesn't work)

@wesm
Copy link
Member

wesm commented Mar 9, 2013

As long as it's stated in the README I think people who care about speed will install them.

jreback added a commit that referenced this pull request Mar 9, 2013
ENH: numexpr on boolean frames
@jreback jreback merged commit 13f54e5 into pandas-dev:master Mar 9, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants