ENH: numexpr on boolean frames #2925

jreback · 2013-02-25T14:49:58Z

simple usage of using numexpr on boolean type frame comparison

name                                                                    
indexing_dataframe_boolean                 13.3200   125.3521     0.1063
frame_mult                                 21.7157    36.6353     0.5928
frame_add                                  22.0432    36.5021     0.6039
frame_multi_and                           385.4241   407.4359     0.9460

optional use of numexpr (though out to suggest it highly)
example usage, many more cases like this

frame_multi_and is the case we are talking about below, a boolean express anded with another,
can only optimize each sub-expression, and not the 3 terms together

original issue is #724

here some runs (on travis ci), for these 4 routines

with no numexpr (no_ne)
with numexpr single threaded (_st)
with numexpr (no suffix)

So numexpr helps in boolean case with and w/o parallelization
add/multiply the parallelization helps (and just numexpr doesn't do much)

frame_mult                              :    27.0671 [ms]
frame_mult_st                           :    85.1929 [ms]
frame_mult_no_ne                        :    86.2602 [ms]
frame_add                               :    26.3325 [ms]
frame_add_st                            :    74.8805 [ms]
frame_add_no_ne                         :    73.9357 [ms]
indexing_dataframe_boolean              :    15.0413 [ms]
indexing_dataframe_boolean_st           :    25.8873 [ms]
indexing_dataframe_boolean_no_ne        :   288.4722 [ms]
frame_multi_and                         :   637.8641 [ms]
frame_multi_and_st                      :   675.7760 [ms]
frame_multi_and_no_ne                   :   589.3791 [ms]

ghost · 2013-02-25T15:11:06Z

hell yeah.

jreback · 2013-02-25T16:23:16Z

@y-p not sure how to test this setup.py changes...anything look weird?
and is the notion of 'highly recommended dependencies' ok?

jreback · 2013-02-25T16:44:35Z

@y-p also...can add bottleneck to travis build for full-dep? (fyi bottleneck optional dep has been there a long time....not sure why we don't have it listed....)

and maybe numexpr/bottlneck for all runs? (though numexpr currently only is python 2.X I think), 3.X being worked on
bottleneck good for 3.X

ghost · 2013-02-25T16:50:24Z

IMO dependencies are either optional or they're not, so I'd put this under
optional,and maybe issue a "you really should" warning at install if the deps are unavailable.

I think wes considers performance to be a core feature of pandas, so unless it
limits the supported platforms, something providing 10x performance should become the default.
So, If this becomes integrated into more parts of pandas (and that would be awesome)
then it should probably become a required dep (does numexpr support all platforms?),

I have minimal knowlege of setuptools, but as I understand it you're using
extra_requires incorrectly, and even if you were it's not what you want since those deps
won't get installed unless some pretty obscure things are done.

go ahead and add this to full-dep, please update ci/print_version as well.
As long as it's not a hard dependency, it shouldn't be part of the other builds.

I'm happy to give my opinion, but policy is really up to wes and I can't speak for him.
Mostly, I just merge the docstring fix PRs ;)

jreback · 2013-02-25T16:56:36Z

ok...thanks....issue is that numexpr currently is not py3 ready (I think 2.02 will do that).....so what do we do about that?
can u have a dependecy only for 2.X?

ghost · 2013-02-25T17:01:53Z

I suspect numexpr is built from source when installed with pip, and FULL_DEPS
is already pretty close to the travis time-limit, try to install with apt-get, similar to how scipy
is installed.

Not sure if you mean for travis or setup.py about 2.x/3.x deps, but the answer is yes to both,
and both already do that for certain packages.

jreback · 2013-02-25T19:11:56Z

@y-p

do you know any way to defer evaluation of something like:

df[(df.A > 0) & (df.B > 0)]

I can think of using a lambda
df[lambda x: (df.A > 0) & (df.B > 0)]
or string
df['(df.A > 0) & (df.B > 0)']

but these both change the 'interface'

?

ghost · 2013-02-25T19:57:31Z

Python is a strict language, so no way I'm aware of that doesn't
change the interface. You could implement a lazy/promise style
for attribute access, like:

df[(df.lazy('A') > 0) & (df.lazy('B') > 0)']
and then do special handling in__getitem__, but that would have
to grow into something pretty large, since it doesn't make sense
only as a special case.

jreback · 2013-02-25T19:59:56Z

k...that's what I thought......

of course I think you made the suggestion before to accept a lambda expression as an indexer...which
essentially solves the problem anyhow

ghost · 2013-02-25T20:00:59Z

you're right, lambda is basically the same thing. Scheme promises are actually lambdas.

jseabold · 2013-02-25T20:02:56Z

I think an alternative (less verbose) interface using strings could be a nice addition but then you may end up writing a mini query language. I've often wanted something like this for dropping and selecting variables. Perhaps select could be extended to do this?

df.select("A > 0 & B > 0", axis=1)

or something.

jreback · 2013-02-25T20:07:25Z

this actually drops very easily into numexpr......I was trying NOT to force the user to do that...
I think this is what patsy does, right?

reason of course is that if I give the entire expression to numexpr (rather than the terms individually) it is MUCH faster
fyi...I actually have to 'parse' it anyhow to make sure things are aligned and such....so a lambda is a bit easier....

jseabold · 2013-02-25T20:16:20Z

Yeah patsy uses tokenize to do parsing of arbitrary expressions. Working with strings vs. using lambda would likely be much more work in protecting the user from himself, but it's less typing for users. I'm not saying it should be part of this PR or anything, just a thought I've had in the back of my mind. My motivation is that it's a bit gentler for users who are not necessarily advanced pythonistas.

jreback · 2013-02-26T19:46:48Z

This is like a dependency that is optional, but should be installed if possible (from the thread below)
the extras_require is 'supposed' to install numexpr if it can, but not fail the whole install if it cannot

http://stackoverflow.com/questions/10572603/specifying-optional-dependencies-in-pypi-python-setup-py

ghost · 2013-02-26T20:01:30Z

I was reading here

These requirements will not be automatically installed unless another 
package depends on them (directly or indirectly) by including the desired “extras”
in square brackets after the associated project name. (Or if the extras were listed 
in a requirement spec on the EasyInstall command line.)

jreback · 2013-02-26T20:11:57Z

reading again...I think you are right!

ok....so how to get the installers to install numexpr and bottleneck, but w/o breaking the install process?

jreback · 2013-02-27T22:51:20Z

@wesm thoughts on the setup issue (to require Numexpr or make it highly recommended )?
(same for bottleneck too I think)

alvorithm · 2013-03-07T10:36:45Z

Just a +1 for numexpr all over the place and a +1 for df["col1 boolop col2 boolop col3**2"] syntax sugar. Prefixing columns with the df name, usually longer than 'df' is a lot of typing, plus the whole-expression optimization argument.

wesm · 2013-03-08T20:57:41Z

How much of the numexpr performance win is coming from parallelization?

wesm · 2013-03-08T20:57:52Z

I'm +1 on numexpr but only as an optional dependency

jreback · 2013-03-08T21:06:10Z

this should be completely optional, it will detect and just use regular stuff

I modified setup to install it if it can, not sure if that part works though

jreback · 2013-03-08T23:18:52Z

I added to vbench a single threaded (vs normal it uses
number of cores)

so a fair amount of the speedup IS due to parallelization
(these are all WITH numexpr), coming from a travis ci,
not comparable to above numbers

indexing_dataframe_boolean_st           :    41.7481 [ms]
indexing_dataframe_boolean              :    16.3809 [ms]
frame_add_st                            :    85.0020 [ms]
frame_add                               :    25.4194 [ms]
frame_mult_st                           :    80.9674 [ms]
frame_mult                              :    28.3772 [ms]

BLD: add highly recommended depdencies section and setup.py changes to optionally build these ENH: add binary_ops vbench add binary_ops accelerations via ne created expressions.py core module ENH: added support for remainder of comp and compare methods added vbench for add/mult BUG: catching error in boolean comparisons if type is int and trying to put NA fixed failing tests from boolean comparisons removed radd operations from numexpr numexpr to now ignore a failing type operation and fail over TST: failing test because float16 not upcasted on 32-bit....weird TST: added test_expressions test suite added ability to turn off usage of numexpr (mainly for testing) restricted ops to allowed dtypes and min shape of input

TST: updating test_expressions to test with 1 and num_cores ENH: added module level function in core/expressions to change use of numexpr BUG: using set_use_numexpr now changes the evaluation functionsx

jreback · 2013-03-09T15:36:17Z

@wesm I updated vbench to do all the cases (no numexpr, numexpr single-threaded, numexpr multi-threads). Seems to do the right thing when multi-threads are enabled. It is completely optionaly (like bottleneck), so I think a worthwhile addition.

wesm · 2013-03-09T18:26:06Z

Agreed. thanks

jreback · 2013-03-09T18:35:14Z

Do you know any easy way to 'encourage' users to install numexpr/bottleneck? (e.g. make it a dependency that the package managers should try to install, but not fail if it doesn't work)

wesm · 2013-03-09T18:46:14Z

As long as it's stated in the README I think people who care about speed will install them.

ENH: numexpr on boolean frames

jreback closed this Mar 8, 2013

jreback reopened this Mar 8, 2013

jreback added 2 commits March 8, 2013 21:20

ENH: added ability to use single or multi-threads in numexp testing

385ff82

TST: updating test_expressions to test with 1 and num_cores ENH: added module level function in core/expressions to change use of numexpr BUG: using set_use_numexpr now changes the evaluation functionsx

CLN: remove setup.py changes

e273828

jreback added a commit that referenced this pull request Mar 9, 2013

Merge pull request #2925 from jreback/compare

13f54e5

ENH: numexpr on boolean frames

jreback merged commit 13f54e5 into pandas-dev:master Mar 9, 2013

Uh oh!

ENH: numexpr on boolean frames #2925

ENH: numexpr on boolean frames #2925

Uh oh!

Conversation

jreback commented Feb 25, 2013

Uh oh!

ghost commented Feb 25, 2013

Uh oh!

jreback commented Feb 25, 2013

Uh oh!

jreback commented Feb 25, 2013

Uh oh!

ghost commented Feb 25, 2013

Uh oh!

jreback commented Feb 25, 2013

Uh oh!

ghost commented Feb 25, 2013

Uh oh!

jreback commented Feb 25, 2013

Uh oh!

ghost commented Feb 25, 2013

Uh oh!

jreback commented Feb 25, 2013

Uh oh!

ghost commented Feb 25, 2013

Uh oh!

jseabold commented Feb 25, 2013

Uh oh!

jreback commented Feb 25, 2013

Uh oh!

jseabold commented Feb 25, 2013

Uh oh!

jreback commented Feb 26, 2013

Uh oh!

ghost commented Feb 26, 2013

Uh oh!

jreback commented Feb 26, 2013

Uh oh!

jreback commented Feb 27, 2013

Uh oh!

alvorithm commented Mar 7, 2013

Uh oh!

wesm commented Mar 8, 2013

Uh oh!

wesm commented Mar 8, 2013

Uh oh!

jreback commented Mar 8, 2013

Uh oh!

jreback commented Mar 8, 2013

Uh oh!

jreback commented Mar 9, 2013

Uh oh!

wesm commented Mar 9, 2013

Uh oh!

jreback commented Mar 9, 2013

Uh oh!

wesm commented Mar 9, 2013

Uh oh!

Uh oh!