ENH: Add regex=True flag to str_contains #5879

unutbu · 2014-01-08T15:17:03Z

Using regex=False can be faster when full regex searching is not needed.
See http://stackoverflow.com/q/20951840/190597

Example use case:

import string
import itertools as IT
import numpy as np
from pandas import Series

def make_series(letters, strlen, size):
    return Series(
        np.fromiter(IT.cycle(letters), count=size*strlen, dtype='|S1')
        .view('|S{}'.format(strlen)))

many = make_series('matchthis'+string.uppercase, strlen=19, size=10000) # 31% matches
few = make_series('matchthis'+string.uppercase*42, strlen=19, size=10000) # 1% matches

In [115]: %timeit many.str.contains('matchthis')
100 loops, best of 3: 4.93 ms per loop

In [116]: %timeit many.str.contains('matchthis', regex=False)
100 loops, best of 3: 2.44 ms per loop

In [118]: %timeit few.str.contains('matchthis')
100 loops, best of 3: 4.81 ms per loop

In [119]: %timeit few.str.contains('matchthis', regex=False)
100 loops, best of 3: 2.37 ms per loop

jreback · 2014-01-08T15:22:58Z

can you add a vbench for both of these? (with and w/o a regex)....

no vbenches for strings at ALL!

pls add vb_suite/strings.py and add to vb_suite/suite.py

bonus points for adding benches for more string ops (could be another PR if you want)!

jreback · 2014-01-08T15:33:38Z

also, pls add a release notes refering this as an API change (you can put in the 0.13.1 bucket)

cancan101 · 2014-01-08T15:53:12Z

Would is_regex be a better name for the parameter?

jreback · 2014-01-08T15:58:10Z

@cancan101 no regex is used in filter/replace so consistent

unutbu · 2014-01-11T04:42:54Z

I added some Benchmarks. Note that I modified _wrap_result (commit da8d451) because x.str.cat returns a string, and strings have no ndim attribute. Without the change x.str.cat raises an AttributeError.

jreback · 2014-01-11T15:16:33Z

can you post the run of this benchmarks ?

unutbu · 2014-01-11T17:10:48Z

Running

time ./test_perf.sh -b master -t str-contains -r strings

yields

Invoked with :
--ncalls: 3
--repeats: 3


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
strings_lower                                |   3.5593 |   3.8073 |   0.9349 |
strings_title                                |   4.1630 |   4.3960 |   0.9470 |
strings_match                                |   5.2484 |   5.4413 |   0.9645 |
series_value_counts_strings                  |   6.8393 |   7.0876 |   0.9650 |
strings_upper                                |   3.9666 |   4.0997 |   0.9675 |
contains_many                                |   4.8403 |   4.9907 |   0.9699 |
strings_rstrip                               |   3.2744 |   3.3757 |   0.9700 |
strings_get                                  |   2.6247 |   2.7027 |   0.9711 |
contains_few                                 |   4.7243 |   4.8621 |   0.9717 |
strings_lstrip                               |   3.2550 |   3.3450 |   0.9731 |
strings_strip                                |   3.5814 |   3.6740 |   0.9748 |
strings_slice                                |   2.5616 |   2.6247 |   0.9760 |
strings_pad                                  |   4.1074 |   4.1869 |   0.9810 |
strings_startswith                           |   3.5427 |   3.6077 |   0.9820 |
strings_endswith                             |   3.5497 |   3.6040 |   0.9849 |
strings_count                                |   5.4746 |   5.5540 |   0.9857 |
strings_len                                  |   2.1100 |   2.1390 |   0.9864 |
strings_center                               |   4.1567 |   4.2037 |   0.9888 |
strings_repeat                               |   4.1867 |   4.2174 |   0.9927 |
strings_findall                              |   8.4347 |   8.4640 |   0.9965 |
match_strings                                |   0.4900 |   0.4887 |   1.0026 |
strings_extract                              | 570.6987 | 568.8750 |   1.0032 |
strings_replace                              |  12.6190 |  12.5709 |   1.0038 |
strings_join_split                           |  31.8976 |  31.4066 |   1.0156 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [b2bdbf5] : PERF: add benchmarks for every str method
Base   [0bab303] : BLD: python3 unicode issue in print_versions

jreback · 2014-01-11T17:20:45Z

@unutbu grt! hmm.....maybe have a look at the last 3 to see if anything obvious can do? (esp extract)?

jreback · 2014-01-14T18:53:30Z

@unutbu can you squash to a smaller num of commits....

@y-p @jtratner look ok?

ghost · 2014-01-14T19:21:20Z

Where's the perf difference? all those benchmarks look like 1.0x to me, or is that 4% it?

unutbu · 2014-01-14T20:02:24Z

@y-p: There should be no performance gain shown in the Benchmarks (because the default behavior, regex=True, was left unchanged). The regex=False keyword in x.str.contains(pat, regex=False) can result in a 2x perf gain for users when the pattern is a plain string however. See http://stackoverflow.com/q/20951840/190597 for a use case.

jreback · 2014-01-14T20:08:36Z

@unutbu I think that you show these benchmarks, but they are not appearing in the above list; this is a case where you compare across benchmarks
e.g. maybe paste the result of these (can just be in ipython as well - in the top section would be great)

strings_contains_many = Benchmark("many.str.contains('matchthis')", setup)
strings_contains_few = Benchmark("few.str.contains('matchthis')", setup)
strings_contains_many_noregex = Benchmark(
     "many.str.contains('matchthis', regex=False)", setup)
strings_contains_few_noregex = Benchmark(
     "few.str.contains('matchthis',

ghost · 2014-01-14T20:37:01Z

can you add a vbench for both of these? (with and w/o a regex)....

No matter, I tried this here and it does indeed give a nice boost.

+1

unutbu · 2014-01-14T21:16:47Z

@jreback: I'd be glad to but I don't know how.

When I run ./test_perf.sh -b master -t str-contains -r strings

those Benchmarks appear to be missing. And some others (with no strings in its
name), like contains_many and contains_few, are present.

I still don't understand how to use vbench very well. Is there any documentation?
I would like to learn how to run benchmarks from IPython.

jreback · 2014-01-14T21:21:51Z

@unutbu that looks right....

try test.perf.sh -H -r strings. I have always found it has issues when the benchmarks themselvs are new

I just meant to do timeite from IPython....for a quick comparison of old/new that is ok (for display purposes in a PR).

ghost · 2014-01-14T21:43:07Z

The filtering is on the benchmark name (first arg of Benchmark()) not the name of the python variable.

Using regex=False can be faster when full regex searching is not needed. See http://stackoverflow.com/q/20951840/190597 TST: Add a test for str_contains with regex=False BUG: Not all strings.str_* functions return an object with an ndim attribute. PERF: add benchmarks for every str method

jreback · 2014-01-14T22:28:01Z

if regex is False but the pattern is actually a regex should this raise? (not sure if its easy to determine if its an actual regex)

ghost · 2014-01-14T22:34:48Z

any regex is also a valid string, so you can't disambiguate.

unutbu · 2014-01-14T22:39:12Z

Someone might have a list of regex strings and wish to find those that contain a literal string like (.*). So the pattern could look like regex and yet one might still wish to search for it like a plain string. Maybe that's a stretch, but it seems conceivable. :)

@jreback: I tried test_perf.sh -H -r strings and it did find and run the missing benchmarks, but with no comparison to master (of course). I then reran ./test_perf.sh -b master -t str-contains -r strings but there was no change in the benchmarks found.

ghost · 2014-01-14T22:42:47Z

a real world example:

'Mr. Smith'

Do I want to match Mr./Mrs smith or an exact match?

jreback · 2014-01-14T22:42:55Z

yep...ok

unutbu · 2014-01-14T22:52:02Z

@y-p: I think filtering is being done in test_perf.py with

benchmarks = [x for x in benchmarks if re.search(args.regex,x.name)]

and x.name is being set in vbench/benchmark.py by the _get_assigned_name function, which ultimately does this:

varname = line.split('=', 1)[0].strip()
return varname

Moreover,

from vbench.api import Benchmark

foo = Benchmark('bar','baz')
print(foo.name)

prints foo, so it looks like the Benchmark().name attribute is being set to the variable name.

But I must be missing something because I don't know why some Benchmarks are not being run.

Maybe this is a clue: contains_many and contains_few were old variable names for Benchmarks I made in vb_suite/strings.py. They no longer exist in the str-contains commit, nor in master. They do exist in vb_suite/benchmark.db, though. For some reason these benchmarks are getting called when they should not even exist...

ghost · 2014-01-14T22:56:43Z

That explaines why I got different results from you, erase the db and they'll probably vanish.

I had no idea wes used the inspect trick. Usually frowned upon as too much magic.
https://www.youtube.com/watch?v=8e0l_Dt28MQ

Update: I'm trying to do too many things at once, Of course the first arg is not the name.

ENH: Add regex=True flag to str_contains

jreback · 2014-01-15T00:26:43Z

@unutbu thanks for the PR....nice work!

jreback added a commit that referenced this pull request Jan 15, 2014

Merge pull request #5879 from unutbu/str-contains

9d9b6dd

ENH: Add regex=True flag to str_contains

jreback merged commit 9d9b6dd into pandas-dev:master Jan 15, 2014

unutbu deleted the str-contains branch January 15, 2014 11:47

davclark mentioned this pull request Jun 24, 2014

regex=False as default behavior, and expand regex disabling options #7563

Closed

Uh oh!

ENH: Add regex=True flag to str_contains #5879

ENH: Add regex=True flag to str_contains #5879

Uh oh!

Conversation

unutbu commented Jan 8, 2014

Uh oh!

jreback commented Jan 8, 2014

Uh oh!

jreback commented Jan 8, 2014

Uh oh!

cancan101 commented Jan 8, 2014

Uh oh!

jreback commented Jan 8, 2014

Uh oh!

unutbu commented Jan 11, 2014

Uh oh!

jreback commented Jan 11, 2014

Uh oh!

unutbu commented Jan 11, 2014

Uh oh!

jreback commented Jan 11, 2014

Uh oh!

jreback commented Jan 14, 2014

Uh oh!

ghost commented Jan 14, 2014

Uh oh!

unutbu commented Jan 14, 2014

Uh oh!

jreback commented Jan 14, 2014

Uh oh!

ghost commented Jan 14, 2014

Uh oh!

unutbu commented Jan 14, 2014

Uh oh!

jreback commented Jan 14, 2014

Uh oh!

ghost commented Jan 14, 2014

Uh oh!

jreback commented Jan 14, 2014

Uh oh!

ghost commented Jan 14, 2014

Uh oh!

unutbu commented Jan 14, 2014

Uh oh!

ghost commented Jan 14, 2014

Uh oh!

jreback commented Jan 14, 2014

Uh oh!

unutbu commented Jan 14, 2014

Uh oh!

ghost commented Jan 14, 2014

Uh oh!

jreback commented Jan 15, 2014

Uh oh!

Uh oh!