Skip to content

Commit 0d8997a

Browse files
committed
ENH: add in, not in, and string/list query support
1 parent cca0173 commit 0d8997a

23 files changed

+1973
-332
lines changed

bench/bench_with_subset.R

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
library(microbenchmark)
2+
library(data.table)
3+
4+
5+
data.frame.subset.bench <- function (n=1e7, times=30) {
6+
df <- data.frame(a=rnorm(n), b=rnorm(n), c=rnorm(n))
7+
print(microbenchmark(subset(df, a <= b & b <= (c ^ 2 + b ^ 2 - a) & b > c),
8+
times=times))
9+
}
10+
11+
12+
# data.table allows something very similar to query with an expression
13+
# but we have chained comparisons AND we're faster BOO YAH!
14+
data.table.subset.expression.bench <- function (n=1e7, times=30) {
15+
dt <- data.table(a=rnorm(n), b=rnorm(n), c=rnorm(n))
16+
print(microbenchmark(dt[, a <= b & b <= (c ^ 2 + b ^ 2 - a) & b > c],
17+
times=times))
18+
}
19+
20+
21+
# compare against subset with data.table for good measure
22+
data.table.subset.bench <- function (n=1e7, times=30) {
23+
dt <- data.table(a=rnorm(n), b=rnorm(n), c=rnorm(n))
24+
print(microbenchmark(subset(dt, a <= b & b <= (c ^ 2 + b ^ 2 - a) & b > c),
25+
times=times))
26+
}
27+
28+
29+
data.frame.with.bench <- function (n=1e7, times=30) {
30+
df <- data.frame(a=rnorm(n), b=rnorm(n), c=rnorm(n))
31+
32+
print(microbenchmark(with(df, a + b * (c ^ 2 + b ^ 2 - a) / (a * c) ^ 3),
33+
times=times))
34+
}
35+
36+
37+
data.table.with.bench <- function (n=1e7, times=30) {
38+
dt <- data.table(a=rnorm(n), b=rnorm(n), c=rnorm(n))
39+
print(microbenchmark(with(dt, a + b * (c ^ 2 + b ^ 2 - a) / (a * c) ^ 3),
40+
times=times))
41+
}
42+
43+
44+
bench <- function () {
45+
data.frame.subset.bench()
46+
data.table.subset.expression.bench()
47+
data.table.subset.bench()
48+
data.frame.with.bench()
49+
data.table.with.bench()
50+
}
51+
52+
53+
bench()

bench/bench_with_subset.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/usr/bin/env python
2+
3+
"""
4+
Microbenchmarks for comparison with R's "with" and "subset" functions
5+
"""
6+
7+
from __future__ import print_function
8+
from timeit import timeit
9+
10+
11+
def bench_with(n=1e7, times=10, repeat=3):
12+
setup = "from pandas import DataFrame\n"
13+
setup += "from numpy.random import randn\n"
14+
setup += "df = DataFrame(randn(%d, 3), columns=list('abc'))\n" % n
15+
setup += "s = 'a + b * (c ** 2 + b ** 2 - a) / (a * c) ** 3'"
16+
print('DataFrame.eval:')
17+
print(timeit('df.eval(s)', setup=setup, repeat=repeat, number=times))
18+
19+
20+
def bench_subset(n=1e7, times=10, repeat=3):
21+
setup = "from pandas import DataFrame\n"
22+
setup += "from numpy.random import randn\n"
23+
setup += "df = DataFrame(randn(%d, 3), columns=list('abc'))\n" % n
24+
setup += "s = 'a <= b <= (c ** 2 + b ** 2 - a) and b > c'"
25+
print('DataFrame.query:')
26+
print(timeit('df.query(s)', setup=setup, repeat=repeat, number=times))
27+
print('DataFrame.__getitem__:')
28+
print(timeit('df[s]', setup=setup, repeat=repeat, number=times))
29+
30+
31+
def bench():
32+
bench_with()
33+
bench_subset()
34+
35+
36+
if __name__ == '__main__':
37+
bench()

doc/source/api.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -514,6 +514,7 @@ Computations / Descriptive Stats
514514
DataFrame.cumsum
515515
DataFrame.describe
516516
DataFrame.diff
517+
DataFrame.eval
517518
DataFrame.kurt
518519
DataFrame.mad
519520
DataFrame.max

doc/source/comparison_with_r.rst

Lines changed: 81 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,88 @@
11
.. currentmodule:: pandas
22
.. _compare_with_r:
33

4-
*******************************
54
Comparison with R / R libraries
65
*******************************
76

8-
Since pandas aims to provide a lot of the data manipulation and analysis
9-
functionality that people use R for, this page was started to provide a more
10-
detailed look at the R language and it's many 3rd party libraries as they
11-
relate to pandas. In offering comparisons with R and CRAN libraries, we care
12-
about the following things:
7+
Since ``pandas`` aims to provide a lot of the data manipulation and analysis
8+
functionality that people use `R <http://www.r-project.org/>`__ for, this page
9+
was started to provide a more detailed look at the `R language
10+
<http://en.wikipedia.org/wiki/R_(programming_language)>`__ and its many third
11+
party libraries as they relate to ``pandas``. In comparisons with R and CRAN
12+
libraries, we care about the following things:
1313

14-
- **Functionality / flexibility**: what can / cannot be done with each tool
15-
- **Performance**: how fast are operations. Hard numbers / benchmarks are
14+
- **Functionality / flexibility**: what can/cannot be done with each tool
15+
- **Performance**: how fast are operations. Hard numbers/benchmarks are
1616
preferable
17-
- **Ease-of-use**: is one tool easier or harder to use (you may have to be
18-
the judge of this given side-by-side code comparisons)
17+
- **Ease-of-use**: Is one tool easier/harder to use (you may have to be
18+
the judge of this, given side-by-side code comparisons)
19+
20+
This page is also here to offer a bit of a translation guide for users of these
21+
R packages.
22+
23+
Base R
24+
------
25+
26+
|subset|_
27+
~~~~~~~~~~
28+
29+
.. versionadded:: 0.13
30+
31+
The :meth:`~pandas.DataFrame.query` method is similar to the base R ``subset``
32+
function. In R you might want to get the rows of a ``data.frame`` where one
33+
column's values are less than another column's values:
34+
35+
.. code-block:: r
36+
37+
df <- data.frame(a=rnorm(10), b=rnorm(10))
38+
subset(df, a <= b)
39+
df[df$a <= df$b,] # note the comma
40+
41+
In ``pandas``, there are a few ways to perform subsetting. You can use
42+
:meth:`~pandas.DataFrame.query` or pass an expression as if it were an
43+
index/slice as well as standard boolean indexing:
44+
45+
.. ipython:: python
46+
47+
from pandas import DataFrame
48+
from numpy.random import randn
49+
50+
df = DataFrame({'a': randn(10), 'b': randn(10)})
51+
df.query('a <= b')
52+
df['a <= b']
53+
df[df.a <= df.b]
54+
df.loc[df.a <= df.b]
1955
20-
As I do not have an encyclopedic knowledge of R packages, feel free to suggest
21-
additional CRAN packages to add to this list. This is also here to offer a big
22-
of a translation guide for users of these R packages.
56+
For more details and examples see :ref:`the query documentation
57+
<indexing.query>`.
2358

24-
data.frame
25-
----------
59+
60+
|with|_
61+
~~~~~~~~
62+
63+
.. versionadded:: 0.13
64+
65+
An expression using a data.frame called ``df`` in R with the columns ``a`` and
66+
``b`` would be evaluated using ``with`` like so:
67+
68+
.. code-block:: r
69+
70+
df <- data.frame(a=rnorm(10), b=rnorm(10))
71+
with(df, a + b)
72+
df$a + df$b # same as the previous expression
73+
74+
In ``pandas`` the equivalent expression, using the
75+
:meth:`~pandas.DataFrame.eval` method, would be:
76+
77+
.. ipython:: python
78+
79+
df = DataFrame({'a': randn(10), 'b': randn(10)})
80+
df.eval('a + b')
81+
df.a + df.b # same as the previous expression
82+
83+
In certain cases :meth:`~pandas.DataFrame.eval` will be much faster than
84+
evaluation in pure Python. For more details and examples see :ref:`the eval
85+
documentation <enhancingperf.eval>`.
2686

2787
zoo
2888
---
@@ -36,3 +96,9 @@ plyr
3696
reshape / reshape2
3797
------------------
3898

99+
100+
.. |with| replace:: ``with``
101+
.. _with: http://finzi.psych.upenn.edu/R/library/base/html/with.html
102+
103+
.. |subset| replace:: ``subset``
104+
.. _subset: http://finzi.psych.upenn.edu/R/library/base/html/subset.html

0 commit comments

Comments
 (0)