Skip to content

Commit fc9d021

Browse files
committed
Merge pull request #5866 from chappers/r-compare
DOC: Flesh out the R comparison section of docs (GH3980)
2 parents ef787b4 + 0533c70 commit fc9d021

File tree

1 file changed

+123
-17
lines changed

1 file changed

+123
-17
lines changed

doc/source/comparison_with_r.rst

Lines changed: 123 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,77 @@ R packages.
3030
Base R
3131
------
3232

33+
|aggregate|_
34+
~~~~~~~~~~~~
35+
36+
In R you may want to split data into subsets and compute the mean for each.
37+
Using a data.frame called ``df`` and splitting it into groups ``by1`` and
38+
``by2``:
39+
40+
.. code-block:: r
41+
42+
df <- data.frame(
43+
v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
44+
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
45+
by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
46+
by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
47+
aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)
48+
49+
The :meth:`~pandas.DataFrame.groupby` method is similar to base R ``aggregate``
50+
function.
51+
52+
.. ipython:: python
53+
54+
from pandas import DataFrame
55+
df = DataFrame({
56+
'v1': [1,3,5,7,8,3,5,np.nan,4,5,7,9],
57+
'v2': [11,33,55,77,88,33,55,np.nan,44,55,77,99],
58+
'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
59+
'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,
60+
np.nan]
61+
})
62+
63+
g = df.groupby(['by1','by2'])
64+
g[['v1','v2']].mean()
65+
66+
For more details and examples see :ref:`the groupby documentation
67+
<groupby.split>`.
68+
69+
|tapply|_
70+
~~~~~~~~~
71+
72+
``tapply`` is similar to ``aggregate``, but data can be in a ragged array,
73+
since the subclass sizes are possibly irregular. Using a data.frame called
74+
``baseball``, and retrieving information based on the array ``team``:
75+
76+
.. code-block:: r
77+
78+
baseball <-
79+
data.frame(team = gl(5, 5,
80+
labels = paste("Team", LETTERS[1:5])),
81+
player = sample(letters, 25),
82+
batting.average = runif(25, .200, .400))
83+
84+
tapply(baseball$batting.average, baseball.example$team,
85+
max)
86+
87+
In ``pandas`` we may use :meth:`~pandas.pivot_table` method to handle this:
88+
89+
.. ipython:: python
90+
91+
import random
92+
import string
93+
94+
baseball = DataFrame({
95+
'team': ["team %d" % (x+1) for x in range(5)]*5,
96+
'player': random.sample(list(string.ascii_lowercase),25),
97+
'batting avg': np.random.uniform(.200, .400, 25)
98+
})
99+
baseball.pivot_table(values='batting avg', cols='team', aggfunc=np.max)
100+
101+
For more details and examples see :ref:`the reshaping documentation
102+
<reshaping.pivot>`.
103+
33104
|subset|_
34105
~~~~~~~~~~
35106

@@ -51,9 +122,6 @@ index/slice as well as standard boolean indexing:
51122

52123
.. ipython:: python
53124
54-
from pandas import DataFrame
55-
from numpy import random
56-
57125
df = DataFrame({'a': random.randn(10), 'b': random.randn(10)})
58126
df.query('a <= b')
59127
df[df.a <= df.b]
@@ -120,8 +188,6 @@ table below shows how these data structures could be mapped in Python.
120188
An expression using a data.frame called ``df`` in R where you want to
121189
summarize ``x`` by ``month``:
122190

123-
124-
125191
.. code-block:: r
126192
127193
require(plyr)
@@ -140,16 +206,14 @@ summarize ``x`` by ``month``:
140206
In ``pandas`` the equivalent expression, using the
141207
:meth:`~pandas.DataFrame.groupby` method, would be:
142208

143-
144-
145209
.. ipython:: python
146210
147211
df = DataFrame({
148-
'x': random.uniform(1., 168., 120),
149-
'y': random.uniform(7., 334., 120),
150-
'z': random.uniform(1.7, 20.7, 120),
212+
'x': np.random.uniform(1., 168., 120),
213+
'y': np.random.uniform(7., 334., 120),
214+
'z': np.random.uniform(1.7, 20.7, 120),
151215
'month': [5,6,7,8]*30,
152-
'week': random.randint(1,4, 120)
216+
'week': np.random.randint(1,4, 120)
153217
})
154218
155219
grouped = df.groupby(['month','week'])
@@ -235,8 +299,8 @@ For more details and examples see :ref:`the reshaping documentation
235299
|cast|_
236300
~~~~~~~
237301

238-
An expression using a data.frame called ``df`` in R to cast into a higher
239-
dimensional array:
302+
In R ``acast`` is an expression using a data.frame called ``df`` in R to cast
303+
into a higher dimensional array:
240304

241305
.. code-block:: r
242306
@@ -256,18 +320,60 @@ In Python the best way is to make use of :meth:`~pandas.pivot_table`:
256320
.. ipython:: python
257321
258322
df = DataFrame({
259-
'x': random.uniform(1., 168., 12),
260-
'y': random.uniform(7., 334., 12),
261-
'z': random.uniform(1.7, 20.7, 12),
323+
'x': np.random.uniform(1., 168., 12),
324+
'y': np.random.uniform(7., 334., 12),
325+
'z': np.random.uniform(1.7, 20.7, 12),
262326
'month': [5,6,7]*4,
263327
'week': [1,2]*6
264328
})
265329
mdf = pd.melt(df, id_vars=['month', 'week'])
266330
pd.pivot_table(mdf, values='value', rows=['variable','week'],
267331
cols=['month'], aggfunc=np.mean)
268332
333+
Similarly for ``dcast`` which uses a data.frame called ``df`` in R to
334+
aggregate information based on ``Animal`` and ``FeedType``:
335+
336+
.. code-block:: r
337+
338+
df <- data.frame(
339+
Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
340+
'Animal2', 'Animal3'),
341+
FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
342+
Amount = c(10, 7, 4, 2, 5, 6, 2)
343+
)
344+
345+
dcast(df, Animal ~ FeedType, sum, fill=NaN)
346+
# Alternative method using base R
347+
with(df, tapply(Amount, list(Animal, FeedType), sum))
348+
349+
Python can approach this in two different ways. Firstly, similar to above
350+
using :meth:`~pandas.pivot_table`:
351+
352+
.. ipython:: python
353+
354+
df = DataFrame({
355+
'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
356+
'Animal2', 'Animal3'],
357+
'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'],
358+
'Amount': [10, 7, 4, 2, 5, 6, 2],
359+
})
360+
361+
df.pivot_table(values='Amount', rows='Animal', cols='FeedType', aggfunc='sum')
362+
363+
The second approach is to use the :meth:`~pandas.DataFrame.groupby` method:
364+
365+
.. ipython:: python
366+
367+
df.groupby(['Animal','FeedType'])['Amount'].sum()
368+
269369
For more details and examples see :ref:`the reshaping documentation
270-
<reshaping.pivot>`.
370+
<reshaping.pivot>` or :ref:`the groupby documentation<groupby.split>`.
371+
372+
.. |aggregate| replace:: ``aggregate``
373+
.. _aggregate: http://finzi.psych.upenn.edu/R/library/stats/html/aggregate.html
374+
375+
.. |tapply| replace:: ``tapply``
376+
.. _tapply: http://finzi.psych.upenn.edu/R/library/base/html/tapply.html
271377

272378
.. |with| replace:: ``with``
273379
.. _with: http://finzi.psych.upenn.edu/R/library/base/html/with.html

0 commit comments

Comments
 (0)