-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Implement Keyword Aggregation for DataFrame.agg and Series.agg #29116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 45 commits
7e461a1
1314059
8bcb313
7bc368d
cf5c6c3
5298331
4fb74b5
97209be
ca273ff
1d2ab15
3ca193c
c8f80ed
8c738e9
d4d9ea4
2a6de27
21e09f9
058a8e9
0da68d8
15e3659
438398d
d47b790
832b8d9
5a3b690
4fb86f0
a1369bf
ef981a3
82c8960
c610391
679ba59
2ee2628
31f7033
2acb244
dfbd67a
ff5e60f
7c6c891
3e55fcb
6d74b29
532337e
400ff3e
05af2de
6206fa4
34199ad
15d099c
c56f05f
d3f0620
20ecfda
89b8e6b
8aa1cc9
091ca75
c2d5104
50ebdaf
0484f5e
425c802
d5c2c6c
8bb9714
2607c5d
0545231
0a27889
a66053e
7311ef0
da2ff37
bcc5bc3
0825027
d3c35f5
cef2b50
b96a942
3123284
7bb3bd0
3da2e2a
3ce91fc
1426ee2
5893a0e
cc85db4
0f55073
90d52ba
381a697
238b4cc
f8e1891
66e9b38
f4d8a4f
c3e34a0
0c0dbad
61f6201
88c7751
e2b957a
99f75b2
30b7296
baea583
04bffe6
42091c3
fc13e19
1403426
3d9655e
d78c57c
0487928
7435ac5
f1cd16c
469691c
1bb35b5
7a6f496
3730f7d
cd8d00f
6dddd55
96dc3ed
075b85b
a44471c
0e2eae4
2fb4b27
5e04185
56d0f89
7f4839e
65d578b
3e6a06c
8651447
a7439fe
449d40f
9fd8ec5
736bea2
d20be20
35b2b17
74d6169
0546224
f5f0e68
54ff962
ac57023
484e42c
47e6598
f28b452
81b4186
1fd4b5b
47dc5fe
9190f7f
89de59e
8493383
9a9dd7f
c75c882
fa61db7
00a1ccf
26b380a
165ea83
d6923f2
f6a5cc1
a747ab6
44405e8
faea906
7e30a61
3d20524
05921af
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6588,16 +6588,51 @@ def _gotitem( | |
**_shared_doc_kwargs | ||
) | ||
@Appender(_shared_docs["aggregate"]) | ||
def aggregate(self, func, axis=0, *args, **kwargs): | ||
def aggregate(self, func=None, axis=0, *args, **kwargs): | ||
from pandas.core.groupby.generic import ( | ||
_is_multi_agg_with_relabel, | ||
_maybe_mangle_lambdas, | ||
_normalize_keyword_aggregation, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. importing private functions is sub-optimal. it looks like all of the stuff that actually uses these functions doesn't rely on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, i made it in the and because of this, I think in the follow up PR, I could do a cleaning on the other one in |
||
) | ||
|
||
axis = self._get_axis_number(axis) | ||
|
||
relabeling = func is None and _is_multi_agg_with_relabel(**kwargs) | ||
if relabeling: | ||
func, indexes, order = _normalize_keyword_aggregation(kwargs) | ||
reordered_indexes = [ | ||
pair[0] for pair in sorted(zip(indexes, order), key=lambda t: t[1]) | ||
] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. most of this is already implemented in the SelectionMixin There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. emm, i think this implementation is sort of post-processing after aggregation from SelectionMixin? no? |
||
kwargs = {} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do you need to do this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. right, removed this, not needed |
||
elif func is None: | ||
# nicer error message | ||
raise TypeError("Must provide 'func' or tuples of '(column, aggfunc).") | ||
|
||
func = _maybe_mangle_lambdas(func) | ||
|
||
result = None | ||
try: | ||
result, how = self._aggregate(func, axis=axis, *args, **kwargs) | ||
except TypeError: | ||
pass | ||
if result is None: | ||
return self.apply(func, axis=axis, args=args, **kwargs) | ||
|
||
if relabeling: | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# when there are more than one column being used in aggregate, the order | ||
# of result will be reversed, and in case the func is not used by other | ||
# columns, there might be NaN values, so separate these two cases | ||
reordered_result = DataFrame(index=indexes) | ||
idx = 0 | ||
for col, funcs in func.items(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is extremely inefficient. instead try appending to a list and concat at the end. why do you need .values? that is not very idiomatic here, nor is the dropna There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can also just reorder the indicies and do an indexing selection would be way better There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, I reduced half of the code in this loop, i am pretty sure the previous version was due to some order issue. for instance in the example below, i think a correct behaviour should be: df = pd.DataFrame({"A": [1, 2, 1, 2], "B": [1, 2, 3, 4], "C": [3,4,5,6]})
df.agg(ab=("C", np.max), foo=("A", max), bar=("B", "mean"), cat=("A", "min"),
dat=("B", "max"), f=("A", "max"), g=("C", "min")) the column of new aggregated df is : Any advice to simply the code would be much appreciated! |
||
v = reordered_indexes[idx : idx + len(funcs)] | ||
if len(func) > 1: | ||
reordered_result.loc[v, col] = result[col][::-1].dropna().values | ||
else: | ||
reordered_result.loc[v, col] = result[col].values | ||
idx = idx + len(funcs) | ||
result = reordered_result | ||
return result | ||
|
||
def _aggregate(self, arg, axis=0, *args, **kwargs): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3800,9 +3800,17 @@ def _gotitem(self, key, ndim, subset=None): | |
**_shared_doc_kwargs | ||
) | ||
@Appender(generic._shared_docs["aggregate"]) | ||
def aggregate(self, func, axis=0, *args, **kwargs): | ||
def aggregate(self, func=None, axis=0, *args, **kwargs): | ||
# Validate the axis parameter | ||
self._get_axis_number(axis) | ||
|
||
if func is None: | ||
# This is due to order issue of dictionary in PY35, e.g. if {"foo" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. py35 has been dropped; can this be simplified? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, |
||
# : "sum", "bar": "min"}, then it will take "bar" first because it | ||
# b is before f | ||
func = OrderedDict(kwargs.items()) | ||
kwargs = {} | ||
|
||
result, how = self._aggregate(func, *args, **kwargs) | ||
if result is None: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1367,3 +1367,96 @@ def test_consistency_of_aggregates_of_columns_with_missing_values(self, df, meth | |
tm.assert_series_equal( | ||
none_in_first_column_result, none_in_second_column_result | ||
) | ||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you create a new test file for these, maybe test_apply_relabling.py (alt we can make a pandas/tests/frame/apply/..... subdir) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok with moving existing functions later, but pls create a new file / subdir now. |
||
class TestDataFrameNamedAggregate: | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# GH 26513 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. usually these go inside the test func There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, sorry, i put this outside was because i wanted it for all functions in the entire class, since they are all coming from the same issue. I changed and put them under each function. |
||
def test_agg_relabel(self): | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
df = pd.DataFrame({"A": [1, 2, 1, 2], "B": [1, 2, 3, 4], "C": [3, 4, 5, 6]}) | ||
|
||
# simplest case with one column, one func | ||
result = df.agg(foo=("B", "sum")) | ||
expected = pd.DataFrame({"B": [10]}, index=pd.Index(["foo"])) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
# test on same column with different methods | ||
result = df.agg(foo=("B", "sum"), bar=("B", "min")) | ||
expected = pd.DataFrame({"B": [10, 1]}, index=pd.Index(["foo", "bar"])) | ||
|
||
tm.assert_frame_equal(result, expected) | ||
|
||
# test on multiple columns with multiple methods | ||
result = df.agg( | ||
foo=("A", "sum"), | ||
bar=("B", "mean"), | ||
cat=("A", "min"), | ||
dat=("B", "max"), | ||
f=("A", "max"), | ||
g=("C", "min"), | ||
) | ||
expected = pd.DataFrame( | ||
{ | ||
"A": [6.0, np.nan, 1.0, np.nan, 2.0, np.nan], | ||
"B": [np.nan, 2.5, np.nan, 4.0, np.nan, np.nan], | ||
"C": [np.nan, np.nan, np.nan, np.nan, np.nan, 3.0], | ||
}, | ||
index=pd.Index(["foo", "bar", "cat", "dat", "f", "g"]), | ||
) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
# test on partial, functools or more complex cases | ||
result = df.agg(foo=("A", np.mean), bar=("A", "mean"), cat=("A", min)) | ||
expected = pd.DataFrame( | ||
{"A": [1.5, 1.5, 1.0]}, index=pd.Index(["foo", "bar", "cat"]) | ||
) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
result = df.agg( | ||
foo=("A", min), | ||
bar=("A", np.min), | ||
cat=("B", max), | ||
dat=("C", "min"), | ||
f=("B", np.sum), | ||
) | ||
expected = pd.DataFrame( | ||
{ | ||
"A": [1.0, 1.0, np.nan, np.nan, np.nan], | ||
"B": [np.nan, np.nan, 10.0, np.nan, 4.0], | ||
"C": [np.nan, np.nan, np.nan, 3.0, np.nan], | ||
}, | ||
index=pd.Index(["foo", "bar", "cat", "dat", "f"]), | ||
) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_agg_namedtuple(self): | ||
df = pd.DataFrame({"A": [0, 1], "B": [1, 2]}) | ||
result = df.agg( | ||
foo=pd.NamedAgg("B", "sum"), | ||
bar=pd.NamedAgg("B", min), | ||
cat=pd.NamedAgg(column="B", aggfunc="count"), | ||
fft=pd.NamedAgg("B", aggfunc="max"), | ||
) | ||
|
||
expected = pd.DataFrame( | ||
{"B": [3, 1, 2, 2]}, index=pd.Index(["foo", "bar", "cat", "fft"]) | ||
) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
result = df.agg( | ||
foo=pd.NamedAgg("A", "min"), | ||
bar=pd.NamedAgg(column="B", aggfunc="max"), | ||
cat=pd.NamedAgg(column="A", aggfunc="max"), | ||
) | ||
expected = pd.DataFrame( | ||
{"A": [0.0, np.nan, 1.0], "B": [np.nan, 2.0, np.nan]}, | ||
index=pd.Index(["foo", "bar", "cat"]), | ||
) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_agg_raises(self): | ||
df = pd.DataFrame({"A": [0, 1], "B": [1, 2]}) | ||
msg = "Must provide" | ||
|
||
with pytest.raises(TypeError, match=msg): | ||
df.agg() |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -750,3 +750,34 @@ def test_apply_scaler_on_date_time_index_aware_series(self): | |
series = tm.makeTimeSeries(nper=30).tz_localize("UTC") | ||
result = pd.Series(series.index).apply(lambda x: 1) | ||
tm.assert_series_equal(result, pd.Series(np.ones(30), dtype="int64")) | ||
|
||
|
||
class TestNamedAggregation: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same can you move to a new file There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will do in followup PR There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you do here |
||
def test_relabel_no_duplicated_method(self): | ||
# this is to test there is no duplicated method used in agg | ||
df = pd.DataFrame({"A": [1, 2, 1, 2], "B": [1, 2, 3, 4]}) | ||
|
||
result = df["A"].agg(foo="sum") | ||
expected = df["A"].agg({"foo": "sum"}) | ||
tm.assert_series_equal(result, expected) | ||
|
||
result = df.B.agg(foo="min", bar="max") | ||
expected = df.B.agg({"foo": "min", "bar": "max"}) | ||
tm.assert_series_equal(result, expected) | ||
|
||
result = df.B.agg(foo=sum, bar=min, cat="max") | ||
expected = df.B.agg({"foo": sum, "bar": min, "cat": "max"}) | ||
tm.assert_series_equal(result, expected) | ||
|
||
def test_relabel_duplicated_method(self): | ||
# this is to test with nested renaming, duplicated method can be used | ||
# if they are assigned with different new names | ||
df = pd.DataFrame({"A": [1, 2, 1, 2], "B": [1, 2, 3, 4]}) | ||
|
||
result = df.A.agg(foo="sum", bar="sum") | ||
expected = pd.Series([6, 6], index=["foo", "bar"], name="A") | ||
tm.assert_series_equal(result, expected) | ||
|
||
result = df.B.agg(foo=min, bar="min") | ||
expected = pd.Series([1, 1], index=["foo", "bar"], name="B") | ||
tm.assert_series_equal(result, expected) |
Uh oh!
There was an error while loading. Please reload this page.