-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
CLN: Clean groupby/test_function.py #32027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsaxton lgtm pending green. one comment, not blocker but I think may be beneficial to reduce the cognitive load.
@pytest.mark.parametrize("dtype", [np.int32, np.int64, np.float32, np.float64]) | ||
def test_cummin_cummax(dtype): | ||
min_val = ( | ||
np.iinfo(dtype).min if np.dtype(dtype).kind == "i" else np.finfo(dtype).min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for readability, could just include num_mins and num_max in parameterisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this could be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would now be more difficult if a fixture is used for dtype. (as requested) ( could create a composable min and max fixtures but probably OTT for one test)
@@ -685,59 +685,53 @@ def test_numpy_compat(func): | |||
getattr(g, func)(foo=1) | |||
|
|||
|
|||
def test_cummin_cummax(): | |||
@pytest.mark.parametrize("dtype", [np.int32, np.int64, np.float32, np.float64]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have fixture for these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which fixture could I use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could try any_real_dtype and see if the tests pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it breaks for the unsigned real dtypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, not sure we want to create another fixture, add a skip in the test for the unsigned or keep the dtype as direct parameterisation and include the min and max in the parameterisation. @jreback?
tm.assert_frame_equal(result, expected) | ||
result = df.groupby("A").B.apply(lambda x: x.cummin()).to_frame() | ||
tm.assert_frame_equal(result, expected) | ||
# cummin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test might be simpler if you break it into multiple tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the test is doing quite a lot. Would it be better to break up in this or a future PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just as easy to do here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broke into two tests, I'm not sure if this looks cleaner though. With a fixture on exact configuration of dtypes it would be nice to split into multiple but without that it gets pretty messy.
|
||
result = f(numeric_only=False) | ||
tm.assert_index_equal(result.columns, expected_columns) | ||
f = getattr(df.groupby("group"), "sum") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just call this directly
df.groupby('group').sum()
result = f() | ||
tm.assert_index_equal(result.columns, expected_columns_numeric) | ||
|
||
result = f(numeric_only=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I think it's a bit easier to read / understand if we just call these functions directly instead of assigning to f then calling, so changed that as well
@pytest.mark.parametrize( | ||
"dtype, min_val, max_val", | ||
[ | ||
(np.int32, np.iinfo(np.int32).min, np.iinfo(np.int32).max), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min_val, max_val can all be computed inside the function
@@ -751,6 +729,65 @@ def test_cummin_cummax(): | |||
expected = base_df.groupby("A").B.apply(lambda x: x.cummin()).to_frame() | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
# Test nan in entire column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you split to another test rather than make this test very long
tm.assert_series_equal(result, expected) | ||
|
||
|
||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simply make this a fixture at the top of the file rather than repeating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment, otherwsie lgtm. ping on green.
params=[np.int32, np.int64, np.float32, np.float64], | ||
ids=["np.int32", "np.int64", "np.float32", "np.float64"], | ||
) | ||
def numpy_dtypes(request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you name this something else, maybe numpy_dtypes_for_minmax; also you can include in the return the iinfo or finfo result (e.g. you would return a tuple); i find this a bit more obvious than doing it in the test function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated based on comments and green
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a fan of returning tuples and unpacking in tests in scenarios where composable fixtures could be used instead and could have more utility. However, naming the fixture numpy_dtypes_for_minmax
somewhat limits the reuse of composible fixtures.
diff --git a/pandas/tests/groupby/test_function.py b/pandas/tests/groupby/test_function.py
index 6205dfb87..348a6c189 100644
--- a/pandas/tests/groupby/test_function.py
+++ b/pandas/tests/groupby/test_function.py
@@ -35,15 +35,31 @@ def numpy_dtypes_for_minmax(request):
Fixture of numpy dtypes with min and max values used for testing
cummin and cummax
"""
- dtype = request.param
- min_val = (
- np.iinfo(dtype).min if np.dtype(dtype).kind == "i" else np.finfo(dtype).min
- )
- max_val = (
- np.iinfo(dtype).max if np.dtype(dtype).kind == "i" else np.finfo(dtype).max
+ return request.param
+
+
+@pytest.fixture()
+def dtype_min(numpy_dtypes_for_minmax):
+ """
+ Fixture to return minimum value of dtype.
+ """
+ return (
+ np.iinfo(numpy_dtypes_for_minmax).min
+ if np.dtype(numpy_dtypes_for_minmax).kind == "i"
+ else np.finfo(numpy_dtypes_for_minmax).min
)
- return (dtype, min_val, max_val)
+
+@pytest.fixture()
+def dtype_max(numpy_dtypes_for_minmax):
+ """
+ Fixture to return maximum value of dtype.
+ """
+ return (
+ np.iinfo(numpy_dtypes_for_minmax).max
+ if np.dtype(numpy_dtypes_for_minmax).kind == "i"
+ else np.finfo(numpy_dtypes_for_minmax).max
+ )
@pytest.mark.parametrize("agg_func", ["any", "all"])
@@ -704,9 +720,8 @@ def test_numpy_compat(func):
reason="https://github.com/pandas-dev/pandas/issues/31992",
strict=False,
)
-def test_cummin(numpy_dtypes_for_minmax):
- dtype = numpy_dtypes_for_minmax[0]
- min_val = numpy_dtypes_for_minmax[1]
+def test_cummin(numpy_dtypes_for_minmax, dtype_min):
+ dtype = numpy_dtypes_for_minmax
# GH 15048
base_df = pd.DataFrame(
@@ -723,8 +738,8 @@ def test_cummin(numpy_dtypes_for_minmax):
tm.assert_frame_equal(result, expected)
# Test w/ min value for dtype
- df.loc[[2, 6], "B"] = min_val
- expected.loc[[2, 3, 6, 7], "B"] = min_val
+ df.loc[[2, 6], "B"] = dtype_min
+ expected.loc[[2, 3, 6, 7], "B"] = dtype_min
result = df.groupby("A").cummin()
tm.assert_frame_equal(result, expected)
expected = df.groupby("A").B.apply(lambda x: x.cummin()).to_frame()
@@ -772,9 +787,8 @@ def test_cummin_all_nan_column():
reason="https://github.com/pandas-dev/pandas/issues/31992",
strict=False,
)
-def test_cummax(numpy_dtypes_for_minmax):
- dtype = numpy_dtypes_for_minmax[0]
- max_val = numpy_dtypes_for_minmax[2]
+def test_cummax(numpy_dtypes_for_minmax, dtype_max):
+ dtype = numpy_dtypes_for_minmax
# GH 15048
base_df = pd.DataFrame(
@@ -791,8 +805,8 @@ def test_cummax(numpy_dtypes_for_minmax):
tm.assert_frame_equal(result, expected)
# Test w/ max value for dtype
- df.loc[[2, 6], "B"] = max_val
- expected.loc[[2, 3, 6, 7], "B"] = max_val
+ df.loc[[2, 6], "B"] = dtype_max
+ expected.loc[[2, 3, 6, 7], "B"] = dtype_max
result = df.groupby("A").cummax()
tm.assert_frame_equal(result, expected)
expected = df.groupby("A").B.apply(lambda x: x.cummax()).to_frame()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we take the min and max out of the fixture I'd just say compute inside the test, since the min and max are each only used in one test respectively, so there isn't much value in having them in a fixture anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I wasn't clear. I'm not a fan but no need to change. see #32027 (comment)
Thanks @dsaxton |
Some small cleanups (removing unnecessary for loops / adding parameterization)