-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fix segfault in GroupBy.count and DataFrame.count #32842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
13be18e
36513c3
d6df0f0
049b897
1e755b3
de75248
3bea5af
432cd68
7e379b1
a660d0f
52a6f99
6c75c89
375a4d1
60289c2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7890,34 +7890,30 @@ def _count_level(self, level, axis=0, numeric_only=False): | |
f"Can only count levels on hierarchical {self._get_axis_name(axis)}." | ||
) | ||
|
||
if frame._is_mixed_type: | ||
# Since we have mixed types, calling notna(frame.values) might | ||
# upcast everything to object | ||
mask = notna(frame).values | ||
else: | ||
# But use the speedup when we have homogeneous dtypes | ||
mask = notna(frame.values) | ||
# Mask NaNs: Mask rows or columns where the index level is NaN, and all | ||
# values in the DataFrame that are NaN | ||
values_mask = notna(frame.values) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is the existing version with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jbrockmendel I understood @jreback 's earlier comment such that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. those two are in fact the same, but performance should be better for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Running some tests shows that In [1]: import pandas as pd
In [2]: from pandas.core.dtypes.missing import notna
In [3]: df = pd.DataFrame({'a': range(10), 'b': ['foo'] * 10})
In [4]: %timeit notna(df.values)
113 µs ± 9.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit notna(df).values
546 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: df = pd.DataFrame({'a': range(1000000), 'b': ['foo'] * 1000000})
In [7]: %timeit notna(df.values)
163 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %timeit notna(df).values
40.7 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit notna(df.values)
158 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit notna(df).values
39.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
||
|
||
index_mask = notna(count_axis.get_level_values(level=level)) | ||
if axis == 1: | ||
# We're transposing the mask rather than frame to avoid potential | ||
# upcasts to object, which induces a ~20x slowdown | ||
mask = mask.T | ||
mask = index_mask & values_mask | ||
else: | ||
mask = index_mask.reshape(-1, 1) & values_mask | ||
|
||
if isinstance(level, str): | ||
level = count_axis._get_level_number(level) | ||
|
||
level_name = count_axis._names[level] | ||
level_index = count_axis.levels[level]._shallow_copy(name=level_name) | ||
level_codes = ensure_int64(count_axis.codes[level]) | ||
counts = lib.count_level_2d(mask, level_codes, len(level_index), axis=0) | ||
|
||
result = DataFrame(counts, index=level_index, columns=agg_axis) | ||
counts = lib.count_level_2d(mask, level_codes, len(level_index), axis=axis) | ||
|
||
if axis == 1: | ||
# Undo our earlier transpose | ||
return result.T | ||
result = DataFrame(counts, index=agg_axis, columns=level_index) | ||
else: | ||
return result | ||
result = DataFrame(counts, index=level_index, columns=agg_axis) | ||
|
||
return result | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I amended the masking logic in this function to mask all rows/columns that |
||
|
||
def _reduce( | ||
self, op, name, axis=0, skipna=True, numeric_only=None, filter_type=None, **kwds | ||
|
Uh oh!
There was an error while loading. Please reload this page.