-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: fix dtype of all-NaN MultiIndex level #17934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -288,6 +288,10 @@ def __init__(self, values, categories=None, ordered=None, dtype=None, | |
self._dtype = dtype | ||
return | ||
|
||
# null_mask indicates missing values we want to exclude from inference. | ||
# This means: only missing values in list-likes (not arrays/ndframes). | ||
null_mask = np.array(False) | ||
|
||
# sanitize input | ||
if is_categorical_dtype(values): | ||
|
||
|
@@ -316,13 +320,14 @@ def __init__(self, values, categories=None, ordered=None, dtype=None, | |
if not isinstance(values, np.ndarray): | ||
values = _convert_to_list_like(values) | ||
from pandas.core.series import _sanitize_array | ||
# On list with NaNs, int values will be converted to float. Use | ||
# "object" dtype to prevent this. In the end objects will be | ||
# casted to int/... in the category assignment step. | ||
if len(values) == 0 or isna(values).any(): | ||
# By convention, empty lists result in object dtype: | ||
if len(values) == 0: | ||
sanitize_dtype = 'object' | ||
else: | ||
sanitize_dtype = None | ||
null_mask = isna(values) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you can move the
to line 291 (where you set it to: as we always need to check this (whether its an array or list-like anyhow) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Well, no: conceptually, because if the user passes an I can:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is: unless you have in mind future refactorings for which the location of missing values in an array matters. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you make the change as discussed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IOW compute null_mask at the top There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
As I explained above, computing |
||
if null_mask.any(): | ||
values = [values[idx] for idx in np.where(~null_mask)[0]] | ||
values = _sanitize_array(values, None, dtype=sanitize_dtype) | ||
|
||
if dtype.categories is None: | ||
|
@@ -370,6 +375,12 @@ def __init__(self, values, categories=None, ordered=None, dtype=None, | |
"mean to use\n'Categorical.from_codes(codes, " | ||
"categories)'?", RuntimeWarning, stacklevel=2) | ||
|
||
if null_mask.any(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a comment here |
||
# Reinsert -1 placeholders for previously removed missing values | ||
full_codes = - np.ones(null_mask.shape, dtype=codes.dtype) | ||
full_codes[~null_mask] = codes | ||
codes = full_codes | ||
|
||
self._dtype = dtype | ||
self._codes = coerce_indexer_dtype(codes, dtype.categories) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adding a lot of complexity (this whole) PR, pls see if you can simplify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jeff, if I could simplify I would have done it already.
But notice this PR is actually simplifying the code path, potentially avoiding casting data from
list
toobject
ndarray toint64
ndarray (skipping the middle step). And the way missing values are treated seems to me much cleaner and clearer than before: all the type inference is just done after removing them, so there is less dtype guesswork.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok will have another look
can u run the category asv and report any changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BENCHMARKS NOT SIGNIFICANTLY CHANGED
- should I copypaste the entire output?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, just want to make sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to add a benchmark for categorical creation (or a long list) with 2 cases (all nulls) and some non-null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: as expected, this only makes a difference if there are many NaNs. My guess is that it's taking half the time because it's checking the "nan-ity" of each value once instead than twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, that's fine, just wanted to be sure