-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Ensure dataframe preserves categorical indices with categorial series #57635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Ensure dataframe preserves categorical indices with categorial series #57635
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix is too specific. This should be fixed where the Index[object]
is being created
Hi @mroeschke, Unit Tests failures seem unrelated (when running them locally everything is ok). |
pandas/core/indexes/base.py
Outdated
if isinstance(self, CategoricalIndex) and isinstance(other, CategoricalIndex): | ||
both_categories = self.categories | ||
# if ordered and unordered, we set categories to be unordered | ||
ordered = False if self.ordered != other.ordered else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do tests fail if you just do:
both_categories = union_categorical([self.categories, other.categories])
self = self.set_categories(both_categories)
other = other.set_categories(both_categories)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mroeschke, switching to using that piece of code, you get:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mroeschke, any suggestion on how you think we should handle the PR at this point? Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I guess it should have been
both_categories = union_categorical([self, other])
self = self.set_categories(both_categories)
other = other.set_categories(both_categories)
The main point here is that ideally there shouldn't be an re-invention of categorical union logic in union
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mroeschke, thanks for your guidance. Taking into account your suggestion, and my previous comments, we'd need to change union
and _union
. Does it sound good to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mroeschke, I've just tried your suggestion:
if isinstance(self.dtype, CategoricalDtype) and isinstance(
other.dtype, CategoricalDtype
):
both_categories = union_categoricals([self, other])
self = self.set_categories(both_categories)
other = other.set_categories(both_categories)
If I run it with this change with the following example:
s1 = pd.Series([1, 2], index=pd.CategoricalIndex(["a", "b"], ordered=False))
s2 = pd.Series([3, 4], index=pd.CategoricalIndex(["b", "c"], ordered=False))
pd.DataFrame([s1, s2]).columns
union_categoricals
returns:
['a', 'b', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']
which is the expected behaviour of that function. However, it triggers the following error when setting the categories in self.set_categories(both_categories)
:
ValueError: Categorical categories must be unique
We only need to change the code to:
if isinstance(self.dtype, CategoricalDtype) and isinstance(
other.dtype, CategoricalDtype
):
both_categories = union_categoricals([self, other]).categories
self = self.set_categories(both_categories)
other = other.set_categories(both_categories)
to make it work for the example above, however, some tests won't pass, e.g.:
FAILED
pandas/tests/reshape/concat/test_append.py::TestAppend::test_append_same_columns_type[CategoricalIndex1] - TypeError: Cannot use sort_categories=True with ordered Categoricals
FAILED
pandas/tests/reshape/concat/test_append.py::TestAppend::test_append_different_columns_types[CategoricalIndex-CategoricalIndex] - TypeError: Categorical.ordered must be the same
leading to the need of handling the order as I did in the last commits.
Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I've just pushed another commit with, what I think, it is a better way on handling it. It does follow the original logic when both CategoricalIndex
differ. Please let me know what you think.
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
Dear @mroeschke, could you please give me some feedback? I'd like to understand what's missing in my last changes. |
Ensure that when constructing a DataFrame with a list of Series with different CategoricalIndexes, the resulting columns are categorical.