Skip to content

BUG: Ensure dataframe preserves categorical indices with categorial series #57635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jmarintur
Copy link
Contributor

Ensure that when constructing a DataFrame with a list of Series with different CategoricalIndexes, the resulting columns are categorical.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix is too specific. This should be fixed where the Index[object] is being created

@jmarintur
Copy link
Contributor Author

Hi @mroeschke, Unit Tests failures seem unrelated (when running them locally everything is ok).

if isinstance(self, CategoricalIndex) and isinstance(other, CategoricalIndex):
both_categories = self.categories
# if ordered and unordered, we set categories to be unordered
ordered = False if self.ordered != other.ordered else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do tests fail if you just do:

both_categories = union_categorical([self.categories, other.categories])
self = self.set_categories(both_categories)
other = other.set_categories(both_categories)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mroeschke, switching to using that piece of code, you get:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mroeschke, any suggestion on how you think we should handle the PR at this point? Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I guess it should have been

both_categories = union_categorical([self, other])
self = self.set_categories(both_categories)
other = other.set_categories(both_categories)

The main point here is that ideally there shouldn't be an re-invention of categorical union logic in union

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mroeschke, thanks for your guidance. Taking into account your suggestion, and my previous comments, we'd need to change union and _union. Does it sound good to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mroeschke, I've just tried your suggestion:

if isinstance(self.dtype, CategoricalDtype) and isinstance(
    other.dtype, CategoricalDtype
):
    both_categories = union_categoricals([self, other])
    self = self.set_categories(both_categories)
    other = other.set_categories(both_categories)

If I run it with this change with the following example:

s1 = pd.Series([1, 2], index=pd.CategoricalIndex(["a", "b"], ordered=False))
s2 = pd.Series([3, 4], index=pd.CategoricalIndex(["b", "c"], ordered=False))
pd.DataFrame([s1, s2]).columns

union_categoricals returns:

['a', 'b', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']

which is the expected behaviour of that function. However, it triggers the following error when setting the categories in self.set_categories(both_categories):

ValueError: Categorical categories must be unique

We only need to change the code to:

if isinstance(self.dtype, CategoricalDtype) and isinstance(
    other.dtype, CategoricalDtype
):
    both_categories = union_categoricals([self, other]).categories
    self = self.set_categories(both_categories)
    other = other.set_categories(both_categories)

to make it work for the example above, however, some tests won't pass, e.g.:

FAILED
pandas/tests/reshape/concat/test_append.py::TestAppend::test_append_same_columns_type[CategoricalIndex1] - TypeError: Cannot use sort_categories=True with ordered Categoricals
FAILED 
pandas/tests/reshape/concat/test_append.py::TestAppend::test_append_different_columns_types[CategoricalIndex-CategoricalIndex] - TypeError: Categorical.ordered must be the same

leading to the need of handling the order as I did in the last commits.

Any thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I've just pushed another commit with, what I think, it is a better way on handling it. It does follow the original logic when both CategoricalIndex differ. Please let me know what you think.

@jmarintur jmarintur requested a review from mroeschke April 4, 2024 09:30
@mroeschke mroeschke added Categorical Categorical Data Type DataFrame DataFrame data structure Constructors Series/DataFrame/Index/pd.array Constructors labels Apr 23, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this May 31, 2024
@jmarintur
Copy link
Contributor Author

Dear @mroeschke, could you please give me some feedback? I'd like to understand what's missing in my last changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame([Series with different CategoricalIndexes]) results in non-Categorical columns
2 participants