BUG: Ensure dataframe preserves categorical indices with categorial series #57635

jmarintur · 2024-02-26T22:23:09Z

Ensure that when constructing a DataFrame with a list of Series with different CategoricalIndexes, the resulting columns are categorical.

…gorical series

mroeschke

This fix is too specific. This should be fixed where the Index[object] is being created

pandas/core/frame.py

jmarintur · 2024-03-04T07:24:31Z

Hi @mroeschke, Unit Tests failures seem unrelated (when running them locally everything is ok).

pandas/core/indexes/base.py

mroeschke · 2024-03-07T20:26:16Z

pandas/core/indexes/base.py

+        if isinstance(self, CategoricalIndex) and isinstance(other, CategoricalIndex):
+            both_categories = self.categories
+            # if ordered and unordered, we set categories to be unordered
+            ordered = False if self.ordered != other.ordered else None


Do tests fail if you just do:

both_categories = union_categorical([self.categories, other.categories]) self = self.set_categories(both_categories) other = other.set_categories(both_categories)

Hi @mroeschke, switching to using that piece of code, you get:

Hi @mroeschke, any suggestion on how you think we should handle the PR at this point? Thank you!

Sorry I guess it should have been

both_categories = union_categorical([self, other]) self = self.set_categories(both_categories) other = other.set_categories(both_categories)

The main point here is that ideally there shouldn't be an re-invention of categorical union logic in union

Hi @mroeschke, thanks for your guidance. Taking into account your suggestion, and my previous comments, we'd need to change union and _union. Does it sound good to you?

Hi @mroeschke, I've just tried your suggestion:

if isinstance(self.dtype, CategoricalDtype) and isinstance( other.dtype, CategoricalDtype ): both_categories = union_categoricals([self, other]) self = self.set_categories(both_categories) other = other.set_categories(both_categories)

If I run it with this change with the following example:

s1 = pd.Series([1, 2], index=pd.CategoricalIndex(["a", "b"], ordered=False)) s2 = pd.Series([3, 4], index=pd.CategoricalIndex(["b", "c"], ordered=False)) pd.DataFrame([s1, s2]).columns

union_categoricals returns:

['a', 'b', 'b', 'c'] Categories (3, object): ['a', 'b', 'c']

which is the expected behaviour of that function. However, it triggers the following error when setting the categories in self.set_categories(both_categories):

ValueError: Categorical categories must be unique

We only need to change the code to:

if isinstance(self.dtype, CategoricalDtype) and isinstance( other.dtype, CategoricalDtype ): both_categories = union_categoricals([self, other]).categories self = self.set_categories(both_categories) other = other.set_categories(both_categories)

to make it work for the example above, however, some tests won't pass, e.g.:

FAILED pandas/tests/reshape/concat/test_append.py::TestAppend::test_append_same_columns_type[CategoricalIndex1] - TypeError: Cannot use sort_categories=True with ordered Categoricals FAILED pandas/tests/reshape/concat/test_append.py::TestAppend::test_append_different_columns_types[CategoricalIndex-CategoricalIndex] - TypeError: Categorical.ordered must be the same

leading to the need of handling the order as I did in the last commits.

Any thoughts?

Hi, I've just pushed another commit with, what I think, it is a better way on handling it. It does follow the original logic when both CategoricalIndex differ. Please let me know what you think.

…andas into ensure-df-series-categorical

mroeschke · 2024-05-31T18:53:20Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

jmarintur · 2024-06-01T11:49:49Z

Dear @mroeschke, could you please give me some feedback? I'd like to understand what's missing in my last changes.

Ensure dataframe preserves categorical index in constructor with cate…

869df9e

…gorical series

mroeschke requested changes Feb 26, 2024

View reviewed changes

mroeschke reviewed Feb 26, 2024

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jmarin added 5 commits March 2, 2024 01:10

Modify union to properly handle categoricalIndex

2e4fb2f

Modify union to properly handle categoricalIndex

4ebc935

merge branch

d1082ea

Handling properly all cases and adapt tests accordingly

8afb172

Type: ignore[attr-define] when self and other are CategoricalIndex

6a0b1af

mroeschke reviewed Mar 5, 2024

View reviewed changes

pandas/core/indexes/base.py Outdated Show resolved Hide resolved

mroeschke reviewed Mar 5, 2024

View reviewed changes

pandas/core/indexes/base.py Outdated Show resolved Hide resolved

mroeschke reviewed Mar 5, 2024

View reviewed changes

pandas/core/indexes/base.py Outdated Show resolved Hide resolved

jmarin and others added 2 commits March 6, 2024 22:02

Use union_categoricals instead of union1d from numpy

f5e4148

Merge branch 'main' into ensure-df-series-categorical

5c43a80

mroeschke reviewed Mar 7, 2024

View reviewed changes

pandas/core/indexes/base.py Outdated Show resolved Hide resolved

mroeschke reviewed Mar 7, 2024

View reviewed changes

jmarin added 3 commits March 7, 2024 23:30

Change from CategoricalIndex to CategoricalDtype check

cb3e6b6

Merge branch 'ensure-df-series-categorical' of github.com:jmarintur/p…

e0df58f

…andas into ensure-df-series-categorical

Improve code to handle it in the original conditional

b332805

jmarintur requested a review from mroeschke April 4, 2024 09:30

mroeschke added Categorical Categorical Data Type DataFrame DataFrame data structure Constructors Series/DataFrame/Index/pd.array Constructors labels Apr 23, 2024

mroeschke closed this May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Ensure dataframe preserves categorical indices with categorial series #57635

BUG: Ensure dataframe preserves categorical indices with categorial series #57635

Uh oh!

jmarintur commented Feb 26, 2024

Uh oh!

mroeschke left a comment

Uh oh!

Uh oh!

jmarintur commented Mar 4, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mroeschke Mar 7, 2024

Uh oh!

jmarintur Mar 7, 2024

Uh oh!

jmarintur Mar 14, 2024

Uh oh!

mroeschke Mar 20, 2024

Uh oh!

jmarintur Mar 20, 2024

Uh oh!

jmarintur Mar 26, 2024

Uh oh!

jmarintur Mar 27, 2024

Uh oh!

mroeschke commented May 31, 2024

Uh oh!

jmarintur commented Jun 1, 2024

Uh oh!

Uh oh!

Uh oh!

BUG: Ensure dataframe preserves categorical indices with categorial series #57635

BUG: Ensure dataframe preserves categorical indices with categorial series #57635

Uh oh!

Conversation

jmarintur commented Feb 26, 2024

Uh oh!

mroeschke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jmarintur commented Mar 4, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mroeschke Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

jmarintur Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

jmarintur Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

mroeschke Mar 20, 2024

Choose a reason for hiding this comment

Uh oh!

jmarintur Mar 20, 2024

Choose a reason for hiding this comment

Uh oh!

jmarintur Mar 26, 2024

Choose a reason for hiding this comment

Uh oh!

jmarintur Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

mroeschke commented May 31, 2024

Uh oh!

jmarintur commented Jun 1, 2024

Uh oh!

Uh oh!