Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
s = pd.Series(['a', 'b'], dtype='category')
print(s)
s.cat = s.cat.add_categories(['a'])
Issue Description
This code will raise
ValueError: new categories must not include old categories: {'a'}
Building reliable code on top of this primitive is made unnecessarily harder by this exception, as code that "seems to work" will stop working as soon as for one reason or another, the item being passed happens to already be in the category. The initial use case I had was to simply apply a fillna() on a Series for the cases where there is actually some NA values (not always):
import pandas as pd
default = 'b'
s = pd.Series(['a', None], dtype='category')
s.fillna(default)
This failed with:
TypeError: Cannot setitem on a Categorical with a new category (b), set the categories first
This is not very polymorphic-friendly as it leaks the fact that it's a categorical even though the only reason I used that is to save memory, but I can live with that.
Onto patching the category:
default = 'b'
s = pd.Series(['a', None], dtype='category')
s = s.cat.add_categories([default])
s.fillna(default)
Now the code seems to work. Only that it contains a landmine ready to blow as soon as the user would naively provide a default of 'a'
:
ValueError: new categories must not include old categories: {'a'}
Expected Behavior
add_categories()
should be able Just Work ™ to allow building reliable libraries on top of pandas.
One might object that:
-
I should not blindly use
fillna()
. Yes, I could checks.isna().any()
, but it would duplicate the NA detection which can be costly on large dataframes, as well as being unnecessary cruft. -
I can work around by checking if the value is in the category. Yes I can, and I already have a module dedicated to combinators or essentially replacement of pandas functions that are not reliable and need extra care, and generally speaking functions that are dependently typed (e.g.
DataFrame.groupby
where the return type isT
ortuple(T)
depending onlen(by)
leading to similar bugs where an innocuous change in input can wreck havoc on the helper). The smallest this module is, the better. -
I should read the documentation. Yes this behavior is documented, but this is an orthogonal concern to the other points.
On the bright side of things:
- not raising an exception would probably be considered backward-compatible (not strictly speaking but it's unlikely someone really relied on that)
- The change is actually trivial:
Current code:
def add_categories(self, new_categories, inplace=no_default):
...
if len(already_included) != 0:
raise ValueError(
f"new categories must not include old categories: {already_included}"
)
new_categories = list(self.dtype.categories) + list(new_categories)
New code:
def add_categories(self, new_categories, inplace=no_default):
...
old = set(self.dtype.categories)
new_categories = list(self.dtype.categories) + list(x for x in new_categories if x not in old)
Installed Versions
Pandas v1.4.0
Python 3.10