Description
First of all, I have to admit that I have some trouble finding a catchy title for this issue. Better suggestions are welcome...
There is some issue with crosstab, when applied on a DataFrame which has been derived from a DataFrame consisting of categorized columns.
Let's consider the following example df:
import pandas as pd
df = pd.DataFrame({'col0':list('abbabcabcab'), 'col1':[1,1,1,1,2,3,2,2,3,3,1],
'crit':[1,1,1,1,1,0,0,0,0,0,0]})
We apply some filter according to some criterion which happens to eliminate all occurrences of 'c' in 'col0' (and '3' in 'col1'). Then we build a crosstab:
df_filtered = df[df.crit == 1]
pd.crosstab(df_filtered.col0, df_filtered.col1)
The result is as expected:
col1 1 2
col0
a 2 0
b 2 1
Now we try the same on categorized columns:
for col in df.columns:
df[col] = df[col].astype('category')
df_filtered = df[df.crit == 1]
pd.crosstab(df_filtered.col0, df_filtered.col1
In this case, value 'c' and '3' are again listed, although they shouldn't be:
col1 1 2 3
col0
a 2 0 0
b 2 1 0
c 0 0 0
My guess is that the reason for this behavior can be found in the lookup-table of the category, which does not account for values that are not represented any longer:
df_filtered.col0.cat.categories
Output:
Index(['a', 'b', 'c'], dtype='object'
So, either this lookup-table has to be fixed upon filtering, or pd.crosstab has to respect this condition.