Description
Code Sample, a copy-pastable example if possible
I found that if passing two pandas Series (with same length, but different indices) to crosstab(), the cross tabulation result becomes incorrect.
# an array of length 15, with index = 0,2,4,6,8,...
x = pd.Series([1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3], index=range(0,30,2))
# an array of length 15, with index = 0,1,2,3,4,...
y = pd.Series([0,0,0,0,1, 1,1,1,1,0, 1,1,1,1,0], index=range(0,15,1))
# convert either one to numpy array, eliminating index
print(pd.crosstab(np.array(x), y, margins=True))
# pass them to crosstab() as is, keeping their (different) indices
print(pd.crosstab(x, y, margins=True))
The output:
col_0 0 1 All
row_0
1 4 1 5
2 1 4 5
3 1 4 5
All 6 9 15
col_0 0 1 All
row_0
1 2 3 5
2 1 2 3
All 3 5 8
Problem description
The second output is problematic (look at the margins and the total count, 8), because when crosstab()
aggregates x
and y
as Series, it only looks at the elements with common indices, which has the potential to omit some (even all) values.
On the other hand, if either x
or y
is passed in as a numpy array (i.e., without any index), then the numpy array "adopts" the index of the other array, resulting in a correct result.
I am not sure whether such a behavior is by design or not. If so, maybe a warning can be raised telling users to expect strange cross tabulation results?
Output of pd.show_versions()
python: 3.6.3.final.0
pandas: 0.22.0
numpy: 1.13.3