Skip to content

crosstab's dependency on common index produces undesirable error #20496

Open
@jsh9

Description

@jsh9

Code Sample, a copy-pastable example if possible

I found that if passing two pandas Series (with same length, but different indices) to crosstab(), the cross tabulation result becomes incorrect.

# an array of length 15, with index = 0,2,4,6,8,...
x = pd.Series([1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3], index=range(0,30,2))

# an array of length 15, with index = 0,1,2,3,4,...
y = pd.Series([0,0,0,0,1, 1,1,1,1,0, 1,1,1,1,0], index=range(0,15,1))

# convert either one to numpy array, eliminating index
print(pd.crosstab(np.array(x), y, margins=True))

# pass them to crosstab() as is, keeping their (different) indices
print(pd.crosstab(x, y, margins=True))

The output:

col_0  0  1  All
row_0           
1      4  1    5
2      1  4    5
3      1  4    5
All    6  9   15

col_0  0  1  All
row_0           
1      2  3    5
2      1  2    3
All    3  5    8

Problem description

The second output is problematic (look at the margins and the total count, 8), because when crosstab() aggregates x and y as Series, it only looks at the elements with common indices, which has the potential to omit some (even all) values.

On the other hand, if either x or y is passed in as a numpy array (i.e., without any index), then the numpy array "adopts" the index of the other array, resulting in a correct result.

I am not sure whether such a behavior is by design or not. If so, maybe a warning can be raised telling users to expect strange cross tabulation results?

Output of pd.show_versions()

python: 3.6.3.final.0
pandas: 0.22.0
numpy: 1.13.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocsReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions