Description
axis=0
or axis=1
, which is it?
I've always found it hard to remember which axis (0/"index"
vs. 1/"columns"
) does what for various operations. I suppose some people find it intuitive, while others (like me) find it confusing and inconsistent.
Case in point, DataFrame.sum
vs. DataFrame.drop
: if I want column sums, I need axis=0
...
>>> import pandas as pd
>>> df = pd.DataFrame(dict(a=[1, 2, 3], b=[4, 5, 6]))
>>> df
a b
0 1 4
1 2 5
2 3 6
>>> df.sum(axis=0)
a 6
b 15
dtype: int64
... but if I want to drop a column, I need axis=1
:
>>> df.drop("a", axis=1)
b
0 4
1 5
2 6
There's an analogous discrepancy in numpy (which is probably where pandas inherited it from?):
>>> import numpy as np
>>> a = df.to_numpy()
>>> a
array([[1, 4],
[2, 5],
[3, 6]])
>>> np.sum(a, axis=0) # sum columns
array([ 6, 15])
>>> np.delete(a, 0, axis=1) # delete first column
array([[4],
[5],
[6]])
I just intuitively conceptualize these operations as working along the same axis, so it's hard for me to internalize that the value of the axis parameter is different in each case. Apparently, I'm not the only person to find this confusing (quoting from the article: "For example, in the np.sum() function, the axis parameter behaves in a way that many people think is counter intuitive").
At the same time, I can imagine that some people find this behavior completely natural (at the very least those who designed the API). And I understand that changing this in pandas while keeping the status quo in numpy would introduce a (probably) worse inconsistency, so I'm not suggesting that.
Suggestion for improvement
What I am suggesting is reviewing the documentation of functions/methods using the axis=
keyword argument and (where applicable) improving the description of what it controls in each case. Pandas is typically used interactively, so documentation is easily accessible. If it contains useful hints on what each axis
value does (and possibly why), it's not such a big problem if this behavior goes against some people's expectations.
Examples
For example, based on the current master docs, the description of the axis
parameter for drop
does a good job at this:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
This makes it reasonably clear to me that if I specify 0
, I'll be removing rows, whereas 1
will result in removing columns.
By contrast, the description of the axis
parameter for sum
is somewhat too generic:
axis : {index (0), columns (1)}
Axis for the function to be applied on.
Based on this, I could conclude (and have repeatedly done so) that if I want column sums, I need to "apply the function on columns", hence axis=1
(which is wrong, cf. above).
A revised description could look something like the following:
axis : {index (0), columns (1)}
Whether to collapse the index (0 or ‘index’), resulting in column sums, or the columns (1 or ‘columns’), resulting in row sums.