Skip to content

Improve docs on what the axis= kwarg does in individual functions/methods #29203

Closed
@dlukes

Description

@dlukes

axis=0 or axis=1, which is it?

I've always found it hard to remember which axis (0/"index" vs. 1/"columns") does what for various operations. I suppose some people find it intuitive, while others (like me) find it confusing and inconsistent.

Case in point, DataFrame.sum vs. DataFrame.drop: if I want column sums, I need axis=0...

>>> import pandas as pd
>>> df = pd.DataFrame(dict(a=[1, 2, 3], b=[4, 5, 6]))
>>> df
   a  b
0  1  4
1  2  5
2  3  6
>>> df.sum(axis=0)
a     6
b    15
dtype: int64

... but if I want to drop a column, I need axis=1:

>>> df.drop("a", axis=1)
   b
0  4
1  5
2  6

There's an analogous discrepancy in numpy (which is probably where pandas inherited it from?):

>>> import numpy as np
>>> a = df.to_numpy()
>>> a
array([[1, 4],
       [2, 5],
       [3, 6]])
>>> np.sum(a, axis=0)  # sum columns
array([ 6, 15])
>>> np.delete(a, 0, axis=1)  # delete first column
array([[4],
       [5],
       [6]])

I just intuitively conceptualize these operations as working along the same axis, so it's hard for me to internalize that the value of the axis parameter is different in each case. Apparently, I'm not the only person to find this confusing (quoting from the article: "For example, in the np.sum() function, the axis parameter behaves in a way that many people think is counter intuitive").

At the same time, I can imagine that some people find this behavior completely natural (at the very least those who designed the API). And I understand that changing this in pandas while keeping the status quo in numpy would introduce a (probably) worse inconsistency, so I'm not suggesting that.

Suggestion for improvement

What I am suggesting is reviewing the documentation of functions/methods using the axis= keyword argument and (where applicable) improving the description of what it controls in each case. Pandas is typically used interactively, so documentation is easily accessible. If it contains useful hints on what each axis value does (and possibly why), it's not such a big problem if this behavior goes against some people's expectations.

Examples

For example, based on the current master docs, the description of the axis parameter for drop does a good job at this:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).

This makes it reasonably clear to me that if I specify 0, I'll be removing rows, whereas 1 will result in removing columns.

By contrast, the description of the axis parameter for sum is somewhat too generic:

axis : {index (0), columns (1)}
Axis for the function to be applied on.

Based on this, I could conclude (and have repeatedly done so) that if I want column sums, I need to "apply the function on columns", hence axis=1 (which is wrong, cf. above).

A revised description could look something like the following:

axis : {index (0), columns (1)}
Whether to collapse the index (0 or ‘index’), resulting in column sums, or the columns (1 or ‘columns’), resulting in row sums.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions