
Description
I've searched the documentation and old issues and I can't find anything on the reason for this unusual choice. The original historic PR was #2922.
One user raised the question before in #14900 and aggressively shut down by jeff without an answer.
It has caused bugs in my code in the past, and I've seen it do the same for other people,
It is the cause of subtle bugs seen in the wild, #26959 (comment). It's inconsistency with python slice conventions is confusing to newbies who ask on SO but no explanation is given.
Every pandas tutorial has to mention this special case:
.loc includes the last value with slice notation. In other data containers such as Python lists, the last value is excluded.
users often need to slice something with closed='left'
behavior, and try to add it in similar situations where it isn't available.
This behavior is fully documented. That's not the problem. For example,
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html says
Note that contrary to usual python slices, both the start and the stop are
included, when present in the index! See "Slicing with labels".).
The "Slicing with labels" section documents the behavior, but gives no reason why the choice to break with python conventions was taken.
Another user who asked this question was given a workaround which is much more cumbersome compared to the convenience of .loc
in #16571 (comment).
He points to DataFrame.between_time, which has kwds for requesting this behavior, but infuriatingly, accepts only times and not datetimes.
A few related cookbook entries were added
https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#dataframes
which explains
There are 2 explicit slicing methods, with a third general case
Positional-oriented (Python slicing style : exclusive of end)
Label-oriented (Non-Python slicing style : inclusive of end)
this again documents how loc
works, but again offers no reason why pandas originally broke with python conventions.
In every case I've seen someone ask "why is label-based slicing right-inclusive?" the answer has always been "because it's label based, not position based", which doesn't really explain anything.
The same issue exists with string based partial time slicing such like df.loc[:"2018"]
,
which will include rows with 2018-01-01
prefix.
So, over the years several people have found this undesirable and/or have tried to find out why, but with an hour's worth of gooling and reading, I can't find an explanation ever have been given.
I'm not saying there's no good reason, I understand that indexing can get complicated, and mixed cases are tricky, etc'. But I want to understand why this was necessary or desirable, and why making .loc
pythonic was unfeasible.
I'll be opening a very small POC PR for discussion in a few minutes, which adds a new indexer called locs
. It is far from complete, but it seems to do exactly what I want for the single-index case, in what I've tried so far. it passes all the equivalent tests loc does.
To summarize:
- it's been asked before, but not answered.
- the behavior is documented, but is surprising and never explained.
- It's not immediately obvious why it's impossible to implement the pythonic version.
So I ask, why is .loc
right-inclusive?