Skip to content

ENH: fastpath indexer API proposal (draft) #6328

Closed
@immerrr

Description

@immerrr

The discussion in #6134 has inspired an idea that I'm writing down for
discussion. The idea is pretty obvious so it should've been considered before,
but I still think pandas as it is right now can benefit from it.

My main complaint about pandas when using it in non-interactive way is that
lookups are significantly slower than with ndarray containers. I do realize
that this happens because of many ways the indexing may be done, but at some
point I've really started thinking about ditching pandas in some
performance-critical paths of my project and replacing them with the dreadful
dict/ndarray combo. Not only doing arr = df.values[df.idx.get_loc[key]]
gets old pretty fast but it's also slower when the frame contains different
dtypes and then you need to go deeper to fix that.

Now I thought what if this slowdown can be reduced by creating fastpath
indexers
that look like the IndexSlice from #6134 and would convey a
message to pandas indexing facilities, like "trust me, I've done all the
preprocessing, just look it up already". I'm talking about something like that
(the names are arbitrary and chosen for illustrative purposes only):

masked_rows = df.fastloc[pd.bool_slice[bool_array]]
# or
masked_rows = df.fastloc[pd.bool_series_slice[bool_series]]
# or
rows_3_and_10 = df.fastloc[pd.pos_slice[3, 10]]
# or
rows_3_through_10 = df.fastloc[pd.range_slice[3:10]]
# or
rows_for_two_days = df.fastloc[pd.tpos_slice['2014-01-01', '2014-01-08']]

Given the actual slice objects will have a common base class, the
implementation could be as easy as:

class FastLocAttribute(object):
   def __init__(self, container):
      self._container = container

    def __getitem__(self, smth):
        if not isinstance(smth, FastpathIndexer):
            raise TypeError("Indexing object is not a FastpathIndexer")

        # open to custom FastpathIndexer implementations
        return smth.getitem(self._container)
        # or a better encapsulated, but not so open
        return self._container._index_method[type(smth)](smth)

Cons:

  • a change in public API
  • one more lookup type
  • inconvenient to use interactively

Pros:

  • adheres to the Zen of Python (explicit is better than implicit)
  • when used in programs, most of the time you know what will the indexing
    object look like and how do you want to use its contents (e.g. no guessing if
    np.array([0,1,0,1]) is a boolean mask or a series of "takeable" indices)
  • lengthier than existing lookup schemes but still shorter than jumping through
    the hoops of NDFrame and Index internals to avoid premature
    pessimization (also, more reliable w.r.t. new releases)
  • fastpath indexing API could be used in pandas internally for the speed (and
    clarity, as in "interesting, what does this function pass to df.loc[...],
    let's find this out")

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions