Skip to content

loc very slow on sorted, non-unique index with list of labels ar argument #9466

Closed
@toobaz

Description

@toobaz
In [1]: import pandas, numpy

In [2]: df = pandas.DataFrame(numpy.random.random((100000, 4)))

In [3]: %timeit df.loc[55555]
10000 loops, best of 3: 118 µs per loop

In [4]: %timeit df.loc[[55555]]
1000 loops, best of 3: 324 µs per loop

... makes sense to me.

In [5]: df.index = list(range(99999)) + [55555]

In [6]: %timeit df.loc[55555]
100 loops, best of 3: 4.04 ms per loop

In [7]: %timeit df.loc[[55555]]
100 loops, best of 3: 16.8 ms per loop

Non-unique index, slower (the second call probably has to scan all the index): still makes sense to me. Sorting should improve things...

In [8]: df.sort(inplace=True)

In [9]: %timeit df.loc[55555]
1000 loops, best of 3: 239 µs per loop

In [10]: %timeit df.loc[[55555]]
100 loops, best of 3: 17.2 ms per loop

... here I'm lost: why this huge difference? The difference is even larger (3 orders of magnitude) in a real database I am working on. Clearly,

In [12]: df.loc[[55555]] == df.loc[55555]
Out[12]: 
          0     1     2     3
55555  True  True  True  True
55555  True  True  True  True

(As a sidenote: the reason why I'm doing calls such as df.loc[[a_label]] is that df.loc[a_label] will return sometimes a Series, sometimes a DataFrame. I currently solve this by using df.loc[df.index == a_label], which is however ~3x slower than df.loc[a_label] - but much faster than the above df.loc[[a_label]].)

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions