Skip to content

PERF: .ix performance on series #5567

Closed
@l736x

Description

@l736x

Series xs when presented with a multi-index should use the data frame logic (whereby it can use the levels to avoid having to scan the entire set). Need to move the logic of xs to core/generic.py so both Series/Frame can use it.


I stumbled on a weird issue.
The first time I access a series location with ix I have a huge overhead.
This is not the case if I transform the series in a dataframe and the access it.
I use a mutliindex below because it makes the effect more visible, but the same problem is present for regular indexes.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.12.0'

In [3]: mi = pd.MultiIndex.from_tuples([(x,y) for x in range(1000) for y in range(1000)])

In [4]: s = pd.Series(randn(1000000), index=mi)

In [5]: %time s.ix[999]
CPU times: user 652 ms, sys: 4 ms, total: 656 ms
Wall time: 656 ms
Out[5]:
0    -0.271328
...
999   -0.832013
Length: 1000, dtype: float64

In [6]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 249 us
Out[6]:
0    -0.271328
...
999   -0.832013
Length: 1000, dtype: float64

The behavior seems related to the index because recreating the series does not reproduce it, but sorting the index does:

In [7]: s = pd.Series(randn(1000000), index=mi)

In [8]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 247 us
Out[8]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

In [9]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 202 us
Out[9]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

In [10]: s = s.sort_index()

In [11]: %time s.ix[999]
CPU times: user 856 ms, sys: 32 ms, total: 888 ms
Wall time: 890 ms
Out[11]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

In [12]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 213 us
Out[12]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

And now the weird thing. I convert the series into a df:

In [16]: s = s.sort_index()

In [17]: %time pd.DataFrame(s).ix[999]
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.78 ms
Out[17]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
0    1000  non-null values
dtypes: float64(1)

In [18]: %time pd.DataFrame(s).ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 397 us
Out[18]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
0    1000  non-null values
dtypes: float64(1)

There is still a little overhead but nothing compared to the previous case.

It might seem an innocent problem but for large time-series the lag becomes of the order of the second and can eat up a lot of performance.

Lastly, I'm sorry but I don't have easy access to current dev version, so it might be that the problem is already solved. (although I'd be curious to know where it comes from)

Edit: can it be linked to #4198 ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions