PERF: .ix performance on series

Series xs when presented with a multi-index should use the data frame logic (whereby it can use the levels to avoid having to scan the entire set). Need to move the logic of `xs` to `core/generic.py` so both Series/Frame can use it.

---

I stumbled on a weird issue.
The first time I access a series location with ix I have a huge overhead.
This is not the case if I transform the series in a dataframe and the access it.
I use a mutliindex below because it makes the effect more visible, but the same problem is present for regular indexes.

``` python
In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.12.0'

In [3]: mi = pd.MultiIndex.from_tuples([(x,y) for x in range(1000) for y in range(1000)])

In [4]: s = pd.Series(randn(1000000), index=mi)

In [5]: %time s.ix[999]
CPU times: user 652 ms, sys: 4 ms, total: 656 ms
Wall time: 656 ms
Out[5]:
0    -0.271328
...
999   -0.832013
Length: 1000, dtype: float64

In [6]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 249 us
Out[6]:
0    -0.271328
...
999   -0.832013
Length: 1000, dtype: float64
```

The behavior seems related to the index because recreating the series does not reproduce it, but sorting the index does:

``` python
In [7]: s = pd.Series(randn(1000000), index=mi)

In [8]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 247 us
Out[8]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

In [9]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 202 us
Out[9]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

In [10]: s = s.sort_index()

In [11]: %time s.ix[999]
CPU times: user 856 ms, sys: 32 ms, total: 888 ms
Wall time: 890 ms
Out[11]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

In [12]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 213 us
Out[12]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64
```

And now the weird thing. I convert the series into a df:

``` python
In [16]: s = s.sort_index()

In [17]: %time pd.DataFrame(s).ix[999]
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.78 ms
Out[17]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
0    1000  non-null values
dtypes: float64(1)

In [18]: %time pd.DataFrame(s).ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 397 us
Out[18]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
0    1000  non-null values
dtypes: float64(1)
```

There is still a little overhead but nothing compared to the previous case.

It might seem an innocent problem but for large time-series the lag becomes of the order of the second and can eat up a lot of performance.

Lastly, I'm sorry but I don't have easy access to current dev version, so it might be that the problem is already solved. (although I'd be curious to know where it comes from)

Edit: can it be linked to #4198 ?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: .ix performance on series #5567

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PERF: .ix performance on series #5567

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions