Description
Series xs when presented with a multi-index should use the data frame logic (whereby it can use the levels to avoid having to scan the entire set). Need to move the logic of xs
to core/generic.py
so both Series/Frame can use it.
I stumbled on a weird issue.
The first time I access a series location with ix I have a huge overhead.
This is not the case if I transform the series in a dataframe and the access it.
I use a mutliindex below because it makes the effect more visible, but the same problem is present for regular indexes.
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.12.0'
In [3]: mi = pd.MultiIndex.from_tuples([(x,y) for x in range(1000) for y in range(1000)])
In [4]: s = pd.Series(randn(1000000), index=mi)
In [5]: %time s.ix[999]
CPU times: user 652 ms, sys: 4 ms, total: 656 ms
Wall time: 656 ms
Out[5]:
0 -0.271328
...
999 -0.832013
Length: 1000, dtype: float64
In [6]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 249 us
Out[6]:
0 -0.271328
...
999 -0.832013
Length: 1000, dtype: float64
The behavior seems related to the index because recreating the series does not reproduce it, but sorting the index does:
In [7]: s = pd.Series(randn(1000000), index=mi)
In [8]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 247 us
Out[8]:
0 -1.207499
...
999 -0.370578
Length: 1000, dtype: float64
In [9]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 202 us
Out[9]:
0 -1.207499
...
999 -0.370578
Length: 1000, dtype: float64
In [10]: s = s.sort_index()
In [11]: %time s.ix[999]
CPU times: user 856 ms, sys: 32 ms, total: 888 ms
Wall time: 890 ms
Out[11]:
0 -1.207499
...
999 -0.370578
Length: 1000, dtype: float64
In [12]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 213 us
Out[12]:
0 -1.207499
...
999 -0.370578
Length: 1000, dtype: float64
And now the weird thing. I convert the series into a df:
In [16]: s = s.sort_index()
In [17]: %time pd.DataFrame(s).ix[999]
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.78 ms
Out[17]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
0 1000 non-null values
dtypes: float64(1)
In [18]: %time pd.DataFrame(s).ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 397 us
Out[18]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
0 1000 non-null values
dtypes: float64(1)
There is still a little overhead but nothing compared to the previous case.
It might seem an innocent problem but for large time-series the lag becomes of the order of the second and can eat up a lot of performance.
Lastly, I'm sorry but I don't have easy access to current dev version, so it might be that the problem is already solved. (although I'd be curious to know where it comes from)
Edit: can it be linked to #4198 ?