Closed
Description
Found when tracking down what was going on with this question about performance.
First the case that makes sense:
>>> s = pd.Series(range(10**3), dtype=np.int32)
>>> s.dtype
dtype('int32')
>>> s.dtype.type
<type 'numpy.int32'>
>>> s.dtype.type in pd.lib._TYPE_MAP
True
>>>
>>> orig_sum_type = (s+s).dtype.type
>>> orig_sum_type
<type 'numpy.int32'>
>>> orig_sum_type in pd.lib._TYPE_MAP
True
Now let's increase the length of the series.
>>> s = pd.Series(range(10**5), dtype=np.int32)
>>> s.dtype
dtype('int32')
>>> s.dtype.type
<type 'numpy.int32'>
>>> s.dtype.type in pd.lib._TYPE_MAP
True
>>>
>>> new_sum_type = (s+s).dtype.type
>>> new_sum_type
<type 'numpy.int32'>
>>> new_sum_type in pd.lib._TYPE_MAP
False
.. wait, what?
>>> orig_sum_type, new_sum_type
(<type 'numpy.int32'>, <type 'numpy.int32'>)
>>> orig_sum_type == new_sum_type
False
>>> orig_sum_type is new_sum_type
False
>>> np.int32 is orig_sum_type
True
>>> np.int32 is new_sum_type
False
We've now got a new numpy.int32
type floating around, not equal to the one in numpy
. The crossover seems to be at 10k:
>>> def find_first():
... for i in range(1, 10**5):
... s = pd.Series(range(i), dtype=np.int32)
... if (s+s).dtype.type not in pd.lib._TYPE_MAP:
... return i
...
>>> find_first()
10001
ISTM that this lack of recognition of the dtype as in _TYPE_MAP
prevents the early exit from being taken in infer_dtype
upon recognition that it's an integer dtype, and that slows things down considerably.