Skip to content

strange dtype behaviour as function of series length #7332

Closed
@dsm054

Description

@dsm054

Found when tracking down what was going on with this question about performance.

First the case that makes sense:

>>> s = pd.Series(range(10**3), dtype=np.int32)
>>> s.dtype
dtype('int32')
>>> s.dtype.type
<type 'numpy.int32'>
>>> s.dtype.type in pd.lib._TYPE_MAP
True
>>> 
>>> orig_sum_type = (s+s).dtype.type
>>> orig_sum_type
<type 'numpy.int32'>
>>> orig_sum_type in pd.lib._TYPE_MAP
True

Now let's increase the length of the series.

>>> s = pd.Series(range(10**5), dtype=np.int32)
>>> s.dtype
dtype('int32')
>>> s.dtype.type
<type 'numpy.int32'>
>>> s.dtype.type in pd.lib._TYPE_MAP
True
>>> 
>>> new_sum_type = (s+s).dtype.type
>>> new_sum_type
<type 'numpy.int32'>
>>> new_sum_type in pd.lib._TYPE_MAP
False

.. wait, what?

>>> orig_sum_type, new_sum_type
(<type 'numpy.int32'>, <type 'numpy.int32'>)
>>> orig_sum_type == new_sum_type
False
>>> orig_sum_type is new_sum_type
False
>>> np.int32 is orig_sum_type
True
>>> np.int32 is new_sum_type
False

We've now got a new numpy.int32 type floating around, not equal to the one in numpy. The crossover seems to be at 10k:

>>> def find_first():
...         for i in range(1, 10**5):
...                 s = pd.Series(range(i), dtype=np.int32)
...                 if (s+s).dtype.type not in pd.lib._TYPE_MAP:
...                         return i
...         
>>> find_first()
10001

ISTM that this lack of recognition of the dtype as in _TYPE_MAP prevents the early exit from being taken in infer_dtype upon recognition that it's an integer dtype, and that slows things down considerably.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions