Skip to content

PERF: NaT groups cause wrong path in grouping #11010

Closed
@jreback

Description

@jreback

xref #10625

before this patch:

In [1]: from string import ascii_lowercase
In [2]: np.random.seed(2718281)
In [3]: n = 1 << 21
In [4]: dr = date_range('2015-08-30', periods=n // 10, freq='T')
In [5]: df = DataFrame({
   ...:         '1st':np.random.choice(list(ascii_lowercase), n),
   ...:         '2nd':np.random.randint(0, 5, n),
   ...:         '3rd':np.random.choice(dr, n)})

In [6]: df.loc[np.random.choice(n, n // 10), '3rd'] = np.nan
In [7]: gr = df.groupby(['1st', '2nd'])

In [8]: %timeit gr.count()
The slowest run took 21.22 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 13.3 ms per loop

In [9]: %timeit gr.count()
100 loops, best of 3: 13.8 ms per loop

In [10]: pd.__version__
Out[10]: '0.16.2+521.g207efc2'

with this patch:

In [8]: %timeit gr.count()
1 loops, best of 3: 144 ms per loop

In [9]: %timeit gr.count()
10 loops, best of 3: 149 ms per loop

In [10]: pd.__version__
Out[10]: '0.16.2+522.g9c2d1a6'

Metadata

Metadata

Assignees

No one assigned

    Labels

    DatetimeDatetime data dtypeGroupbyMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolatePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions