Skip to content

Performance of pandas.algos.groupby_int64 #14293

Closed
@mrocklin

Description

@mrocklin

For dask.dataframe shuffle operations (groupby.apply, merge), when running with multiple threads per process, I sometimes find my computations dominated by pandas.algos.groupby_int64. Looking at the source code for this it looks like it's using dynamic pure python objects from Cython. I'm curious if there are ways to accelerate this function, particularly in multi-threaded situations (releasing the GIL).

One solution that comes to mind would be to do a single pass over labels, pre-compute the length of each members list in results and then pre-allocate these as arrays. This might allow better GIL-releasing behavior.

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    GroupbyPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions