Skip to content

PERF: groupby with many empty groups memory blowup #30552

Closed
@jbrockmendel

Description

@jbrockmendel

Suppose we have a Categorical with many unused categories:

cat = pd.Categorical(range(24), categories=range(10**5))

df = pd.DataFrame({"A": cat, "B": range(24), "C": range(24), "D": 1})

gb = df.groupby(["A", "B", "C"])

>>> gb.size()  # memory balloons to 9+ GB before i kill it

There are only 24 rows in this DataFrame, so we shouldn't be creating millions of groups.

Without the Categorical, but just a large cross-product that implies many empty groups, this works fine:

df = pd.DataFrame({n: range(12) for n in range(8)})

gb = df.groupby(list(range(7)))
gb.size() # <-- works fine

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypeGroupbyPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions