Skip to content

Result of groupby -> select -> nlargest duplicates index levels #17477

Open
@JamesOwers

Description

@JamesOwers

To reproduce:

import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index2 = pd.MultiIndex.from_tuples(tuples, names=['1st', '2nd'])
df = pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index2[:6])
df.stack('1st').\
    groupby(level=['1st', 'first'])['one'].\
    nlargest(3)

output:

1st  first  first  second  1st
bar  bar    bar    two     bar    1.985984
                   one     bar   -0.175834
     baz    baz    two     bar    1.089335
                   one     bar   -1.447996
     foo    foo    one     bar    1.154403
                   two     bar   -0.112992
baz  bar    bar    one     baz    0.528254
                   two     baz   -2.052733
     baz    baz    one     baz   -0.447332
                   two     baz   -0.818858
     foo    foo    one     baz   -0.939259
                   two     baz   -1.282821
foo  bar    bar    two     foo    2.075440
                   one     foo   -0.423804
     baz    baz    one     foo    0.250846
                   two     foo    0.177904
     foo    foo    two     foo   -0.468794
                   one     foo   -0.627487
Name: one, dtype: float64

expected result can be obtained by resetting the index:

df.stack('1st').\
    groupby(level=['1st', 'first'])['one'].\
    nlargest(3).\
    reset_index(level=[0, 1], drop=True)

The issue is a problem because the resulting index now contains unnecessary duplicates. The behaviour is also not consistent with when all the columns are grouped by:

df.stack('1st').\
    groupby(level=['1st', 'first', 'second'])['one'].\
    nlargest(3)

Out:

first  second  1st
bar    one     bar   -0.175834
               baz    0.528254
               foo   -0.423804
       two     bar    1.985984
               baz   -2.052733
               foo    2.075440
baz    one     bar   -1.447996
               baz   -0.447332
               foo    0.250846
       two     bar    1.089335
               baz   -0.818858
               foo    0.177904
foo    one     bar    1.154403
               baz   -0.939259
               foo   -0.627487
       two     bar   -0.112992
               baz   -1.282821
               foo   -0.468794
Name: one, dtype: float64

Pandas version 0.20.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugGroupbyReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions