Skip to content

inconsistent name handling in value_counts, part 2 #11579

Closed
@corr724

Description

@corr724

I was happy to see in the release notes for 0.17.0 that value_counts no longer discards the series name, but the implementation wasn't what I expected.

0.17.0 gives

>>> series = pd.Series([1731, 364, 813, 1731], name='user_id')
>>> series.value_counts()
1731    2
813     1
364     1
Name: user_id, dtype: int64

which doesn't set the index name.

In my opinion the old series name belongs in the index, not in the series name:

>>> series.value_counts()
user_id
1731    2
813     1
364     1
dtype: int64

Why:

  • It's logical: the user_id has moved to the index, and the values now represent occurrence counts
  • This would be consistent with how .groupby().size() behaves
  • Adding a missing index name is cumbersome and requires creating a temporary variable
  • In many cases the series name is discarded, while index names tend to stick around: for example, pd.DataFrame({'n': series.value_counts(), 'has_duplicates': series.value_counts() > 1}) should really have user_id as an index name

There are three options:

  • result.name = None and result.index.name = series.name
  • result.name = series.name and result.index.name = series.name
  • result.name = 'size' or 'count' and result.index.name = series.name

The first option seems more elegant to me but @sinhrks, who reported #10150, apparently expected result.name to be filled, so perhaps there are use cases where the second option is useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffBug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions