Skip to content

Updated value_counts documentation and implementation and added single label subset test #50955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Feb 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1088,6 +1088,7 @@ Removal of prior version deprecations/changes
- Arguments after ``expr`` in :meth:`DataFrame.eval` and :meth:`DataFrame.query` are keyword-only (:issue:`47587`)
- Removed :meth:`Index._get_attributes_dict` (:issue:`50648`)
- Removed :meth:`Series.__array_wrap__` (:issue:`50648`)
- Changed behavior of :meth:`.DataFrame.value_counts` to return a :class:`Series` with :class:`MultiIndex` for any list-like(one element or not) but an :class:`Index` for a single label (:issue:`50829`)

.. ---------------------------------------------------------------------------
.. _whatsnew_200.performance:
Expand Down
18 changes: 13 additions & 5 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -6957,7 +6957,7 @@ def value_counts(

Parameters
----------
subset : list-like, optional
subset : label or list of labels, optional
Columns to use when counting unique combinations.
normalize : bool, default False
Return proportions rather than frequencies.
Expand All @@ -6981,9 +6981,10 @@ def value_counts(
Notes
-----
The returned Series will have a MultiIndex with one level per input
column. By default, rows that contain any NA values are omitted from
the result. By default, the resulting Series will be in descending
order so that the first element is the most frequently-occurring row.
column but an Index (non-multi) for a single label. By default, rows
that contain any NA values are omitted from the result. By default,
the resulting Series will be in descending order so that the first
element is the most frequently-occurring row.

Examples
--------
Expand Down Expand Up @@ -7049,6 +7050,13 @@ def value_counts(
John Smith 1
NaN 1
Name: count, dtype: int64

>>> df.value_counts("first_name")
first_name
John 2
Anne 1
Beth 1
Name: count, dtype: int64
"""
if subset is None:
subset = self.columns.tolist()
Expand All @@ -7063,7 +7071,7 @@ def value_counts(
counts /= counts.sum()

# Force MultiIndex for single column
if len(subset) == 1:
if is_list_like(subset) and len(subset) == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke - this will make value_counts return a MultiIndex for any list-like (one element or not) but an Index (non-multi) for a single label. Wanted a 2nd opinion to make sure that is desirable behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's reasonable behavior. Might be good to document that return difference

counts.index = MultiIndex.from_arrays(
[counts.index], names=[counts.index.name]
)
Expand Down
20 changes: 20 additions & 0 deletions pandas/tests/frame/methods/test_value_counts.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import numpy as np
import pytest

import pandas as pd
import pandas._testing as tm
Expand Down Expand Up @@ -155,3 +156,22 @@ def test_data_frame_value_counts_dropna_false(nulls_fixture):
)

tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("columns", (["first_name", "middle_name"], [0, 1]))
def test_data_frame_value_counts_subset(nulls_fixture, columns):
# GH 50829
df = pd.DataFrame(
{
columns[0]: ["John", "Anne", "John", "Beth"],
columns[1]: ["Smith", nulls_fixture, nulls_fixture, "Louise"],
},
)
result = df.value_counts(columns[0])
expected = pd.Series(
data=[2, 1, 1],
index=pd.Index(["John", "Anne", "Beth"], name=columns[0]),
name="count",
)

tm.assert_series_equal(result, expected)