Skip to content

Implement DataFrame.value_counts #27350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from
Closed
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Other API changes
^^^^^^^^^^^^^^^^^

- :meth:`pandas.api.types.infer_dtype` will now return "integer-na" for integer and ``np.nan`` mix (:issue:`27283`)
-
- Added :meth:`pandas.core.frame.DataFrame.value_counts` (:issue:`5377`).
-

.. _whatsnew_1000.deprecations:
Expand Down
46 changes: 46 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -8394,6 +8394,52 @@ def isin(self, values):
self.columns,
)

def value_counts(self):
"""
The number of times each unique row appears in the DataFrame.

Rows that contain any NaN value are omitted from the results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show a versionadded tag

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What version should I enter for the tag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0.0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Returns
-------
counts : Series

See Also
--------
Series.value_counts: Equivalent method on Series.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have more options on the Series.value_counts, dropna for example these need to be implemented

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no option in group_by to not drop rows containing a NaN. How do I go about implementing that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be OK with raising a NotImplementedError for that case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. This changed the method pretty significantly. PTAL.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single-column case now works, but the code raises NotImplementedError for the multi-column case.


Examples
--------

>>> df = pd.DataFrame({'num_legs': [2, 4, 4], 'num_wings': [2, 0, 0]},
... index=['falcon', 'dog', 'cat'])
>>> df
num_legs num_wings
falcon 2 2
dog 4 0
cat 4 0

>>> df.value_counts()
num_legs num_wings
2 2 1
4 0 2
dtype: int64

>>> df1col = df[['num_legs']]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 2nd example is showing how this works for a Series?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> type(df[['num_legs']]) 
pandas.core.frame.DataFrame

>>> df1col
num_legs
falcon 2
dog 4
cat 4

>>> df1col.value_counts()
num_legs
2 1
4 2
dtype: int64
"""
return self.groupby(self.columns.tolist()).size()

# ----------------------------------------------------------------------
# Add plotting methods to DataFrame
plot = CachedAccessor("plot", pandas.plotting.PlotAccessor)
Expand Down
29 changes: 29 additions & 0 deletions pandas/tests/frame/test_analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -2766,3 +2766,32 @@ def test_multiindex_column_lookup(self):
result = df.nlargest(3, ("x", "b"))
expected = df.iloc[[3, 2, 1]]
tm.assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go ahead and make a new file, test_value_counts.py and put in pandas/tests/frame/analytics/test_value_counts.py (we will split / move analytics later)

def test_data_frame_value_counts(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split up this test? Roughly one test per "thing" you're testing (single column, raising for unsupported keyword, etc.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Multi column data frame.
df = pd.DataFrame(
{"num_legs": [2, 4, 4], "num_wings": [2, 0, 0]},
index=["falcon", "dog", "cat"],
)
actual = df.value_counts()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use result rather than actual

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

expected = pd.Series(
data=[1, 2],
index=pd.MultiIndex.from_arrays(
[(2, 4), (2, 0)], names=["num_legs", "num_wings"]
),
)
tm.assert_series_equal(actual, expected)

# Single column data frame.
df_single_col = df[["num_legs"]]
actual = df_single_col.value_counts()
expected = pd.Series(
data=[1, 2], index=pd.Int64Index(data=[2, 4], name="num_legs")
)
tm.assert_series_equal(actual, expected)

# Empty data frame.
df_no_cols = pd.DataFrame()
actual = df_no_cols.value_counts()
expected = pd.Series([], dtype=np.int64)
tm.assert_series_equal(actual, expected)