Skip to content

ENH add cumcount groupby method #5510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 14, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions doc/source/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -705,3 +705,16 @@ can be used as group keys. If so, the order of the levels will be preserved:
factor = qcut(data, [0, .25, .5, .75, 1.])

data.groupby(factor).mean()

Enumerate group items
~~~~~~~~~~~~~~~~~~~~~

To see the order in which each row appears within its group, use the
``cumcount`` method:

.. ipython:: python

df = pd.DataFrame(list('aaabba'), columns=['A'])
df

df.groupby('A').cumcount()
1 change: 1 addition & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ New features
- ``to_csv()`` now outputs datetime objects according to a specified format
string via the ``date_format`` keyword (:issue:`4313`)
- Added ``LastWeekOfMonth`` DateOffset (:issue:`4637`)
- Added ``cumcount`` groupby method (:issue:`4646`)
- Added ``FY5253``, and ``FY5253Quarter`` DateOffsets (:issue:`4511`)
- Added ``mode()`` method to ``Series`` and ``DataFrame`` to get the
statistical mode(s) of a column/series. (:issue:`5367`)
Expand Down
45 changes: 43 additions & 2 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -468,6 +468,7 @@ def ohlc(self):
Compute sum of values, excluding missing values

For multiple groupings, the result index will be a MultiIndex

"""
return self._cython_agg_general('ohlc')

Expand All @@ -480,9 +481,49 @@ def picker(arr):
return np.nan
return self.agg(picker)

def cumcount(self):
'''
Number each item in each group from 0 to the length of that group.

Essentially this is equivalent to

>>> self.apply(lambda x: Series(np.arange(len(x)), x.index)).

Example
-------

>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']], columns=['A'])
>>> df
A
0 a
1 a
2 a
3 b
4 b
5 a
>>> df.groupby('A').cumcount()
0 0
1 1
2 2
3 0
4 1
5 3
dtype: int64

'''
index = self.obj.index
cumcounts = np.zeros(len(index), dtype='int64')
for v in self.indices.values():
cumcounts[v] = np.arange(len(v), dtype='int64')
return Series(cumcounts, index)


def _try_cast(self, result, obj):
""" try to cast the result to our obj original type,
we may have roundtripped thru object in the mean-time """
"""
try to cast the result to our obj original type,
we may have roundtripped thru object in the mean-time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind removing this blank line if you're editing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it in specifically, pep8 says it should be there, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, no idea.

"""
if obj.ndim > 1:
dtype = obj.values.dtype
else:
Expand Down
53 changes: 52 additions & 1 deletion pandas/tests/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2560,6 +2560,57 @@ def test_groupby_with_empty(self):
grouped = series.groupby(grouper)
assert next(iter(grouped), None) is None

def test_cumcount(self):
df = DataFrame([['a'], ['a'], ['a'], ['b'], ['a']], columns=['A'])
g = df.groupby('A')
sg = g.A

expected = Series([0, 1, 2, 0, 3])

assert_series_equal(expected, g.cumcount())
assert_series_equal(expected, sg.cumcount())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe test for what happens if you have empty DataFrame? grouped Series? cumcount on something that's not a column (i.e., passed into the object) and maybe one different dtype for good measure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, tests are incredibly light. sg is a grouped Series.

Will add empty, it does work.

Not sure what you mean by not an column.... :S

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.groupby([1, 1, 3, 5, 6])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that that matters for your implementation, but might be good to have if we replace with something faster for some reason

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced with something faster, and added this test.


def test_cumcount_empty(self):
ge = DataFrame().groupby()
se = Series().groupby()

e = Series(dtype='int') # edge case, as this is usually considered float

assert_series_equal(e, ge.cumcount())
assert_series_equal(e, se.cumcount())

def test_cumcount_dupe_index(self):
df = DataFrame([['a'], ['a'], ['a'], ['b'], ['a']], columns=['A'], index=[0] * 5)
g = df.groupby('A')
sg = g.A

expected = Series([0, 1, 2, 0, 3], index=[0] * 5)

assert_series_equal(expected, g.cumcount())
assert_series_equal(expected, sg.cumcount())

def test_cumcount_mi(self):
mi = MultiIndex.from_tuples([[0, 1], [1, 2], [2, 2], [2, 2], [1, 0]])
df = DataFrame([['a'], ['a'], ['a'], ['b'], ['a']], columns=['A'], index=mi)
g = df.groupby('A')
sg = g.A

expected = Series([0, 1, 2, 0, 3], index=mi)

assert_series_equal(expected, g.cumcount())
assert_series_equal(expected, sg.cumcount())

def test_cumcount_groupby_not_col(self):
df = DataFrame([['a'], ['a'], ['a'], ['b'], ['a']], columns=['A'], index=[0] * 5)
g = df.groupby([0, 0, 0, 1, 0])
sg = g.A

expected = Series([0, 1, 2, 0, 3], index=[0] * 5)

assert_series_equal(expected, g.cumcount())
assert_series_equal(expected, sg.cumcount())


def test_filter_series(self):
import pandas as pd
s = pd.Series([1, 3, 20, 5, 22, 24, 7])
Expand Down Expand Up @@ -3180,7 +3231,7 @@ def test_tab_completion(self):
'min','name','ngroups','nth','ohlc','plot', 'prod',
'size','std','sum','transform','var', 'count', 'head', 'describe',
'cummax', 'dtype', 'quantile', 'rank', 'cumprod', 'tail',
'resample', 'cummin', 'fillna', 'cumsum'])
'resample', 'cummin', 'fillna', 'cumsum', 'cumcount'])
self.assertEqual(results, expected)

def assert_fp_equal(a, b):
Expand Down