Skip to content

BUG: GroupBy.get_group doesnt work with TimeGrouper #6914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 28, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -399,6 +399,7 @@ Bug Fixes
- Better error message when passing a frequency of 'MS' in ``Period`` construction (GH5332)
- Bug in `Series.__unicode__` when `max_rows` is `None` and the Series has more than 1000 rows. (:issue:`6863`)
- Bug in ``groupby.get_group`` where a datetlike wasn't always accepted (:issue:`5267`)
- Bug in ``groupBy.get_group`` created by ``TimeGrouper`` raises ``AttributeError`` (:issue:`6914`)
- Bug in ``DatetimeIndex.tz_localize`` and ``DatetimeIndex.tz_convert`` affects to NaT (:issue:`5546`)
- Bug in arithmetic operations affecting to NaT (:issue:`6873`)
- Bug in ``Series.str.extract`` where the resulting ``Series`` from a single
Expand Down
12 changes: 12 additions & 0 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from functools import wraps
import numpy as np
import datetime
import collections

from pandas.compat import(
zip, builtins, range, long, lrange, lzip,
Expand Down Expand Up @@ -1556,6 +1557,17 @@ def apply(self, f, data, axis=0):

return result_keys, result_values, mutated

@cache_readonly
def indices(self):
indices = collections.defaultdict(list)

i = 0
for label, bin in zip(self.binlabels, self.bins):
if i < bin:
indices[label] = list(range(i, bin))
i = bin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume each group is contiguous / sorted?

I have a feeling there is a more efficient way to do this get this out, @jreback ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what _groupby_indices does (a cython routine). also make an example that has an unsorted bins for testing as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that binlabels and bins are always sorted before passed to BinGrouper. Is it incorrect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. in any event, does not _groupby_indices work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following is the _groupby_indices result. It returns correct index even if the input is not sorted?

>>> import pandas.algos as _algos
>>> import pandas.core.common as com
>>> values = ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'C']
>>> _algos.groupby_indices(com._ensure_object(values))
{'A': array([0, 1, 2, 3, 4]), 'C': array([7]), 'B': array([5, 6])}

>>> values = ['B', 'B', 'C', 'A', 'A', 'A', 'A', 'A']
>>> _algos.groupby_indices(com._ensure_object(values))
{'A': array([3, 4, 5, 6, 7]), 'C': array([2]), 'B': array([0, 1])}

But BinGrouper doesn't know actual data index by itself, so I'm not sure what results are actually correct. Should it return the same indices regardless of bins order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, different. BinGrouper.indices must be a dict which key is timestamp. The logic cannot be replaced with _groupby_indices.

>>> b.indices
defaultdict(<type 'list'>, {Timestamp('2013-12-31 00:00:00', offset='M'): [6, 7], Timestamp('2013-10-31 00:00:00', offset='M'): [2, 3, 4, 5], Timestamp('2013-01-31 00:00:00', offset='M'): [0, 1]})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _get_indicies_dict; don't reinvent the wheel here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using _get_indices_dict is not easy, because BinGrouper doesn't have information which can be passed to the function as it is.

>>> import pandas as pd
>>> import datetime
>>> from pandas.core.groupby import GroupBy, _get_indices_dict

>>> df = pd.DataFrame({'Branch' : 'A A A A A A A B'.split(),
                   'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(),
                   'Quantity': [1,3,5,1,8,1,9,3],
                   'Date' : [
                    datetime.datetime(2013,1,1,13,0), datetime.datetime(2013,1,1,13,5),
                    datetime.datetime(2013,10,1,20,0), datetime.datetime(2013,10,2,10,0),
                    datetime.datetime(2013,10,1,20,0), datetime.datetime(2013,10,2,10,0),
                    datetime.datetime(2013,12,2,12,0), datetime.datetime(2013,12,2,14,0),]})

>>> grouped = df.groupby(pd.Grouper(freq='1M',key='Date'))
>>> grouped.grouper.bins
[2 2 2 2 2 2 2 2 2 6 6 8]
>>> grouped.grouper.binlabels
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-31, ..., 2013-12-31]
Length: 12, Freq: M, Timezone: None

Thus, I have to convert it using the similar logic as current implementation.

>>> indices = []
>>> i = 0
>>> for j, bin in enumerate(grouped.grouper.bins):
>>>     if i < bin:
>>>         indices.extend([j] * (bin - i))
>>>         i = bin
>>> indices = np.array(indices)
>>> indices
[ 0  0  9  9  9  9 11 11]

And _get_indices_dict returns keys as tuple, further conversion required.

>>> _get_indices_dict([indices], [grouped.grouper.binlabels])
{(numpy.datetime64('2013-10-31T09:00:00.000000000+0900'),): array([2, 3, 4, 5]), (numpy.datetime64('2013-12-31T09:00:00.000000000+0900'),): array([6, 7]), (numpy.datetime64('2013-01-31T09:00:00.000000000+0900'),): array([0, 1])}

Do you have better logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not do something like (?):

{g.grouper.levels[k] for k, v in pd.core.groupby._groupby_indices(b.bins).iteritems()}

may be faster...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but failed in test. I think simply passing bins to existing method shouldn't work, because bins are corresponding to frequencies to be split, not index. Thus its length can differ from index.

Using dataframe in above example, returnes values are different from expected.

>>> list(pd.core.groupby._groupby_indices(grouped.grouper.bins).iteritems())
[(8, array([11])), (2, array([0, 1, 2, 3, 4, 5, 6, 7, 8])), (6, array([ 9, 10]))]

return indices

@cache_readonly
def ngroups(self):
return len(self.binlabels)
Expand Down
49 changes: 49 additions & 0 deletions pandas/tests/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -3140,6 +3140,55 @@ def test_timegrouper_with_reg_groups(self):
result2 = df.groupby([pd.TimeGrouper(freq=freq), 'user_id'])['whole_cost'].sum()
assert_series_equal(result2, expected)

def test_timegrouper_get_group(self):
# GH 6914

df_original = DataFrame({
'Buyer': 'Carl Joe Joe Carl Joe Carl'.split(),
'Quantity': [18,3,5,1,9,3],
'Date' : [datetime(2013,9,1,13,0), datetime(2013,9,1,13,5),
datetime(2013,10,1,20,0), datetime(2013,10,3,10,0),
datetime(2013,12,2,12,0), datetime(2013,9,2,14,0),]})
df_reordered = df_original.sort(columns='Quantity')

# single grouping
expected_list = [df_original.iloc[[0, 1, 5]], df_original.iloc[[2, 3]],
df_original.iloc[[4]]]
dt_list = ['2013-09-30', '2013-10-31', '2013-12-31']

for df in [df_original, df_reordered]:
grouped = df.groupby(pd.Grouper(freq='M', key='Date'))
for t, expected in zip(dt_list, expected_list):
dt = pd.Timestamp(t)
result = grouped.get_group(dt)
assert_frame_equal(result, expected)

# multiple grouping
expected_list = [df_original.iloc[[1]], df_original.iloc[[3]],
df_original.iloc[[4]]]
g_list = [('Joe', '2013-09-30'), ('Carl', '2013-10-31'), ('Joe', '2013-12-31')]

for df in [df_original, df_reordered]:
grouped = df.groupby(['Buyer', pd.Grouper(freq='M', key='Date')])
for (b, t), expected in zip(g_list, expected_list):
dt = pd.Timestamp(t)
result = grouped.get_group((b, dt))
assert_frame_equal(result, expected)

# with index
df_original = df_original.set_index('Date')
df_reordered = df_original.sort(columns='Quantity')

expected_list = [df_original.iloc[[0, 1, 5]], df_original.iloc[[2, 3]],
df_original.iloc[[4]]]

for df in [df_original, df_reordered]:
grouped = df.groupby(pd.Grouper(freq='M'))
for t, expected in zip(dt_list, expected_list):
dt = pd.Timestamp(t)
result = grouped.get_group(dt)
assert_frame_equal(result, expected)

def test_cumcount(self):
df = DataFrame([['a'], ['a'], ['a'], ['b'], ['a']], columns=['A'])
g = df.groupby('A')
Expand Down