BUG: TimeGrouper outputs different result by column order #6908

sinhrks · 2014-04-18T12:34:42Z

TimeGrouper may output incorrect results depending on the target column order. The problem seems to be caused by 2 parts.

TimeGrouper._get_time_bins and related methods expects sorted values input.
BinGrouper.get_iterator expects sorted data input.

>>> df = pd.DataFrame({'Branch' : 'A A A A A A A B'.split(),
                   'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(),
                   'Quantity': [1,3,5,1,8,1,9,3],
                   'Date' : [
                    datetime(2013,1,1,13,0), datetime(2013,1,1,13,5),
                    datetime(2013,10,1,20,0), datetime(2013,10,2,10,0),
                    datetime(2013,10,1,20,0), datetime(2013,10,2,10,0),
                    datetime(2013,12,2,12,0), datetime(2013,12,2,14,0),]})

# correct
>>> df.groupby([pd.Grouper(freq='1M',key='Date'),'Buyer']).sum()
                  Quantity
Date       Buyer          
2013-01-31 Carl          1
           Mark          3
2013-10-31 Carl          6
           Joe           9
2013-12-31 Carl          3
           Joe           9

[6 rows x 1 columns]

>>> df_sorted = df.sort('Quantity')          # change "Date" column unsorted
# incorrect
>>> df_sorted.groupby([pd.Grouper(freq='1M',key='Date'),'Buyer']).sum()
                  Quantity
Date       Buyer          
2013-01-31 Carl          1
2013-10-31 Carl          1
           Joe           1
           Mark          3
2013-12-31 Carl          8
           Joe          17

[6 rows x 1 columns]

>>> df_sorted.groupby([pd.Grouper(freq='1M',key='Date', sort=True),'Buyer']).sum()
# same incorrect result

# correct
>>> df.groupby([pd.Grouper(freq='6M',key='Date'),'Buyer']).sum()
                  Quantity
Date       Buyer          
2013-01-31 Carl          1
           Mark          3
2014-01-31 Carl          9
           Joe          18

[4 rows x 1 columns]

# incorrect
>>> df_sorted.groupby([pd.Grouper(freq='6M',key='Date'),'Buyer']).sum()
                  Quantity
Date       Buyer          
2013-01-31 Carl          1
2014-01-31 Carl          9
           Joe          18
           Mark          3

[4 rows x 1 columns]

jreback · 2014-04-18T12:43:43Z

this close #6764 ?

jreback · 2014-04-18T12:45:57Z

pandas/tests/test_groupby.py

-        expected = DataFrame({
-            'Buyer': 'Carl Joe Mark Carl Joe'.split(),
-            'Quantity': [6,8,3,4,10],
-            'Date' : [


why are all of the tests changed? (I know they are just moved) but very hard to tell of actual changes
best just to leave original tests unless small change and just add new ones

it's fine what u did just FYI for future

My intension was to perform exactly the same tests for both original and sorted data. But it is possible to prepare duplicate tests by copy & paste. Which is prefferable?

no this is fine
prefer this to copy/paste

sinhrks · 2014-04-18T12:53:10Z

It seems do. I've commented on #6764, and will add a test for this.

jreback · 2014-04-18T13:01:52Z

just run a quick perf check - anything shows up post and investigate

almost certainly fine - good to check anyhow

jreback · 2014-04-18T23:18:27Z

Can u add in a release note

I think their was a releated issue u can simply add this onto (for pd.Grouper) (or after)

sinhrks · 2014-04-19T01:21:43Z

Thanks, I've added release note and test for #6764.

My initial implementation had unnecessary sort and it makes time series performance worse. Now it is fixed and following is an updates vbench result. No test looks consntantly get worse.

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
....
indexing_dataframe_boolean_rows_object       |   1.0853 |   0.9326 |   1.1637 |
frame_ctor_dtindex_YearEnd(2)                |   0.9511 |   0.8060 |   1.1799 |
write_store_table_mixed                      |  76.4666 |  64.6210 |   1.1833 |
indexing_dataframe_boolean                   |  24.6797 |  20.7040 |   1.1920 |
eval_frame_and_all_threads                   |  41.8170 |  30.7407 |   1.3603 |
-------------------------------------------------------------------------------

BUG: TimeGrouper outputs different result by column order

jreback reviewed Apr 18, 2014
View reviewed changes

sinhrks mentioned this pull request Apr 18, 2014

BUG: multiple grouping with a TimeGrouper requires sort #6764

Closed

jreback added this to the 0.14.0 milestone Apr 18, 2014

jreback added Bug labels Apr 18, 2014

BUG: TimeGrouper outputs different result by column order

2cf3eb6

jreback added a commit that referenced this pull request Apr 19, 2014

Merge pull request #6908 from sinhrks/grouper

21565a3

BUG: TimeGrouper outputs different result by column order

jreback merged commit 21565a3 into pandas-dev:master Apr 19, 2014

sinhrks deleted the grouper branch April 19, 2014 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: TimeGrouper outputs different result by column order #6908

BUG: TimeGrouper outputs different result by column order #6908

Uh oh!

sinhrks commented Apr 18, 2014

Uh oh!

jreback commented Apr 18, 2014

Uh oh!

jreback Apr 18, 2014

Uh oh!

jreback Apr 18, 2014

Uh oh!

sinhrks Apr 18, 2014

Uh oh!

jreback Apr 18, 2014

Uh oh!

sinhrks commented Apr 18, 2014

Uh oh!

jreback commented Apr 18, 2014

Uh oh!

jreback commented Apr 18, 2014

Uh oh!

sinhrks commented Apr 19, 2014

Uh oh!

Uh oh!

Uh oh!

BUG: TimeGrouper outputs different result by column order #6908

BUG: TimeGrouper outputs different result by column order #6908

Uh oh!

Conversation

sinhrks commented Apr 18, 2014

Uh oh!

jreback commented Apr 18, 2014

Uh oh!

jreback Apr 18, 2014

Choose a reason for hiding this comment

Uh oh!

jreback Apr 18, 2014

Choose a reason for hiding this comment

Uh oh!

sinhrks Apr 18, 2014

Choose a reason for hiding this comment

Uh oh!

jreback Apr 18, 2014

Choose a reason for hiding this comment

Uh oh!

sinhrks commented Apr 18, 2014

Uh oh!

jreback commented Apr 18, 2014

Uh oh!

jreback commented Apr 18, 2014

Uh oh!

sinhrks commented Apr 19, 2014

Uh oh!

Uh oh!