Skip to content

BUG: TimeGrouper outputs different result by column order #6908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 19, 2014

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Apr 18, 2014

closes #6764

TimeGrouper may output incorrect results depending on the target column order. The problem seems to be caused by 2 parts.

  • TimeGrouper._get_time_bins and related methods expects sorted values input.
  • BinGrouper.get_iterator expects sorted data input.
>>> df = pd.DataFrame({'Branch' : 'A A A A A A A B'.split(),
                   'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(),
                   'Quantity': [1,3,5,1,8,1,9,3],
                   'Date' : [
                    datetime(2013,1,1,13,0), datetime(2013,1,1,13,5),
                    datetime(2013,10,1,20,0), datetime(2013,10,2,10,0),
                    datetime(2013,10,1,20,0), datetime(2013,10,2,10,0),
                    datetime(2013,12,2,12,0), datetime(2013,12,2,14,0),]})

# correct
>>> df.groupby([pd.Grouper(freq='1M',key='Date'),'Buyer']).sum()
                  Quantity
Date       Buyer          
2013-01-31 Carl          1
           Mark          3
2013-10-31 Carl          6
           Joe           9
2013-12-31 Carl          3
           Joe           9

[6 rows x 1 columns]

>>> df_sorted = df.sort('Quantity')          # change "Date" column unsorted
# incorrect
>>> df_sorted.groupby([pd.Grouper(freq='1M',key='Date'),'Buyer']).sum()
                  Quantity
Date       Buyer          
2013-01-31 Carl          1
2013-10-31 Carl          1
           Joe           1
           Mark          3
2013-12-31 Carl          8
           Joe          17

[6 rows x 1 columns]

>>> df_sorted.groupby([pd.Grouper(freq='1M',key='Date', sort=True),'Buyer']).sum()
# same incorrect result
# correct
>>> df.groupby([pd.Grouper(freq='6M',key='Date'),'Buyer']).sum()
                  Quantity
Date       Buyer          
2013-01-31 Carl          1
           Mark          3
2014-01-31 Carl          9
           Joe          18

[4 rows x 1 columns]

# incorrect
>>> df_sorted.groupby([pd.Grouper(freq='6M',key='Date'),'Buyer']).sum()
                  Quantity
Date       Buyer          
2013-01-31 Carl          1
2014-01-31 Carl          9
           Joe          18
           Mark          3

[4 rows x 1 columns]

@jreback
Copy link
Contributor

jreback commented Apr 18, 2014

this close #6764 ?

expected = DataFrame({
'Buyer': 'Carl Joe Mark Carl Joe'.split(),
'Quantity': [6,8,3,4,10],
'Date' : [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are all of the tests changed? (I know they are just moved) but very hard to tell of actual changes
best just to leave original tests unless small change and just add new ones

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's fine what u did just FYI for future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intension was to perform exactly the same tests for both original and sorted data. But it is possible to prepare duplicate tests by copy & paste. Which is prefferable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is fine
prefer this to copy/paste

@sinhrks
Copy link
Member Author

sinhrks commented Apr 18, 2014

It seems do. I've commented on #6764, and will add a test for this.

@jreback
Copy link
Contributor

jreback commented Apr 18, 2014

just run a quick perf check - anything shows up post and investigate

almost certainly fine - good to check anyhow

@jreback jreback added this to the 0.14.0 milestone Apr 18, 2014
@jreback
Copy link
Contributor

jreback commented Apr 18, 2014

Can u add in a release note

I think their was a releated issue u can simply add this onto (for pd.Grouper) (or after)

@sinhrks
Copy link
Member Author

sinhrks commented Apr 19, 2014

Thanks, I've added release note and test for #6764.

My initial implementation had unnecessary sort and it makes time series performance worse. Now it is fixed and following is an updates vbench result. No test looks consntantly get worse.

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
....
indexing_dataframe_boolean_rows_object       |   1.0853 |   0.9326 |   1.1637 |
frame_ctor_dtindex_YearEnd(2)                |   0.9511 |   0.8060 |   1.1799 |
write_store_table_mixed                      |  76.4666 |  64.6210 |   1.1833 |
indexing_dataframe_boolean                   |  24.6797 |  20.7040 |   1.1920 |
eval_frame_and_all_threads                   |  41.8170 |  30.7407 |   1.3603 |
-------------------------------------------------------------------------------

jreback added a commit that referenced this pull request Apr 19, 2014
BUG: TimeGrouper outputs different result by column order
@jreback jreback merged commit 21565a3 into pandas-dev:master Apr 19, 2014
@sinhrks sinhrks deleted the grouper branch April 19, 2014 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: multiple grouping with a TimeGrouper requires sort
2 participants