Skip to content

Grouping a 2 rows DataFrame by time and columns doesn't work as expected #11185

Closed
@JCalderan

Description

@JCalderan

Hi guys,

Working with pandas is great, however I might have notice a bug while grouping a 2 rows DataFrame by time and columns:

>>> import pandas as pd
>>> import numpy as np
>>> from datetime import datetime

>>> freq = 's'
>>> t1 = np.datetime64(datetime.utcnow(), freq)
>>> index = pd.date_range(start=t1, periods=2, freq=freq)
# DatetimeIndex(['2015-09-24 08:55:27', '2015-09-24 08:55:28'], dtype='datetime64[ns]', freq='S', tz=None)
>>> df = pd.DataFrame([['A', 10], ['B', 15]], columns=['metric', 'values'], index=index)
#                     metric  values
#2015-09-24 08:55:27      A      10
#2015-09-24 08:55:28      B      15
>>> grouped = df.groupby([pd.Grouper(level=0, freq=freq), 'metric'])
# here the grouping should output something similar to the input DataFrame, 
# since each rows are already individual groups reguarding the parameters of the groupby function.
>>> grouped.mean()
#                                                     values
# <pandas.tseries.resample.TimeGrouper object at ...      10
# metric                                                  15
#
# notice how the index is broken : a new TimeGrouper object is the first index values, 
# while the second value is the name of the columns used to create the second group...
# now let's try to add another row : a new second, a new metric
>>> df_2 = pd.DataFrame([['C', 0]], columns=df.columns, index=[df.index.shift(-1, freq)[0]])
>>> df_2 = df_2.append(df)
#                     metric  values
#2015-09-24 08:55:26      C       0
#2015-09-24 08:55:27      A      10
#2015-09-24 08:55:28      B      15
>>> grouped = df_2.groupby([pd.Grouper(level=0, freq=freq), 'metric'])
>>> grouped.mean()
#                             values
#                     metric        
#2015-09-24 08:55:26 C            0
#2015-09-24 08:55:27 A           10
#2015-09-24 08:55:28 B           15
# work as expected with 3 rows !
# let's try with 1 row :
>>> df_2.iloc[0:1].groupby([pd.Grouper(level=0, freq=freq), 'metric']).mean()
#                             values
#                     metric        
#2015-09-24 08:55:26 C            0
# work as expected too !

I have tried to group by key, instead of level, or to use another frequency for aggregating (using freq = 's' while building the dataframe, then aggregate with freq='T'), but the result is the same.

Did I miss something ?

Please, not that using the resampling API provide the expected result, but i think the grouping API should provide consistent results :

>>> df.groupby(['metric']).resample(how='mean', freq=freq)
#                             values
# metric                            
# A      2015-09-24 08:55:27      10
# B      2015-09-24 08:55:28      15

Here are the dependencies I have installed with pandas (working on Ubuntu 12.04.5 LTS):

>>> from pandas.util.print_versions import show_versions
>>> show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.5.0-54-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions