Closed
Description
Hi guys,
Working with pandas is great, however I might have notice a bug while grouping a 2 rows DataFrame by time and columns:
>>> import pandas as pd
>>> import numpy as np
>>> from datetime import datetime
>>> freq = 's'
>>> t1 = np.datetime64(datetime.utcnow(), freq)
>>> index = pd.date_range(start=t1, periods=2, freq=freq)
# DatetimeIndex(['2015-09-24 08:55:27', '2015-09-24 08:55:28'], dtype='datetime64[ns]', freq='S', tz=None)
>>> df = pd.DataFrame([['A', 10], ['B', 15]], columns=['metric', 'values'], index=index)
# metric values
#2015-09-24 08:55:27 A 10
#2015-09-24 08:55:28 B 15
>>> grouped = df.groupby([pd.Grouper(level=0, freq=freq), 'metric'])
# here the grouping should output something similar to the input DataFrame,
# since each rows are already individual groups reguarding the parameters of the groupby function.
>>> grouped.mean()
# values
# <pandas.tseries.resample.TimeGrouper object at ... 10
# metric 15
#
# notice how the index is broken : a new TimeGrouper object is the first index values,
# while the second value is the name of the columns used to create the second group...
# now let's try to add another row : a new second, a new metric
>>> df_2 = pd.DataFrame([['C', 0]], columns=df.columns, index=[df.index.shift(-1, freq)[0]])
>>> df_2 = df_2.append(df)
# metric values
#2015-09-24 08:55:26 C 0
#2015-09-24 08:55:27 A 10
#2015-09-24 08:55:28 B 15
>>> grouped = df_2.groupby([pd.Grouper(level=0, freq=freq), 'metric'])
>>> grouped.mean()
# values
# metric
#2015-09-24 08:55:26 C 0
#2015-09-24 08:55:27 A 10
#2015-09-24 08:55:28 B 15
# work as expected with 3 rows !
# let's try with 1 row :
>>> df_2.iloc[0:1].groupby([pd.Grouper(level=0, freq=freq), 'metric']).mean()
# values
# metric
#2015-09-24 08:55:26 C 0
# work as expected too !
I have tried to group by key, instead of level, or to use another frequency for aggregating (using freq = 's' while building the dataframe, then aggregate with freq='T'), but the result is the same.
Did I miss something ?
Please, not that using the resampling API provide the expected result, but i think the grouping API should provide consistent results :
>>> df.groupby(['metric']).resample(how='mean', freq=freq)
# values
# metric
# A 2015-09-24 08:55:27 10
# B 2015-09-24 08:55:28 15
Here are the dependencies I have installed with pandas (working on Ubuntu 12.04.5 LTS):
>>> from pandas.util.print_versions import show_versions
>>> show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.5.0-54-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None