Description
xref #10668 (for more examples)
Hello!
I'm running into some odd behavior trying to group rows of a pandas dataframe by ID and then selecting out max/min datetimes (w/ timezones). This is with python 2.7, pandas 0.18.1 and numpy 1.11.1 (I saw in earlier posts a similar problem was apparently fixed w/ pandas 0.15).
Specifically, if I try:
print orders.groupby('OrderID')['start_time'].agg(np.min).iloc[:5]
I get:
OrderID
O161101XVS100000044 2016-11-01 12:03:12.920000-04:00
O161101XVS100000047 2016-11-01 12:03:36.693000-04:00
O161101XVS100000098 2016-11-01 12:09:08.330000-04:00
O161101XVS100000122 2016-11-01 12:09:59.950000-04:00
O161101XVS100000152 2016-11-01 12:11:29.790000-04:00
Name: start_time, dtype: datetime64[ns, US/Eastern]
Where the raw data had times closer to 8 am (US/Eastern). In other words, it reverted back to UTC times, even though it says it's eastern times, and has UTC-4 offset.
But if I instead try:
print orders.groupby('OrderID')['start_time'].agg(lambda x: np.min(x)).iloc[:5]
I now get:
OrderID
O161101XVS100000044 2016-11-01 08:03:12.920000-04:00
O161101XVS100000047 2016-11-01 08:03:36.693000-04:00
O161101XVS100000098 2016-11-01 08:09:08.330000-04:00
O161101XVS100000122 2016-11-01 08:09:59.950000-04:00
O161101XVS100000152 2016-11-01 08:11:29.790000-04:00
Name: start_time, dtype: datetime64[ns, US/Eastern]
Which is the behavior I intended. This second method is vastly slower, and I would have assumed the two approaches would yield identical results ...