Description
I am working through logs of web requests, and when I want to find the most common, say, user agent string for a (disguised) user, I run something like the following:
from pandas import Series, DataFrame, Timestamp
tdf = DataFrame({'day': {0: Timestamp('2015-02-24 00:00:00'), 1: Timestamp('2015-02-24 00:00:00'),
2: Timestamp('2015-02-24 00:00:00'), 3: Timestamp('2015-02-24 00:00:00'),
4: Timestamp('2015-02-24 00:00:00')},
'userAgent': {0: 'some UA string', 1: 'some UA string', 2: 'some UA string',
3: 'another UA string', 4: 'some UA string'},
'userId': {0: '17661101', 1: '17661101', 2: '17661101', 3: '17661101', 4: '17661101'}})
def most_common_values(df):
return Series({c: s.value_counts().index[0] for c,s in df.iteritems()})
tdf.groupby('day').apply(most_common_values)
Note that in this (admittedly unusual) example, all of the lines are identical. I'm not sure if that is necessary to recreate the issue. And, I'm obscuring the exact purpose of this code, but it reproduces the bug: The 'userId' comes back as a Timestamp, not a string. This happens after the function most_common_values returns, since that userId string is not returned as a timestamp. if we change the value of the userId to an int:
tdf['userId'] = tdf.userId.astype(int)
or if the value of the associated integer is small enough:
tdf['userId'] = '15320104`
then the results are what we'd expect (the most common value as its original type is returned.)
I imagine that for some reason something like a dateutil parser is being called on strings by default but that probably shoulnd't be happening...