Skip to content

Pandas attempts to convert some strings to timestamps when grouping by a timestamp and aggregating? #10078

Closed
@marcuscollins

Description

@marcuscollins

I am working through logs of web requests, and when I want to find the most common, say, user agent string for a (disguised) user, I run something like the following:

from pandas import Series, DataFrame, Timestamp

tdf = DataFrame({'day': {0: Timestamp('2015-02-24 00:00:00'),  1: Timestamp('2015-02-24 00:00:00'),
                                      2: Timestamp('2015-02-24 00:00:00'), 3: Timestamp('2015-02-24 00:00:00'),
                                      4: Timestamp('2015-02-24 00:00:00')},
                            'userAgent': {0: 'some UA string', 1: 'some UA string', 2: 'some UA string',
                                                 3: 'another UA string', 4: 'some UA string'},
                             'userId': {0: '17661101',  1: '17661101', 2: '17661101', 3: '17661101', 4: '17661101'}})

def most_common_values(df):
    return Series({c: s.value_counts().index[0] for c,s in df.iteritems()})

tdf.groupby('day').apply(most_common_values)

Note that in this (admittedly unusual) example, all of the lines are identical. I'm not sure if that is necessary to recreate the issue. And, I'm obscuring the exact purpose of this code, but it reproduces the bug: The 'userId' comes back as a Timestamp, not a string. This happens after the function most_common_values returns, since that userId string is not returned as a timestamp. if we change the value of the userId to an int:

tdf['userId'] = tdf.userId.astype(int)

or if the value of the associated integer is small enough:

tdf['userId'] = '15320104`

then the results are what we'd expect (the most common value as its original type is returned.)

I imagine that for some reason something like a dateutil parser is being called on strings by default but that probably shoulnd't be happening...

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions