Description
Hello there, its me the bug hunter again :)
I have this massive 200 million rows dataset, and I encountered some very annoying behavior. I wonder if this is a bug.
I load my csv using
mylog = pd.read_csv('/mydata.csv',
names = ['mydatetime', 'var2', 'var3', 'var4'],
dtype = {'mydatetime' : str},
skiprows = 1)
and the datetime
column really look like regular timestamps (tz aware)
mylog.mydatetime.head()
Out[22]:
0 2019-03-03T20:58:38.000-0500
1 2019-03-03T20:58:38.000-0500
2 2019-03-03T20:58:38.000-0500
3 2019-03-03T20:58:38.000-0500
4 2019-03-03T20:58:38.000-0500
Name: mydatetime, dtype: object
Now, I take extra care in converting these string into proper timestamps:
mylog['mydatetime'] = pd.to_datetime(mylog['mydatetime'] ,errors = 'coerce', format = '%Y-%m-%dT%H:%M:%S.%f%z', infer_datetime_format = True, cache = True)
That takes a looong time to process, but seems OK. The output is
mylog.mydatetime.head()
Out[23]:
0 2019-03-03 20:58:38-05:00
1 2019-03-03 20:58:38-05:00
2 2019-03-03 20:58:38-05:00
3 2019-03-03 20:58:38-05:00
4 2019-03-03 20:58:38-05:00
Name: mydatetime, dtype: object
What is puzzling is that so far I thought I had full control of my dtypes
. However, running the simple
mylog['myday'] = pd.to_datetime(mylog['mydatetime'].dt.date, errors = 'coerce')
File "pandas/_libs/tslib.pyx", line 537, in pandas._libs.tslib.array_to_datetime
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
The only way I was able to go past this error was by running
mylog['myday'] = pd.to_datetime(mylog['mydatetime'].apply(lambda x: x.date()))
Is this a bug? Before upgrading to 24.1
I was not getting the tz
error above. What do you think? I cant share the data but I am happy to try some things to help you out!
Thanks!