Description
I think there may be a bug with the row-wise handling of numpy.timedelta64
data types when using DataFrame.apply
. As a check, the problem does not appear when using DataFrame.applymap
. The problem may be related to #4532, but I'm unsure. I've included an example below.
This is only a minor problem for my use-case, which is cross-checking timestamps from a counter/timer card. I can easily work around the issue with DataFrame.itertuples
etc.
Thank you for your time and for making such a useful package!
Example
Version
Import and check versions.
$ date
Thu Jul 17 16:28:38 CDT 2014
$ conda update pandas
Fetching package metadata: ..
# All requested packages already installed.
# packages in environment at /Users/harrold/anaconda:
#
pandas 0.14.1 np18py27_0
$ ipython
Python 2.7.8 |Anaconda 2.0.1 (x86_64)| (default, Jul 2 2014, 15:36:00)
Type "copyright", "credits" or "license" for more information.
IPython 2.1.0 -- An enhanced Interactive Python.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: from __future__ import print_function
In [2]: import numpy as np
In [3]: import pandas as pd
In [4]: pd.util.print_versions.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.8.final.0
python-bits: 64
OS: Darwin
OS-release: 11.4.2
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.14.1
nose: 1.3.3
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: 0.999
httplib2: 0.8
apiclient: 1.2
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None
Create test data
Using subset of original raw data as example.
In [5]: datetime_start = np.datetime64(u'2014-05-31T01:23:19.9600345Z')
In [6]: timedeltas_elapsed = [30053400, 40053249, 50053098]
Compute datetimes from elapsed timedeltas, then create differential timedeltas from datetimes. All elements are either type numpy.datetime64
or numpy.timedelta64
.
In [7]: df = pd.DataFrame(dict(datetimes = timedeltas_elapsed))
In [8]: df = df.applymap(lambda elt: np.timedelta64(elt, 'us'))
In [9]: df = df.applymap(lambda elt: np.datetime64(datetime_start + elt))
In [10]: df['differential_timedeltas'] = df['datetimes'] - df['datetimes'].shift()
In [11]: print(df)
datetimes differential_timedeltas
0 2014-05-31 01:23:50.013434500 NaT
1 2014-05-31 01:24:00.013283500 00:00:09.999849
2 2014-05-31 01:24:10.013132500 00:00:09.999849
Expected behavior
With element-wise handling using DataFrame.applymap
, all elements are correctly identified as datetimes (timestamps) or timedeltas.
In [12]: print(df.applymap(lambda elt: type(elt)))
datetimes differential_timedeltas
0 <class 'pandas.tslib.Timestamp'> <type 'numpy.timedelta64'>
1 <class 'pandas.tslib.Timestamp'> <type 'numpy.timedelta64'>
2 <class 'pandas.tslib.Timestamp'> <type 'numpy.timedelta64'>
Bug
With row-wise handling using DataFrame.apply
, all elements are type pandas.tslib.Timestamp
. I expected 'differential_timedeltas' to be type numpy.timedelta64
or another type of timedelta, not a type of datetime (timestamp).
In [13]: # For 'datetimes':
In [14]: print(df.apply(lambda row: type(row['datetimes']), axis=1))
0 <class 'pandas.tslib.Timestamp'>
1 <class 'pandas.tslib.Timestamp'>
2 <class 'pandas.tslib.Timestamp'>
dtype: object
In [15]: # For 'differential_timedeltas':
In [16]: print(df.apply(lambda row: type(row['differential_timedeltas']), axis=1))
0 <class 'pandas.tslib.NaTType'>
1 <class 'pandas.tslib.Timestamp'>
2 <class 'pandas.tslib.Timestamp'>
dtype: object