Skip to content

BUG: Arithmetic, timezone and offsets operations affecting to NaT #6873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 13, 2014

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Apr 11, 2014

NaT affected by some datetime related ops unexpectedly.

Arithmetic

Applying arithmetic ops to NaT is not handled properly. Based on numpy results, I understand that results should be all NaT as long as valid data is passed.

# current results
>>> pd.NaT + pd.offsets.Hour(1)
2262-04-11 01:12:43.145224192
>>> pd.NaT - pd.offsets.Hour(1))
OverflowError: Python int too large to convert to C long
>>> pd.NaT - pd.Timestamp('2011-01-01')
-734779 days, 0:00:00

# numpy
>>> np.datetime64('nat') + np.timedelta64(1, 'h')
NaT
>>> np.datetime64('nat') - np.timedelta64(1, 'h')
NaT
>>> np.datetime64('nat') - np.datetime64('2011-01-01')
NaT

Timezone

Closes #5546.

# current results
>>> idx = pd.DatetimeIndex(['2011-01-01 00:00', '2011-01-02 00:00', pd.NaT])
>>> idx.tz_localize('Asia/Tokyo')
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-01 00:00:00+09:00, ..., 2262-04-10 00:12:43.145224192+09:00]
Length: 3, Freq: None, Timezone: Asia/Tokyo

>>> idx = pd.DatetimeIndex(['2011-01-01 00:00', '2011-01-02 00:00', pd.NaT], tz='US/Eastern')
>>> idx.tz_convert('Asia/Tokyo')
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-01 14:00:00+09:00, ..., 2262-04-11 14:12:43.145224192+09:00]

Note I fixed DatetimeIndex, and I leave NatType still doesn't have tz_localize and tz_convert methods Timestamp has. Is it should be added?

Offsets

These have apply method which accepts Timestamp, but it cannot handle Nat.

# current result
>>> pd.offsets.Hour(1).apply(pd.NaT)
2262-04-11 01:12:43.145224192

@sinhrks
Copy link
Member Author

sinhrks commented Apr 11, 2014

Followings are results after the fix.

>>> pd.NaT + pd.offsets.Hour(1)
NaT
>>> pd.NaT - pd.offsets.Hour(1)
NaT
>>> pd.NaT - pd.Timestamp('2011-01-01')
NaT

>>> idx = pd.DatetimeIndex(['2011-01-01 00:00', '2011-01-02 00:00', pd.NaT])
>>> idx.tz_localize('Asia/Tokyo')
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-01 00:00:00+09:00, ..., NaT]
Length: 3, Freq: None, Timezone: Asia/Tokyo

>>> idx = pd.DatetimeIndex(['2011-01-01 00:00', '2011-01-02 00:00', pd.NaT], tz='US/Eastern')
>>> idx.tz_convert('Asia/Tokyo')
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-01 14:00:00+09:00, ..., NaT]
Length: 3, Freq: None, Timezone: Asia/Tokyo

>>> pd.offsets.Hour(1).apply(pd.NaT)
NaT

@jreback jreback added Bug and removed Timeseries labels Apr 12, 2014
@@ -611,7 +611,8 @@ def __sub__(self, other):
def _add_delta(self, delta):
if isinstance(delta, (Tick, timedelta)):
inc = offsets._delta_to_nanoseconds(delta)
new_values = (self.asi8 + inc).view(_NS_DTYPE)
new_values = np.array([i + inc if i != tslib.iNaT else i for i in self.asi8])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be very slow, instead do something like this:

mask = self.asi8 == tslib.iNaT
new_values = (self.asi8 + inc).view(_NS_DTYPE)
new_values[mask] = tslib.iNaT

@jreback
Copy link
Contributor

jreback commented Apr 12, 2014

pls rebase

needs a perf check - run a vbench and post/investigate >1.20 ratio

@jreback jreback added this to the 0.14.0 milestone Apr 12, 2014
@sinhrks
Copy link
Member Author

sinhrks commented Apr 12, 2014

Followings are vbench results. There seems to be no test constantly get worse, but is there anything should be checked?

First time

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
....
frame_get_numeric_data                       |   0.2447 |   0.2143 |   1.1416 |
frame_fancy_lookup_all                       |  37.5026 |  32.1034 |   1.1682 |
timestamp_ops_diff2                          |  64.3330 |  54.9770 |   1.1702 |
timeseries_large_lookup_value                |   0.1390 |   0.0484 |   2.8719 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [9a66e5c] : BUG: Arithmetic, timezone and offsets operations affecting to NaT
Base   [c70b4ae] : BLD: change 2.7 build to use scipy 0.13.3

Second time:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
....
packers_write_pack                           |  12.9484 |  11.5690 |   1.1192 |
frame_dtypes                                 |   0.3934 |   0.3263 |   1.2056 |
frame_boolean_row_select                     |   0.5393 |   0.4443 |   1.2137 |
frame_to_string_floats                       |  82.4350 |  66.6873 |   1.2361 |
frame_assign_timeseries_index                |   2.0270 |   1.6250 |   1.2474 |
timedelta_convert_int                        |   0.4474 |   0.3517 |   1.2721 |
frame_get_numeric_data                       |   0.2870 |   0.2170 |   1.3227 |

if offset is FY5253 or offset is FY5253Quarter:
offset = offset(n=1, startingMonth=1, weekday=1,
qtr_with_extra_week=1, variation='last')
elif offset is WeekOfMonth or offset is LastWeekOfMonth:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can yum make a helper function of this offset creation so that u can use it here and in the apply_out_of_range ?

@sinhrks
Copy link
Member Author

sinhrks commented Apr 13, 2014

Got it. Modified and rebased.

jreback added a commit that referenced this pull request Apr 13, 2014
BUG: Arithmetic, timezone and offsets operations affecting to NaT
@jreback jreback merged commit 7a4514c into pandas-dev:master Apr 13, 2014
@jreback
Copy link
Contributor

jreback commented Apr 13, 2014

thanks

@sinhrks sinhrks deleted the nat_convert branch April 13, 2014 11:37
@jreback jreback mentioned this pull request Apr 13, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Timedelta Timedelta data type Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tz_localize does not handle NaT
2 participants