-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: hashing datetime64 objects #50960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 59 commits
7761ecd
610b0c6
0e140ae
919383c
ae8c0bb
92a39eb
2f67805
0635f86
229ab72
6e96805
24fda82
058b666
7398991
3fdf564
6e4836e
f55337a
a97dfc9
1338ca2
9fb1987
818682c
74ab540
037ba05
7a4b1ab
47e5247
dcd09dd
32d479b
d47cfd8
c091317
1ce791e
704fb69
6d962b0
f838953
0d8500a
58a29b6
95069e0
7362f3e
dd08670
998a4cc
b75730b
5c57a5e
6b4460f
c94609b
afe9493
4fecc97
b4cc41e
c620339
3633653
c55f182
6e2bbf0
23c2826
143b3a3
ffb8365
1fdfd64
5513721
875d6af
a29a56a
40e6e17
15a701c
af25f40
9d5cb46
d5a031d
bd7d432
394d86e
1766bc3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,7 +7,10 @@ | |
typedef npy_complex64 khcomplex64_t; | ||
typedef npy_complex128 khcomplex128_t; | ||
|
||
// get pandas_datetime_to_datetimestruct | ||
#include <../../tslibs/src/datetime/np_datetime.h> | ||
jbrockmendel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#include "datetime.h" | ||
|
||
// khash should report usage to tracemalloc | ||
#if PY_VERSION_HEX >= 0x03060000 | ||
|
@@ -305,6 +308,7 @@ khuint32_t PANDAS_INLINE kh_python_hash_func(PyObject* key); | |
#define _PandasHASH_XXROTATE(x) ((x << 13) | (x >> 19)) /* Rotate left 13 bits */ | ||
#endif | ||
|
||
|
||
Py_hash_t PANDAS_INLINE tupleobject_hash(PyTupleObject* key) { | ||
Py_ssize_t i, len = Py_SIZE(key); | ||
PyObject **item = key->ob_item; | ||
|
@@ -315,9 +319,7 @@ Py_hash_t PANDAS_INLINE tupleobject_hash(PyTupleObject* key) { | |
if (lane == (Py_uhash_t)-1) { | ||
return -1; | ||
} | ||
acc += lane * _PandasHASH_XXPRIME_2; | ||
acc = _PandasHASH_XXROTATE(acc); | ||
acc *= _PandasHASH_XXPRIME_1; | ||
acc = tuple_update_uhash(acc, lane); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, missed that you removed this here. If we duplicate the code, maybe just don't change it here? |
||
} | ||
|
||
/* Add input length, mangled to keep the historical value of hash(()). */ | ||
|
@@ -351,6 +353,10 @@ khuint32_t PANDAS_INLINE kh_python_hash_func(PyObject* key) { | |
else if (PyTuple_CheckExact(key)) { | ||
hash = tupleobject_hash((PyTupleObject*)key); | ||
} | ||
else if (PyObject_TypeCheck(key, &PyDatetimeArrType_Type)) { | ||
// GH#50690 | ||
hash = np_datetime64_object_hash((PyDatetimeScalarObject *)key); | ||
} | ||
else { | ||
hash = PyObject_Hash(key); | ||
} | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -25,8 +25,9 @@ This file is derived from NumPy 1.7. See NUMPY_LICENSE.txt | |||||
#include <numpy/arrayobject.h> | ||||||
#include <numpy/arrayscalars.h> | ||||||
#include <numpy/ndarraytypes.h> | ||||||
#include "np_datetime.h" | ||||||
|
||||||
#include "np_datetime.h" | ||||||
#include "datetime.h" | ||||||
|
||||||
const int days_per_month_table[2][12] = { | ||||||
{31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31}, | ||||||
|
@@ -1033,3 +1034,101 @@ PyArray_DatetimeMetaData | |||||
get_datetime_metadata_from_dtype(PyArray_Descr *dtype) { | ||||||
return (((PyArray_DatetimeDTypeMetaData *)dtype->c_metadata)->meta); | ||||||
} | ||||||
|
||||||
|
||||||
// we could use any hashing algorithm, this is the original CPython's for tuples | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have these same defines in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yah I moved some of it rather than duplicating, likely could go further down that path. I'd like to get this in soonish so prefer to do bigger refactors separately There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think with C its generally really hard to track down the effects of copy/pasting defines across headers and implementations though. Maybe we just create a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Moving it into its own file seems nice probably. I wouldn't care too much, since this is only a problem if you would include the other header here (even then it might not be since it matches). On a general note, it might be good to add include guards to your headers:
(some pattern there, google style suggests full paths and at least a partial path would make sense I think) |
||||||
|
||||||
#if SIZEOF_PY_UHASH_T > 4 | ||||||
#define _PandasHASH_XXPRIME_1 ((Py_uhash_t)11400714785074694791ULL) | ||||||
#define _PandasHASH_XXPRIME_2 ((Py_uhash_t)14029467366897019727ULL) | ||||||
#define _PandasHASH_XXPRIME_5 ((Py_uhash_t)2870177450012600261ULL) | ||||||
#define _PandasHASH_XXROTATE(x) ((x << 31) | (x >> 33)) /* Rotate left 31 bits */ | ||||||
#else | ||||||
#define _PandasHASH_XXPRIME_1 ((Py_uhash_t)2654435761UL) | ||||||
#define _PandasHASH_XXPRIME_2 ((Py_uhash_t)2246822519UL) | ||||||
#define _PandasHASH_XXPRIME_5 ((Py_uhash_t)374761393UL) | ||||||
#define _PandasHASH_XXROTATE(x) ((x << 13) | (x >> 19)) /* Rotate left 13 bits */ | ||||||
#endif | ||||||
|
||||||
|
||||||
Py_uhash_t tuple_update_uhash(Py_uhash_t acc, Py_uhash_t lane) { | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
The static inline should be the same as pandas inline. However, you can of course also just use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In other PRs I've asked us to move away from specifying inline explicitly instead allowing the compiler to choose for us. Is that the wrong approach? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIRC its a compiler hint mainly but I am not sure where it matters in many places. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea on second read this might be a place where it actually could inline. Thought it was using the Python runtime at first but I see it doesn't now, so it might have a chance gcc has |
||||||
acc += lane * _PandasHASH_XXPRIME_2; | ||||||
acc = _PandasHASH_XXROTATE(acc); | ||||||
acc *= _PandasHASH_XXPRIME_1; | ||||||
return acc; | ||||||
} | ||||||
|
||||||
// https://github.com/pandas-dev/pandas/pull/50960 | ||||||
Py_hash_t | ||||||
hash_datetime_from_struct(npy_datetimestruct* dts) { | ||||||
/* | ||||||
* If we cannot cast to datetime, use the datetime struct values directly | ||||||
* and mix them similar to a tuple. | ||||||
*/ | ||||||
|
||||||
Py_uhash_t acc = _PandasHASH_XXPRIME_5; | ||||||
#if 64 <= SIZEOF_PY_UHASH_T | ||||||
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->year); | ||||||
#else | ||||||
/* Mix lower and uper bits of the year if int64 is larger */ | ||||||
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->year); | ||||||
acc = tuple_update_uhash(acc, (Py_uhash_t)(dts->year >> SIZEOF_PY_UHASH_T)); | ||||||
#endif | ||||||
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->month); | ||||||
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->day); | ||||||
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->min); | ||||||
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->sec); | ||||||
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->us); | ||||||
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->ps); | ||||||
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->as); | ||||||
Comment on lines
+1080
to
+1083
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could bunch everything below second (or even minute or more?) into a single 64bit number (then unfortunately the same trick to split it up if necessary on the platform). Not sure that is worthwhile, probably a tiny bit faster, but I am mainly wondering if it might generalize a bit nicer if we use a simpler scheme that doesn't require the full datetime struct in principle. |
||||||
/* should be a need to mix length, as it is fixed anyway? */ | ||||||
if (acc == (Py_uhash_t)-1) { | ||||||
acc = (Py_uhash_t)-2; | ||||||
} | ||||||
return acc; | ||||||
} | ||||||
|
||||||
|
||||||
// TODO(jbrockmendel): same thing for timedelta64 objects | ||||||
Py_hash_t np_datetime64_object_hash(PyDatetimeScalarObject* key) { | ||||||
// GH#50690 numpy's hash implementation does not preserve comparabity | ||||||
// either across resolutions or with standard library objects. | ||||||
// See also Timestamp.__hash__ | ||||||
|
||||||
NPY_DATETIMEUNIT unit = (NPY_DATETIMEUNIT)key->obmeta.base; | ||||||
npy_datetime value = key->obval; | ||||||
npy_datetimestruct dts; | ||||||
|
||||||
if (value == NPY_DATETIME_NAT) { | ||||||
// np.datetime64("NaT") in any reso | ||||||
return NPY_DATETIME_NAT; | ||||||
} | ||||||
|
||||||
pandas_datetime_to_datetimestruct(value, unit, &dts); | ||||||
|
||||||
if ((dts.year > 0) && (dts.year <= 9999) && (dts.ps == 0) && (dts.as == 0)) { | ||||||
// we CAN cast to pydatetime, so use that hash to ensure we compare | ||||||
// as matching standard library datetimes (and pd.Timestamps) | ||||||
if (PyDateTimeAPI == NULL) { | ||||||
/* delayed import, may be nice to move to import time */ | ||||||
PyDateTime_IMPORT; | ||||||
if (PyDateTimeAPI == NULL) { | ||||||
return -1; | ||||||
} | ||||||
} | ||||||
|
||||||
PyObject* dt; | ||||||
Py_hash_t hash; | ||||||
|
||||||
dt = PyDateTime_FromDateAndTime( | ||||||
dts.year, dts.month, dts.day, dts.hour, dts.min, dts.sec, dts.us); | ||||||
if (dt == NULL) { | ||||||
return -1; | ||||||
} | ||||||
hash = PyObject_Hash(dt); | ||||||
Py_DECREF(dt); | ||||||
return hash; | ||||||
} | ||||||
|
||||||
return hash_datetime_from_struct(&dts); | ||||||
} |
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -22,6 +22,7 @@ This file is derived from NumPy 1.7. See NUMPY_LICENSE.txt | |||
#endif // NPY_NO_DEPRECATED_API | ||||
|
||||
#include <numpy/ndarraytypes.h> | ||||
#include <numpy/arrayscalars.h> | ||||
|
||||
typedef struct { | ||||
npy_int64 days; | ||||
|
@@ -116,4 +117,8 @@ PyArray_DatetimeMetaData get_datetime_metadata_from_dtype( | |||
PyArray_Descr *dtype); | ||||
|
||||
|
||||
Py_hash_t np_datetime64_object_hash(PyDatetimeScalarObject* key); | ||||
Py_hash_t hash_datetime_from_struct(npy_datetimestruct* dts); | ||||
Py_uhash_t tuple_update_uhash(Py_uhash_t acc, Py_uhash_t lane); | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
no need to make this public. |
||||
|
||||
#endif // PANDAS__LIBS_TSLIBS_SRC_DATETIME_NP_DATETIME_H_ |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,6 +53,9 @@ static int pandas_datetime_exec(PyObject *module) { | |
capi->get_datetime_iso_8601_strlen = get_datetime_iso_8601_strlen; | ||
capi->make_iso_8601_datetime = make_iso_8601_datetime; | ||
capi->make_iso_8601_timedelta = make_iso_8601_timedelta; | ||
capi->hash_datetime_from_struct = hash_datetime_from_struct; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You also need to define a macro for this later in the file There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
didnt seem to do it |
||
capi->np_datetime64_object_hash = np_datetime64_object_hash; | ||
capi->tuple_update_uhash = tuple_update_uhash; | ||
|
||
PyObject *capsule = PyCapsule_New(capi, PandasDateTime_CAPSULE_NAME, | ||
pandas_datetime_destructor); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,6 +23,7 @@ See NUMPY_LICENSE.txt for the license. | |
#endif // NPY_NO_DEPRECATED_API | ||
|
||
#include <numpy/ndarraytypes.h> | ||
#include <numpy/arrayscalars.h> | ||
#include "np_datetime.h" | ||
#include "np_datetime_strings.h" | ||
#include "date_conversions.h" | ||
|
@@ -55,6 +56,9 @@ typedef struct { | |
int (*make_iso_8601_datetime)(npy_datetimestruct *, char *, int, int, | ||
NPY_DATETIMEUNIT); | ||
int (*make_iso_8601_timedelta)(pandas_timedeltastruct *, char *, size_t *); | ||
Py_hash_t (*hash_datetime_from_struct)(npy_datetimestruct* dts); | ||
Py_hash_t (*np_datetime64_object_hash)(PyDatetimeScalarObject* key); | ||
Py_uhash_t tuple_update_uhash(Py_uhash_t acc, Py_uhash_t lane); | ||
} PandasDateTime_CAPI; | ||
|
||
// The capsule name appears limited to module.attributename; see bpo-32414 | ||
|
@@ -107,6 +111,12 @@ static PandasDateTime_CAPI *PandasDateTimeAPI = NULL; | |
(base)) | ||
#define make_iso_8601_timedelta(tds, outstr, outlen) \ | ||
PandasDateTimeAPI->make_iso_8601_timedelta((tds), (outstr), (outlen)) | ||
#define hash_datetime_from_struct(dts) \ | ||
PandasDateTimeAPI->hash_datetime_from_struct((dts)) | ||
#define np_datetime64_object_hash(dts) \ | ||
PandasDateTimeAPI->np_datetime64_object_hash((key)) | ||
#define tuple_update_uhash(acc, lane) \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK need to be a bit careful here. Remember that C has no concept of namespaces - if you define With the capsule, the function that you ultimately include as part of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tuple_update_hash is accessed in khash_python.h, is only defined in np_datetime.c AFAICT There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Header files are loosely just copied into whatever they are included into. So you are doing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do they have a different notion of what tuple_update_uhash is? It's only defined in one place. Opened #51951 to revert the capsuling. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You have a declaration for I understand it is frustrating but the implementation here is not correct. Have you checked out any books on C programming? K&R is a nice book for a quick intro - might be worth reading through that or another book before coming back to this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah sorry not khash_python looks like np_datetime There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, so if i remove it from np_datetime.h i get build failure locally There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's right. The way the includes happen and the way you have things structured now there you are either providing conflicting declarations for I really think you'd be better off getting an intro to C book and learning a bit more about what headers/modules do with respect to the different compilation phases. Of course always good to get hands-on experience, but doing some of the basics first before trying to dive in would probably make the process easier There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will I appreciate your encouragement but I have a long reading list and dayjob responsibilities. Can you just tell me how to fix the [expletive deleted] thing that was working a week ago? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. Move your hashing functions to the capsule as static functions |
||
PandasDateTimeAPI->tuple_update_uhash((acc), (lane)) | ||
#endif /* !defined(_PANDAS_DATETIME_IMPL) */ | ||
|
||
#ifdef __cplusplus | ||
|
Uh oh!
There was an error while loading. Please reload this page.