-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: fix merging on datetimelike columns to not use object-dtype factorizer #53231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
d5c7604
971f267
caec63a
35a650d
47e6bd7
52c768b
fafb61d
d46d7c4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -88,6 +88,7 @@ | |
from pandas.core.arrays import ( | ||
ArrowExtensionArray, | ||
BaseMaskedArray, | ||
DatetimeArray, | ||
ExtensionArray, | ||
) | ||
from pandas.core.arrays._mixins import NDArrayBackedExtensionArray | ||
|
@@ -103,7 +104,6 @@ | |
if TYPE_CHECKING: | ||
from pandas import DataFrame | ||
from pandas.core import groupby | ||
from pandas.core.arrays import DatetimeArray | ||
|
||
_factorizers = { | ||
np.int64: libhashtable.Int64Factorizer, | ||
|
@@ -2355,12 +2355,14 @@ def _factorize_keys( | |
rk = extract_array(rk, extract_numpy=True, extract_range=True) | ||
# TODO: if either is a RangeIndex, we can likely factorize more efficiently? | ||
|
||
if isinstance(lk.dtype, DatetimeTZDtype) and isinstance(rk.dtype, DatetimeTZDtype): | ||
if (isinstance(lk, DatetimeArray) and isinstance(rk, DatetimeArray)) and ( | ||
lk._is_tzawareness_compat(rk) | ||
): | ||
# Extract the ndarray (UTC-localized) values | ||
# Note: we dont need the dtypes to match, as these can still be compared | ||
lk, rk = cast("DatetimeArray", lk)._ensure_matching_resos(rk) | ||
lk = cast("DatetimeArray", lk)._ndarray | ||
rk = cast("DatetimeArray", rk)._ndarray | ||
lk, rk = lk._ensure_matching_resos(rk) | ||
lk = cast(DatetimeArray, lk)._ndarray | ||
rk = cast(DatetimeArray, rk)._ndarray | ||
|
||
elif ( | ||
isinstance(lk.dtype, CategoricalDtype) | ||
|
@@ -2388,6 +2390,13 @@ def _factorize_keys( | |
# "_values_for_factorize" | ||
rk, _ = rk._values_for_factorize() # type: ignore[union-attr] | ||
|
||
if needs_i8_conversion(lk.dtype) and lk.dtype == rk.dtype: | ||
# GH#23917 TODO: Needs tests for non-matching dtypes | ||
# GH#23917 TODO: needs tests for case where lk is integer-dtype | ||
# and rk is datetime-dtype | ||
lk = np.asarray(lk, dtype=np.int64) | ||
rk = np.asarray(rk, dtype=np.int64) | ||
Comment on lines
+2396
to
+2401
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This part was (accidentally?) removed in #49876. So there won't be any tests failing from leaving this out (because falling back to the object factorizer also gives correct results), but it is necessary to ensure we use the faster factorizer for datetimelikes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i expect that the relevant cases should go through the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not fully:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. makes sense, thanks. |
||
|
||
klass, lk, rk = _convert_arrays_and_get_rizer_klass(lk, rk) | ||
|
||
rizer = klass(max(len(lk), len(rk))) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is checking this way more performant?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea, but that was also not the reason for using this: what we want to check here is that we have either two DatetimeTZDtypes, or two np.dtype("M8").
So initially I expanded the original
if isinstance(lk.dtype, DatetimeTZDtype) and isinstance(rk.dtype, DatetimeTZDtype)
with something likeor (isinstance(lk.dtype, np.dtype) and lk.dtype.kind == "M" and isinstance(rk.dtype, np.dtype) and rk.dtype.kind == "M")
. But this got a bit long, and so this seemed simpler (and since we are calling a specific DatetimeArray method in the if-block, it also make sense code-logic-wise to check for it).(the performance benefit of the PR comes from avoiding using the ObjectFactorizer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I prefer the status quo here largely bc it avoids introducing a new method. Not a huge deal though.
BTW new optimized check for
isinstance(lk.dtype, np.dtype) and lk.dtype.kind == "M"
islib.is_np_dtype(lk.dtype, "M")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, with
is_np_dtype
that can be shorter, so reverted back to the original dtype checks (and now with the added dtype check for numpy dtypes), avoiding the new method