Skip to content

Commit 7a019bf

Browse files
authored
PERF: Improve performance when hashing a nullable extension array (#56508)
* PERF: Improve performance when hashing a nullable extension array * Fixup * Add whatsnew
1 parent 90f3c10 commit 7a019bf

File tree

3 files changed

+13
-2
lines changed

3 files changed

+13
-2
lines changed

doc/source/whatsnew/v2.2.0.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -354,8 +354,8 @@ See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for mor
354354

355355
Other API changes
356356
^^^^^^^^^^^^^^^^^
357+
- The hash values of nullable extension dtypes changed to improve the performance of the hashing operation (:issue:`56507`)
357358
- ``check_exact`` now only takes effect for floating-point dtypes in :func:`testing.assert_frame_equal` and :func:`testing.assert_series_equal`. In particular, integer dtypes are always checked exactly (:issue:`55882`)
358-
-
359359

360360
.. ---------------------------------------------------------------------------
361361
.. _whatsnew_220.deprecations:
@@ -515,6 +515,7 @@ Performance improvements
515515
- Performance improvement in :meth:`Series.value_counts` and :meth:`Series.mode` for masked dtypes (:issue:`54984`, :issue:`55340`)
516516
- Performance improvement in :meth:`.DataFrameGroupBy.nunique` and :meth:`.SeriesGroupBy.nunique` (:issue:`55972`)
517517
- Performance improvement in :meth:`.SeriesGroupBy.idxmax`, :meth:`.SeriesGroupBy.idxmin`, :meth:`.DataFrameGroupBy.idxmax`, :meth:`.DataFrameGroupBy.idxmin` (:issue:`54234`)
518+
- Performance improvement when hashing a nullable extension array (:issue:`56507`)
518519
- Performance improvement when indexing into a non-unique index (:issue:`55816`)
519520
- Performance improvement when indexing with more than 4 keys (:issue:`54550`)
520521
- Performance improvement when localizing time to UTC (:issue:`55241`)

pandas/core/arrays/base.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1994,7 +1994,7 @@ def _hash_pandas_object(
19941994
... hash_key="1000000000000000",
19951995
... categorize=False
19961996
... )
1997-
array([11381023671546835630, 4641644667904626417], dtype=uint64)
1997+
array([ 6238072747940578789, 15839785061582574730], dtype=uint64)
19981998
"""
19991999
from pandas.core.util.hashing import hash_array
20002000

pandas/core/arrays/masked.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@
8888
)
8989
from pandas.core.indexers import check_array_indexer
9090
from pandas.core.ops import invalid_comparison
91+
from pandas.core.util.hashing import hash_array
9192

9293
if TYPE_CHECKING:
9394
from collections.abc import (
@@ -914,6 +915,15 @@ def _concat_same_type(
914915
mask = np.concatenate([x._mask for x in to_concat], axis=axis)
915916
return cls(data, mask)
916917

918+
def _hash_pandas_object(
919+
self, *, encoding: str, hash_key: str, categorize: bool
920+
) -> npt.NDArray[np.uint64]:
921+
hashed_array = hash_array(
922+
self._data, encoding=encoding, hash_key=hash_key, categorize=categorize
923+
)
924+
hashed_array[self.isna()] = hash(self.dtype.na_value)
925+
return hashed_array
926+
917927
def take(
918928
self,
919929
indexer,

0 commit comments

Comments
 (0)