Open
Description
Research
-
I have searched the [pandas] tag on StackOverflow for similar questions.
-
I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
https://stackoverflow.com/questions/77784124/the-concept-behind-how-hashes-are-combined-in-pandas
Question about pandas
I came across the combine_hash_arrays
functions in pandas library here. How does it keep good collision rate after combination ? Does anyone have the mathematical proof for this code ?
The below code is retrieved from pandas implementation.
def combine_hash_arrays(
arrays: Iterator[np.ndarray], num_items: int
) -> npt.NDArray[np.uint64]:
"""
Parameters
----------
arrays : Iterator[np.ndarray]
num_items : int
Returns
-------
np.ndarray[uint64]
Should be the same as CPython's tupleobject.c
"""
try:
first = next(arrays)
except StopIteration:
return np.array([], dtype=np.uint64)
arrays = itertools.chain([first], arrays)
mult = np.uint64(1000003)
out = np.zeros_like(first) + np.uint64(0x345678)
last_i = 0
for i, a in enumerate(arrays):
inverse_i = num_items - i
out ^= a
out *= mult
mult += np.uint64(82520 + inverse_i + inverse_i)
last_i = i
assert last_i + 1 == num_items, "Fed in wrong num_items"
out += np.uint64(97531)
return out
I have already checked whether they work well by checking their combined hashes' collision rate.