Skip to content

Cannot hash table with index containing mixed types including non utf-8 bytes strings #27215

Open
@stestagg

Description

@stestagg

Example

import pandas
from pandas.util import hash_pandas_object
hash_pandas_object(pandas.DataFrame({'a': [1,2]}, index=[1, b'\xff1']), encoding='latin1')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

Problem description

This is pretty niche :)

If a table passed to hash_pandas_object has an index with mixed types, then this branch is followed:

vals = hashing.hash_object_array(vals.astype(str).astype(object),
hash_key, encoding)

which calls: vals.astype(str), (I'm assuming so that the values can be converted to useful python objects) where vals is a numpy array.

As shown here, this does not work if the array contains ascii-compatible byte values:

>>> np.array([b'\xff']).astype(str)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

There is an encoding argument that can be passed to hash_pandas_object but this is not used when converting the values to str.

Expected Output

1        XXXXXXXXXXXXXXXXXXXXXX
b'11'    XXXXXXXXXXXXXXXXXXXXXX
dtype: uint64

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.125-linuxkit
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 5.0.0
pip: 19.1.1
setuptools: 40.8.0
Cython: 0.29.11
numpy: 1.18.0.dev0+13de4d8
scipy: 1.1.0
pyarrow: 0.13.0
xarray: None
IPython: 7.5.0
sphinx: 2.1.2
patsy: 0.5.1
dateutil: 2.7.3
pytz: 2019.1
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.5
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugStringsString extension data type and string datahashinghash_pandas_object

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions