Description
Example
import pandas
from pandas.util import hash_pandas_object
hash_pandas_object(pandas.DataFrame({'a': [1,2]}, index=[1, b'\xff1']), encoding='latin1')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
Problem description
This is pretty niche :)
If a table passed to hash_pandas_object
has an index with mixed types, then this branch is followed:
pandas/pandas/core/util/hashing.py
Lines 297 to 298 in 4e185fc
which calls: vals.astype(str)
, (I'm assuming so that the values can be converted to useful python objects) where vals is a numpy array.
As shown here, this does not work if the array contains ascii-compatible byte values:
>>> np.array([b'\xff']).astype(str)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
There is an encoding argument that can be passed to hash_pandas_object
but this is not used when converting the values to str.
Expected Output
1 XXXXXXXXXXXXXXXXXXXXXX
b'11' XXXXXXXXXXXXXXXXXXXXXX
dtype: uint64
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.125-linuxkit
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: 5.0.0
pip: 19.1.1
setuptools: 40.8.0
Cython: 0.29.11
numpy: 1.18.0.dev0+13de4d8
scipy: 1.1.0
pyarrow: 0.13.0
xarray: None
IPython: 7.5.0
sphinx: 2.1.2
patsy: 0.5.1
dateutil: 2.7.3
pytz: 2019.1
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.5
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None