Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame(dict(x=['a','nan', 'NA', 'na', 'NaN'],
y=[1,2,3,4,5])).set_index('x')
df #1
df.index.isna() #2
df.to_hdf('test.h5', key='test')
df2 = pd.read_hdf('test.h5')
df2 #3
df2.index.isna() #4
Issue Description
The output marked 1 does not match with 3 and 2 does not match with 4 after saving to and reloading from HDF5.
Basically, HDF5 converts string 'nan' in index to NaN before saving to HDF5.
It seems that the problem does not occur if I remove set_index('x'), i.e. the data is in the column but not on index.
Expected Behavior
The dataframe before saving should match with the one reloaded from the saved hdf5 file.
Installed Versions
INSTALLED VERSIONS
commit : f2c8480
python : 3.8.8.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1160.45.1.el7.x86_64
Version : #1 SMP Wed Oct 13 17:20:51 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : POSIX
LANG : en_US.UTF-8
LOCALE : None.None
pandas : 1.2.3
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.3
hypothesis : None
sphinx : 4.0.1
blosc : None
feather : None
xlsxwriter : 1.3.8
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.22.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.4.1
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.6.2
sqlalchemy : 1.4.6
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1