
Description
Code Sample, a copy-pastable example if possible
In [1]: df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['d', 'e', 'f']})
In [2]: df.dtypes
Out[2]:
A object
B object
dtype: object
In [3]: df1 = df.convert_dtypes()
In [4]: df1.dtypes
Out[4]:
A string # <------ expected
B string # <------ expected
dtype: object
In [5]: df2 = df1.convert_dtypes()
In [6]: df2.dtypes
Out[6]:
A object # <------ NOT expected
B object # <------ NOT expected
dtype: object
In [7]: df3['A'] = 'test'
In [8]: df3
Out[8]:
A B
0 test d # <------ expected
1 test e # <------ expected
2 test f # <------ expected
In [9]: df3.dtypes
Out[9]:
A object # <------ NOT expected (but understandable)
B string # <------ expected
dtype: object
In [10]: df4 = df3.convert_dtypes()
In [11]: df4
Out[11]:
A B
0 test b'd' # <------ NOT expected (Why are they bytes now???)
1 test b'e' # <------ NOT expected
2 test b'f' # <------ NOT expected
In [12]: df4.dtypes
Out[12]:
A string # <------ expected
B object # <------ NOT expected
dtype: object
Problem description
The documentation for DataFrame.convert_dtypes() claims that it will 'Convert columns to best possible dtypes using dtypes supporting pd.NA'. However, this does not appear to be the case. As you can see in Out[6]
above, if the dtypes are already optimal, they will be converted back to objects.
Even worse, if some of the dtypes are optimal, but other dtypes are objects, they will be switched -- the optimal dtypes will become objects, and the objects will become optimal.
Some mysterious additional details:
- Assigning a string to an 'optimal' column changes the dtype to object. (
Out[9]
) - If one column is a string, and the other an object
df.convert_dtypes()
will not only cause the string column to become object and the object column to become string, it will also change the strings in the newly created object column into bytes (Out[11]
)
Expected Output
Anything which can be converted to one of the new nullable datatypes would be converted to one of the new nullable datatypes. Anything which is already a nullable datatype would remain as it is.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.0.0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200127
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 4.54.2
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : None
scipy : 1.3.2
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : None
xarray : 0.14.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.48.0