Skip to content

BUG: DataFrame.convert_dtypes fails on column that is already "string" dtype #31731

Closed
@ghost

Description

Code Sample, a copy-pastable example if possible

In [1]: df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['d', 'e', 'f']})
In [2]: df.dtypes
Out[2]:
A    object
B    object
dtype: object

In [3]: df1 = df.convert_dtypes()

In [4]: df1.dtypes
Out[4]:
A    string                 # <------  expected
B    string                 # <------  expected
dtype: object

In [5]: df2 = df1.convert_dtypes()

In [6]: df2.dtypes
Out[6]:
A    object                 # <------ NOT expected
B    object                 # <------ NOT expected
dtype: object

In [7]: df3['A'] = 'test'

In [8]: df3
Out[8]:
      A  B
0  test  d                 # <------  expected
1  test  e                 # <------  expected
2  test  f                 # <------  expected

In [9]: df3.dtypes
Out[9]:
A    object                 # <------ NOT expected (but understandable)
B    string                 # <------  expected
dtype: object

In [10]: df4 = df3.convert_dtypes()

In [11]: df4
Out[11]:
      A     B
0  test  b'd'                 # <------ NOT expected (Why are they bytes now???)
1  test  b'e'                 # <------ NOT expected
2  test  b'f'                 # <------ NOT expected

In [12]: df4.dtypes
Out[12]:
A    string                 # <------  expected
B    object                 # <------ NOT expected
dtype: object

Problem description

The documentation for DataFrame.convert_dtypes() claims that it will 'Convert columns to best possible dtypes using dtypes supporting pd.NA'. However, this does not appear to be the case. As you can see in Out[6] above, if the dtypes are already optimal, they will be converted back to objects.

Even worse, if some of the dtypes are optimal, but other dtypes are objects, they will be switched -- the optimal dtypes will become objects, and the objects will become optimal.

Some mysterious additional details:

  • Assigning a string to an 'optimal' column changes the dtype to object. (Out[9])
  • If one column is a string, and the other an object df.convert_dtypes() will not only cause the string column to become object and the object column to become string, it will also change the strings in the newly created object column into bytes (Out[11])

Expected Output

Anything which can be converted to one of the new nullable datatypes would be converted to one of the new nullable datatypes. Anything which is already a nullable datatype would remain as it is.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200127
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 4.54.2
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : None
scipy : 1.3.2
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : None
xarray : 0.14.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.48.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions