Skip to content

read_csv can't roundtrip with UTF16/32 encodings #24130

Closed
@WillAyd

Description

@WillAyd

This works fine:

In [4]: with tempfile.TemporaryFile(mode='w+', encoding='utf8') as outfile: 
   ...:     outfile.write('foo') 
   ...:     outfile.seek(0) 
   ...:     pd.read_csv(outfile, encoding='utf8') 

Not quite so lucky on these:

In [4]: with tempfile.TemporaryFile(mode='w+', encoding='utf16') as outfile: 
   ...:     outfile.write('foo') 
   ...:     outfile.seek(0) 
   ...:     pd.read_csv(outfile, encoding='utf16') 
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x6f in position 2: truncated data

In [4]: with tempfile.TemporaryFile(mode='w+', encoding='utf32') as outfile: 
   ...:     outfile.write('foo') 
   ...:     outfile.seek(0) 
   ...:     pd.read_csv(outfile, encoding='utf32') 
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-2: truncated data

I believe this is strictly a problem with the C parser.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: b78aa8d
python: 3.6.7.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+1223.gb78aa8d85
pytest: 4.0.0
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.0
IPython: 7.1.1
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.1.2
lxml.etree: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: 0.9.2
psycopg2: None
jinja2: 2.10
s3fs: 0.1.6
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: 0.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions