Skip to content

csv_read() fails on properly decoding latin-1(i.e. non utf8) encoded file from URL #10424

Closed
@BotoKopo

Description

@BotoKopo

Problem

Here is a problem that we had with a colleague, working on data available on a ftp (or http) server (internal network, we're sorry we can't have a proper example file to point to).

Reading a csv file (with csv_read) encoded with non utf8 (like latin-1), with special character in header, fails to properly unicode the header when file is accessed through an URL (http or ftp), but not when file is local, nor when it's utf-8 (local or distant) file.
The result looks like the file was decoded twice.

An example shoud be clearer.

Let's say we have 2 CSV files (on a distant server), data.latin1.csv and data.utf8.csv, encoded in latin-1 and utf-8, and both containing :

a,b°
1.1,2.2

Then following code :

import sys
import os.path as op
import pandas as pd

path = "ftp://sorry/I/cant/supply/such/a/path/for/the/example/data.encoding.csv"

for enc in ('latin1', 'utf8') :
    f = path.replace('encoding', enc)
    data = pd.read_csv(f, encoding=enc)
    print("encoding {0} : non-ascii={1} , length={2}".format(enc, data.columns[1].encode('utf8'), len(data.columns[1])))

will give :

encoding latin1 : non-ascii=b° , length=3
encoding utf8 : non-ascii=b° , length=2

This was tested with Python 2.7.6 + Pandas 0.13.1 and Python 3.4.0 + Pandas 0.15.2 with same result.

Same action on local files will give appropriate result, i.e. like previous 'utf8' encoding output (this REALLY IS a matter of URL+latin1 or anything but utf-8). It looks like data was decoded twice, as we can see in output length as latin1 escape code for '°' is considered as a "normal" character being converted to utf-8.

This test will raise an error ("UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 3: ordinal not in range(128)") when python engine is used for read_csv() .

in pandas code

Now, having a look at Pandas' code, I would focus on 2 points in pandas.io.parsers :

  • when file is an url, data is opened through urllib (or urllib2), then read, decoded (according to requested encoding) and result is fed into a StringIO stream (Cf. pandas.io.common.maybe_read_encoded_stream() ) ,
  • as far as I could trace it, file seems to be decoded later, especially for 'c'-engine in pandas.io.parsers.CParserWrapper.read() method (in fact by _parser.read() at the end, which is C-parser)

This would explain the twice decoding scheme when file is url, and normal decoding when file is local.

Furthermore, in pandas.io.common, when replacing (in maybe_read_encoded_stream() function) :

from pandas.compat import StringIO
...
reader = StringIO(reader.read().decode(encoding, errors))

by :

from pandas.compat import StringIO, BytesIO
...
reader = BytesIO(reader.read())

this problem seems to be solved (which is logical when we look at which StringIO/ByteIO functions are pointing to (depending on Python version) and which data they're handling).

So it seems to me that the problem is located at that point, and it would then be a bug.
However, it could be a feature ;-) as I don't know whether there could be side-effects for other cases than the one discussed here, especially if StringIO was intentionally used for a purpose I can't figure out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO CSVread_csv, to_csvIO NetworkLocal or Cloud (AWS, GCS, etc.) IO IssuesUnicodeUnicode strings

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions