Skip to content

read_csv from Google Cloud Storage ignores encoding #32392

Closed
@EgorBEremeev

Description

@EgorBEremeev

Code Sample, a copy-pastable example if possible

    dataframe = pd.read_csv('gs://mybucket/my_file', encoding = 'cp1251')

Problem description

Reading csv files which have encoding other than utf-8, like cp1251, from the Google Cloud Storage fails with error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

the stacktrace from the Google Cloud Function environment:

Traceback (most recent call last):
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function
_function_handler.invoke_user_function(event_object)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function
return call_user_function(request_or_event)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function
event_context.Context(**request_or_event.context))
File "/user_code/main.py", line 60, in load_csv_to_bq
na_filter=False)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in init
self._make_engine(self.engine)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 529, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 748, in pandas._libs.parsers.TextReader._get_header
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

It looks that in pandas ignoring of encoding parameter happens, because in the pandas.io.gcs.get_filepath_or_buffer the mode = 'rb' is passed to call of GCSFileSystem.open(filepath_or_buffer, mode)

Tracing back to the moment of the first actual setting the mode parameter we have stop on this line:

pandas.io.common.py

def get_filepath_or_buffer(
    filepath_or_buffer, encoding=None, compression=None, mode=None
)

, because in the call of get_filepath_or_buffer() performed from here

pandas/pandas/io/parsers.py

Lines 430 to 432 in 29d6b02

fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
filepath_or_buffer, encoding, compression
)

we do not pass value of mode and default mode=None works.

But in the current gcsf master GCSFileSystem.open() has been removed and fsspec.AbstractFileSystem.open() has works instead:

where applying of passed encoding for the text reading\writing is now implemented:

        if "b" not in mode:
            mode = mode.replace("t", "") + "b"

            text_kwargs = {
                k: kwargs.pop(k)
                for k in ["encoding", "errors", "newline"]
                if k in kwargs
            }
            return io.TextIOWrapper(
                self.open(path, mode, block_size, **kwargs), **text_kwargs
            )

Expected Output

The encoding value passed into pd.read_csv() is applyied while reading from GCS, csv files are read.

As I could suggest for read_csv() we need pass mode=r and for to_csv() (see #26124) we need pass mode=w in the call of get_filepath_or_buffer(). But I'm not sure where in code it's better to implement this change.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : 0.6.0
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : 0.13.1
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO CSVread_csv, to_csv

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions