Description
Code Sample, a copy-pastable example if possible
dataframe = pd.read_csv('gs://mybucket/my_file', encoding = 'cp1251')
Problem description
Reading csv files which have encoding other than utf-8, like cp1251, from the Google Cloud Storage fails with error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte
the stacktrace from the Google Cloud Function environment:
Traceback (most recent call last):
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function
_function_handler.invoke_user_function(event_object)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function
return call_user_function(request_or_event)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function
event_context.Context(**request_or_event.context))
File "/user_code/main.py", line 60, in load_csv_to_bq
na_filter=False)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in init
self._make_engine(self.engine)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 529, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 748, in pandas._libs.parsers.TextReader._get_header
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte
It looks that in pandas
ignoring of encoding
parameter happens, because in the pandas.io.gcs.get_filepath_or_buffer
the mode = 'rb'
is passed to call of GCSFileSystem.open(filepath_or_buffer, mode)
Tracing back to the moment of the first actual setting the mode
parameter we have stop on this line:
pandas.io.common.py
def get_filepath_or_buffer(
filepath_or_buffer, encoding=None, compression=None, mode=None
)
, because in the call of get_filepath_or_buffer()
performed from here
Lines 430 to 432 in 29d6b02
we do not pass value of
mode
and default mode=None
works.
But in the current gcsf master GCSFileSystem.open()
has been removed and fsspec.AbstractFileSystem.open()
has works instead:
where applying of passed encoding
for the text reading\writing is now implemented:
if "b" not in mode:
mode = mode.replace("t", "") + "b"
text_kwargs = {
k: kwargs.pop(k)
for k in ["encoding", "errors", "newline"]
if k in kwargs
}
return io.TextIOWrapper(
self.open(path, mode, block_size, **kwargs), **text_kwargs
)
Expected Output
The encoding value passed into pd.read_csv() is applyied while reading from GCS, csv files are read.
As I could suggest for read_csv() we need pass mode=r
and for to_csv() (see #26124) we need pass mode=w
in the call of get_filepath_or_buffer()
. But I'm not sure where in code it's better to implement this change.
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : 0.6.0
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : 0.13.1
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None