Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
out_path_spec = 'file:///workspace/dlio/dlio_benchmark/data/default/train/0/img_00_of_64.csv'
df = pd.DataFrame(range(5))
df.to_csv(out_path_spec)
Issue Description
I happen to be using a python framework that provides multiple fsspec backends including file://. However, looking at the pandas code, I suspect the failure can happen with other backends besides file://. I am using pandas version 2.0.3, but the code in main is the same.
I want to create a not-yet-existing file named /workspace/dlio/dlio_benchmark/data/default/train/0/img_00_of_64.csv. The directory which will contain the file already exists:
ls -l /workspace/dlio/dlio_benchmark/data/default/train/0/
total 0
Here is a sample backtrace, where 'out_path_spec' contains the CSV path mentioned above:
df.to_csv(out_path_spec, compression=compression)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 3772, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "/usr/local/lib/python3.8/dist-packages/pandas/io/formats/format.py", line 1188, in to_csv
csv_formatter.save()
File "/usr/local/lib/python3.8/dist-packages/pandas/io/formats/csvs.py", line 242, in save
with get_handle(
File "/usr/local/lib/python3.8/dist-packages/pandas/io/common.py", line 719, in get_handle
ioargs = _get_filepath_or_buffer(
File "/usr/local/lib/python3.8/dist-packages/pandas/io/common.py", line 371, in _get_filepath_or_buffer
with urlopen(req_info) as req:
File "/usr/local/lib/python3.8/dist-packages/pandas/io/common.py", line 271, in urlopen
return urllib.request.urlopen(*args, **kwargs)
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 1489, in file_open
return self.open_local_file(req)
File "/usr/lib/python3.8/urllib/request.py", line 1528, in open_local_file
raise URLError(exp)
urllib.error.URLError: <urlopen error [Errno 2] No such file or directory: '/workspace/dlio/dlio_benchmark/data/default/train/0/img_00_of_64.csv'>
The problem occurs in file io/common.py in function _get_filepath_or_buffer(). The code makes a call to is_url() using the pathname. Since 'file' is a valid scheme in _VALID_URLS the function returns True. That causes the code to call urlopen() using the path, which fails because the file doesn't exist yet, I am in the process of trying to create it.
This logic seems incorrect to me, when df.to_csv() is called it is likely that the fsspec path doesn't exist yet, so trying to open the URL seems guaranteed to fail?
Expected Behavior
The df.to_csv() call should succeed, the file (or whatever fsspec backend object) should be created.
Installed Versions
INSTALLED VERSIONS
commit : 0f43794
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-425.3.1.el8.x86_64
Version : #1 SMP Wed Nov 9 20:13:27 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 2.0.3
numpy : 1.24.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 45.2.0
pip : 23.3.1
Cython : None
pytest : 7.4.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.10.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None