Skip to content

BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based (#55828) #56309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -572,6 +572,7 @@ I/O
- Bug in :meth:`DataFrame.to_hdf` and :func:`read_hdf` with ``datetime64`` dtypes with non-nanosecond resolution failing to round-trip correctly (:issue:`55622`)
- Bug in :meth:`pandas.read_excel` with ``engine="odf"`` (``ods`` files) when string contains annotation (:issue:`55200`)
- Bug in :meth:`pandas.read_excel` with an ODS file without cached formatted cell for float values (:issue:`55219`)
- Bug where :meth:`DataFrame.to_csv` would raise a ``URLError`` when specifying local file scheme paths to not-yet-created files (:issue:`55828`)
- Bug where :meth:`DataFrame.to_json` would raise an ``OverflowError`` instead of a ``TypeError`` with unsupported NumPy types (:issue:`55403`)

Period
Expand Down
13 changes: 13 additions & 0 deletions pandas/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -382,6 +382,19 @@ def _get_filepath_or_buffer(
# urlopen function defined elsewhere in this module
import urllib.request

# Fix for GH #55828
parsed_url = parse_url(filepath_or_buffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe is_url should not be true for fsspec urls. So that might be a much nicer way of fixing this issue (I think @krehm was also hinting at that in the issue) - I'm not familiar with the urllib regex, we might need to exclude more fsspec URLs from it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twoertwein it did seem to me that any fsspec url in is_url is guaranteed to fail in this case, which seemed like a logic flaw to me. But I'm not familiar with the urllib code either, so was hesitant to specify a particular solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overlap between urllib/fsspec is at the moment:

>>> set(uses_relative + uses_netloc + uses_params).intersection(fsspec.available_protocols())
{'sftp', 'ftp', 'file', 'git', 'http', 'https'}

Could have something like this:

_VALID_URLS = set(uses_relative + uses_netloc + uses_params).difference( fsspec.available_protocols())
_VALID_URLS.update(("http", "https"))
_VALID_URLS.discard("")

Technically this is a behavior change for sftp, git, ... (might be okay, probably not used frequently?). fsspec should have available_protocols since early 2022 fsspec/filesystem_spec#913 might need to double check whether we need to bump the minimum version of fsspec. This might make the regex in is_fsspec_url obsolete.

@mroeschke

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twoertwein I didn't want to touch is_url as its used in a few other places and I wasn't sure if it would break anything. Is it okay to do so?

These are the places is_url is used without a corresponding is_fsspec_url call:

pandas.io.html._LxmlFrameParser._build_doc
pandas.io.html._read
pandas.io.formats.html.HTMLFormatter._write_cell

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for checking that!

Do you think it is possible to replace if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer): with if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer) and not is_fsspec_url(filepath_or_buffer):?

if parsed_url.scheme == "file":
file_path = urllib.request.url2pathname(parsed_url.path)
file_path = os.path.normpath(file_path)
return IOArgs(
filepath_or_buffer=open(file_path, "rb"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this still respect mode?

encoding=encoding,
compression=compression,
should_close=True,
mode=fsspec_mode,
)

# assuming storage_options is to be interpreted as headers
req_info = urllib.request.Request(filepath_or_buffer, headers=storage_options)
with urlopen(req_info) as req:
Expand Down