Skip to content

Commit df1d440

Browse files
authored
Storage options (#35381)
1 parent 9a8152c commit df1d440

File tree

18 files changed

+502
-54
lines changed

18 files changed

+502
-54
lines changed

doc/source/user_guide/io.rst

Lines changed: 52 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1649,29 +1649,72 @@ options include:
16491649
Specifying any of the above options will produce a ``ParserWarning`` unless the
16501650
python engine is selected explicitly using ``engine='python'``.
16511651

1652-
Reading remote files
1653-
''''''''''''''''''''
1652+
.. _io.remote:
1653+
1654+
Reading/writing remote files
1655+
''''''''''''''''''''''''''''
16541656

1655-
You can pass in a URL to a CSV file:
1657+
You can pass in a URL to read or write remote files to many of Pandas' IO
1658+
functions - the following example shows reading a CSV file:
16561659

16571660
.. code-block:: python
16581661
16591662
df = pd.read_csv('https://download.bls.gov/pub/time.series/cu/cu.item',
16601663
sep='\t')
16611664
1662-
S3 URLs are handled as well but require installing the `S3Fs
1665+
All URLs which are not local files or HTTP(s) are handled by
1666+
`fsspec`_, if installed, and its various filesystem implementations
1667+
(including Amazon S3, Google Cloud, SSH, FTP, webHDFS...).
1668+
Some of these implementations will require additional packages to be
1669+
installed, for example
1670+
S3 URLs require the `s3fs
16631671
<https://pypi.org/project/s3fs/>`_ library:
16641672

16651673
.. code-block:: python
16661674
1667-
df = pd.read_csv('s3://pandas-test/tips.csv')
1675+
df = pd.read_json('s3://pandas-test/adatafile.json')
1676+
1677+
When dealing with remote storage systems, you might need
1678+
extra configuration with environment variables or config files in
1679+
special locations. For example, to access data in your S3 bucket,
1680+
you will need to define credentials in one of the several ways listed in
1681+
the `S3Fs documentation
1682+
<https://s3fs.readthedocs.io/en/latest/#credentials>`_. The same is true
1683+
for several of the storage backends, and you should follow the links
1684+
at `fsimpl1`_ for implementations built into ``fsspec`` and `fsimpl2`_
1685+
for those not included in the main ``fsspec``
1686+
distribution.
1687+
1688+
You can also pass parameters directly to the backend driver. For example,
1689+
if you do *not* have S3 credentials, you can still access public data by
1690+
specifying an anonymous connection, such as
1691+
1692+
.. versionadded:: 1.2.0
1693+
1694+
.. code-block:: python
1695+
1696+
pd.read_csv("s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/SaKe2013"
1697+
"-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
1698+
storage_options={"anon": True})
1699+
1700+
``fsspec`` also allows complex URLs, for accessing data in compressed
1701+
archives, local caching of files, and more. To locally cache the above
1702+
example, you would modify the call to
1703+
1704+
.. code-block:: python
16681705
1669-
If your S3 bucket requires credentials you will need to set them as environment
1670-
variables or in the ``~/.aws/credentials`` config file, refer to the `S3Fs
1671-
documentation on credentials
1672-
<https://s3fs.readthedocs.io/en/latest/#credentials>`_.
1706+
pd.read_csv("simplecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/"
1707+
"SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
1708+
storage_options={"s3": {"anon": True}})
16731709
1710+
where we specify that the "anon" parameter is meant for the "s3" part of
1711+
the implementation, not to the caching implementation. Note that this caches to a temporary
1712+
directory for the duration of the session only, but you can also specify
1713+
a permanent store.
16741714

1715+
.. _fsspec: https://filesystem-spec.readthedocs.io/en/latest/
1716+
.. _fsimpl1: https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations
1717+
.. _fsimpl2: https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations
16751718

16761719
Writing out data
16771720
''''''''''''''''

doc/source/whatsnew/v1.2.0.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,20 @@ including other versions of pandas.
1313
Enhancements
1414
~~~~~~~~~~~~
1515

16+
Passing arguments to fsspec backends
17+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
18+
19+
Many read/write functions have acquired the ``storage_options`` optional argument,
20+
to pass a dictionary of parameters to the storage backend. This allows, for
21+
example, for passing credentials to S3 and GCS storage. The details of what
22+
parameters can be passed to which backends can be found in the documentation
23+
of the individual storage backends (detailed from the fsspec docs for
24+
`builtin implementations`_ and linked to `external ones`_). See
25+
Section :ref:`io.remote`.
26+
27+
.. _builtin implementations: https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations
28+
.. _external ones: https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations
29+
1630
.. _whatsnew_120.binary_handle_to_csv:
1731

1832
Support for binary file handles in ``to_csv``

pandas/_typing.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,3 +106,6 @@
106106
List[AggFuncTypeBase],
107107
Dict[Label, Union[AggFuncTypeBase, List[AggFuncTypeBase]]],
108108
]
109+
110+
# for arbitrary kwargs passed during reading/writing files
111+
StorageOptions = Optional[Dict[str, Any]]

pandas/conftest.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1224,3 +1224,25 @@ def sort_by_key(request):
12241224
Tests None (no key) and the identity key.
12251225
"""
12261226
return request.param
1227+
1228+
1229+
@pytest.fixture()
1230+
def fsspectest():
1231+
pytest.importorskip("fsspec")
1232+
from fsspec import register_implementation
1233+
from fsspec.implementations.memory import MemoryFileSystem
1234+
from fsspec.registry import _registry as registry
1235+
1236+
class TestMemoryFS(MemoryFileSystem):
1237+
protocol = "testmem"
1238+
test = [None]
1239+
1240+
def __init__(self, **kwargs):
1241+
self.test[0] = kwargs.pop("test", None)
1242+
super().__init__(**kwargs)
1243+
1244+
register_implementation("testmem", TestMemoryFS, clobber=True)
1245+
yield TestMemoryFS()
1246+
registry.pop("testmem", None)
1247+
TestMemoryFS.test[0] = None
1248+
TestMemoryFS.store.clear()

pandas/core/frame.py

Lines changed: 34 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@
5555
Label,
5656
Level,
5757
Renamer,
58+
StorageOptions,
5859
ValueKeyFunc,
5960
)
6061
from pandas.compat import PY37
@@ -2058,6 +2059,7 @@ def to_stata(
20582059
version: Optional[int] = 114,
20592060
convert_strl: Optional[Sequence[Label]] = None,
20602061
compression: Union[str, Mapping[str, str], None] = "infer",
2062+
storage_options: StorageOptions = None,
20612063
) -> None:
20622064
"""
20632065
Export DataFrame object to Stata dta format.
@@ -2134,6 +2136,16 @@ def to_stata(
21342136
21352137
.. versionadded:: 1.1.0
21362138
2139+
storage_options : dict, optional
2140+
Extra options that make sense for a particular storage connection, e.g.
2141+
host, port, username, password, etc., if using a URL that will
2142+
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
2143+
will be raised if providing this argument with a local path or
2144+
a file-like buffer. See the fsspec and backend storage implementation
2145+
docs for the set of allowed keys and values.
2146+
2147+
.. versionadded:: 1.2.0
2148+
21372149
Raises
21382150
------
21392151
NotImplementedError
@@ -2194,6 +2206,7 @@ def to_stata(
21942206
write_index=write_index,
21952207
variable_labels=variable_labels,
21962208
compression=compression,
2209+
storage_options=storage_options,
21972210
**kwargs,
21982211
)
21992212
writer.write_file()
@@ -2246,9 +2259,10 @@ def to_feather(self, path, **kwargs) -> None:
22462259
)
22472260
def to_markdown(
22482261
self,
2249-
buf: Optional[IO[str]] = None,
2250-
mode: Optional[str] = None,
2262+
buf: Optional[Union[IO[str], str]] = None,
2263+
mode: str = "wt",
22512264
index: bool = True,
2265+
storage_options: StorageOptions = None,
22522266
**kwargs,
22532267
) -> Optional[str]:
22542268
if "showindex" in kwargs:
@@ -2266,9 +2280,14 @@ def to_markdown(
22662280
result = tabulate.tabulate(self, **kwargs)
22672281
if buf is None:
22682282
return result
2269-
buf, _, _, _ = get_filepath_or_buffer(buf, mode=mode)
2283+
buf, _, _, should_close = get_filepath_or_buffer(
2284+
buf, mode=mode, storage_options=storage_options
2285+
)
22702286
assert buf is not None # Help mypy.
2287+
assert not isinstance(buf, str)
22712288
buf.writelines(result)
2289+
if should_close:
2290+
buf.close()
22722291
return None
22732292

22742293
@deprecate_kwarg(old_arg_name="fname", new_arg_name="path")
@@ -2279,6 +2298,7 @@ def to_parquet(
22792298
compression: Optional[str] = "snappy",
22802299
index: Optional[bool] = None,
22812300
partition_cols: Optional[List[str]] = None,
2301+
storage_options: StorageOptions = None,
22822302
**kwargs,
22832303
) -> None:
22842304
"""
@@ -2327,6 +2347,16 @@ def to_parquet(
23272347
23282348
.. versionadded:: 0.24.0
23292349
2350+
storage_options : dict, optional
2351+
Extra options that make sense for a particular storage connection, e.g.
2352+
host, port, username, password, etc., if using a URL that will
2353+
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
2354+
will be raised if providing this argument with a local path or
2355+
a file-like buffer. See the fsspec and backend storage implementation
2356+
docs for the set of allowed keys and values
2357+
2358+
.. versionadded:: 1.2.0
2359+
23302360
**kwargs
23312361
Additional arguments passed to the parquet library. See
23322362
:ref:`pandas io <io.parquet>` for more details.
@@ -2373,6 +2403,7 @@ def to_parquet(
23732403
compression=compression,
23742404
index=index,
23752405
partition_cols=partition_cols,
2406+
storage_options=storage_options,
23762407
**kwargs,
23772408
)
23782409

pandas/core/generic.py

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@
4040
Label,
4141
Level,
4242
Renamer,
43+
StorageOptions,
4344
TimedeltaConvertibleTypes,
4445
TimestampConvertibleTypes,
4546
ValueKeyFunc,
@@ -2058,6 +2059,7 @@ def to_json(
20582059
compression: Optional[str] = "infer",
20592060
index: bool_t = True,
20602061
indent: Optional[int] = None,
2062+
storage_options: StorageOptions = None,
20612063
) -> Optional[str]:
20622064
"""
20632065
Convert the object to a JSON string.
@@ -2141,6 +2143,16 @@ def to_json(
21412143
21422144
.. versionadded:: 1.0.0
21432145
2146+
storage_options : dict, optional
2147+
Extra options that make sense for a particular storage connection, e.g.
2148+
host, port, username, password, etc., if using a URL that will
2149+
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
2150+
will be raised if providing this argument with a local path or
2151+
a file-like buffer. See the fsspec and backend storage implementation
2152+
docs for the set of allowed keys and values
2153+
2154+
.. versionadded:: 1.2.0
2155+
21442156
Returns
21452157
-------
21462158
None or str
@@ -2319,6 +2331,7 @@ def to_json(
23192331
compression=compression,
23202332
index=index,
23212333
indent=indent,
2334+
storage_options=storage_options,
23222335
)
23232336

23242337
def to_hdf(
@@ -2633,6 +2646,7 @@ def to_pickle(
26332646
path,
26342647
compression: Optional[str] = "infer",
26352648
protocol: int = pickle.HIGHEST_PROTOCOL,
2649+
storage_options: StorageOptions = None,
26362650
) -> None:
26372651
"""
26382652
Pickle (serialize) object to file.
@@ -2653,6 +2667,16 @@ def to_pickle(
26532667
26542668
.. [1] https://docs.python.org/3/library/pickle.html.
26552669
2670+
storage_options : dict, optional
2671+
Extra options that make sense for a particular storage connection, e.g.
2672+
host, port, username, password, etc., if using a URL that will
2673+
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
2674+
will be raised if providing this argument with a local path or
2675+
a file-like buffer. See the fsspec and backend storage implementation
2676+
docs for the set of allowed keys and values
2677+
2678+
.. versionadded:: 1.2.0
2679+
26562680
See Also
26572681
--------
26582682
read_pickle : Load pickled pandas object (or any object) from file.
@@ -2686,7 +2710,13 @@ def to_pickle(
26862710
"""
26872711
from pandas.io.pickle import to_pickle
26882712

2689-
to_pickle(self, path, compression=compression, protocol=protocol)
2713+
to_pickle(
2714+
self,
2715+
path,
2716+
compression=compression,
2717+
protocol=protocol,
2718+
storage_options=storage_options,
2719+
)
26902720

26912721
def to_clipboard(
26922722
self, excel: bool_t = True, sep: Optional[str] = None, **kwargs
@@ -3031,6 +3061,7 @@ def to_csv(
30313061
escapechar: Optional[str] = None,
30323062
decimal: Optional[str] = ".",
30333063
errors: str = "strict",
3064+
storage_options: StorageOptions = None,
30343065
) -> Optional[str]:
30353066
r"""
30363067
Write object to a comma-separated values (csv) file.
@@ -3142,6 +3173,16 @@ def to_csv(
31423173
31433174
.. versionadded:: 1.1.0
31443175
3176+
storage_options : dict, optional
3177+
Extra options that make sense for a particular storage connection, e.g.
3178+
host, port, username, password, etc., if using a URL that will
3179+
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
3180+
will be raised if providing this argument with a local path or
3181+
a file-like buffer. See the fsspec and backend storage implementation
3182+
docs for the set of allowed keys and values
3183+
3184+
.. versionadded:: 1.2.0
3185+
31453186
Returns
31463187
-------
31473188
None or str
@@ -3194,6 +3235,7 @@ def to_csv(
31943235
doublequote=doublequote,
31953236
escapechar=escapechar,
31963237
decimal=decimal,
3238+
storage_options=storage_options,
31973239
)
31983240
formatter.save()
31993241

0 commit comments

Comments
 (0)