Closed
Description
cc @mrocklin for dask.dataframe visibility
I'm one of the developers of https://github.com/rapidsai/cudf and we're working on adding GPU-accelerated file readers / writers to our library. It seems most of the standard formats are covered quite nicely in the Pandas API, but ORC isn't. Before we went off defining our own API I wanted to open a discussion for defining what that API would look like so we can be consistent with the Pandas and Pandas-like community.
At the top level, I imagine it would look almost identical to Parquet in something like the following:
def read_orc(path, engine='auto', columns=None, **kwargs):
"""
Load an orc object from the file path, returning a DataFrame.
Parameters
----------
path : string
File path
columns : list, default=None
If not None, only these columns will be read from the file.
engine : {'auto', 'pyarrow'}, default 'auto'
Orc library to use. If 'auto', then the option
``io.orc.engine`` is used. The default ``io.orc.engine``
behavior is to use 'pyarrow'.
kwargs are passed to the engine
Returns
-------
DataFrame
"""
...
def to_orc(self, fname, engine='auto', compression='snappy', index=None,
partition_cols=None, **kwargs):
"""
Write a DataFrame to the binary orc format.
This function writes the dataframe as a `orc file
<https://orc.apache.org/>`_. You can choose different orc
backends, and have the option of compression. See
:ref:`the user guide <io.orc>` for more details.
Parameters
----------
fname : str
File path or Root Directory path. Will be used as Root Directory
path while writing a partitioned dataset.
engine : {'auto', 'pyarrow'}, default 'auto'
Orc library to use. If 'auto', then the option
``io.orc.engine`` is used. The default ``io.orc.engine``
behavior is to use 'pyarrow'.
compression : {'snappy', 'gzip', 'brotli', None}, default 'snappy'
Name of the compression to use. Use ``None`` for no compression.
index : bool, default None
If ``True``, include the dataframe's index(es) in the file output.
If ``False``, they will not be written to the file. If ``None``,
the behavior depends on the chosen engine.
partition_cols : list, optional, default None
Column names by which to partition the dataset
Columns are partitioned in the order they are given
**kwargs
Additional arguments passed to the orc library. See
:ref:`pandas io <io.orc>` for more details.
"""
...