Skip to content

DISCUSS: What would an ORC reader/writer API look like? #25229

Closed
@kkraus14

Description

@kkraus14

cc @mrocklin for dask.dataframe visibility

I'm one of the developers of https://github.com/rapidsai/cudf and we're working on adding GPU-accelerated file readers / writers to our library. It seems most of the standard formats are covered quite nicely in the Pandas API, but ORC isn't. Before we went off defining our own API I wanted to open a discussion for defining what that API would look like so we can be consistent with the Pandas and Pandas-like community.

At the top level, I imagine it would look almost identical to Parquet in something like the following:

def read_orc(path, engine='auto', columns=None, **kwargs):
    """
    Load an orc object from the file path, returning a DataFrame.

    Parameters
    ----------
    path : string
        File path
    columns : list, default=None
        If not None, only these columns will be read from the file.
    engine : {'auto', 'pyarrow'}, default 'auto'
        Orc library to use. If 'auto', then the option
        ``io.orc.engine`` is used. The default ``io.orc.engine``
        behavior is to use 'pyarrow'.
    kwargs are passed to the engine

    Returns
    -------
    DataFrame
    """
    ...


def to_orc(self, fname, engine='auto', compression='snappy', index=None,
           partition_cols=None, **kwargs):
    """
    Write a DataFrame to the binary orc format.

    This function writes the dataframe as a `orc file
    <https://orc.apache.org/>`_. You can choose different orc
    backends, and have the option of compression. See
    :ref:`the user guide <io.orc>` for more details.

    Parameters
    ----------
    fname : str
        File path or Root Directory path. Will be used as Root Directory
        path while writing a partitioned dataset.
    engine : {'auto', 'pyarrow'}, default 'auto'
        Orc library to use. If 'auto', then the option
        ``io.orc.engine`` is used. The default ``io.orc.engine``
        behavior is to use 'pyarrow'.
    compression : {'snappy', 'gzip', 'brotli', None}, default 'snappy'
        Name of the compression to use. Use ``None`` for no compression.
    index : bool, default None
        If ``True``, include the dataframe's index(es) in the file output.
        If ``False``, they will not be written to the file. If ``None``,
        the behavior depends on the chosen engine.
    partition_cols : list, optional, default None
        Column names by which to partition the dataset
        Columns are partitioned in the order they are given
    **kwargs
        Additional arguments passed to the orc library. See
        :ref:`pandas io <io.orc>` for more details.
    """
    ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignIO DataIO issues that don't fit into a more specific label

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions