diff --git a/protocol/API.md b/protocol/API.md new file mode 100644 index 00000000..4faf4c4c --- /dev/null +++ b/protocol/API.md @@ -0,0 +1,72 @@ +# API of the `__dataframe__` protocol + +Specification for objects to be accessed, for the purpose of dataframe +interchange between libraries, via the `__dataframe__` method on a libraries' +data frame object. + +For guiding requirements, see {ref}`design-requirements`. + + +## Concepts in this design + +1. A `Buffer` class. A *buffer* is a contiguous block of memory - this is the + only thing that actually maps to a 1-D array in a sense that it could be + converted to NumPy, CuPy, et al. +2. A `Column` class. A *column* has a single dtype. It can consist + of multiple *chunks*. A single chunk of a column (which may be the whole + column if ``num_chunks == 1``) is modeled as again a `Column` instance, and + contains 1 data *buffer* and (optionally) one *mask* for missing data. +3. A `DataFrame` class. A *data frame* is an ordered collection of *columns*, + which are identified with names that are unique strings. All the data + frame's rows are the same length. It can consist of multiple *chunks*. A + single chunk of a data frame is modeled as again a `DataFrame` instance. +4. A *mask* concept. A *mask* of a single-chunk column is a *buffer*. +5. A *chunk* concept. A *chunk* is a sub-dividing element that can be applied + to a *data frame* or a *column*. + +Note that the only way to access these objects is through a call to +`__dataframe__` on a data frame object. This is NOT meant as public API; +only think of instances of the different classes here to describe the API of +what is returned by a call to `__dataframe__`. They are the concepts needed +to capture the memory layout and data access of a data frame. + + +## Design decisions + +1. Use a separate column abstraction in addition to a dataframe interface. + + Rationales: + + - This is how it works in R, Julia and Apache Arrow. + - Semantically most existing applications and users treat a column similar to a 1-D array + - We should be able to connect a column to the array data interchange mechanism(s) + + Note that this does not imply a library must have such a public user-facing + abstraction (ex. ``pandas.Series``) - it can only be accessed via + ``__dataframe__``. + +2. Use methods and properties on an opaque object rather than returning + hierarchical dictionaries describing memory. + + This is better for implementations that may rely on, for example, lazy + computation. + +3. No row names. If a library uses row names, use a regular column for them. + + See discussion at + [wesm/dataframe-protocol/pull/1](https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241) + Optional row names are not a good idea, because people will assume they're + present (see cuDF experience, forced to add because pandas has them). + Requiring row names seems worse than leaving them out. Note that row labels + could be added in the future - right now there's no clear requirements for + more complex row labels that cannot be represented by a single column. These + do exist, for example Modin has has table and tree-based row labels. + +## Interface + + + +```{literalinclude} dataframe_protocol.py +--- +language: python +--- diff --git a/protocol/Makefile b/protocol/Makefile new file mode 100644 index 00000000..d4bb2cbb --- /dev/null +++ b/protocol/Makefile @@ -0,0 +1,20 @@ +# Minimal makefile for Sphinx documentation +# + +# You can set these variables from the command line, and also +# from the environment for the first two. +SPHINXOPTS ?= +SPHINXBUILD ?= sphinx-build +SOURCEDIR = . +BUILDDIR = _build + +# Put it first so that "make" without argument is like "make help". +help: + @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) + +.PHONY: help Makefile + +# Catch-all target: route all unknown targets to Sphinx using the new +# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). +%: Makefile + @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/protocol/images/dataframe_conceptual_model.png b/protocol/_static/images/dataframe_conceptual_model.png similarity index 100% rename from protocol/images/dataframe_conceptual_model.png rename to protocol/_static/images/dataframe_conceptual_model.png diff --git a/protocol/conf.py b/protocol/conf.py new file mode 100644 index 00000000..54ec3717 --- /dev/null +++ b/protocol/conf.py @@ -0,0 +1,146 @@ +# Configuration file for the Sphinx documentation builder. +# +# This file only contains a selection of the most common options. For a full +# list see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Path setup -------------------------------------------------------------- + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. +# +# import os +# import sys +# sys.path.insert(0, os.path.abspath('.')) + +import sphinx_material + +# -- Project information ----------------------------------------------------- + +project = 'Python dataframe interchange protocol' +copyright = '2021, Consortium for Python Data API Standards' +author = 'Consortium for Python Data API Standards' + +# The full version, including alpha/beta/rc tags +release = '2021-DRAFT' + + +# -- General configuration --------------------------------------------------- + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [ + 'myst_parser', + 'sphinx.ext.extlinks', + 'sphinx.ext.intersphinx', + 'sphinx.ext.todo', + 'sphinx_markdown_tables', + 'sphinx_copybutton', +] + +# Add any paths that contain templates here, relative to this directory. +templates_path = ['_templates'] + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This pattern also affects html_static_path and html_extra_path. +exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] + +# MyST options +myst_heading_anchors = 3 +myst_enable_extensions = ["colon_fence"] + +# -- Options for HTML output ------------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +extensions.append("sphinx_material") +html_theme_path = sphinx_material.html_theme_path() +html_context = sphinx_material.get_html_context() +html_theme = 'sphinx_material' + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = ['_static'] + + +# -- Material theme options (see theme.conf for more information) ------------ +html_show_sourcelink = False +html_sidebars = { + "**": ["logo-text.html", "globaltoc.html", "localtoc.html", "searchbox.html"] +} + +html_theme_options = { + + # Set the name of the project to appear in the navigation. + 'nav_title': 'Python dataframe interchange protocol', + + # Set you GA account ID to enable tracking + #'google_analytics_account': 'UA-XXXXX', + + # Specify a base_url used to generate sitemap.xml. If not + # specified, then no sitemap will be built. + #'base_url': 'https://project.github.io/project', + + # Set the color and the accent color (see + # https://material.io/design/color/the-color-system.html) + 'color_primary': 'indigo', + 'color_accent': 'green', + + # Set the repo location to get a badge with stats + #'repo_url': 'https://github.com/project/project/', + #'repo_name': 'Project', + + "html_minify": False, + "html_prettify": True, + "css_minify": True, + "logo_icon": "", + "repo_type": "github", + "touch_icon": "images/apple-icon-152x152.png", + "theme_color": "#2196f3", + "master_doc": False, + + # Visible levels of the global TOC; -1 means unlimited + 'globaltoc_depth': 2, + # If False, expand all TOC entries + 'globaltoc_collapse': True, + # If True, show hidden TOC entries + 'globaltoc_includehidden': True, + + "nav_links": [ + {"href": "index", "internal": True, "title": "Dataframe interchange protcol"}, + { + "href": "https://data-apis.org", + "internal": False, + "title": "Consortium for Python Data API Standards", + }, + ], + "heroes": { + "index": "A protocol for zero-copy data interchange between Python dataframe libraries", + #"customization": "Configuration options to personalize your site.", + }, + + #"version_dropdown": True, + #"version_json": "_static/versions.json", + "table_classes": ["plain"], +} + + +todo_include_todos = True +#html_favicon = "images/favicon.ico" + +html_use_index = True +html_domain_indices = True + +extlinks = { + "duref": ( + "http://docutils.sourceforge.net/docs/ref/rst/" "restructuredtext.html#%s", + "", + ), + "durole": ("http://docutils.sourceforge.net/docs/ref/rst/" "roles.html#%s", ""), + "dudir": ("http://docutils.sourceforge.net/docs/ref/rst/" "directives.html#%s", ""), +} diff --git a/protocol/dataframe_protocol.py b/protocol/dataframe_protocol.py index 6b0c0f3f..8bbf3327 100644 --- a/protocol/dataframe_protocol.py +++ b/protocol/dataframe_protocol.py @@ -1,70 +1,3 @@ -""" -Specification for objects to be accessed, for the purpose of dataframe -interchange between libraries, via the ``__dataframe__`` method on a libraries' -data frame object. - -For guiding requirements, see https://github.com/data-apis/dataframe-api/pull/35 - - -Concepts in this design ------------------------ - -1. A `Buffer` class. A *buffer* is a contiguous block of memory - this is the - only thing that actually maps to a 1-D array in a sense that it could be - converted to NumPy, CuPy, et al. -2. A `Column` class. A *column* has a single dtype. It can consist - of multiple *chunks*. A single chunk of a column (which may be the whole - column if ``num_chunks == 1``) is modeled as again a `Column` instance, and - contains 1 data *buffer* and (optionally) one *mask* for missing data. -3. A `DataFrame` class. A *data frame* is an ordered collection of *columns*, - which are identified with names that are unique strings. All the data - frame's rows are the same length. It can consist of multiple *chunks*. A - single chunk of a data frame is modeled as again a `DataFrame` instance. -4. A *mask* concept. A *mask* of a single-chunk column is a *buffer*. -5. A *chunk* concept. A *chunk* is a sub-dividing element that can be applied - to a *data frame* or a *column*. - -Note that the only way to access these objects is through a call to -``__dataframe__`` on a data frame object. This is NOT meant as public API; -only think of instances of the different classes here to describe the API of -what is returned by a call to ``__dataframe__``. They are the concepts needed -to capture the memory layout and data access of a data frame. - - -Design decisions ----------------- - -**1. Use a separate column abstraction in addition to a dataframe interface.** - -Rationales: -- This is how it works in R, Julia and Apache Arrow. -- Semantically most existing applications and users treat a column similar to a 1-D array -- We should be able to connect a column to the array data interchange mechanism(s) - -Note that this does not imply a library must have such a public user-facing -abstraction (ex. ``pandas.Series``) - it can only be accessed via ``__dataframe__``. - -**2. Use methods and properties on an opaque object rather than returning -hierarchical dictionaries describing memory** - -This is better for implementations that may rely on, for example, lazy -computation. - -**3. No row names. If a library uses row names, use a regular column for them.** - -See discussion at https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241 -Optional row names are not a good idea, because people will assume they're present -(see cuDF experience, forced to add because pandas has them). -Requiring row names seems worse than leaving them out. - -Note that row labels could be added in the future - right now there's no clear -requirements for more complex row labels that cannot be represented by a single -column. These do exist, for example Modin has has table and tree-based row -labels. - -""" - - class Buffer: """ Data in the buffer is guaranteed to be contiguous in memory. diff --git a/protocol/dataframe_protocol_summary.md b/protocol/design_requirements.md similarity index 76% rename from protocol/dataframe_protocol_summary.md rename to protocol/design_requirements.md index 9b6647a4..25406141 100644 --- a/protocol/dataframe_protocol_summary.md +++ b/protocol/design_requirements.md @@ -1,94 +1,7 @@ -# The `__dataframe__` protocol - -This document aims to describe the scope of the dataframe interchange protocol, -as well as its essential design requirements/principles and the functionality -it needs to support. - - -## Purpose of `__dataframe__` - -The purpose of `__dataframe__` is to be a _data interchange_ protocol. I.e., -a way to convert one type of dataframe into another type (for example, -convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into -a Vaex dataframe). - -Currently (June 2020) there is no way to do this in an -implementation-independent way. - -The main use case this protocol intends to enable is to make it possible to -write code that can accept any type of dataframe instead of being tied to a -single type of dataframe. To illustrate that: - -```python -def somefunc(df, ...): - """`df` can be any dataframe supporting the protocol, rather than (say) - only a pandas.DataFrame""" - # could also be `cudf.from_dataframe(df)`, or `vaex.from_dataframe(df)` - # note: this should throw a TypeError if it cannot be done without a device - # transfer (e.g. move data from GPU to CPU) - add `force=True` in that case - new_pandas_df = pd.from_dataframe(df) - # From now on, use Pandas dataframe internally -``` - -### Non-goals - -Providing a _complete, standardized dataframe API_ is not a goal of the -`__dataframe__` protocol. Instead, this is a goal of the full dataframe API -standard, which the Consortium for Python Data API Standards aims to provide -in the future. When that full API standard is implemented by dataframe -libraries, the example above can change to: - -```python -def get_df_module(df): - """Utility function to support programming against a dataframe API""" - if hasattr(df, '__dataframe_namespace__'): - # Retrieve the namespace - pdx = df.__dataframe_namespace__() - else: - # Here we can raise an exception if we only want to support compliant dataframes, - # or convert to our default choice of dataframe if we want to accept (e.g.) dicts - pdx = pd - df = pd.DataFrame(df) - - return pdx, df - - -def somefunc(df, ...): - """`df` can be any dataframe conforming to the dataframe API standard""" - pdx, df = get_df_module(df) - # From now on, use `df` methods and `pdx` functions/objects -``` - - -### Constraints - -An important constraint on the `__dataframe__` protocol is that it should not -make achieving the goal of the complete standardized dataframe API more -difficult to achieve. - -There is a small concern here. Say that a library adopts `__dataframe__` first, -and it goes from supporting only Pandas to officially supporting other -dataframes like `modin.pandas.DataFrame`. At that point, changing to -supporting the full dataframe API standard as a next step _implies a -backwards compatibility break_ for users that now start relying on Modin -dataframe support. E.g., the second transition will change from returning a -Pandas dataframe from `somefunc(df_modin)` to returning a Modin dataframe -later. It must be made very clear to libraries accepting `__dataframe__` that -this is a consequence, and that that should be acceptable to them. - - -### Progression / timeline - -- **Current status**: most dataframe-consuming libraries work _only_ with - Pandas, and rely on many Pandas-specific functions, methods and behavior. -- **Status after `__dataframe__`**: with minor code changes (as in first - example above), libraries can start supporting all conforming dataframes, - convert them to Pandas dataframes, and still rely on the same - Pandas-specific functions, methods and behavior. -- **Status after standard dataframe API adoption**: libraries can start - supporting all conforming dataframes _without converting to Pandas or - relying on its implementation details_. At this point, it's possible to - "program to an interface" rather than to a specific library like Pandas. +# Design concepts and requirements + +This document aims to describe the design requirements and principles of the +dataframe interchange protcol, and the functionality it needs to support. ## Conceptual model of a dataframe @@ -97,7 +10,7 @@ For a protocol to exchange dataframes between libraries, we need both a model of what we mean by "dataframe" conceptually for the purposes of the protocol, and a model of how the data is represented in memory: -![Conceptual model of a dataframe, containing chunks, columns and 1-D arrays](images/dataframe_conceptual_model.png) +![Conceptual model of a dataframe, containing chunks, columns and 1-D arrays](_static/images/dataframe_conceptual_model.png) The smallest building blocks are **1-D arrays** (or "buffers"), which are contiguous in memory and contain data with the same dtype. A **column** @@ -107,46 +20,64 @@ A column or a dataframe can be "chunked"; a **chunk** is a subset of a column or dataframe that contains a set of (neighboring) rows. +(design-requirements)= + ## Protocol design requirements 1. Must be a standard Python-level API that is unambiguously specified, and not rely on implementation details of any particular dataframe library. + 2. Must treat dataframes as an ordered collection of columns (which are conceptually 1-D arrays with a dtype and missing data support). + _Note: this relates to the API for `__dataframe__`, and does not imply that the underlying implementation must use columnar storage!_ + 3. Must allow the consumer to select a specific set of columns for conversion. + 4. Must allow the consumer to access the following "metadata" of the dataframe: number of rows, number of columns, column names, column data types. + _Note: this implies that a data type specification needs to be created._ _Note: column names are required; they must be strings and unique. If a dataframe doesn't have them, dummy ones like `'0', '1', ...` can be used._ + 5. Must include device support. + 6. Must avoid device transfers by default (e.g. copy data from GPU to CPU), and provide an explicit way to force such transfers (e.g. a `force=` or `copy=` keyword that the caller can set to `True`). + 7. Must be zero-copy wherever possible. + 8. Must support missing values (`NA`) for all supported dtypes. + 9. Must supports string, categorical and datetime dtypes. + 10. Must allow the consumer to inspect the representation for missing values that the producer uses for each column or data type. Sentinel values, bit masks, and boolean masks must be supported. Must also be able to define whether the semantic meaning of `NaN` and `NaT` is "not-a-number/datetime" or "missing". + _Rationale: this enables the consumer to control how conversion happens, for example if the producer uses `-128` as a sentinel value in an `int8` column while the consumer uses a separate bit mask, that information allows the consumer to make this mapping._ + 11. Must allow the producer to describe its memory layout in sufficient detail. In particular, for missing data and data types that may have multiple in-memory representations (e.g., categorical), those representations must all be describable in order to let the consumer map that to the representation it uses. + _Rationale: prescribing a single in-memory representation in this protocol would lead to unnecessary copies being made if that represention isn't the native one a library uses._ + _Note: the memory layout is columnar. Row-major dataframes can use this protocol, but not in a zero-copy fashion (see requirement 2 above)._ + 12. Must support chunking, i.e. accessing the data in "batches" of rows. There must be metadata the consumer can access to learn in how many chunks the data is stored. The consumer may also convert the data in @@ -154,34 +85,56 @@ or dataframe that contains a set of (neighboring) rows. its columns to shorter length. That request may not be such that it would force the producer to concatenate data that is already stored in separate chunks. + _Rationale: support for chunking is more efficient for libraries that natively store chunks, and it is needed for dataframes that do not fit in memory (e.g. dataframes stored on disk or lazily evaluated)._ +13. May (desired, not required) support `__dlpack__` as the array interchange + protocol at the individual buffer level for dtypes that are supported by + DLPack. + + _Rationale: there is a connection between dataframe and array interchange + protocols. If we treat a dataframe as a set of columns which each are a set + of 1-D arrays (there may be more than one in the case of using masks for + missing data, or in the future for nested dtypes), it may be expected that + there is a connection to be made with the array data interchange method. + The array interchange is based on DLPack; its major limitation from the + point of view of dataframes is the lack of support of all required data + types (string, categorical, datetime) and missing data._ + We'll also list some things that were discussed but are not requirements: 1. Object dtype does not need to be supported + 2. Nested/structured dtypes within a single column does not need to be supported. + _Rationale: not used a lot, additional design complexity not justified. May be added in the future (does have support in the Arrow C Data Interface). Also note that Arrow and NumPy structured dtypes have different memory layouts, e.g. a `(float, int)` dtype would be stored as two separate child arrays in Arrow and as a single `f0, i0, f1, i1, ...` interleaved array in NumPy._ + 3. Extension dtypes, i.e. a way to extend the set of dtypes that is explicitly support, are out of scope. + _Rationale: complex to support, not used enough to justify that complexity._ + 4. Support for strided storage in buffers. + _Rationale: this is supported by a subset of dataframes only, mainly those that use NumPy arrays. In many real-world use cases, strided arrays will force a copy at some point, so requiring contiguous memory layout (and hence an extra copy at the moment `__dataframe__` is used) is considered a good trade-off for reduced implementation complexity._ + 5. "virtual columns", i.e. columns for which the data is not yet in memory because it uses lazy evaluation, are not supported other than through letting the producer materialize the data in memory when the consumer calls `__dataframe__`. + _Rationale: the full dataframe API will support this use case by "programming to an interface"; this data interchange protocol is fundamentally built around describing data in memory_. @@ -189,25 +142,13 @@ We'll also list some things that were discussed but are not requirements: ### To be decided -_The connection between dataframe and array interchange protocols_. If we -treat a dataframe as a set of columns which each are a set of 1-D arrays -(there may be more than one in the case of using masks for missing data, or -in the future for nested dtypes), it may be expected that there is a -connection to be made with the array data interchange method. The array -interchange is based on DLPack; its major limitation from the point of view -of dataframes is the lack of support of all required data types (string, -categorical, datetime) and missing data. A requirement could be added that -`__dlpack__` should be supported in case the data types in a column are -supported by DLPack. Missing data via a boolean mask as a separate array -could also be supported. - _Should there be a standard `from_dataframe` constructor function?_ This isn't completely necessary, however it's expected that a full dataframe API standard will have such a function. The array API standard also has such a function, namely `from_dlpack`. Adding at least a recommendation on syntax for this function makes sense, e.g., simply `from_dataframe(df)`. -Discussion at https://github.com/data-apis/dataframe-api/issues/29#issuecomment-685903651 -is relevant. +Discussion at +[dataframe-api/issues/29](https://github.com/data-apis/dataframe-api/issues/29#issuecomment-685903651) is relevant. ## Frequently asked questions @@ -220,7 +161,7 @@ except `__dataframe__` is a Python-level rather than C-level interface. The data types format specification of that interface is something that could be used unchanged. -The main limitation is to be that it does not have device support +The main limitation seems to be that Arrow does not have device support -- `@kkraus14` will bring this up on the Arrow dev mailing list. Another identified issue is that the "deleter" on the Arrow C struct is present at the column level, and there are use cases for having it at the buffer level @@ -271,6 +212,11 @@ they are the ones that need such a particular format. So, it can call the constructor it needs. For example, `x = np.asarray(df['colname'])` (where `df` supports `__dataframe__`). +A related question is: can `__array__` and/or `__arrow_array__` be used at the +column level? This is more reasonable, but probably does lead to more +complexity for very limited gains - for an issue with discussion on that, see +[dataframe-api/issues/48](https://github.com/data-apis/dataframe-api/issues/48). + ### Does an interface describing memory work for virtual columns? diff --git a/protocol/index.rst b/protocol/index.rst new file mode 100644 index 00000000..4fc44ade --- /dev/null +++ b/protocol/index.rst @@ -0,0 +1,14 @@ +Python dataframe interchange protocol +===================================== + +Contents +-------- + +.. toctree:: + :caption: Context + :maxdepth: 1 + + purpose_and_scope + design_requirements + API + diff --git a/protocol/make.bat b/protocol/make.bat new file mode 100644 index 00000000..2119f510 --- /dev/null +++ b/protocol/make.bat @@ -0,0 +1,35 @@ +@ECHO OFF + +pushd %~dp0 + +REM Command file for Sphinx documentation + +if "%SPHINXBUILD%" == "" ( + set SPHINXBUILD=sphinx-build +) +set SOURCEDIR=. +set BUILDDIR=_build + +if "%1" == "" goto help + +%SPHINXBUILD% >NUL 2>NUL +if errorlevel 9009 ( + echo. + echo.The 'sphinx-build' command was not found. Make sure you have Sphinx + echo.installed, then set the SPHINXBUILD environment variable to point + echo.to the full path of the 'sphinx-build' executable. Alternatively you + echo.may add the Sphinx directory to PATH. + echo. + echo.If you don't have Sphinx installed, grab it from + echo.http://sphinx-doc.org/ + exit /b 1 +) + +%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% +goto end + +:help +%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% + +:end +popd diff --git a/protocol/purpose_and_scope.md b/protocol/purpose_and_scope.md new file mode 100644 index 00000000..bf5d5a61 --- /dev/null +++ b/protocol/purpose_and_scope.md @@ -0,0 +1,296 @@ +# Purpose and scope + +```{note} + +This document is ready for wider community feedback, but still contains a +number of TODOs, and is expected to change and evolve before a first official +release. At least two independent implementation are also needed, in order to +validate the design and find potential issues. +``` + +## Introduction + +Python users today have a number of great choices for dataframe libraries. +From Pandas and cuDF to Vaex, Koalas, Modin, Ibis, and more. Combining multiple +types of dataframes in a larger application or analysis workflow, or developing +a library which uses dataframes as a data structure, presents a challenge +though. Those libraries all have different APIs, and there is no standard way +of converting one type of dataframe into another. + + +### This dataframe protocol + +The purpose of this dataframe protocol (`__dataframe__`) is to enable _data +interchange_. I.e., a way to convert one type of dataframe into another type +(for example, convert a Koalas dataframe into a Pandas dataframe, or a cuDF +dataframe into a Vaex dataframe). + +Currently (July 2021) there is no way to do this in an +implementation-independent way. + +A main use case this protocol intends to enable is to make it possible to +write code that can accept any type of dataframe instead of being tied to a +single type of dataframe. To illustrate that: + +```python +def somefunc(df, ...): + """`df` can be any dataframe supporting the protocol, rather than (say) + only a pandas.DataFrame""" + # could also be `cudf.from_dataframe(df)`, or `vaex.from_dataframe(df)` + # note: this should throw a TypeError if it cannot be done without a device + # transfer (e.g. move data from GPU to CPU) - add `force=True` in that case + new_pandas_df = pd.from_dataframe(df) + # From now on, use Pandas dataframe internally +``` + +It is important to note that providing a _complete, standardized dataframe API_ +is not a goal of the `__dataframe__` protocol. Instead, this is a goal of the +full dataframe API standard, which the Consortium for Python Data API Standards +aims to develop in the future. When that full API standard is implemented by +dataframe libraries, the example above can change to: + +```python +def get_df_module(df): + """Utility function to support programming against a dataframe API""" + if hasattr(df, '__dataframe_namespace__'): + # Retrieve the namespace + pdx = df.__dataframe_namespace__() + else: + # Here we can raise an exception if we only want to support compliant dataframes, + # or convert to our default choice of dataframe if we want to accept (e.g.) dicts + pdx = pd + df = pd.DataFrame(df) + + return pdx, df + + +def somefunc(df, ...): + """`df` can be any dataframe conforming to the dataframe API standard""" + pdx, df = get_df_module(df) + # From now on, use `df` methods and `pdx` functions/objects +``` + + +### History + +Dataframe libraries in several programming language exist, such as +[R](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame), +[Scala](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html), +[Julia](https://juliadata.github.io/DataFrames.jl/stable/) and others. + +In Python, the most popular dataframe library is [pandas](https://pandas.pydata.org/). +pandas was initially developed at a hedge fund, with a focus on +[panel data](https://en.wikipedia.org/wiki/Panel_data) and financial time series. +It was open sourced in 2009, and since then it has been growing in popularity, including +many other domains outside time series and financial data. While still rich in time series +functionality, today is considered a general-purpose dataframe library. The original +`Panel` class that gave name to the library was deprecated in 2017 and removed in 2019, +to focus on the main `DataFrame` class. + +Internally, pandas is implemented (mostly) on top of NumPy, which is used to +store the data and to perform many of the operations. Some parts of pandas are +written in Cython. + +Other libraries emerged in the last years, to address some of the limitations of pandas. +But in most cases, the libraries implemented a public API very similar to pandas, to +make the transition to their libraries easier. The next section provides a +short description of the main dataframe libraries in Python. + +#### Python dataframe libraries + +[Dask](https://dask.org/) is a task scheduler built in Python, which implements a +dataframe interface. Dask dataframe uses pandas internally in the workers, and it provides +an API similar to pandas, adapted to its distributed and lazy nature. + +[Vaex](https://vaex.io/) is an out-of-core alternative to pandas. Vaex uses hdf5 to +create memory maps that avoid loading data sets to memory. Some parts of Vaex are +implemented in C++. + +[Modin](https://github.com/modin-project/modin) is a distributed dataframe +library originally built on [Ray](https://github.com/ray-project/ray), but has +a more modular way, that allows it to also use Dask as a scheduler, or replace the +pandas-like public API by a SQLite-like one. + +[cuDF](https://github.com/rapidsai/cudf) is a GPU dataframe library built on top +of Apache Arrow and RAPIDS. It provides an API similar to pandas. + +[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is a +dataframe library that uses Spark as a backend. PySpark public API is based on the +original Spark API, and not in pandas. + +[Koalas](https://github.com/databricks/koalas) is a dataframe library built on +top of PySpark that provides a pandas-like API. + +[Ibis](https://ibis-project.org/) is a dataframe library with multiple SQL backends. +It uses SQLAlchemy and a custom SQL compiler to translate its pandas-like API to +SQL statements, executed by the backends. It supports conventional DBMS, as well +as big data systems such as Apache Impala or BigQuery. + +#### History of this dataframe protocol + +While there is no dataframe protocol like the one described in this document in +Python yet, there is a long history of _array_ interchange protocols - the +Python buffer protocol, various NumPy protocols like `__array_interface__`, +DLPack, and more. + +A number of people have discussed creating a similar protocol for dataframes. +Such discussions gained momentum when Gael Varoquaux discussed the possibility +of a dataframe interchange protocol last year in a +[Discourse thread](https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/26). +In response, Wes McKinney implemented an initial +[prototype](https://github.com/wesm/dataframe-protocol/pull/1). The +conversation and prototype generated a number of good ideas and stimulating +discussions; however, the topic was complex enough to necessitate a more +comprehensive approach, including collecting requirements and use cases from +a large set of stakeholders. This protocol is a natural follow-up to those +early discussions, and is taking exactly such a comprehensive approach. + + +(Scope)= + +## Scope (includes out-of-scope / non-goals) + +This section outlines what is in scope and out of scope for this dataframe +interchange protocol. + +### In scope + +The scope of the dataframe interchange protocol includes: + +- Functionality which needs to be included in a dataframe library for it to + support this protocol. +- Names of the relevant methods and functions. +- Function signatures, including type annotations. +- Semantics of functions and methods. +- Data type and device support. +- Memory ownership and lifetime. +- Basic dataframe metadata. + + +### Out of scope + +1. Providing a full dataframe API is out of scope. + + _Rationale: this is a much larger undertaking, ._ + +2. Non-Python API standardization (e.g., C/C++ APIs) + +3. Standardization of these dtypes is out of scope: object dtype, + nested/structured dtypes, and custom dtypes via an extension mechanism. + + _Rationale: object dtypes are inefficient and may contain anything (so hard + to support in a sensible way); nested/structures dtypes may be supported in + the future but are not used that much and are complex to implement; custom + dtypes would increase design complexity that is not justified._ + +4. Strided data storage, i.e. data that is regularly laid out but not + contiguous in memory, is out of scope. + + _Rationale: not all libraries support strided data (e.g., Apache Arrow). + Adding support to avoid copies may not have many real-world benefits._ + +5. "virtual columns", i.e. columns for which the data is not yet in memory + because it uses lazy evaluation, are not supported other than through + letting the producer materialize the data in memory when the consumer + calls `__dataframe__`. + + _Rationale: the full dataframe API will support this use case by + "programming to an interface"; this data interchange protocol is + fundamentally built around describing data in memory_. + + +**Non-goals** for the API standard include: + +- Providing a full dataframe API to enable "programming to an API". + +### Constraints + +An important constraint on the `__dataframe__` protocol is that it should not +make achieving the goal of the complete standardized dataframe API more +difficult to achieve. + +There is a small concern here. Say that a library adopts `__dataframe__` first, +and it goes from supporting only Pandas to officially supporting other +dataframes like `modin.pandas.DataFrame`. At that point, changing to +supporting the full dataframe API standard as a next step _implies a +backwards compatibility break_ for users that now start relying on Modin +dataframe support. E.g., the second transition will change from returning a +Pandas dataframe from `somefunc(df_modin)` to returning a Modin dataframe +later. It must be made very clear to libraries accepting `__dataframe__` that +this is a consequence, and that that should be acceptable to them. + + +### Progression / timeline + +- **Current status**: most dataframe-consuming libraries work _only_ with + Pandas, and rely on many Pandas-specific functions, methods and behavior. +- **Status after `__dataframe__`**: with minor code changes (as in first + example above), libraries can start supporting all conforming dataframes, + convert them to Pandas dataframes, and still rely on the same + Pandas-specific functions, methods and behavior. +- **Status after standard dataframe API adoption**: libraries can start + supporting all conforming dataframes _without converting to Pandas or + relying on its implementation details_. At this point, it's possible to + "program to an interface" rather than to a specific library like Pandas. + + +## Stakeholders + +Dataframes are a key element of data science workflows and appplications. Hence +there are many stakeholders for a dataframe protocol like this. The _direct_ +stakeholders of this standard are authors/maintainers of +Python dataframe libraries. There are many more types of _indirect_ stakeholders +though, including: + +- maintainers of libraries and other programs which depend on dataframe libraries + (called "dataframe-consuming libraries" in the rest of this document) +- Python dataframe end users +- authors of non-Python dataframe libraries + +Libraries that are being most actively considered during the creation of the +first version of this protocol include: + +- [pandas](https://pandas.pydata.org) +- [Dask](https://dask.org/) +- [cuDF](https://github.com/rapidsai/cudf) +- [Vaex](https://vaex.io/) +- [Modin](https://github.com/modin-project/modin) +- [Koalas](https://github.com/databricks/koalas) +- [Ibis](https://ibis-project.org/) + +Other Python dataframe libraries that are currently under active development and +could adopt this protocol include: + +- [PySpark](https://spark.apache.org/docs/latest/api/python/) +- [Turi Create](https://github.com/apple/turicreate) + +Other relevant projects that provide "infrastructure" for dataframe libraries +to build on, include: + +- [Apache Arrow](https://arrow.apache.org/) and its Python bindings [PyArrow](https://arrow.apache.org/docs/python) +- [NumPy](https://numpy.org/) + +There are a lot of dataframe-consuming libraries; some of the most +prominent ones include: + +- [scikit-learn](https://scikit-learn.org/) +- [Matplotlib](https://matplotlib.org/) +- [Seaborn](https://networkx.github.io/) +- [pyjanitor](https://pyjanitor.readthedocs.io/) + +Compilers, runtimes, and dispatching layers for which this API standard may be +relevant: + +- TODO + + +(how-to-adopt-this-protocol)= + +## How to adopt this protocol + +To adopt the protocol, a dataframe library must implement a method named +`__dataframe__` on its dataframe class/object. + +_TODO: versioning the protocol_ + +