From 766aa5070d5482d5e0d428781b0674211cb0b7b7 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 13 Mar 2019 09:55:44 -0500 Subject: [PATCH 1/5] Add specification for RangeIndex-as-metadata in Parquet file schema custom metadata --- doc/source/development/developer.rst | 66 ++++++++++++++++++++-------- 1 file changed, 47 insertions(+), 19 deletions(-) diff --git a/doc/source/development/developer.rst b/doc/source/development/developer.rst index a283920ae4377..1d84a1940de56 100644 --- a/doc/source/development/developer.rst +++ b/doc/source/development/developer.rst @@ -37,10 +37,17 @@ So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a .. code-block:: text - {'index_columns': ['__index_level_0__', '__index_level_1__', ...], + {'index_columns': [, , ...], 'column_indexes': [, , ..., ], 'columns': [, , ...], - 'pandas_version': $VERSION} + 'pandas_version': $VERSION, + 'creator': { + 'library': $LIBRARY, + 'version': $LIBRARY_VERSION + }} + +The "descriptor" values ```` in the ``'index_columns'`` field are +dictionaries with values as described below. Here, ````/```` and so forth are dictionaries containing the metadata for each column, *including the index columns*. This has JSON form: @@ -53,26 +60,42 @@ for each column, *including the index columns*. This has JSON form: 'numpy_type': numpy_type, 'metadata': metadata} -.. note:: +See below for the detailed specification for these + +Index Metadata Descriptors +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``RangeIndex`` can be stored as metadata only, not requiring serialization. The +descriptor format for these as is follows: + +.. code-block:: python + + {'kind': range, + 'name': index.name, + 'start': index._start, + 'stop': index._stop, + 'step': index._step} - Every index column is stored with a name matching the pattern - ``__index_level_\d+__`` and its corresponding column information is can be - found with the following code snippet. +Other index types must be serialized as data columns along with the other +DataFrame columns. The metadata for these is a dict with ``kind`` field +``'serialized'`` and ``'field_name'`` field indicating which data column +contains the index data. For example, - Following this naming convention isn't strictly necessary, but strongly - suggested for compatibility with Arrow. +.. code-block:: python - Here's an example of how the index metadata is structured in pyarrow: + {'kind': 'serialized', + 'field_name': '__index_level_0__'} - .. code-block:: python +Every index column is stored with a name matching the pattern +``__index_level_\d+__``. Following this naming convention isn't strictly +necessary, but strongly suggested for compatibility with Arrow and +disambiguation. The ``'field_name'`` is the actual name of the column in the +serialized Parquet table. If the ``Index`` has a non-None ``name`` attribute, +then it can be found in the ``name`` field of the metadata for that serialized +data column as described below. - # assuming there's at least 3 levels in the index - index_columns = metadata['index_columns'] # noqa: F821 - columns = metadata['columns'] # noqa: F821 - ith_index = 2 - assert index_columns[ith_index] == '__index_level_2__' - ith_index_info = columns[-len(index_columns):][ith_index] - ith_index_level_name = ith_index_info['name'] +Column Metadata +~~~~~~~~~~~~~~~ ``pandas_type`` is the logical type of the column, and is one of: @@ -121,7 +144,8 @@ As an example of fully-formed metadata: .. code-block:: text - {'index_columns': ['__index_level_0__'], + {'index_columns': [{'kind': 'serialized', + 'field_name': '__index_level_0__'}], 'column_indexes': [ {'name': None, 'field_name': 'None', @@ -161,4 +185,8 @@ As an example of fully-formed metadata: 'numpy_type': 'int64', 'metadata': None} ], - 'pandas_version': '0.20.0'} + 'pandas_version': '0.20.0', + 'creator': { + 'library': 'pyarrow', + 'version': '0.13.0' + }} From 2c8431c38cf192d314c3ec33da32214eea7f378f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 13 Mar 2019 12:50:09 -0500 Subject: [PATCH 2/5] Add string quotes to range --- doc/source/development/developer.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/development/developer.rst b/doc/source/development/developer.rst index 1d84a1940de56..39c88a57ce96d 100644 --- a/doc/source/development/developer.rst +++ b/doc/source/development/developer.rst @@ -70,7 +70,7 @@ descriptor format for these as is follows: .. code-block:: python - {'kind': range, + {'kind': 'range', 'name': index.name, 'start': index._start, 'stop': index._stop, From 931ca2c60ee865a055950cb9277849634eae311c Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Wed, 13 Mar 2019 12:43:48 -0700 Subject: [PATCH 3/5] Update doc/source/development/developer.rst Co-Authored-By: wesm --- doc/source/development/developer.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/source/development/developer.rst b/doc/source/development/developer.rst index 39c88a57ce96d..fafa94588aa84 100644 --- a/doc/source/development/developer.rst +++ b/doc/source/development/developer.rst @@ -70,6 +70,7 @@ descriptor format for these as is follows: .. code-block:: python + index = pd.RangeIndex(0, 10, 2) {'kind': 'range', 'name': index.name, 'start': index._start, From d3cd9048819a375a828fd9f96a764b4108a82826 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 1 Apr 2019 21:10:34 -0500 Subject: [PATCH 4/5] Revert to current scheme of index column names in 'index_columns' for non-RangeIndex --- doc/source/development/developer.rst | 27 ++++++++++----------------- 1 file changed, 10 insertions(+), 17 deletions(-) diff --git a/doc/source/development/developer.rst b/doc/source/development/developer.rst index fafa94588aa84..7717896ff8036 100644 --- a/doc/source/development/developer.rst +++ b/doc/source/development/developer.rst @@ -78,22 +78,16 @@ descriptor format for these as is follows: 'step': index._step} Other index types must be serialized as data columns along with the other -DataFrame columns. The metadata for these is a dict with ``kind`` field -``'serialized'`` and ``'field_name'`` field indicating which data column -contains the index data. For example, +DataFrame columns. The metadata for these is a string indicating the name of +the field in the data columns, for example ``'__index_level_0__'``. -.. code-block:: python - - {'kind': 'serialized', - 'field_name': '__index_level_0__'} - -Every index column is stored with a name matching the pattern -``__index_level_\d+__``. Following this naming convention isn't strictly -necessary, but strongly suggested for compatibility with Arrow and -disambiguation. The ``'field_name'`` is the actual name of the column in the -serialized Parquet table. If the ``Index`` has a non-None ``name`` attribute, -then it can be found in the ``name`` field of the metadata for that serialized -data column as described below. +If an index has a non-None ``name`` attribute, and there is no other column +with a name matching that value, then the ``index.name`` value can be used as +the descriptor. Otherwise (for unnamed indexes and ones with names colliding +with other column names) a disambiguating name with pattern matching +``__index_level_\d+__`` should be used. In cases of named indexes as data +columns, ``name`` attribute is always stored in the column descriptors as +above. Column Metadata ~~~~~~~~~~~~~~~ @@ -145,8 +139,7 @@ As an example of fully-formed metadata: .. code-block:: text - {'index_columns': [{'kind': 'serialized', - 'field_name': '__index_level_0__'}], + {'index_columns': ['__index_level_0__'], 'column_indexes': [ {'name': None, 'field_name': 'None', From 10d1e86043ec3c04a2b681f8d254b560c6ca8658 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Thu, 8 Aug 2019 14:25:18 +0200 Subject: [PATCH 5/5] small clean-up --- doc/source/development/developer.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/doc/source/development/developer.rst b/doc/source/development/developer.rst index 7717896ff8036..923ef005d5926 100644 --- a/doc/source/development/developer.rst +++ b/doc/source/development/developer.rst @@ -47,9 +47,9 @@ So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a }} The "descriptor" values ```` in the ``'index_columns'`` field are -dictionaries with values as described below. +strings (referring to a column) or dictionaries with values as described below. -Here, ````/```` and so forth are dictionaries containing the metadata +The ````/```` and so forth are dictionaries containing the metadata for each column, *including the index columns*. This has JSON form: .. code-block:: text @@ -60,7 +60,7 @@ for each column, *including the index columns*. This has JSON form: 'numpy_type': numpy_type, 'metadata': metadata} -See below for the detailed specification for these +See below for the detailed specification for these. Index Metadata Descriptors ~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -73,9 +73,9 @@ descriptor format for these as is follows: index = pd.RangeIndex(0, 10, 2) {'kind': 'range', 'name': index.name, - 'start': index._start, - 'stop': index._stop, - 'step': index._step} + 'start': index.start, + 'stop': index.stop, + 'step': index.step} Other index types must be serialized as data columns along with the other DataFrame columns. The metadata for these is a string indicating the name of