From 766aa5070d5482d5e0d428781b0674211cb0b7b7 Mon Sep 17 00:00:00 2001
From: Wes McKinney <wesm+git@apache.org>
Date: Wed, 13 Mar 2019 09:55:44 -0500
Subject: [PATCH 1/5] Add specification for RangeIndex-as-metadata in Parquet
 file schema custom metadata

---
 doc/source/development/developer.rst | 66 ++++++++++++++++++++--------
 1 file changed, 47 insertions(+), 19 deletions(-)
diff --git a/doc/source/development/developer.rst b/doc/source/development/developer.rst
index a283920ae4377..1d84a1940de56 100644
--- a/doc/source/development/developer.rst
+++ b/doc/source/development/developer.rst
@@ -37,10 +37,17 @@ So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a
 
 .. code-block:: text
 
-   {'index_columns': ['__index_level_0__', '__index_level_1__', ...],
+   {'index_columns': [<descr0>, <descr1>, ...],
     'column_indexes': [<ci0>, <ci1>, ..., <ciN>],
     'columns': [<c0>, <c1>, ...],
-    'pandas_version': $VERSION}
+    'pandas_version': $VERSION,
+    'creator': {
+      'library': $LIBRARY,
+      'version': $LIBRARY_VERSION
+    }}
+
+The "descriptor" values ``<descr0>`` in the ``'index_columns'`` field are
+dictionaries with values as described below.
 
 Here, ``<c0>``/``<ci0>`` and so forth are dictionaries containing the metadata
 for each column, *including the index columns*. This has JSON form:
@@ -53,26 +60,42 @@ for each column, *including the index columns*. This has JSON form:
     'numpy_type': numpy_type,
     'metadata': metadata}
 
-.. note::
+See below for the detailed specification for these
+
+Index Metadata Descriptors
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``RangeIndex`` can be stored as metadata only, not requiring serialization. The
+descriptor format for these as is follows:
+
+.. code-block:: python
+
+   {'kind': range,
+    'name': index.name,
+    'start': index._start,
+    'stop': index._stop,
+    'step': index._step}
 
-   Every index column is stored with a name matching the pattern
-   ``__index_level_\d+__`` and its corresponding column information is can be
-   found with the following code snippet.
+Other index types must be serialized as data columns along with the other
+DataFrame columns. The metadata for these is a dict with ``kind`` field
+``'serialized'`` and ``'field_name'`` field indicating which data column
+contains the index data. For example,
 
-   Following this naming convention isn't strictly necessary, but strongly
-   suggested for compatibility with Arrow.
+.. code-block:: python
 
-   Here's an example of how the index metadata is structured in pyarrow:
+   {'kind': 'serialized',
+    'field_name': '__index_level_0__'}
 
-    .. code-block:: python
+Every index column is stored with a name matching the pattern
+``__index_level_\d+__``. Following this naming convention isn't strictly
+necessary, but strongly suggested for compatibility with Arrow and
+disambiguation. The ``'field_name'`` is the actual name of the column in the
+serialized Parquet table. If the ``Index`` has a non-None ``name`` attribute,
+then it can be found in the ``name`` field of the metadata for that serialized
+data column as described below.
 
-       # assuming there's at least 3 levels in the index
-       index_columns = metadata['index_columns']  # noqa: F821
-       columns = metadata['columns']  # noqa: F821
-       ith_index = 2
-       assert index_columns[ith_index] == '__index_level_2__'
-       ith_index_info = columns[-len(index_columns):][ith_index]
-       ith_index_level_name = ith_index_info['name']
+Column Metadata
+~~~~~~~~~~~~~~~
 
 ``pandas_type`` is the logical type of the column, and is one of:
 
@@ -121,7 +144,8 @@ As an example of fully-formed metadata:
 
 .. code-block:: text
 
-   {'index_columns': ['__index_level_0__'],
+   {'index_columns': [{'kind': 'serialized',
+                       'field_name': '__index_level_0__'}],
     'column_indexes': [
         {'name': None,
          'field_name': 'None',
@@ -161,4 +185,8 @@ As an example of fully-formed metadata:
          'numpy_type': 'int64',
          'metadata': None}
     ],
-    'pandas_version': '0.20.0'}
+    'pandas_version': '0.20.0',
+    'creator': {
+      'library': 'pyarrow',
+      'version': '0.13.0'
+    }}

From 2c8431c38cf192d314c3ec33da32214eea7f378f Mon Sep 17 00:00:00 2001
From: Wes McKinney <wesm+git@apache.org>
Date: Wed, 13 Mar 2019 12:50:09 -0500
Subject: [PATCH 2/5] Add string quotes to range

---
 doc/source/development/developer.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/source/development/developer.rst b/doc/source/development/developer.rst
index 1d84a1940de56..39c88a57ce96d 100644
--- a/doc/source/development/developer.rst
+++ b/doc/source/development/developer.rst
@@ -70,7 +70,7 @@ descriptor format for these as is follows:
 
 .. code-block:: python
 
-   {'kind': range,
+   {'kind': 'range',
     'name': index.name,
     'start': index._start,
     'stop': index._stop,

From 931ca2c60ee865a055950cb9277849634eae311c Mon Sep 17 00:00:00 2001
From: Tom Augspurger <TomAugspurger@users.noreply.github.com>
Date: Wed, 13 Mar 2019 12:43:48 -0700
Subject: [PATCH 3/5] Update doc/source/development/developer.rst

Co-Authored-By: wesm <wesm@users.noreply.github.com>
---
 doc/source/development/developer.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/doc/source/development/developer.rst b/doc/source/development/developer.rst
index 39c88a57ce96d..fafa94588aa84 100644
--- a/doc/source/development/developer.rst
+++ b/doc/source/development/developer.rst
@@ -70,6 +70,7 @@ descriptor format for these as is follows:
 
 .. code-block:: python
 
+   index = pd.RangeIndex(0, 10, 2)
    {'kind': 'range',
     'name': index.name,
     'start': index._start,

From d3cd9048819a375a828fd9f96a764b4108a82826 Mon Sep 17 00:00:00 2001
From: Wes McKinney <wesm+git@apache.org>
Date: Mon, 1 Apr 2019 21:10:34 -0500
Subject: [PATCH 4/5] Revert to current scheme of index column names in
 'index_columns' for non-RangeIndex

---
 doc/source/development/developer.rst | 27 ++++++++++-----------------
 1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/doc/source/development/developer.rst b/doc/source/development/developer.rst
index fafa94588aa84..7717896ff8036 100644
--- a/doc/source/development/developer.rst
+++ b/doc/source/development/developer.rst
@@ -78,22 +78,16 @@ descriptor format for these as is follows:
     'step': index._step}
 
 Other index types must be serialized as data columns along with the other
-DataFrame columns. The metadata for these is a dict with ``kind`` field
-``'serialized'`` and ``'field_name'`` field indicating which data column
-contains the index data. For example,
+DataFrame columns. The metadata for these is a string indicating the name of
+the field in the data columns, for example ``'__index_level_0__'``.
 
-.. code-block:: python
-
-   {'kind': 'serialized',
-    'field_name': '__index_level_0__'}
-
-Every index column is stored with a name matching the pattern
-``__index_level_\d+__``. Following this naming convention isn't strictly
-necessary, but strongly suggested for compatibility with Arrow and
-disambiguation. The ``'field_name'`` is the actual name of the column in the
-serialized Parquet table. If the ``Index`` has a non-None ``name`` attribute,
-then it can be found in the ``name`` field of the metadata for that serialized
-data column as described below.
+If an index has a non-None ``name`` attribute, and there is no other column
+with a name matching that value, then the ``index.name`` value can be used as
+the descriptor. Otherwise (for unnamed indexes and ones with names colliding
+with other column names) a disambiguating name with pattern matching
+``__index_level_\d+__`` should be used. In cases of named indexes as data
+columns, ``name`` attribute is always stored in the column descriptors as
+above.
 
 Column Metadata
 ~~~~~~~~~~~~~~~
@@ -145,8 +139,7 @@ As an example of fully-formed metadata:
 
 .. code-block:: text
 
-   {'index_columns': [{'kind': 'serialized',
-                       'field_name': '__index_level_0__'}],
+   {'index_columns': ['__index_level_0__'],
     'column_indexes': [
         {'name': None,
          'field_name': 'None',

From 10d1e86043ec3c04a2b681f8d254b560c6ca8658 Mon Sep 17 00:00:00 2001
From: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Date: Thu, 8 Aug 2019 14:25:18 +0200
Subject: [PATCH 5/5] small clean-up

---
 doc/source/development/developer.rst | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/doc/source/development/developer.rst b/doc/source/development/developer.rst
index 7717896ff8036..923ef005d5926 100644
--- a/doc/source/development/developer.rst
+++ b/doc/source/development/developer.rst
@@ -47,9 +47,9 @@ So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a
     }}
 
 The "descriptor" values ``<descr0>`` in the ``'index_columns'`` field are
-dictionaries with values as described below.
+strings (referring to a column) or dictionaries with values as described below.
 
-Here, ``<c0>``/``<ci0>`` and so forth are dictionaries containing the metadata
+The ``<c0>``/``<ci0>`` and so forth are dictionaries containing the metadata
 for each column, *including the index columns*. This has JSON form:
 
 .. code-block:: text
@@ -60,7 +60,7 @@ for each column, *including the index columns*. This has JSON form:
     'numpy_type': numpy_type,
     'metadata': metadata}
 
-See below for the detailed specification for these
+See below for the detailed specification for these.
 
 Index Metadata Descriptors
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -73,9 +73,9 @@ descriptor format for these as is follows:
    index = pd.RangeIndex(0, 10, 2)
    {'kind': 'range',
     'name': index.name,
-    'start': index._start,
-    'stop': index._stop,
-    'step': index._step}
+    'start': index.start,
+    'stop': index.stop,
+    'step': index.step}
 
 Other index types must be serialized as data columns along with the other
 DataFrame columns. The metadata for these is a string indicating the name of