-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema #25709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema #25709
Changes from 1 commit
766aa50
2c8431c
931ca2c
d3cd904
8f6d6a7
10d1e86
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,10 +37,17 @@ So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a | |
|
||
.. code-block:: text | ||
|
||
{'index_columns': ['__index_level_0__', '__index_level_1__', ...], | ||
{'index_columns': [<descr0>, <descr1>, ...], | ||
'column_indexes': [<ci0>, <ci1>, ..., <ciN>], | ||
'columns': [<c0>, <c1>, ...], | ||
'pandas_version': $VERSION} | ||
'pandas_version': $VERSION, | ||
'creator': { | ||
'library': $LIBRARY, | ||
'version': $LIBRARY_VERSION | ||
}} | ||
|
||
The "descriptor" values ``<descr0>`` in the ``'index_columns'`` field are | ||
dictionaries with values as described below. | ||
|
||
Here, ``<c0>``/``<ci0>`` and so forth are dictionaries containing the metadata | ||
for each column, *including the index columns*. This has JSON form: | ||
|
@@ -53,26 +60,42 @@ for each column, *including the index columns*. This has JSON form: | |
'numpy_type': numpy_type, | ||
'metadata': metadata} | ||
|
||
.. note:: | ||
See below for the detailed specification for these | ||
|
||
Index Metadata Descriptors | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
``RangeIndex`` can be stored as metadata only, not requiring serialization. The | ||
descriptor format for these as is follows: | ||
|
||
.. code-block:: python | ||
|
||
{'kind': range, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, will fix |
||
'name': index.name, | ||
'start': index._start, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #25720 was just merged, prob could just change this (though of course implementing this for <0.25.x will still require this) |
||
'stop': index._stop, | ||
'step': index._step} | ||
|
||
Every index column is stored with a name matching the pattern | ||
``__index_level_\d+__`` and its corresponding column information is can be | ||
found with the following code snippet. | ||
Other index types must be serialized as data columns along with the other | ||
DataFrame columns. The metadata for these is a dict with ``kind`` field | ||
``'serialized'`` and ``'field_name'`` field indicating which data column | ||
contains the index data. For example, | ||
|
||
Following this naming convention isn't strictly necessary, but strongly | ||
suggested for compatibility with Arrow. | ||
.. code-block:: python | ||
|
||
Here's an example of how the index metadata is structured in pyarrow: | ||
{'kind': 'serialized', | ||
'field_name': '__index_level_0__'} | ||
|
||
.. code-block:: python | ||
Every index column is stored with a name matching the pattern | ||
``__index_level_\d+__``. Following this naming convention isn't strictly | ||
necessary, but strongly suggested for compatibility with Arrow and | ||
disambiguation. The ``'field_name'`` is the actual name of the column in the | ||
serialized Parquet table. If the ``Index`` has a non-None ``name`` attribute, | ||
then it can be found in the ``name`` field of the metadata for that serialized | ||
data column as described below. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this actually correct? (although it was already there in the doc before, to be clear) But this is not what I see from a quick test:
(I remember that we had a discussion about this before, but can't directly remember the outcome of that) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It appears that the current behavior is to use the name of the index if it does not conflict with any of the data columns. So we should update these docs to reflect that |
||
|
||
# assuming there's at least 3 levels in the index | ||
index_columns = metadata['index_columns'] # noqa: F821 | ||
columns = metadata['columns'] # noqa: F821 | ||
ith_index = 2 | ||
assert index_columns[ith_index] == '__index_level_2__' | ||
ith_index_info = columns[-len(index_columns):][ith_index] | ||
ith_index_level_name = ith_index_info['name'] | ||
Column Metadata | ||
~~~~~~~~~~~~~~~ | ||
|
||
``pandas_type`` is the logical type of the column, and is one of: | ||
|
||
|
@@ -121,7 +144,8 @@ As an example of fully-formed metadata: | |
|
||
.. code-block:: text | ||
|
||
{'index_columns': ['__index_level_0__'], | ||
{'index_columns': [{'kind': 'serialized', | ||
'field_name': '__index_level_0__'}], | ||
'column_indexes': [ | ||
{'name': None, | ||
'field_name': 'None', | ||
|
@@ -161,4 +185,8 @@ As an example of fully-formed metadata: | |
'numpy_type': 'int64', | ||
'metadata': None} | ||
], | ||
'pandas_version': '0.20.0'} | ||
'pandas_version': '0.20.0', | ||
'creator': { | ||
'library': 'pyarrow', | ||
'version': '0.13.0' | ||
}} |
Uh oh!
There was an error while loading. Please reload this page.