Skip to content

Commit 47caeb8

Browse files
committed
Draft metadata specification doc for Apache Parquet
1 parent d92f06a commit 47caeb8

File tree

2 files changed

+95
-0
lines changed

2 files changed

+95
-0
lines changed

doc/source/index.rst.template

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,7 @@ See the package overview for more detail about what's in the library.
146146
comparison_with_r
147147
comparison_with_sql
148148
comparison_with_sas
149+
metadata
149150
{% endif -%}
150151
{% if api -%}
151152
api

doc/source/metadata.rst

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
.. _metadata:
2+
3+
.. currentmodule:: pandas
4+
5+
**********************************************
6+
Storing pandas Objects in Various File Formats
7+
**********************************************
8+
9+
This document provides specifications for metadata to assist with reading and
10+
writing pandas objects to different third party file formats.
11+
12+
Apache Parquet
13+
--------------
14+
15+
The Apache Parquet format provides key-value metadata at the file and column
16+
level, stored in the footer of the Parquet file:
17+
18+
.. code-block:: shell
19+
20+
5: optional list<KeyValue> key_value_metadata
21+
22+
where ``KeyValue`` is
23+
24+
.. code-block:: shell
25+
26+
struct KeyValue {
27+
1: required string key
28+
2: optional string value
29+
}
30+
31+
So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a
32+
``pandas`` metadata key with the the value stored as :
33+
34+
.. code-block:: text
35+
36+
{'index_columns': ['__index_level_0__', '__index_level_1__', ...],
37+
'columns': [<c0>, <c1>, ...]}
38+
39+
Here, ``<c0>`` and so forth are dictionaries containing the metadata for each
40+
column. This has JSON form:
41+
42+
.. code-block:: text
43+
44+
{'name': column_name,
45+
'type': pandas_type,
46+
'numpy_dtype': numpy_type,
47+
'metadata': type_metadata}
48+
49+
``pandas_type`` is the logical type of the column, and is one of:
50+
51+
* Boolean: ``'bool'``
52+
* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'``
53+
* Floats: ``'float16', 'float32', 'float64'``
54+
* Datetime: ``'datetime', 'datetimetz'``
55+
* String: ``'unicode', 'bytes'``
56+
* Categorical: ``'categorical'``
57+
58+
The ``numpy_type`` is the physical storage type of the column, which is the
59+
result of ``str(dtype)`` for the underlying NumPy array that holds the data. So
60+
for ``datetimetz`` this is ``datetime64[ns]`` and for categorical, it may be
61+
any of the supported integer categorical types.
62+
63+
The ``type_metadata`` is ``None`` except for:
64+
65+
* ``datetimetz``: ``{'timezone': zone}``, e.g. ``{'timezone', 'America/New_York'}``
66+
* ``categorical``: ``{'num_categories': K}``
67+
68+
As an example of fully-formed metadata:
69+
70+
.. code-block:: text
71+
72+
{'index_columns': ['__index_level_0__'],
73+
'columns': [
74+
{'name': 'c0',
75+
'type': 'int8',
76+
'numpy_type': 'int8',
77+
'metadata': None},
78+
{'name': 'c1',
79+
'type': 'string',
80+
'numpy_type': 'object',
81+
'metadata': None},
82+
{'name': 'c2',
83+
'type': 'categorical',
84+
'numpy_type': 'int16',
85+
'metadata': {'num_categories': 1000}},
86+
{'name': 'c3',
87+
'type': 'datetimetz',
88+
'numpy_type': 'datetime64[ns]',
89+
'metadata': {'timezone': 'America/Los_Angeles'}},
90+
{'name': '__index_level_0__',
91+
'type': 'int64',
92+
'numpy_type': 'int64',
93+
'metadata': None}
94+
]}

0 commit comments

Comments
 (0)