|
| 1 | +.. _metadata: |
| 2 | + |
| 3 | +.. currentmodule:: pandas |
| 4 | + |
| 5 | +********************************************** |
| 6 | +Storing pandas Objects in Various File Formats |
| 7 | +********************************************** |
| 8 | + |
| 9 | +This document provides specifications for metadata to assist with reading and |
| 10 | +writing pandas objects to different third party file formats. |
| 11 | + |
| 12 | +Apache Parquet |
| 13 | +-------------- |
| 14 | + |
| 15 | +The Apache Parquet format provides key-value metadata at the file and column |
| 16 | +level, stored in the footer of the Parquet file: |
| 17 | + |
| 18 | +.. code-block:: shell |
| 19 | +
|
| 20 | + 5: optional list<KeyValue> key_value_metadata |
| 21 | +
|
| 22 | +where ``KeyValue`` is |
| 23 | + |
| 24 | +.. code-block:: shell |
| 25 | +
|
| 26 | + struct KeyValue { |
| 27 | + 1: required string key |
| 28 | + 2: optional string value |
| 29 | + } |
| 30 | +
|
| 31 | +So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a |
| 32 | +``pandas`` metadata key with the the value stored as : |
| 33 | + |
| 34 | +.. code-block:: text |
| 35 | +
|
| 36 | + {'index_columns': ['__index_level_0__', '__index_level_1__', ...], |
| 37 | + 'columns': [<c0>, <c1>, ...]} |
| 38 | +
|
| 39 | +Here, ``<c0>`` and so forth are dictionaries containing the metadata for each |
| 40 | +column. This has JSON form: |
| 41 | + |
| 42 | +.. code-block:: text |
| 43 | +
|
| 44 | + {'name': column_name, |
| 45 | + 'type': pandas_type, |
| 46 | + 'numpy_dtype': numpy_type, |
| 47 | + 'metadata': type_metadata} |
| 48 | +
|
| 49 | +``pandas_type`` is the logical type of the column, and is one of: |
| 50 | + |
| 51 | +* Boolean: ``'bool'`` |
| 52 | +* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'`` |
| 53 | +* Floats: ``'float16', 'float32', 'float64'`` |
| 54 | +* Datetime: ``'datetime', 'datetimetz'`` |
| 55 | +* String: ``'unicode', 'bytes'`` |
| 56 | +* Categorical: ``'categorical'`` |
| 57 | + |
| 58 | +The ``numpy_type`` is the physical storage type of the column, which is the |
| 59 | +result of ``str(dtype)`` for the underlying NumPy array that holds the data. So |
| 60 | +for ``datetimetz`` this is ``datetime64[ns]`` and for categorical, it may be |
| 61 | +any of the supported integer categorical types. |
| 62 | + |
| 63 | +The ``type_metadata`` is ``None`` except for: |
| 64 | + |
| 65 | +* ``datetimetz``: ``{'timezone': zone}``, e.g. ``{'timezone', 'America/New_York'}`` |
| 66 | +* ``categorical``: ``{'num_categories': K}`` |
| 67 | + |
| 68 | +As an example of fully-formed metadata: |
| 69 | + |
| 70 | +.. code-block:: text |
| 71 | +
|
| 72 | + {'index_columns': ['__index_level_0__'], |
| 73 | + 'columns': [ |
| 74 | + {'name': 'c0', |
| 75 | + 'type': 'int8', |
| 76 | + 'numpy_type': 'int8', |
| 77 | + 'metadata': None}, |
| 78 | + {'name': 'c1', |
| 79 | + 'type': 'string', |
| 80 | + 'numpy_type': 'object', |
| 81 | + 'metadata': None}, |
| 82 | + {'name': 'c2', |
| 83 | + 'type': 'categorical', |
| 84 | + 'numpy_type': 'int16', |
| 85 | + 'metadata': {'num_categories': 1000}}, |
| 86 | + {'name': 'c3', |
| 87 | + 'type': 'datetimetz', |
| 88 | + 'numpy_type': 'datetime64[ns]', |
| 89 | + 'metadata': {'timezone': 'America/Los_Angeles'}}, |
| 90 | + {'name': '__index_level_0__', |
| 91 | + 'type': 'int64', |
| 92 | + 'numpy_type': 'int64', |
| 93 | + 'metadata': None} |
| 94 | + ]} |
0 commit comments