Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
The orient="table"
option of the JSON interface enforces the TableSchema specification but only partially.
The table below shows the deviations from the specification:
Data Data type |
table-schema format |
table-schema type |
pandas Specification orient=’table’ |
pandas read_json orient=’table’ |
---|---|---|---|---|
datetime | default (datetime ISO8601 in UTC) | datetime | datetime64[ns] | ok |
number | default | number | float64 | ok |
integer | default | integer | int64 | ok |
boolean | default | boolean | bool | ok |
string | default | string | object | ok |
custom type | default | any (custom type) | category / string | ok |
string | Format not supported | |||
uri | uri | string | Format not supported | |
binary | Binary (base64 string) | string | Format not supported | |
uuid | uuid | string | Format not supported | |
date, time or datetime with parsable format | any (parsable ?) | date, time, datetime | Format partially supported | |
date, time or datetime with custom format | <PATTERN> | date, time, datetime | Format not supported | |
duration | default (lexical duration ISO8601) | duration | timedelta64[ns] | Read_json not yet implemented |
Json data | default (json) | object | Unsupported | |
Json array | default (json array) | array | Unsupported | |
date | default (date ISO8601) | date | Unsupported | |
time | default (time ISO8601) | time | Unsupported | |
year | default | year | Unsupported | |
month | default | yearmonth | Unsupported | |
Point (string) | default (string “lon, lat”) | geopoint | Unsupported | |
Point (geojson array) | array (array [lon, lat]) | geopoint | Unsupported | |
Point (json object) | object (eg {"lon": 90, "lat": 45}) | geopoint | Unsupported | |
Geometry (geojson) | default (geojson spec) | geojson | Unsupported | |
Geometry (topojson) | Topojson (topojson spec) | geojson | Unsupported | |
Everything (custom type) | (string) | Only ‘any‘ is supported |
Feature Description
proposal
To have a simple and reversible solution which respect the TableSchema specification, I propose to add an 'External Type' (as defined in some JSON export e.g. "ExtDtype":"Float32"
).
This External Type (NTVtype) can be added to the name of DataFrame Columns following NTV format (as defined in Internet-Draft).
This proposal is included in the PDEP-0012
The table below shows the correspondence between TableSchema format / type and pandas NTVtype / dtype:
format / type | NTVtype / dtype |
---|---|
default / datetime | / datetime64[ns] |
default / number | / float64 |
default / integer | / int64 |
default / boolean | / bool |
default / string | / object |
default / duration | / timedelta64[ns] |
email / string | email / string |
uri / string | uri / string |
default / object | object / object |
default / array | array / object |
default / date | date / object |
default / time | time / object |
default / year | year / int64 |
default / yearmonth | month / int64 |
array / geopoint | point / object |
default / geojson | geojson / object |
Note:
- other TableSchema format are defined and are to be studied (uuid, binary, topojson, specific format for geopoint and datation)
- the first six lines correspond to the existing
Issues
The proposal resolves or provides an alternative solution for the following issues: #12997, #14358, #16492, #35420, #35464, #36211, #39537, #49585, #50782, #51375, #52595, #53252
Examples
In [1]: from shapely.geometry import Point
from datetime import date
In [2]: df = pd.DataFrame({
'end february::date': ['date(2023,2,28)', 'date(2024,2,29)', 'date(2025,2,28)'],
'coordinates::point': ['Point([2.3, 48.9])', 'Point([5.4, 43.3])', 'Point([4.9, 45.8])'],
'contact::email': ['john.doe@table.com', 'lisa.minelli@schema.com', 'walter.white@breaking.com']
})
In [3]: df
Out[3]:
end february::date coordinates::point contact::email
0 2023-02-28 POINT (2.3 48.9) john.doe@table.com
1 2024-02-29 POINT (5.4 43.3) lisa.minelli@schema.com
2 2025-02-28 POINT (4.9 45.8) walter.white@breaking.com
JSON representation
In [4]: pprint(df.to_json(orient='table'), compact=True, width=140, sort_dicts=False)
Out[4]:
{'schema': {'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'end february', 'type': 'date'},
{'name': 'coordinates', 'type': 'geopoint', 'format': 'array'},
{'name': 'contact', 'type': 'string', 'format': 'email'}],
'primaryKey': ['index'],
'pandas_version': '1.4.0'},
'data': [{'index': 0, 'end february': '2023-02-28', 'coordinates': [2.3, 48.9], 'contact': 'john.doe@table.com'},
{'index': 1, 'end february': '2024-02-29', 'coordinates': [5.4, 43.3], 'contact': 'lisa.minelli@schema.com'},
{'index': 2, 'end february': '2025-02-28', 'coordinates': [4.9, 45.8], 'contact': 'walter.white@breaking.com'}]}
Alternative Solutions
The alternative solutions are to add a new pandas attribute ext_type
or to extend the dtype
as NTVtype
.
Those solutions have important impacts.
Additional Context
-> NTV repository : https://github.com/loco-philippe/NTV
-> PDEP-0012