Description
Code Sample, a copy-pastable example
Demonstration of the serialization issue:
import pandas as pd
df = pd.DataFrame([0.9884619112598676])
js = df.to_json(double_precision=15) # raises exception if double_precision>=16 is set
print(f"orig: { df[0][0]} ({ df[0].dtypes})")
print(f"JSON: {js}")
Output:
orig: 0.9884619112598676 (float64)
JSON: {"0":{"0":0.988461911259868}}
Demonstration that deserialization silently disregards the last digit:
import numpy as np
js = '{"0":{"0":0.9884619112598676}}'
df = pd.read_json(js)
flt = np.float64("0.9884619112598676")
print(f" JSON: {js}")
print(f" numpy: {flt}")
print(f"Pandas: {df[0][0]} ({df[0].dtypes})")
Output:
JSON: {"0":{"0":0.9884619112598676}}
numpy: 0.9884619112598676
Pandas: 0.988461911259867 (float64)
Problem description
64-bit floating point numbers require up to 17 decimal digits to be fully round-tripped to textual representation and back (e.g., see https://stackoverflow.com/questions/6118231/why-do-i-need-17-significant-digits-and-not-16-to-represent-a-double/). Pandas' ujson-based decoder cuts them off at 15 digits, causing loss of precision. This introduces inconsistencies when a pandas dataframe is transmitted from point A to point B via different serializations vs. when it's not (e.g., in our case, this issue cropped up while validating a REST API for a near-Earth asteroid orbit computation service).
I traced this down to an old version of ultrajsonenc.c
that's been imported to Pandas code and forces this cut. Modern versions don't seem to have this limitation (and do away with double_precision
argument to ujson.dump
alltogether) -- e.g., see here.
Expected Output
Modern ujson
seems to handle this fine, keeping the required precision:
import ujson
print(ujson.__version__)
print(ujson.dumps(0.9884619112598676))
>>> 4.0.1
>>> 0.9884619112598676
A solution may be to update the version shipped with Pandas.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : b5958ee
python : 3.8.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.5
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.3.1
setuptools : 49.6.0.post20201009
Cython : None
pytest : 6.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 2.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : None