Skip to content

BUG: to_json()/read_json() can't correctly dump/load numbers requiring >15 digits of precision #38437

Closed
@mjuric

Description

@mjuric

Code Sample, a copy-pastable example

Demonstration of the serialization issue:

import pandas as pd

df = pd.DataFrame([0.9884619112598676])
js = df.to_json(double_precision=15)   # raises exception if double_precision>=16 is set

print(f"orig:           { df[0][0]} ({ df[0].dtypes})")
print(f"JSON: {js}")

Output:

orig:           0.9884619112598676 (float64)
JSON: {"0":{"0":0.988461911259868}}

Demonstration that deserialization silently disregards the last digit:

import numpy as np

js = '{"0":{"0":0.9884619112598676}}'
df = pd.read_json(js)
flt = np.float64("0.9884619112598676")
print(f"  JSON: {js}")
print(f" numpy:           {flt}")
print(f"Pandas:           {df[0][0]}  ({df[0].dtypes})")

Output:

  JSON: {"0":{"0":0.9884619112598676}}
 numpy:           0.9884619112598676
Pandas:           0.988461911259867  (float64)

Problem description

64-bit floating point numbers require up to 17 decimal digits to be fully round-tripped to textual representation and back (e.g., see https://stackoverflow.com/questions/6118231/why-do-i-need-17-significant-digits-and-not-16-to-represent-a-double/). Pandas' ujson-based decoder cuts them off at 15 digits, causing loss of precision. This introduces inconsistencies when a pandas dataframe is transmitted from point A to point B via different serializations vs. when it's not (e.g., in our case, this issue cropped up while validating a REST API for a near-Earth asteroid orbit computation service).

I traced this down to an old version of ultrajsonenc.c that's been imported to Pandas code and forces this cut. Modern versions don't seem to have this limitation (and do away with double_precision argument to ujson.dump alltogether) -- e.g., see here.

Expected Output

Modern ujson seems to handle this fine, keeping the required precision:

import ujson
print(ujson.__version__)
print(ujson.dumps(0.9884619112598676))

>>> 4.0.1
>>> 0.9884619112598676

A solution may be to update the version shipped with Pandas.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : b5958ee
python : 3.8.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.5
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.3.1
setuptools : 49.6.0.post20201009
Cython : None
pytest : 6.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 2.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocsIO JSONread_json, to_json, json_normalize

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions