Skip to content

BUG: incorrect reading of CSV containing large integers #52505

Closed
@fingoldo

Description

@fingoldo

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from io import StringIO

scsv = """#SYMBOL,SYSTEM,TYPE,MOMENT,ID,ACTION,PRICE,VOLUME,ID_DEAL,PRICE_DEAL
RIH3,F,S,20230301181139587,1925036343869802844,0,96690.00000,2,,
USDRUBF,F,S,20230301181139587,2023552585717888193,0,75.10000,14,,
USDRUBF,F,S,20230301181139587,2023552585717889863,1,75.00000,14,,
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,1,2023552585717263358,75.00000
USDRUBF,F,B,20230301181139587,2023552585717882895,2,75.00000,1,2023552585717263358,75.00000
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,1,2023552585717263359,75.00000
USDRUBF,F,B,20230301181139587,2023552585717888161,2,75.00000,1,2023552585717263359,75.00000
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,10,2023552585717263360,75.00000
USDRUBF,F,B,20230301181139587,2023552585717889759,2,75.00000,10,2023552585717263360,75.00000
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,2,2023552585717263361,75.00000
USDRUBF,F,B,20230301181139587,2023552585717889827,2,75.00000,2,2023552585717263361,75.00000
"""

for engine in "c pyarrow python".split():
    test = StringIO(scsv)
    orders = pd.read_csv(test, engine=engine,)
    print(engine, len(orders.query("ID_DEAL==2023552585717263360")))

Issue Description

Reading of this financial data peace is terribly incorrect. I get the following output:

c 8
pyarrow 8
python 0

None of the engines is able to parse the data correctly, I suppose due to the integers being pretty big. But pandas produces no warning, just silently butchers the data. The issue is not new to 2.0, it was present in 1.5 as well.

It seems to be connected to float64 not having enough precision to hold all the digits. int64 would be ok probably, as the previous column works with int64, but current column has missing values.
using of the new nullable integer type kind of works, but not for pyarrow:
...
orders = pd.read_csv(test, engine=engine,dtype={"ID_DEAL": "Int64"})
...

c 2
pyarrow 8
python 2

But certainly the loss of precision needs to be recognized automatically, and either error should be issued, or Int64 dtype suggested. Otherwise, this bug can lead to catastrophical consequences in decision support systems powered by pandas.

Expected Behavior

c 2
pyarrow 2
python 2

Installed Versions

INSTALLED VERSIONS ------------------ commit : 478d340 python : 3.8.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.17763 machine : AMD64 processor : Intel64 Family 6 Model 45 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Russian_Russia.1251

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions