Skip to content

BUG: read_csv fails with uint64 #14983

Closed
Closed
@gfyoung

Description

@gfyoung

master at aba7d2:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a\n' + str(2**63)
>>>
>>> read_csv(StringIO(data), engine='c').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
a    1 non-null object
dtypes: object(1)
memory usage: 88.0+ bytes
>>>
>>> read_csv(StringIO(data), engine='python').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
a    1 non-null object
dtypes: object(1)
memory usage: 88.0+ bytes

We should be able to handle uint64, and tests like this one here should not be enforcing buggy behavior.

The buggy behavior for the C engine traces to here, where we attempt to cast according to this order defined here. Note for starters that uint64 is not in that list. This try-except is due to OverflowError with int64, after which we immediately convert to an object array of strings. At first, I thought inserting uint64 to the list would be good, but that can cause bad casting in the other direction, i.e. negative numbers get converted to their uint64 equivalents.

The buggy behavior for the Python engine traces to here, where we attempt to infer the dtype here. However, as I pointed out in #14982, this function fails with uint64 with a similar (and non-sensical) try-except for OverflowError in int64.

The questions that I posed in #14982 are also relevant here, since they should be consistent across both engines that also is performant. Patching the Python engine probably requires fixing #14982 first, and patching the C engine probably requires adding new functions to parser.pyx to parse uint64 and tokenizer.c. However, in light of the questions that I posed in #14982, I'm not really sure what is best.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDtype ConversionsUnexpected or buggy dtype conversionsIO CSVread_csv, to_csv

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions