Description
master
at aba7d2:
>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a\n' + str(2**63)
>>>
>>> read_csv(StringIO(data), engine='c').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
a 1 non-null object
dtypes: object(1)
memory usage: 88.0+ bytes
>>>
>>> read_csv(StringIO(data), engine='python').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
a 1 non-null object
dtypes: object(1)
memory usage: 88.0+ bytes
We should be able to handle uint64
, and tests like this one here should not be enforcing buggy behavior.
The buggy behavior for the C engine traces to here, where we attempt to cast according to this order defined here. Note for starters that uint64
is not in that list. This try-except
is due to OverflowError
with int64
, after which we immediately convert to an object
array of strings. At first, I thought inserting uint64
to the list would be good, but that can cause bad casting in the other direction, i.e. negative numbers get converted to their uint64
equivalents.
The buggy behavior for the Python engine traces to here, where we attempt to infer the dtype
here. However, as I pointed out in #14982, this function fails with uint64
with a similar (and non-sensical) try-except
for OverflowError
in int64
.
The questions that I posed in #14982 are also relevant here, since they should be consistent across both engines that also is performant. Patching the Python engine probably requires fixing #14982 first, and patching the C engine probably requires adding new functions to parser.pyx
to parse uint64
and tokenizer.c
. However, in light of the questions that I posed in #14982, I'm not really sure what is best.