BUG: read_csv fails with uint64

`master` at <a href="https://github.com/pandas-dev/pandas/commit/aba7d255a165bbc221be106839c59a876655c0c2">aba7d2</a>:
~~~python
>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a\n' + str(2**63)
>>>
>>> read_csv(StringIO(data), engine='c').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
a    1 non-null object
dtypes: object(1)
memory usage: 88.0+ bytes
>>>
>>> read_csv(StringIO(data), engine='python').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
a    1 non-null object
dtypes: object(1)
memory usage: 88.0+ bytes
~~~

We should be able to handle `uint64`, and tests like this one <a href="https://github.com/pandas-dev/pandas/blob/aba7d255a165bbc221be106839c59a876655c0c2/pandas/io/tests/parser/common.py#L937">here</a> should not be enforcing buggy behavior.

The buggy behavior for the C engine traces to <a href="https://github.com/pandas-dev/pandas/blob/master/pandas/parser.pyx#L1126">here</a>, where we attempt to cast according to this order defined <a href="https://github.com/pandas-dev/pandas/blob/master/pandas/parser.pyx#L417">here</a>.  Note for starters that `uint64` is not in that list.  This `try-except` is due to `OverflowError` with `int64`, after which we immediately convert to an `object` array of strings.  At first, I thought inserting `uint64` to the list would be good, but that can cause bad casting in the other direction, i.e. negative numbers get converted to their `uint64` equivalents.

The buggy behavior for the Python engine traces to <a href="https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L1345">here</a>, where we attempt to infer the `dtype` <a href="https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L1392">here</a>.  However, as I pointed out in #14982, this function fails with `uint64` with a similar (and non-sensical) `try-except` for `OverflowError` in `int64`.

The questions that I posed in #14982 are also relevant here, since they should be consistent across both engines that also is performant.  Patching the Python engine probably requires fixing #14982 first, and patching the C engine probably requires adding new functions to `parser.pyx` to parse `uint64` and `tokenizer.c`.  However, in light of the questions that I posed in #14982, I'm not really sure what is best.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: read_csv fails with uint64 #14983

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

BUG: read_csv fails with uint64 #14983

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions