Skip to content

BUG: setting dtype='timestamp[ns][pyarrow]' removes precision from strings #49767

Open
@MarcoGorelli

Description

@MarcoGorelli

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from pandas import *
import sys
import datetime as dt
import pyarrow as pa
import io

data = [dt.datetime(2020, 1, 1, 1, 1, 1, 1), dt.datetime(1999, 1, 1)]

ser = pd.Series(data, dtype='timestamp[ns][pyarrow]')
print(ser)
print()
ser = pd.Series(data, dtype='datetime64[ns]')
print(ser)

Issue Description

The above outputs:

0    2020-01-01 01:01:01.000001
1           1999-01-01 00:00:00
dtype: timestamp[ns][pyarrow]

0   2020-01-01 01:01:01.000001
1   1999-01-01 00:00:00.000000
dtype: datetime64[ns]

Note how, in the pyarrow case, the date formats are different.

This causes issues when parsing with to_datetime and specifying format=:

In [1]: data = [dt.datetime(2020, 1, 1, 1, 1, 1, 1), dt.datetime(1999, 1, 1)]

In [2]: ser = pd.Series(data, dtype='timestamp[ns][pyarrow]')
   ...: string_data = ser.to_csv(index=False, header=None).splitlines()
   ...: pd.to_datetime(string_data, format='%Y-%m-%d %H:%M:%S.%f')
   ...: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [2], line 3
      1 ser = pd.Series(data, dtype='timestamp[ns][pyarrow]')
      2 string_data = ser.to_csv(index=False, header=None).splitlines()
----> 3 pd.to_datetime(string_data, format='%Y-%m-%d %H:%M:%S.%f')

File ~/pandas-dev/pandas/core/tools/datetimes.py:1110, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1108         result = _convert_and_box_cache(argc, cache_array)
   1109     else:
-> 1110         result = convert_listlike(argc, format)
   1111 else:
   1112     result = convert_listlike(np.array([arg]), format)[0]

File ~/pandas-dev/pandas/core/tools/datetimes.py:436, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    433         return res
    435 utc = tz == "utc"
--> 436 result, tz_parsed = objects_to_datetime64ns(
    437     arg,
    438     dayfirst=dayfirst,
    439     yearfirst=yearfirst,
    440     utc=utc,
    441     errors=errors,
    442     require_iso8601=require_iso8601,
    443     allow_object=True,
    444     format=format,
    445     exact=exact,
    446 )
    448 if tz_parsed is not None:
    449     # We can take a shortcut since the datetime64 numpy array
    450     # is in UTC
    451     dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))

File ~/pandas-dev/pandas/core/arrays/datetimes.py:2144, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, format, exact)
   2142 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
   2143 try:
-> 2144     result, tz_parsed = tslib.array_to_datetime(
   2145         data.ravel("K"),
   2146         errors=errors,
   2147         utc=utc,
   2148         dayfirst=dayfirst,
   2149         yearfirst=yearfirst,
   2150         require_iso8601=require_iso8601,
   2151         format=format,
   2152         exact=exact,
   2153     )
   2154     result = result.reshape(data.shape, order=order)
   2155 except OverflowError as err:
   2156     # Exception is raised when a part of date is greater than 32 bit signed int

File ~/pandas-dev/pandas/_libs/tslib.pyx:442, in pandas._libs.tslib.array_to_datetime()
    440 @cython.wraparound(False)
    441 @cython.boundscheck(False)
--> 442 cpdef array_to_datetime(
    443     ndarray[object] values,
    444     str errors='raise',

File ~/pandas-dev/pandas/_libs/tslib.pyx:622, in pandas._libs.tslib.array_to_datetime()
    620     continue
    621 elif is_raise:
--> 622     raise ValueError(
    623         f"time data \"{val}\" at position {i} doesn't "
    624         f"match format \"{format}\""

ValueError: time data "1999-01-01 00:00:00" at position 1 doesn't match format "%Y-%m-%d %H:%M:%S.%f"

Expected Behavior

If it printed

0    2020-01-01 01:01:01.000001
1    1999-01-01 00:00:00.000000
dtype: timestamp[ns][pyarrow]

, just like it does for datetime64[ns], then the issue would be solve

Installed Versions

INSTALLED VERSIONS

commit : 8020bf1
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.102.1-microsoft-standard-WSL2
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.0.0.dev0+697.g8020bf1b25.dirty
numpy : 1.23.4
pytz : 2022.6
dateutil : 2.8.2
setuptools : 59.8.0
pip : 22.3.1
Cython : 0.29.32
pytest : 7.2.0
hypothesis : 6.56.4
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : 8.6.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : 0.8.3
fsspec : 2021.11.0
gcsfs : 2021.11.0
matplotlib : 3.6.2
numba : 0.56.3
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : 1.2.0
pyxlsb : 1.0.10
s3fs : 2021.11.0
scipy : 1.9.3
snappy :
sqlalchemy : 1.4.43
tables : 3.7.0
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : None
qtpy : None
pyqt5 : None
None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityBugExtensionArrayExtending pandas with custom dtypes or arrays.datetime.datestdlib datetime.date support

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions