Skip to content

BUG: With cache, to_datetime() returns pd.NaT for inputs that produce duplicated values #42259

Closed
@zyc09

Description

@zyc09
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from pandas import NaT, to_datetime, Series, Timestamp
from pandas._testing import assert_series_equal

input_ser = Series([None] + [NaT] * 50 + ["2012 July 26", Timestamp("2012-07-26")], dtype="object")
expected_ser = Series([NaT] * 51 + [Timestamp("2012-07-26"), Timestamp("2012-07-26")], dtype="datetime64[ns]")
result = to_datetime(input_ser)
assert_series_equal(result, expected_ser)
# AssertionError: numpy array are different

# actual result is [NaT] * 51 + [Timestamp("2012-07-26"), NaT]
# ie. below passes
# assert_series_equal(result, Series([NaT] * 51 + [Timestamp("2012-07-26"), NaT], dtype="datetime64[ns]")) 

Problem description

The current to_datetime will incorrectly parse and omit data in certain situations due to a slightly erroneous deduplication to fix
GH#39882 and GH#35888.

Expected Output

to parse the datetime correctly.
Eg. return Series([NaT] * 51 + [Timestamp("2012-07-26"), Timestamp("2012-07-26")], dtype="datetime64[ns]") in the example above.

Output of pd.show_versions()

pandas : 1.4.0.dev0+108.gfa6b96e128
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 57.0.0
Cython : 0.29.23
pytest : 6.2.4
hypothesis : 6.14.0
sphinx : 4.0.2
blosc : 1.10.4
feather : None
xlsxwriter : 1.4.3
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.05.0
fastparquet : 0.6.3
gcsfs : 2021.05.0
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.1
pyxlsb : None
s3fs : 2021.05.0
scipy : 1.7.0
sqlalchemy : 1.4.19
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

Metadata

Metadata

Assignees

Labels

BugDatetimeDatetime data dtypeRegressionFunctionality that used to work in a prior pandas version

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions