Skip to content

REGR: outer merge resulting in missing datetime values converts to int #31157

Closed
@vadella

Description

@vadella

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame(
    {
        "x": [0, 1],
        "date": pd.to_datetime(["2020-01-05T12:00", "2020-01-05T13:00"]),
    }
)
df2 = pd.DataFrame({"x": [0, 2], "y": [0, 1]})

merged = pd.merge(df, df2, on="x", how="outer")
merged.info()
print(merged)
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 3 entries, 0 to 2
    Data columns (total 3 columns):
     #   Column  Non-Null Count  Dtype  
    ---  ------  --------------  -----  
     0   x       3 non-null      int64  
     1   date    3 non-null      object 
     2   y       2 non-null      float64
    dtypes: float64(1), int64(1), object(1)
    memory usage: 96.0+ bytes
   x   date                 y
0  0   1578225600000000000  0.0
1  1   1578229200000000000  NaN
2  2  -9223372036854775808  1.0
merged["date"].map(type)
    0    <class 'int'>
    1    <class 'int'>
    2    <class 'int'>
    Name: date, dtype: object
print(pd.merge(df, df2, on="x", how="inner"))
   x                date  y
0  0 2020-01-05 12:00:00  0

Problem description

Merging 2 DataFrames which result in missing Datetime values silently converts the column to integers with an object dtype, instead of keeping datetime64[ns] as dtype, which was the behaviour in pandas 0.25.3, and the behaviour I would expect.

Expected Output

merged = pd.merge(df, df2, on="x", how="outer")
merged.info()
print(merged)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 3 columns):
x       3 non-null int64
date    2 non-null datetime64[ns]
y       2 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 96.0 bytes
   x                date    y
0  0 2020-01-05 12:00:00  0.0
1  1 2020-01-05 13:00:00  NaN
2  2                 NaT  1.0
merged["date"].map(type)
    0    <class 'pandas._libs.tslibs.timestamps.Timesta...
    1    <class 'pandas._libs.tslibs.timestamps.Timesta...
    2        <class 'pandas._libs.tslibs.nattype.NaTType'>
    Name: date, dtype: object

This is the output from Python 0.25.3

Whether the y-column should be converted to float or Int64 is open for debate, but not the issue at hand here.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 8.1
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.0rc0
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 45.1.0.post20200119
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : 3.0.2
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    RegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions