Description
Code Sample, a copy-pastable example if possible
This is an excerpt from tests/frame/test_sorting.py:
from pandas import (DataFrame, Timestamp, NaT)
from pandas.util.testing import assert_frame_equal
# test_sort_nat_values_in_int_column(self):
# GH 14922: "sorting with large float and multiple columns incorrect"
float_values = (2.0, -1.797693e308)
# and now check if NaT is still considered as "na" for datetime64
# columns:
df = DataFrame(dict(datetime=[Timestamp("2016-01-01"), NaT],
float=float_values), columns=["datetime", "float"])
df_reversed = DataFrame(dict(datetime=[NaT, Timestamp("2016-01-01")],
float=float_values[::-1]),
columns=["datetime", "float"],
index=[1, 0])
df_sorted = df.sort_values(["datetime", "float"], na_position="first")
assert_frame_equal(df_sorted, df_reversed)
df_sorted = df.sort_values(["datetime", "float"], na_position="last")
assert_frame_equal(df_sorted, df_reversed)
Output
Traceback (most recent call last):
File "/home/mficek/repos/pandas-mficek/pandas/tests/frame/bugreport.py", line 24, in <module>
assert_frame_equal(df_sorted, df_reversed)
File "/home/mficek/repos/pandas-mficek/pandas/util/testing.py", line 1363, in assert_frame_equal
obj='{0}.index'.format(obj))
File "/home/mficek/repos/pandas-mficek/pandas/util/testing.py", line 939, in assert_index_equal
obj=obj, lobj=left, robj=right)
File "pandas/_libs/testing.pyx", line 59, in pandas._libs.testing.assert_almost_equal
File "pandas/_libs/testing.pyx", line 173, in pandas._libs.testing.assert_almost_equal
File "/home/mficek/repos/pandas-mficek/pandas/util/testing.py", line 1102, in raise_assert_detail
raise AssertionError(msg)
AssertionError: DataFrame.index are different
DataFrame.index values are different (100.0 %)
[left]: Int64Index([0, 1], dtype='int64')
[right]: Int64Index([1, 0], dtype='int64')
Problem description
I'm working on #17111 trying to change sorting algorithm for better performance. I noticed that in tests sorting with NaT, interpreted as int, should ignore na_position argument in sort_values function. But not for datetime64 type, which should consider NaT as not-a-number value.
Therefore the output above is "expected" for me, because I think sort_values simply does not consider NaT in datetime64 type as nan.
In my opinion, there should be the following change in the test suite:
df_sorted = df.sort_values(["datetime", "float"], na_position="last")
assert_frame_equal(df_sorted, df)
which however means, that version 0.20.3 wouldn't pass.
Am I wrong? Do I miss something here, please?
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-87-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.20.3
pytest: 3.1.3
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None