Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
Latest Pandas version my system installs (viaanaconda update pandas
), which is 0.25.3. -
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import pandas as pd
df1 = pd.DataFrame(
[
[4, 5, 6],
[5, 4, 6],
[4, 5, 6],
[3, 7, 5],
[5, 9, 0],
[1, 2, 3],
],
columns=["a", "b", "c"],
)
df2 = pd.DataFrame(
[
[1, 2, 3],
[4, 5, 6],
[1, 6, 0],
],
columns=["a", "b", "c"],
)
df1mi = pd.MultiIndex.from_frame(df1)
df2mi = pd.MultiIndex.from_frame(df2)
ismi = df1mi.intersection(df2mi) # intersection of df1 and df2
# df1mi, _ = df1mi.sortlevel(sort_remaining=True)
complement = df1mi.drop(ismi).to_frame().reset_index(drop=True)
print(df1, "\n")
print(df2, "\n")
print(complement)
Problem description
Above code computes the multi-set complement A-B for two multi-sets A and B (only difference to sets is that duplicates in A that are not in B are preserved).
Without the commented-out line, pandas issues a warning (sys:1: PerformanceWarning: indexing past lexsort depth may impact performance.
). However, the result is wrong: Row 1 from df1 is missing in the complement.
With the commented-out line, no warning is issued and the result is correct.
The bug is that without the sorting, the drop command malfunctions.
Expected Output
Correct output (with explicit sort):
a b c
0 3 7 5
1 5 4 6
2 5 9 0
Wrong output (without explicit sort):
a b c
0 3 7 5
1 5 9 0
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.3
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3.post20200330
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None