Skip to content

BUG: pd.Series.duplicated(keep='first'|'last') returns multiple duplicates #59333

Closed
@JMBurley

Description

@JMBurley

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
for keep_val in ['first','last']:
    print(f"{keep_val = }")
    series = pd.Series([1, 2, 2, 3, 4, 4, 5, 5, 5])
    # Identify duplicates (erroneously finds 5 twice)
    mask = series.duplicated(keep=keep_val)
    print(series[mask])

    data = pd.Series(['1', '2', '2', '3', '4', '4', '5', '5', '5'])
    # Identify duplicates (erroneously finds 5 twice)
    mask = series.duplicated(keep=keep_val)
    print(series[mask])

Issue Description

keep='first'|'last' should only return one instance of each duplicated values.

In the above examples it returns ['2', '4', '5', '5'] not ['2', '4', '5'].

keep='last' returns the '5' at index 6, 7
keep='first' returns the '5' at index 7, 8
eg.
image

Expected Behavior

series.duplicated(keep='first'|'last') should only return one instance of each duplicated values.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2e python : 3.11.4.final.0 python-bits : 64 OS : Darwin OS-release : 21.6.0 Version : Darwin Kernel Version 21.6.0: Wed Aug 10 14:28:23 PDT 2022; root:xnu-8020.141.5~2/RELEASE_ARM64_T6000 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.25.2
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.1.2
pip : 23.2.1
Cython : 3.0.0a10
pytest : 7.4.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : 2.9.7
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : 2023.9.0
gcsfs : None
matplotlib : 3.7.2
numba : 0.58.1
numexpr : 2.8.5
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2023.9.0
scipy : 1.11.2
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2023.8.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Metadata

Metadata

Labels

BugNeeds TriageIssue that has not been reviewed by a pandas team member

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions