Skip to content

ENH: Add option for DataFrame.compare and Series.compare to overlook differences involving missing values  #59390

Closed
@qmarcou

Description

@qmarcou

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I'd like to use the compare() method to extract only contradictory data between datasets, not all changes in the dataset.
For instance:

import pandas as np
a = pd.Series([1, pd.NA, pd.NA])
b = pd.Series([2, "zz", pd.NA])
a.compare(b)
Out[3]: 
   self other
0     1     2
1  <NA>    zz

While the first row is a clear cut difference /contradiction (used to be 1 and is now 2), the second row is a bit more ambiguous and could be interpreted either as:

  • used to be a pd.NA object ans is now a str with a set value (current behavior, completly sound as a default behavior imo)
  • used to be something we didn't know but might as well have been "zz", and is now "zz", hence no difference reported (desired optional feature)

This behavior has not been discussed in #30852 and its source issue #30429

Feature Description

Add a new keyword argument such as skipna for aggregation functions to compare such that

a.compare(b, skipna=False)
Out: 
   self other
0     1     2
1  <NA>    zz
a.compare(b, skipna=True)
Out: 
   self other
0     1     2

skipna=False would be the default value to conform with the current default behavior (which is stricter/safer and thus more sound as default behavior)

Alternative Solutions

I can get the desired output by first erasing NA-linked differences using combine_first as a workaround:

def skipna_compare(a: pd.Series | pd.DataFrame,
                   b: pd.Series | pd.DataFrame
                   ) -> (pd.Series | pd.DataFrame):
    filled_a = a.combine_first(b)
    filled_b = b.combine_first(a)
    return filled_a.compare(filled_b)
skipna_compare(a,b)
Out: 
  self other
0    1     2

But I don't think this is really efficient and satisfactory.

Additional Context

Thanks for the amazing work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Closing CandidateMay be closeable, needs more eyeballsEnhancementMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions