Skip to content

ENH: add percentage threshold to DataFrame.dropna #35299

Closed
@erfannariman

Description

@erfannariman

Is your feature request related to a problem?
When doing feature selection for models, or with handling missing data, in most cases you want to dynamically set a threshold to drop column which exceed this. A percentage is a more general metric to use to see how many data is missing data per row or per column. Right now we have the thresh argument, which accepts an int, but in most cases this is not desirable. For example when running the code in production, the shape of the data changes over time and thus a percentage threshold would make more sense to me. Or when you don't know the data well enough, what does "a number" mean in this case. Percentage is standarized.

Besides that, the thresh right now is not flexible to go over column or index axis. For data cleaning you might want to drop rows which exceed the threshold, but for feature selection, you want to treat each column separately.

Describe the solution you'd like
Have a percentage threshold for both column and index axis.

API breaking implications
Not that I can think of, but not 100% sure.

Describe alternatives you've considered
Writing own custom functions.

Additional context

Locally I created a version where the perc argument is created (see linked draft PR):

example 1:

>>> df = pd.DataFrame({'col': ["A", "A", "B", "B"],
...                    'A': [80, np.nan, np.nan, np.nan],
...                    'B': [80, np.nan, 76, 67]})
>>> df
  col     A     B
0   A  80.0  80.0
1   A   NaN   NaN
2   B   NaN  76.0
3   B   NaN  67.0

>>> df.dropna(perc=0.5)
  col     A     B
0   A  80.0  80.0
2   B   NaN  76.0
3   B   NaN  67.0

>>> df.dropna(perc=0.5, axis=1)
  col     B
0   A  80.0
1   A   NaN
2   B  76.0
3   B  67.0

example 2:

>>> df = pd.DataFrame(np.random.randint(1, 10, (10,4)), columns=list('ABCD'))
>>> df.loc[3, 'B':] = np.nan
>>> df.loc[8, :'B'] = np.nan
>>> df.loc[1:6, 'C'] = np.nan
>>> df
     A    B    C    D
0  7.0  2.0  3.0  7.0
1  3.0  2.0  NaN  9.0
2  2.0  2.0  NaN  2.0
3  7.0  NaN  NaN  NaN 
4  1.0  9.0  NaN  4.0
5  9.0  9.0  NaN  1.0
6  2.0  2.0  NaN  6.0
7  9.0  3.0  5.0  6.0
8  NaN  NaN  9.0  5.0
9  7.0  7.0  3.0  1.0

As we can see, index 3 (axis=0) has 75% missing values and column C (axis=1) has 60% missing values. With the percentage argument we can specify what the threshold is, but also consider row wise or column wise:

Example consider per row:

>>> df.dropna(perc=.4, axis=0)
     A    B    C    D
0  7.0  2.0  3.0  7.0
1  3.0  2.0  NaN  9.0
2  2.0  2.0  NaN  2.0
4  1.0  9.0  NaN  4.0
5  9.0  9.0  NaN  1.0
6  2.0  2.0  NaN  6.0
7  9.0  3.0  5.0  6.0
9  7.0  7.0  3.0  1.0

Above we can see row 3 and 8 got dropped because these had > 40% missing values.

>>> df.dropna(perc=.4, axis=1)
     A    B    D
0  7.0  2.0  7.0
1  3.0  2.0  9.0
2  2.0  2.0  2.0
3  7.0  NaN  NaN
4  1.0  9.0  4.0
5  9.0  9.0  1.0
6  2.0  2.0  6.0
7  9.0  3.0  6.0
8  NaN  NaN  5.0
9  7.0  7.0  1.0

Same command but with axis=1, so we can consider percentage threshold per column, and we see that column C got dropped because it had 60% missing values

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignEnhancementMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions