ENH: add percentage threshold to DataFrame.dropna

**Is your feature request related to a problem?**
When doing feature selection for models, or with handling missing data, in most cases you want to dynamically set a threshold to drop column which exceed this. A percentage is a more general metric to use to see how many data is missing data per row or per column. Right now we have the `thresh` argument, which accepts an int, but in most cases this is not desirable. For example when running the code in production, the shape of the data changes over time and thus a percentage threshold would make more sense to me. Or when you don't know the data well enough, what does "a number" mean in this case. Percentage is standarized.

Besides that, the thresh right now is not flexible to go over column or index axis. For data cleaning you might want to drop rows which exceed the threshold, but for feature selection, you want to treat each column separately.

**Describe the solution you'd like**
Have a percentage threshold for both column and index axis.

**API breaking implications**
Not that I can think of, but not 100% sure.

**Describe alternatives you've considered**
Writing own custom functions.

**Additional context**

Locally I created a version where the `perc` argument is created (see linked draft PR):

**example 1**:
```python
>>> df = pd.DataFrame({'col': ["A", "A", "B", "B"],
...                    'A': [80, np.nan, np.nan, np.nan],
...                    'B': [80, np.nan, 76, 67]})
>>> df
  col     A     B
0   A  80.0  80.0
1   A   NaN   NaN
2   B   NaN  76.0
3   B   NaN  67.0

>>> df.dropna(perc=0.5)
  col     A     B
0   A  80.0  80.0
2   B   NaN  76.0
3   B   NaN  67.0

>>> df.dropna(perc=0.5, axis=1)
  col     B
0   A  80.0
1   A   NaN
2   B  76.0
3   B  67.0
```
---
**example 2**:
```python
>>> df = pd.DataFrame(np.random.randint(1, 10, (10,4)), columns=list('ABCD'))
>>> df.loc[3, 'B':] = np.nan
>>> df.loc[8, :'B'] = np.nan
>>> df.loc[1:6, 'C'] = np.nan
>>> df
     A    B    C    D
0  7.0  2.0  3.0  7.0
1  3.0  2.0  NaN  9.0
2  2.0  2.0  NaN  2.0
3  7.0  NaN  NaN  NaN 
4  1.0  9.0  NaN  4.0
5  9.0  9.0  NaN  1.0
6  2.0  2.0  NaN  6.0
7  9.0  3.0  5.0  6.0
8  NaN  NaN  9.0  5.0
9  7.0  7.0  3.0  1.0
```
As we can see, index 3 (`axis=0`) has 75% missing values and column C (`axis=1`) has 60% missing values. With the `percentage` argument we can specify what the threshold is, but also consider row wise or column wise:

Example consider **per row**:
```python
>>> df.dropna(perc=.4, axis=0)
     A    B    C    D
0  7.0  2.0  3.0  7.0
1  3.0  2.0  NaN  9.0
2  2.0  2.0  NaN  2.0
4  1.0  9.0  NaN  4.0
5  9.0  9.0  NaN  1.0
6  2.0  2.0  NaN  6.0
7  9.0  3.0  5.0  6.0
9  7.0  7.0  3.0  1.0
```
Above we can see row 3 and 8 got dropped because these had ` > 40%` missing values.
```python
>>> df.dropna(perc=.4, axis=1)
     A    B    D
0  7.0  2.0  7.0
1  3.0  2.0  9.0
2  2.0  2.0  2.0
3  7.0  NaN  NaN
4  1.0  9.0  4.0
5  9.0  9.0  1.0
6  2.0  2.0  6.0
7  9.0  3.0  6.0
8  NaN  NaN  5.0
9  7.0  7.0  1.0
```
Same command but with `axis=1`, so we can consider percentage threshold **per column**, and we see that column C got dropped because it had 60% missing values





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: add percentage threshold to DataFrame.dropna #35299

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: add percentage threshold to DataFrame.dropna #35299

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions