Sample selection bias and up/down-sampling

It's a bit of an open-ended question. In my understanding up/down-sampling the input data depending on the target class is equivalent to having a dataset with sample selection bias. The possible impact of the latter on ML models is discussed e.g. by [Zadrozny 2004](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.170&rep=rep1&type=pdf).

In the use case of imbalanced-learn I gather that is not an issue because the sample selection only happens depending on the target variable `y`, not any of the features in `X`?  (which corresponds to case 2 on page 2 of the above-linked paper).

An orthogonal question: assuming we do have some dataset with sample selection bias based on some feature in `X` (case 3, page 2 of the same paper). In other words, the distribution of one of the column of X does not match the real world distribution and we would like to compensate for it. Could one of the approaches in imbalanced-learn be used (or adapted) for it? Would something like this be in the scope of this project?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sample selection bias and up/down-sampling #540

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sample selection bias and up/down-sampling #540

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions