Description
It's a bit of an open-ended question. In my understanding up/down-sampling the input data depending on the target class is equivalent to having a dataset with sample selection bias. The possible impact of the latter on ML models is discussed e.g. by Zadrozny 2004.
In the use case of imbalanced-learn I gather that is not an issue because the sample selection only happens depending on the target variable y
, not any of the features in X
? (which corresponds to case 2 on page 2 of the above-linked paper).
An orthogonal question: assuming we do have some dataset with sample selection bias based on some feature in X
(case 3, page 2 of the same paper). In other words, the distribution of one of the column of X does not match the real world distribution and we would like to compensate for it. Could one of the approaches in imbalanced-learn be used (or adapted) for it? Would something like this be in the scope of this project?