Skip to content

Sample selection bias and up/down-sampling #540

Open
@rth

Description

@rth

It's a bit of an open-ended question. In my understanding up/down-sampling the input data depending on the target class is equivalent to having a dataset with sample selection bias. The possible impact of the latter on ML models is discussed e.g. by Zadrozny 2004.

In the use case of imbalanced-learn I gather that is not an issue because the sample selection only happens depending on the target variable y, not any of the features in X? (which corresponds to case 2 on page 2 of the above-linked paper).

An orthogonal question: assuming we do have some dataset with sample selection bias based on some feature in X (case 3, page 2 of the same paper). In other words, the distribution of one of the column of X does not match the real world distribution and we would like to compensate for it. Could one of the approaches in imbalanced-learn be used (or adapted) for it? Would something like this be in the scope of this project?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions