Skip to content

[ENH] Switch estimator operator logical checks to be interface based rather than inheritance based #856

Closed
@NV-jpt

Description

@NV-jpt

General Introduction and Relevance

Currently, many imbalanced-learn samplers require an estimator explicitly inherits from Scikit-learn Mixin classes in order to determine the nature of an estimator operator; often this is achieved with the method check_neighbors_object() .

This Scikit-learn inheritance based programming model provides consistency, but also limits flexibility. Switching these checks to be duck-typing based rather than inheritance based would still preserve consistency, but also allow users to use other scikit-learn API-compliant libraries with imbalanced-learn that cannot directly subclass scikit-learn, such as cuML . Using cuML with imbalanced-learn can be significantly faster, as shown in the examples below that achieve roughly 180x and 680x speedups with the example configuration.

Additional Context

A key limitation is that users cannot cleanly integrate estimators from libraries that do not directly depend on scikit-learn, but enforce the same API contract.

The use of a duck typing based programming model would allow imbalanced-learn to be more flexible; duck typing in PyData is becoming increasingly powerful as more libraries are standardizing on numpy and scikit-learn like API contracts rather than inheritance (Dask, xarray, Pangeo, CuPy, RAPIDS, and more).

The TPOT community recently adopted duck-typing to allow GPU-accelerated CUML estimators and have seen great results: https://github.com/EpistasisLab/tpot/blob/master/tutorials/Higgs_Boson.ipynb

Proposal

In check_neighbors_object(), convert the Mixin class check to instead check key attributes that determine whether an object is a certain type of estimator:

Instead of KNeighborsMixin, check for the key attributes that determine whether an object is KNeighborsMixin-like

  • kneighbors()
  • kneighbors_graph()

If there is potential interest, I'd be happy to open up a pull request for further discussion.

Potential Impact

Adopting duck typing for these checks would immediately allow using cuML (a GPU-accelerated machine learning library with a scikit-learn-like API) with imbalanced-learn. It would also open the door to other libraries (current and future), without requiring imbalanced-learn to have any knowledge or involvement.

As a concrete example of the benefit, using cuML estimators on the GPU instead of scikit-learn can provide a large speedup. Faster estimators can make a significant difference in the total runtime. The benchmark graphs below provide a couple small examples (reproducing gist provided at the bottom) which correspond to 180x and 680x speedups using cuML on a GPU vs the default scikit-learn with CPU cores.

There are a number of samplers that this proposed change would affect; allowing for the integration of estimators that do not directly subclass scikit-learn. This list includes: ADASYN, SMOTE, BorderlineSMOTE, SMOTENC, EditedNearestNeighbours, and RepeatedEditedNearestNeighbours samplers.

Samplers, such as CondensedNearestNeighbour, KMeansSMOTE, SMOTEN, SMOTEENN, and others, incorporate additional steps in their _validate_estimator() methods, and will require additional modifications to enable their seamless integration with estimators that do not directly subclass scikit-learn.

adasyn
smote

Hardware Specs for the Loose Benchmark:
Intel Xeon E5-2698, 2.2 GHz, 16-cores & NVIDIA V100 32 GB GPU

Benchmarking Code:
https://gist.github.com/NV-jpt/276be3fe57b0ca384dbdabeba4a7e643

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions