Skip to content

SMOTENC MemoryError #752

Closed
Closed
@MokaddemMouna

Description

@MokaddemMouna

Hi,
I have an imbalanced dataset which contains continuous and categorical features. I am trying to use SMOTENC to oversample my minor class. I give SMOTENC the raw categorical features (strings). When I run this with a tiny subset of my origianl dataset (about 188 samples), it works fine and generates new samples with raw categorical features. But when I run it on the original dataset (~3M), I have the below error.
When I see (42507, 72255), as if the algorithm is one hot encoding my raw categorical features under the hood. This is something that i cannot understand as the original paper of SMOTE talks about median of standard deviation of continous features for the categorical features. So categorical features don't need to encoded before passing them to SMOTENC. When debugging, I found out that when the std = 0, there is some calculus done with the ohe to include in the distance as far as I understand. But the below line generates an error when trying to put together the samples of the minority class and their corresponding neighbors:

# convert to dense array since scipy.sparse doesn't handle 3D
        nn_data = (nn_data.toarray() if sparse.issparse(nn_data) else nn_data)

Here's my code and the error.

oversample = SMOTENC(categorical_features=[0, 1, 2, 3, 4, 11, 12],
                         k_neighbors=5,
                         sampling_strategy={1: 60000},
                         n_jobs=8)
    undersample = RandomUnderSampler(sampling_strategy={0: 120000})
    x_train, y_train = oversample.fit_resample(x_tot, y_tot)
    x_train, y_train = undersample.fit_resample(x_train, y_train)
File "/home/manou/anaconda3/lib/python3.7/site-packages/scipy/sparse/base.py", line 1189, in _process_toarray_args
   return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError: Unable to allocate 22.9 GiB for an array with shape (42507, 72255) and data type float64

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions