Skip to content

down_sampling method is choosing duplicate rows of majority class when not needed #287

Closed
@Dwtliao

Description

@Dwtliao

Duplicated Majority Class rows in RandomUnderSampler.fit_sample

using the under sampling function like the example below:


rus = RandomUnderSampler(ratio = 0.3, random_state=0)

x_rus, y_rus = rus.fit_sample(x_train, y_train)

I found the majority class rows were being duplicated though there were plenty of data to choose from. I have only 10% of minority class in my data and using ratio = 0.3, there's plenty of majority class rows to use so why would the RandomUnderSampler duplicate rows in the majority class? I was only able to find this issue because I attached a row_id to each row before I passed it into down sampling and when I examined my classifier training results, I saw the duplicate rows when sorting the rows by row_id.

Steps/Code to Reproduce

b = np.array([100, 99, 98,97,96,95,94,93,92,91,100, 99, 98,97,96,95,94,93,92,91 ])
a = np.array([0,1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
a = a.reshape(20,1)
b = b.reshape(20,1)
y_ds = np.array([1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
y_ds = y_ds.reshape(20,1)
x_ds = np.concatenate((a, b), axis=1)
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(ratio = 0.2, random_state=0)
x_rus, y_rus = rus.fit_sample(x_ds, y_ds,)
print x_rus
print y_rus

-->

Expected Results

row_id is 1st column so rows 6 & 7 below should not be the same row_id = 5. using the same random_state=0 every time I can repoduce this error
here is the x_ds array contents being down sampled:
array([[ 0, 100],
[ 1, 99],
[ 2, 98],
[ 3, 97],
[ 4, 96],
[ 5, 95],
[ 6, 94],
[ 7, 93],
[ 8, 92],
[ 9, 91],
[ 10, 100],
[ 11, 99],
[ 12, 98],
[ 13, 97],
[ 14, 96],
[ 15, 95],
[ 16, 94],
[ 17, 93],
[ 18, 92],
[ 19, 91]])

Actual Results

[[ 0 100]
[ 1 99]
[ 14 96]
[ 17 93]
[ 2 98]
[ 5 95]
[ 5 95]
[ 9 91]]
[1 1 0 0 0 0 0 0]

Versions

Darwin-15.6.0-x86_64-i386-64bit
('Python', '2.7.12 |Anaconda custom (x86_64)| (default, Jul 2 2016, 17:43:17) \n[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)]')
('NumPy', '1.11.3')
('SciPy', '0.18.1')
('Scikit-Learn', '0.18.1')
('Imbalanced-Learn', '0.2.1')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions