Description
Duplicated Majority Class rows in RandomUnderSampler.fit_sample
using the under sampling function like the example below:
rus = RandomUnderSampler(ratio = 0.3, random_state=0)
x_rus, y_rus = rus.fit_sample(x_train, y_train)
I found the majority class rows were being duplicated though there were plenty of data to choose from. I have only 10% of minority class in my data and using ratio = 0.3, there's plenty of majority class rows to use so why would the RandomUnderSampler duplicate rows in the majority class? I was only able to find this issue because I attached a row_id to each row before I passed it into down sampling and when I examined my classifier training results, I saw the duplicate rows when sorting the rows by row_id.
Steps/Code to Reproduce
b = np.array([100, 99, 98,97,96,95,94,93,92,91,100, 99, 98,97,96,95,94,93,92,91 ])
a = np.array([0,1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
a = a.reshape(20,1)
b = b.reshape(20,1)
y_ds = np.array([1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
y_ds = y_ds.reshape(20,1)
x_ds = np.concatenate((a, b), axis=1)
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(ratio = 0.2, random_state=0)
x_rus, y_rus = rus.fit_sample(x_ds, y_ds,)
print x_rus
print y_rus
-->
Expected Results
row_id is 1st column so rows 6 & 7 below should not be the same row_id = 5. using the same random_state=0 every time I can repoduce this error
here is the x_ds array contents being down sampled:
array([[ 0, 100],
[ 1, 99],
[ 2, 98],
[ 3, 97],
[ 4, 96],
[ 5, 95],
[ 6, 94],
[ 7, 93],
[ 8, 92],
[ 9, 91],
[ 10, 100],
[ 11, 99],
[ 12, 98],
[ 13, 97],
[ 14, 96],
[ 15, 95],
[ 16, 94],
[ 17, 93],
[ 18, 92],
[ 19, 91]])
Actual Results
[[ 0 100]
[ 1 99]
[ 14 96]
[ 17 93]
[ 2 98]
[ 5 95]
[ 5 95]
[ 9 91]]
[1 1 0 0 0 0 0 0]
Versions
Darwin-15.6.0-x86_64-i386-64bit
('Python', '2.7.12 |Anaconda custom (x86_64)| (default, Jul 2 2016, 17:43:17) \n[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)]')
('NumPy', '1.11.3')
('SciPy', '0.18.1')
('Scikit-Learn', '0.18.1')
('Imbalanced-Learn', '0.2.1')