Closed
Description
train_test_apart_stratify()
produces different results for the same input data, even when setting random_state=0
.
To reproduce this, I've adapted the example from the function's docstring to contain only strings (i.e., the values for a
are now str instead of int). Run this several times to see different results.
import pandas
from pandas_streaming.df import train_test_apart_stratify
df = pandas.DataFrame([dict(a="1", b="e"),
dict(a="1", b="f"),
dict(a="2", b="e"),
dict(a="2", b="f")])
train, test = train_test_apart_stratify(
df, group="a", stratify="b", test_size=0.5)
print(train)
print('-----------')
print(test)
The cause seems to be the sets created in connex_split.py#L530 are then iterated over in connex_split.py#L543 but a set is an unordered collection. Replacing ids[k]
with sorted(ids[k])
on L543 seems to fix this.
Metadata
Metadata
Assignees
Labels
No labels