Skip to content

Reproducibility in train_test_apart_stratify() #41

Closed
@stephengmatthews

Description

@stephengmatthews

train_test_apart_stratify() produces different results for the same input data, even when setting random_state=0.

To reproduce this, I've adapted the example from the function's docstring to contain only strings (i.e., the values for a are now str instead of int). Run this several times to see different results.

import pandas
from pandas_streaming.df import train_test_apart_stratify

df = pandas.DataFrame([dict(a="1", b="e"),
                       dict(a="1", b="f"),
                       dict(a="2", b="e"),
                       dict(a="2", b="f")])

train, test = train_test_apart_stratify(
    df, group="a", stratify="b", test_size=0.5)
print(train)
print('-----------')
print(test)

The cause seems to be the sets created in connex_split.py#L530 are then iterated over in connex_split.py#L543 but a set is an unordered collection. Replacing ids[k] with sorted(ids[k]) on L543 seems to fix this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions