Reproducibility in train_test_apart_stratify()

`train_test_apart_stratify()` produces different results for the same input data, even when setting `random_state=0`.

To reproduce this, I've adapted the example from the function's docstring to contain only strings (i.e., the values for `a` are now str instead of int). Run this several times to see different results.

```python
import pandas
from pandas_streaming.df import train_test_apart_stratify

df = pandas.DataFrame([dict(a="1", b="e"),
                       dict(a="1", b="f"),
                       dict(a="2", b="e"),
                       dict(a="2", b="f")])

train, test = train_test_apart_stratify(
    df, group="a", stratify="b", test_size=0.5)
print(train)
print('-----------')
print(test)
```

The cause seems to be the sets created in [connex_split.py#L530](https://github.com/sdpython/pandas-streaming/blob/9753f32b958c7675eab3356412bae944bbe101b9/pandas_streaming/df/connex_split.py#L530) are then iterated over in [connex_split.py#L543](https://github.com/sdpython/pandas-streaming/blob/9753f32b958c7675eab3356412bae944bbe101b9/pandas_streaming/df/connex_split.py#L543) but a set is an [unordered collection](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset). Replacing `ids[k]` with `sorted(ids[k])` on L543 seems to fix this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproducibility in train_test_apart_stratify() #41

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reproducibility in train_test_apart_stratify() #41

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions