Skip to content

Index Constructors inferring output from data #17246

Open
@TomAugspurger

Description

@TomAugspurger

Two proposals:

Consolidate all inference to the Index constructor

Passing dtype=object disables inference

Index(..., dtype=object) disable all inference. So Index([1, 2], dtype=object) will give you an Index instead of Int64Index, and Index([(1, 'a'), (2, 'b')], dtype=object) an Index instead of MultiIndex, etc.

(original post follows)


Or how much magic should we have in the Index constructors? Currently we infer the index type from the data, which is often convenient, but sometime difficult to reason able behavior. e.g. hash_tuples currently doesn't work if your tuples all happen to be length 1, since it uses a MultiIndex internally.

Do we want to make our Index constructors more predictable? For reference, here are some examples:

>>> import pandas as pd
# 1.) Index -> MultiIndex
>>> pd.Index([(1, 2), (3, 4)])
MultiIndex(levels=[[1, 3], [2, 4]],
           labels=[[0, 1], [0, 1]])

>>> pd.Index([(1, 2), (3, 4)], tupleize_cols=False)
Index([(1, 2), (3, 4)], dtype='object')

# 2.) Index -> Int64Index
>>> pd.Index([1, 2, 3, 4, 5])
Int64Index([1, 2, 3, 4, 5], dtype='int64')

# 3.) Index -> RangeIndex
>>> pd.Index(range(1, 5))
RangeIndex(start=1, stop=5, step=1)

# 4.) Index -> DatetimeIndex
>>> pd.Index([pd.Timestamp('2017'), pd.Timestamp('2018')])
DatetimeIndex(['2017-01-01', '2018-01-01'], dtype='datetime64[ns]', freq=None)

# 5.) Index -> IntervalIndex
>>> pd.Index([pd.Interval(3, 4), pd.Interval(4, 5)])
IntervalIndex([(3, 4], (4, 5]]
              closed='right',
              dtype='interval[int64]')

# 5.) MultiIndex -> Index
>>> pd.MultiIndex.from_tuples([(1,), (2,), (3,)])
Int64Index([1, 2, 3], dtype='int64')

Of these, I think the first (Index -> MultiIndex if you have tuples) and the last (MultiIndex -> Index if you're tuples are all length 1) are undesirable. The Index -> MultiIndex one has the tupleize_cols keyword to control this behavior. In #17236 I add an analogous keyword to the MI constructor. The rest are probably fine, but I don't have any real reason for saying that [1, 2, 3] magically returning an Int64Index is ok, but [(1, 2), (3, 4)] returning a MI isn't (maybe the difference between a MI and Index is larger than the difference between an Int64Index and Index?). I believe that in either the RangeIndex or IntervalIndex someone (@shoyer?) had objections to overloading the Index constructor to return the specialized type.

So, what should we do about these? Leave them as is? Deprecate the type inference? My vote is for merging #17236 and leaving everything else as is. To me, it's not worth breaking API over.

cc @jreback, @jorisvandenbossche, @shoyer

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions