Description
Two proposals:
Consolidate all inference to the Index
constructor
- Retain
Index(...)
inferring the best container for the data passed - Remove
MultiIndex(data)
returning anIndex
when data is a list of length-1 tuples (xref API: Have MultiIndex consturctors always return a MI #17236)
Passing dtype=object
disables inference
Index(..., dtype=object)
disable all inference. So Index([1, 2], dtype=object)
will give you an Index
instead of Int64Index
, and Index([(1, 'a'), (2, 'b')], dtype=object)
an Index
instead of MultiIndex
, etc.
(original post follows)
Or how much magic should we have in the Index constructors? Currently we infer the index type from the data, which is often convenient, but sometime difficult to reason able behavior. e.g. hash_tuples
currently doesn't work if your tuples all happen to be length 1, since it uses a MultiIndex internally.
Do we want to make our Index
constructors more predictable? For reference, here are some examples:
>>> import pandas as pd
# 1.) Index -> MultiIndex
>>> pd.Index([(1, 2), (3, 4)])
MultiIndex(levels=[[1, 3], [2, 4]],
labels=[[0, 1], [0, 1]])
>>> pd.Index([(1, 2), (3, 4)], tupleize_cols=False)
Index([(1, 2), (3, 4)], dtype='object')
# 2.) Index -> Int64Index
>>> pd.Index([1, 2, 3, 4, 5])
Int64Index([1, 2, 3, 4, 5], dtype='int64')
# 3.) Index -> RangeIndex
>>> pd.Index(range(1, 5))
RangeIndex(start=1, stop=5, step=1)
# 4.) Index -> DatetimeIndex
>>> pd.Index([pd.Timestamp('2017'), pd.Timestamp('2018')])
DatetimeIndex(['2017-01-01', '2018-01-01'], dtype='datetime64[ns]', freq=None)
# 5.) Index -> IntervalIndex
>>> pd.Index([pd.Interval(3, 4), pd.Interval(4, 5)])
IntervalIndex([(3, 4], (4, 5]]
closed='right',
dtype='interval[int64]')
# 5.) MultiIndex -> Index
>>> pd.MultiIndex.from_tuples([(1,), (2,), (3,)])
Int64Index([1, 2, 3], dtype='int64')
Of these, I think the first (Index -> MultiIndex
if you have tuples) and the last (MultiIndex -> Index
if you're tuples are all length 1) are undesirable. The Index -> MultiIndex
one has the tupleize_cols
keyword to control this behavior. In #17236 I add an analogous keyword to the MI constructor. The rest are probably fine, but I don't have any real reason for saying that [1, 2, 3]
magically returning an Int64Index is ok, but [(1, 2), (3, 4)]
returning a MI
isn't (maybe the difference between a MI and Index is larger than the difference between an Int64Index and Index?). I believe that in either the RangeIndex
or IntervalIndex
someone (@shoyer?) had objections to overloading the Index
constructor to return the specialized type.
So, what should we do about these? Leave them as is? Deprecate the type inference? My vote is for merging #17236 and leaving everything else as is. To me, it's not worth breaking API over.