Index Constructors inferring output from data

Two proposals:

## Consolidate all inference to the `Index` constructor

- Retain `Index(...)` inferring the best container for the data passed
- Remove `MultiIndex(data)` returning an `Index` when data is a list of length-1 tuples (xref https://github.com/pandas-dev/pandas/pull/17236)

## Passing `dtype=object` disables inference

`Index(..., dtype=object)` disable all inference. So `Index([1, 2], dtype=object)` will give you an `Index` instead of `Int64Index`, and `Index([(1, 'a'), (2, 'b')], dtype=object)` an `Index` instead of `MultiIndex`, etc.

(original post follows)

---


Or how much magic should we have in the Index constructors? Currently we infer the index type from the data, which is often convenient, but sometime difficult to reason able behavior. e.g. `hash_tuples` currently doesn't work if your tuples all happen to be length 1, since it uses a MultiIndex internally.

Do we want to make our `Index` constructors more predictable? For reference, here are some examples:

```python
>>> import pandas as pd
# 1.) Index -> MultiIndex
>>> pd.Index([(1, 2), (3, 4)])
MultiIndex(levels=[[1, 3], [2, 4]],
           labels=[[0, 1], [0, 1]])

>>> pd.Index([(1, 2), (3, 4)], tupleize_cols=False)
Index([(1, 2), (3, 4)], dtype='object')

# 2.) Index -> Int64Index
>>> pd.Index([1, 2, 3, 4, 5])
Int64Index([1, 2, 3, 4, 5], dtype='int64')

# 3.) Index -> RangeIndex
>>> pd.Index(range(1, 5))
RangeIndex(start=1, stop=5, step=1)

# 4.) Index -> DatetimeIndex
>>> pd.Index([pd.Timestamp('2017'), pd.Timestamp('2018')])
DatetimeIndex(['2017-01-01', '2018-01-01'], dtype='datetime64[ns]', freq=None)

# 5.) Index -> IntervalIndex
>>> pd.Index([pd.Interval(3, 4), pd.Interval(4, 5)])
IntervalIndex([(3, 4], (4, 5]]
              closed='right',
              dtype='interval[int64]')

# 5.) MultiIndex -> Index
>>> pd.MultiIndex.from_tuples([(1,), (2,), (3,)])
Int64Index([1, 2, 3], dtype='int64')
```

Of these, I think the first (`Index -> MultiIndex` if you have tuples) and the last (`MultiIndex -> Index` if you're tuples are all length 1) are undesirable. The `Index -> MultiIndex` one has the `tupleize_cols` keyword to control this behavior. In https://github.com/pandas-dev/pandas/pull/17236 I add an analogous keyword to the MI constructor. The rest are probably fine, but I don't have any real reason for saying that `[1, 2, 3]` magically returning an Int64Index is ok, but `[(1, 2), (3, 4)]` returning a `MI` isn't (maybe the difference between a MI and Index is larger than the difference between an Int64Index and Index?). I believe that in either the `RangeIndex` or `IntervalIndex` someone (@shoyer?) had objections to overloading the `Index` constructor to return the specialized type.

So, what should we do about these? Leave them as is? Deprecate the type inference? My vote is for merging #17236 and leaving everything else as is. To me, it's not worth breaking API over.

cc @jreback, @jorisvandenbossche, @shoyer 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Index Constructors inferring output from data #17246

Consolidate all inference to the `Index` constructor

Passing `dtype=object` disables inference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Index Constructors inferring output from data #17246

Description

Consolidate all inference to the Index constructor

Passing dtype=object disables inference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Consolidate all inference to the `Index` constructor

Passing `dtype=object` disables inference