CLN: revisit & simplify core data structures

This is an attempt to simplify/streamline internal API that has been brewing inside my head for quite a while. It does mean a significant overhaul and may take time, but it may prove worth the while. I'm putting it here for discussion ahead of time to make sure the effort isn't wasted for going in the wrong direction.
### Idea

The idea is simple: "augmented take" operation — with -1 taking from nowhere and creating a new column — is enough to express any reindexing/merging/joining that may happen at Index level.  So, lower levels of API that do the heavy lifting may be relieved from the burden of operating on labels and keeping them in sync. This will make them more self-contained with the following benefits:
- simplify implementation
- regularize data-handling operations making them more dependable, no more "oh no, I have a duplicate/timestamp/period/multiindex/etc. label in the index, all slicing operations are now 10x slower", this happened to me more times than I'm proud of.
- simplify contributions from new developers, no more tracking down all the code paths up to the public API to fix one small check.
- make code more test- and benchmark-friendly, no more exponential growth in number of tests/benchmarks for each new feature
- declaring an API will simplify developing and maintaining "unconventional" storages (sparse, categorical, compressed, etc.)

There's a [three-year-old ticket](https://github.com/pydata/pandas/issues/162) ticket that mentions a similar (if not the same) idea. As mentioned there, this _may_ break pickles and other deserialization and thus it will require a separate legacy deserialization compatibility layer.

[Another ticket](https://github.com/pydata/pandas/issues/163) mentions moving Block & BlockManager internals to cython level and dropping axes dependency will definitely facilitate that.
### Goals

The end goal is to have internals layered as follows:
- **Block**: a _proper_ homogenous ndarray
  - think numpy.ndarray with all necessary fixes/workarounds
  - datatype inference
  - support for custom pandas datatypes
  - typical (slice, concatenate) and pandas-specific (take-with-insert) operations
- **RemappableBlock**: homogeneous ndarray that supports "remapping" one of its axes
  - think _Block_ + _ref_locs_
  - _ref_locs_ should be Int64Index (platform-int-index?, also [RangeIndex](https://github.com/pydata/pandas/issues/939) will help)
  - _Block_ instances may be shared between _RemappableBlocks_
  - optimizations for no-remapping mode of work (think _SingleBlockManager_)
  - FIXME: better name?
- **BlockManager**: a _proper_ heterogeneous ndarray
  - external interface is similar to that of **Block**
  - can share _RemappableBlocks_
- **NDFrame**: labeled heterogeneous ndarray
  - more or less equivalent to current _NDFrames_
  - merging/joining/reindexing only appears at this level
### Deliverables
- Stage 1 **DONE**
  - make ref_locs primary source of information (leaving items/ref_items in place to back it up and avoid breakage)
  - port merging/joining internals to loc-based implementation (there's quite a number of hacks a.t.m. that make this non-trivial)
  - drop Block `items` & `ref_items` fields (ensure io backward compatibility!)
  - fix performance issues & integrate with mainline
- Stage 2: TBA


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CLN: revisit & simplify core data structures #6744

Idea

Goals

Deliverables

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

CLN: revisit & simplify core data structures #6744

Description

Idea

Goals

Deliverables

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions