Description
This is an attempt to simplify/streamline internal API that has been brewing inside my head for quite a while. It does mean a significant overhaul and may take time, but it may prove worth the while. I'm putting it here for discussion ahead of time to make sure the effort isn't wasted for going in the wrong direction.
Idea
The idea is simple: "augmented take" operation — with -1 taking from nowhere and creating a new column — is enough to express any reindexing/merging/joining that may happen at Index level. So, lower levels of API that do the heavy lifting may be relieved from the burden of operating on labels and keeping them in sync. This will make them more self-contained with the following benefits:
- simplify implementation
- regularize data-handling operations making them more dependable, no more "oh no, I have a duplicate/timestamp/period/multiindex/etc. label in the index, all slicing operations are now 10x slower", this happened to me more times than I'm proud of.
- simplify contributions from new developers, no more tracking down all the code paths up to the public API to fix one small check.
- make code more test- and benchmark-friendly, no more exponential growth in number of tests/benchmarks for each new feature
- declaring an API will simplify developing and maintaining "unconventional" storages (sparse, categorical, compressed, etc.)
There's a three-year-old ticket ticket that mentions a similar (if not the same) idea. As mentioned there, this may break pickles and other deserialization and thus it will require a separate legacy deserialization compatibility layer.
Another ticket mentions moving Block & BlockManager internals to cython level and dropping axes dependency will definitely facilitate that.
Goals
The end goal is to have internals layered as follows:
- Block: a proper homogenous ndarray
- think numpy.ndarray with all necessary fixes/workarounds
- datatype inference
- support for custom pandas datatypes
- typical (slice, concatenate) and pandas-specific (take-with-insert) operations
- RemappableBlock: homogeneous ndarray that supports "remapping" one of its axes
- think Block + ref_locs
- ref_locs should be Int64Index (platform-int-index?, also RangeIndex will help)
- Block instances may be shared between RemappableBlocks
- optimizations for no-remapping mode of work (think SingleBlockManager)
- FIXME: better name?
- BlockManager: a proper heterogeneous ndarray
- external interface is similar to that of Block
- can share RemappableBlocks
- NDFrame: labeled heterogeneous ndarray
- more or less equivalent to current NDFrames
- merging/joining/reindexing only appears at this level
Deliverables
- Stage 1 DONE
- make ref_locs primary source of information (leaving items/ref_items in place to back it up and avoid breakage)
- port merging/joining internals to loc-based implementation (there's quite a number of hacks a.t.m. that make this non-trivial)
- drop Block
items
&ref_items
fields (ensure io backward compatibility!) - fix performance issues & integrate with mainline
- Stage 2: TBA