Skip to content

CLN: revisit & simplify core data structures #6744

Closed
@immerrr

Description

@immerrr

This is an attempt to simplify/streamline internal API that has been brewing inside my head for quite a while. It does mean a significant overhaul and may take time, but it may prove worth the while. I'm putting it here for discussion ahead of time to make sure the effort isn't wasted for going in the wrong direction.

Idea

The idea is simple: "augmented take" operation — with -1 taking from nowhere and creating a new column — is enough to express any reindexing/merging/joining that may happen at Index level. So, lower levels of API that do the heavy lifting may be relieved from the burden of operating on labels and keeping them in sync. This will make them more self-contained with the following benefits:

  • simplify implementation
  • regularize data-handling operations making them more dependable, no more "oh no, I have a duplicate/timestamp/period/multiindex/etc. label in the index, all slicing operations are now 10x slower", this happened to me more times than I'm proud of.
  • simplify contributions from new developers, no more tracking down all the code paths up to the public API to fix one small check.
  • make code more test- and benchmark-friendly, no more exponential growth in number of tests/benchmarks for each new feature
  • declaring an API will simplify developing and maintaining "unconventional" storages (sparse, categorical, compressed, etc.)

There's a three-year-old ticket ticket that mentions a similar (if not the same) idea. As mentioned there, this may break pickles and other deserialization and thus it will require a separate legacy deserialization compatibility layer.

Another ticket mentions moving Block & BlockManager internals to cython level and dropping axes dependency will definitely facilitate that.

Goals

The end goal is to have internals layered as follows:

  • Block: a proper homogenous ndarray
    • think numpy.ndarray with all necessary fixes/workarounds
    • datatype inference
    • support for custom pandas datatypes
    • typical (slice, concatenate) and pandas-specific (take-with-insert) operations
  • RemappableBlock: homogeneous ndarray that supports "remapping" one of its axes
    • think Block + ref_locs
    • ref_locs should be Int64Index (platform-int-index?, also RangeIndex will help)
    • Block instances may be shared between RemappableBlocks
    • optimizations for no-remapping mode of work (think SingleBlockManager)
    • FIXME: better name?
  • BlockManager: a proper heterogeneous ndarray
    • external interface is similar to that of Block
    • can share RemappableBlocks
  • NDFrame: labeled heterogeneous ndarray
    • more or less equivalent to current NDFrames
    • merging/joining/reindexing only appears at this level

Deliverables

  • Stage 1 DONE
    • make ref_locs primary source of information (leaving items/ref_items in place to back it up and avoid breakage)
    • port merging/joining internals to loc-based implementation (there's quite a number of hacks a.t.m. that make this non-trivial)
    • drop Block items & ref_items fields (ensure io backward compatibility!)
    • fix performance issues & integrate with mainline
  • Stage 2: TBA

Metadata

Metadata

Assignees

No one assigned

    Labels

    InternalsRelated to non-user accessible pandas implementation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions