Description
There was a question on the sync call today about defining "what is a data frame?". People may have different perspectives, but I wanted to offer mine:
A "data frame" is a programming interface for expressing data manipulations and analytical operations on tabular datasets (a dataset, in turn, is a collection of columns each having their own logical data type) in a general purpose programming language. The interface often exposes imperative, composable constructs where operations consist of multiple statements or function calls. This contrasts with the declarative interface of query languages like SQL.
Things that IMHO should not be included in the definition, and are implementation-specific concerns, and any given "data frame library" may work differently:
- Presumptions of data representation (for example, most data frame libraries in Python have a bespoke / custom representation based on lower-level data containers). This includes specific type-specific questions like "how is a timestamp represented" or "how is categorical data reprepresented", since these are implementation dependent. Also, just because the programming interface has "columns" does not guarantee that the data representation is columnar.
- Presumptions of data locality (in-memory, distributed in-memory, out-of-core, remote)
- Presumptions of execution semantics (eager vs. deferred)
Hopefully one objective of this group will be to define a standardized programming interface that avoids commingling implementation-specific details into the interface.
That said, there may be people that want to create "RPandas" (see RPython) -- i.e. to provide for substituting new objects into existing code that uses pandas. If that's what some people want, we will need to clarify that up front.