Open
Description
Split from the discussions in #2.
To avoid the trap of "let's just match pandas", let's collect a list of specific problems with the pandas API, which we'll intentionally deviate from. To the extent possible we should limit this discussoin to issues with the API, rather than implementation.
- pandas.DataFrame can't implement
collections.abc.Mapping
because.values
is a property returning an array, rather than a method. (added by @TomAugspurger) - The "groupby-apply" pattern of passing opaque functions for non-trivial aggregations that are otherwise able to be expressed easily in e.g. SQL (consider: any aggregation expression that involves more than one column) (added by @wesm)
- Indexing in pandas accepts a variety of different inputs, which each have their own semantics, e.g. passing a function to the
__getitem__
orloc
/iloc
. It is not explicitly clear to new users the difference betweendf[["a","b","c"]]
anddf[slice(5)]
anddf[lambda idx: idx % 5 == 0]
. (added by @devin-petersohn) - pandas allows the dot operator (
__getattr__
) to get columns, which causes problems for columns that share names with other APIs. (added by @devin-petersohn) - Duplicate APIs:
- Simple aliases: e.g.
isna
andisnull
,multiply
andmul
, etc.(added by @devin-petersohn) - More complex duplication: e.g.
query("a > b")
anddf[df["a"] > df["b"]]
(added by @devin-petersohn) - Indexing, there are 7 or 8 ways to get one or more columns in pandas: e.g.
__getitem__
,__getattr__
,loc
,iloc
,apply
,drop
(added by @devin-petersohn) merge
andjoin
call each other and are confusing for new users (added by @devin-petersohn)
- Simple aliases: e.g.
- Having a separate object to represent a one column dataframe (i.e.
Series
). Creating all the complexity of having to reimplement most functionality of dataframe. And not providing a consistent way of applying operations to N columns (including 1). Separate object for a dataframe colum? (is Series needed?) #6 @datapythonista - "missing" APIs / extension points:
- These are APIs or extension points that pandas and/or numpy lacks, and which -- for one reason or another -- has led libraries needing to consume pandas objects (e.g.
DataFrame
,Series
) to hard-code support for these types. This makes pandas work well with these libraries but means it's not easy (or even possible) for otherDataFrame
implementations to be supported. Lack of interop support between alternativeDataFrame
implementations and these libraries can be a small but constant annoyance for users, and in some cases a performance issue as well (if data needs to be converted to a pandas object just to get something to work). - Introspection API for autocomplete / "IntelliSense" APIs.
- In riptide we've implemented a hook + protocol and implemented it on our dataframe class
Dataset
. This provides more-detailed data compared to what a "static" tool like Jedi can return; compared todir
, our protocol allows ourDataset
class to control which columns, properties, etc. are returned for display in autocomplete dropdowns. - Our protocol also allows as well as to provide richer metadata for data columns. For example, the
dtype
or array subclass name; for Categoricals, we can provide the number of labels/categories. - The features mentioned above could alternatively be implemented through some property(ies) on the standardized
DataFrame
and/orArray
APIs (rather than a protocol with a method that returns a more-complex data structure / dictionary).
...
- In riptide we've implemented a hook + protocol and implemented it on our dataframe class
- These are APIs or extension points that pandas and/or numpy lacks, and which -- for one reason or another -- has led libraries needing to consume pandas objects (e.g.
Metadata
Metadata
Assignees
Labels
No labels