Avoiding the "pandas trap"

Split from the discussions in https://github.com/pydata-apis/dataframe-api/issues/2.

To avoid the trap of "let's just match pandas", let's collect a list of specific problems with the pandas API, which we'll intentionally deviate from. To the extent possible we should limit this discussoin to issues with the API, rather than implementation.

---

- pandas.DataFrame can't implement [`collections.abc.Mapping`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Mapping) because `.values` is a property returning an array, rather than a method. (added by @TomAugspurger)
- The "groupby-apply" pattern of passing opaque functions for non-trivial aggregations that are otherwise able to be expressed easily in e.g. SQL (consider: any aggregation expression that involves more than one column) (added by @wesm)
- Indexing in pandas accepts a variety of different inputs, which each have their own semantics, e.g. passing a function to the `__getitem__` or `loc`/`iloc`. It is not explicitly clear to new users the difference between `df[["a","b","c"]]` and `df[slice(5)]` and `df[lambda idx: idx % 5 == 0]`. (added by @devin-petersohn)
- pandas allows the dot operator (`__getattr__`) to get columns, which causes problems for columns that share names with other APIs. (added by @devin-petersohn)
- Duplicate APIs:
	- Simple aliases: e.g. `isna` and `isnull`, `multiply` and `mul`, etc.(added by @devin-petersohn)
	- More complex duplication: e.g. `query("a > b")` and `df[df["a"] > df["b"]]` (added by @devin-petersohn)
	- Indexing, there are 7 or 8 ways to get one or more columns in pandas: e.g. `__getitem__`, `__getattr__`, `loc`, `iloc`, `apply`, `drop` (added by @devin-petersohn)
	- `merge` and `join` call each other and are confusing for new users (added by @devin-petersohn)
- Having a separate object to represent a one column dataframe (i.e. `Series`). Creating all the complexity of having to reimplement most functionality of dataframe. And not providing a consistent way of applying operations to N columns (including 1). #6 @datapythonista 
- "missing" APIs / extension points:
	- These are APIs or extension points that pandas and/or numpy *lacks*, and which -- for one reason or another -- has led libraries needing to consume pandas objects (e.g. ``DataFrame``, ``Series``) to hard-code support for these types. This makes pandas work well with these libraries but means it's not easy (or even possible) for other ``DataFrame`` implementations to be supported. Lack of interop support between alternative ``DataFrame`` implementations and these libraries can be a small but constant annoyance for users, and in some cases a performance issue as well (if data needs to be converted to a pandas object just to get something to work).
	- Introspection API for autocomplete / "IntelliSense" APIs.
		- In _riptide_ we've implemented a hook + protocol and implemented it on our dataframe class ``Dataset``. This provides more-detailed data compared to what a "static" tool like Jedi can return; compared to ``dir``, our protocol allows our ``Dataset`` class to control which columns, properties, etc. are returned for display in autocomplete dropdowns.
		- Our protocol also allows as well as to provide richer metadata for data columns. For example, the ``dtype`` or array subclass name; for Categoricals, we can provide the number of labels/categories.
		- The features mentioned above could alternatively be implemented through some property(ies) on the standardized ``DataFrame`` and/or ``Array`` APIs (rather than a protocol with a method that returns a more-complex data structure / dictionary).
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoiding the "pandas trap" #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Avoiding the "pandas trap" #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions