API: Implement pipe protocol, a method for extensible method chaining

Today @shoyer and I were talking about a new "protocol" that will let us sidestep the whole macro / method chaining issue.  The basic idea is that pandas objects define a `pipe` method (ideally other libraries will implement this to, assuming this is useful).

Based on the discussions below, we're leaning towards a method like

``` python
def pipe(self, func, *args, **kwargs):
    pipe_func = getattr(func, '__pipe_func__', func)
    return pipe_func(self, *args, **kwargs)
```

That's it. This lets you write code like:

``` python
import seaborn as sns

(sns.load_dataset('iris')
    .query("sepal_length > 5")
    .assign(sepal_ratio = lambda df: df.sepal_length / df.sepal_width)
    .pipe(lambda df: sns.violinplot(x='species', y='sepal_ratio', data=df))
)
```

seaborn didn't have to do anything! If the DataFrame is the first argument to the function, things are even simpler:

``` python
def f(x):
    return x / 2

df['sepal_length'].pipe(f)
```

Users or libraries can work around the need for the (somewhat ugly) `lambda _:`, by using the `__pipe_func__`  attribute of the function being `pipe`d in. This is where a protocol (de facto or official) would be useful, since libraries that know nothing else about each other can rely on it. As an example, consider seaborn's violin plot, which expects a DataFrame as its fourth argument, `data`. Seaborn can define a simple decorator to attach a `__pipe_func__` attribute, allowing _it_ to define how it expects to be `pipe`d to. 

``` python
def pipeable(func):
    def pipe_func(data, *args, **kwargs):
        return func(*args, data=data, **kwargs)
    func.__pipe_func__ = pipe_func
    return func

# now we decorate all the Seaborn methods as pipeable

@pipeable
def violinplot(x=None, y=None, hue=None, data=None, ...):
    # ...   
```

And users write

``` python
(sns.load_dataset('iris')
    .query("sepal_length > 5")
    .assign(sepal_ratio = lambda x: x.sepal_length / x.sepal_width)
    .pipe(sns.violinplot, x='species', y='sepal_ratio')
)
```
### Why?

Heavily nested function calls are bad. They're hard to read, and can easily introduce bugs. Consider:

``` python
# f, g, and h are functions that take and receive a DataFrame
result = f(g(h(df), arg1=1), arg2=2, arg3=3)
```

For pandas, the approach has been to add `f`, `g`, and `h` as methods to (say) `DataFrame`

``` python
(df.h()
   .g(arg1=1)
   .f(arg2=2, arg3=3)
)
```

The code is certainly cleaner. It reads and flows top to bottom instead of inside-out. The function arguments are next to the function calls. But there's a hidden cost. DataFrame has something like 200+ methods, which is crazy. It's less flexible for users since it's hard to get their own functions into pipelines (short of monkey-patching). With `.pipe`, we can

``` python
(df.pipe(h)
   .pipe(g, arg1=1)
   .pipe(f, arg2=2, arg3=3)
)
```

The other way around the nested calls is using temporary variables:

``` python
r1 = h(df)
r2 = g(r1, arg1=1)
r3 = f(r2, arg2=2, arg3=3)
```

Which is better, but not as good as the `.pipe` solution.

---

A relevant thread on python-ideas, started by @mrocklin: https://mail.python.org/pipermail/python-ideas/2015-March/032745.html

This doesn't achieve everything macros could. We still can't do things like `df.plot(x=x_col, y=y_col)` where `x_col` and `y_col` are captured by `df`'s namespace. But it may be good enough.

Going to cc a bunch of people here, who've had interest in the past.

@shoyer 
@mrocklin
@datnamer 
@dalejung


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API: Implement pipe protocol, a method for extensible method chaining #10129

Why?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

API: Implement pipe protocol, a method for extensible method chaining #10129

Description

Why?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions