Skip to content

API: Implement pipe protocol, a method for extensible method chaining #10129

Closed
@TomAugspurger

Description

@TomAugspurger

Today @shoyer and I were talking about a new "protocol" that will let us sidestep the whole macro / method chaining issue. The basic idea is that pandas objects define a pipe method (ideally other libraries will implement this to, assuming this is useful).

Based on the discussions below, we're leaning towards a method like

def pipe(self, func, *args, **kwargs):
    pipe_func = getattr(func, '__pipe_func__', func)
    return pipe_func(self, *args, **kwargs)

That's it. This lets you write code like:

import seaborn as sns

(sns.load_dataset('iris')
    .query("sepal_length > 5")
    .assign(sepal_ratio = lambda df: df.sepal_length / df.sepal_width)
    .pipe(lambda df: sns.violinplot(x='species', y='sepal_ratio', data=df))
)

seaborn didn't have to do anything! If the DataFrame is the first argument to the function, things are even simpler:

def f(x):
    return x / 2

df['sepal_length'].pipe(f)

Users or libraries can work around the need for the (somewhat ugly) lambda _:, by using the __pipe_func__ attribute of the function being piped in. This is where a protocol (de facto or official) would be useful, since libraries that know nothing else about each other can rely on it. As an example, consider seaborn's violin plot, which expects a DataFrame as its fourth argument, data. Seaborn can define a simple decorator to attach a __pipe_func__ attribute, allowing it to define how it expects to be piped to.

def pipeable(func):
    def pipe_func(data, *args, **kwargs):
        return func(*args, data=data, **kwargs)
    func.__pipe_func__ = pipe_func
    return func

# now we decorate all the Seaborn methods as pipeable

@pipeable
def violinplot(x=None, y=None, hue=None, data=None, ...):
    # ...   

And users write

(sns.load_dataset('iris')
    .query("sepal_length > 5")
    .assign(sepal_ratio = lambda x: x.sepal_length / x.sepal_width)
    .pipe(sns.violinplot, x='species', y='sepal_ratio')
)

Why?

Heavily nested function calls are bad. They're hard to read, and can easily introduce bugs. Consider:

# f, g, and h are functions that take and receive a DataFrame
result = f(g(h(df), arg1=1), arg2=2, arg3=3)

For pandas, the approach has been to add f, g, and h as methods to (say) DataFrame

(df.h()
   .g(arg1=1)
   .f(arg2=2, arg3=3)
)

The code is certainly cleaner. It reads and flows top to bottom instead of inside-out. The function arguments are next to the function calls. But there's a hidden cost. DataFrame has something like 200+ methods, which is crazy. It's less flexible for users since it's hard to get their own functions into pipelines (short of monkey-patching). With .pipe, we can

(df.pipe(h)
   .pipe(g, arg1=1)
   .pipe(f, arg2=2, arg3=3)
)

The other way around the nested calls is using temporary variables:

r1 = h(df)
r2 = g(r1, arg1=1)
r3 = f(r2, arg2=2, arg3=3)

Which is better, but not as good as the .pipe solution.


A relevant thread on python-ideas, started by @mrocklin: https://mail.python.org/pipermail/python-ideas/2015-March/032745.html

This doesn't achieve everything macros could. We still can't do things like df.plot(x=x_col, y=y_col) where x_col and y_col are captured by df's namespace. But it may be good enough.

Going to cc a bunch of people here, who've had interest in the past.

@shoyer
@mrocklin
@datnamer
@dalejung

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions