Description
Today @shoyer and I were talking about a new "protocol" that will let us sidestep the whole macro / method chaining issue. The basic idea is that pandas objects define a pipe
method (ideally other libraries will implement this to, assuming this is useful).
Based on the discussions below, we're leaning towards a method like
def pipe(self, func, *args, **kwargs):
pipe_func = getattr(func, '__pipe_func__', func)
return pipe_func(self, *args, **kwargs)
That's it. This lets you write code like:
import seaborn as sns
(sns.load_dataset('iris')
.query("sepal_length > 5")
.assign(sepal_ratio = lambda df: df.sepal_length / df.sepal_width)
.pipe(lambda df: sns.violinplot(x='species', y='sepal_ratio', data=df))
)
seaborn didn't have to do anything! If the DataFrame is the first argument to the function, things are even simpler:
def f(x):
return x / 2
df['sepal_length'].pipe(f)
Users or libraries can work around the need for the (somewhat ugly) lambda _:
, by using the __pipe_func__
attribute of the function being pipe
d in. This is where a protocol (de facto or official) would be useful, since libraries that know nothing else about each other can rely on it. As an example, consider seaborn's violin plot, which expects a DataFrame as its fourth argument, data
. Seaborn can define a simple decorator to attach a __pipe_func__
attribute, allowing it to define how it expects to be pipe
d to.
def pipeable(func):
def pipe_func(data, *args, **kwargs):
return func(*args, data=data, **kwargs)
func.__pipe_func__ = pipe_func
return func
# now we decorate all the Seaborn methods as pipeable
@pipeable
def violinplot(x=None, y=None, hue=None, data=None, ...):
# ...
And users write
(sns.load_dataset('iris')
.query("sepal_length > 5")
.assign(sepal_ratio = lambda x: x.sepal_length / x.sepal_width)
.pipe(sns.violinplot, x='species', y='sepal_ratio')
)
Why?
Heavily nested function calls are bad. They're hard to read, and can easily introduce bugs. Consider:
# f, g, and h are functions that take and receive a DataFrame
result = f(g(h(df), arg1=1), arg2=2, arg3=3)
For pandas, the approach has been to add f
, g
, and h
as methods to (say) DataFrame
(df.h()
.g(arg1=1)
.f(arg2=2, arg3=3)
)
The code is certainly cleaner. It reads and flows top to bottom instead of inside-out. The function arguments are next to the function calls. But there's a hidden cost. DataFrame has something like 200+ methods, which is crazy. It's less flexible for users since it's hard to get their own functions into pipelines (short of monkey-patching). With .pipe
, we can
(df.pipe(h)
.pipe(g, arg1=1)
.pipe(f, arg2=2, arg3=3)
)
The other way around the nested calls is using temporary variables:
r1 = h(df)
r2 = g(r1, arg1=1)
r3 = f(r2, arg2=2, arg3=3)
Which is better, but not as good as the .pipe
solution.
A relevant thread on python-ideas, started by @mrocklin: https://mail.python.org/pipermail/python-ideas/2015-March/032745.html
This doesn't achieve everything macros could. We still can't do things like df.plot(x=x_col, y=y_col)
where x_col
and y_col
are captured by df
's namespace. But it may be good enough.
Going to cc a bunch of people here, who've had interest in the past.