Skip to content

Commit 2e1b427

Browse files
DOC: Restructure and expand UDF page
1 parent c75171a commit 2e1b427

File tree

1 file changed

+125
-47
lines changed

1 file changed

+125
-47
lines changed

doc/source/user_guide/user_defined_functions.rst

Lines changed: 125 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -96,15 +96,15 @@ User-Defined Functions can be applied across various pandas methods:
9696
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
9797
| :meth:`apply` (axis=1) | Row (Series) | Row (Series) | Apply a function to each row |
9898
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
99-
| :meth:`agg` | Series/DataFrame | Scalar or Series | Aggregate and summarizes values, e.g., sum or custom reducer |
99+
| :meth:`pipe` | Series or DataFrame | Series or DataFrame | Chain functions together to apply to Series or Dataframe |
100100
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
101-
| :meth:`transform` (axis=0) | Column (Series) | Column(Series) | Same as :meth:`apply` with (axis=0), but it raises an exception if the function changes the shape of the data |
101+
| :meth:`filter` | Series or DataFrame | Boolean | Only accepts UDFs in group by. Function is called for each group, and the group is removed from the result if the function returns ``False`` |
102102
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
103-
| :meth:`transform` (axis=1) | Row (Series) | Row (Series) | Same as :meth:`apply` with (axis=1), but it raises an exception if the function changes the shape of the data |
103+
| :meth:`agg` | Series or DataFrame | Scalar or Series | Aggregate and summarizes values, e.g., sum or custom reducer |
104104
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
105-
| :meth:`filter` | Series or DataFrame | Boolean | Only accepts UDFs in group by. Function is called for each group, and the group is removed from the result if the function returns ``False`` |
105+
| :meth:`transform` (axis=0) | Column (Series) | Column (Series) | Same as :meth:`apply` with (axis=0), but it raises an exception if the function changes the shape of the data |
106106
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
107-
| :meth:`pipe` | Series/DataFrame | Series/DataFrame | Chain functions together to apply to Series or Dataframe |
107+
| :meth:`transform` (axis=1) | Row (Series) | Row (Series) | Same as :meth:`apply` with (axis=1), but it raises an exception if the function changes the shape of the data |
108108
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
109109

110110
When applying UDFs in pandas, it is essential to select the appropriate method based
@@ -118,53 +118,108 @@ decisions, ensuring more efficient and maintainable code.
118118
and :ref:`ewm()<window>` for details.
119119

120120

121-
:meth:`DataFrame.apply`
122-
~~~~~~~~~~~~~~~~~~~~~~~
121+
:meth:`Series.map` and :meth:`DataFrame.map`
122+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
123123

124-
The :meth:`apply` method allows you to apply UDFs along either rows or columns. While flexible,
125-
it is slower than vectorized operations and should be used only when you need operations
126-
that cannot be achieved with built-in pandas functions.
124+
The :meth:`map` method is used specifically to apply element-wise UDFs. This means the function
125+
will be called for each element in the ``Series`` or ``DataFrame``, with the individual value or
126+
the cell as the function argument.
127127

128-
When to use: :meth:`apply` is suitable when no alternative vectorized method or UDF method is available,
129-
but consider optimizing performance with vectorized operations wherever possible.
128+
.. ipython:: python
130129
131-
:meth:`DataFrame.agg`
132-
~~~~~~~~~~~~~~~~~~~~~
130+
temperature_celsius = pd.DataFrame({
131+
"NYC": [14, 21, 23],
132+
"Los Angeles": [22, 28, 31],
133+
})
133134
134-
If you need to aggregate data, :meth:`agg` is a better choice than apply because it is
135-
specifically designed for aggregation operations.
135+
def to_fahrenheit(value):
136+
return value * (9 / 5) + 32
136137
137-
When to use: Use :meth:`agg` for performing custom aggregations, where the operation returns
138-
a scalar value on each input.
138+
temperature_celsius.map(to_fahrenheit)
139139
140-
:meth:`DataFrame.transform`
141-
~~~~~~~~~~~~~~~~~~~~~~~~~~~
140+
In this example, the function ``to_fahrenheit`` will be called 6 times, once for each value
141+
in the ``DataFrame``. And the result of each call will be returned in the corresponding cell
142+
of the resulting ``DataFrame``.
142143

143-
The :meth:`transform` method is ideal for performing element-wise transformations while preserving the shape of the original DataFrame.
144-
It is generally faster than apply because it can take advantage of pandas' internal optimizations.
144+
In general, ``map`` will be slow, as it will not make use of vectorization. Instead, a Python
145+
function call for each value will be required, which will slow down things significantly if
146+
working with medium or large data.
145147

146-
When to use: When you need to perform element-wise transformations that retain the original structure of the DataFrame.
148+
When to use: Use :meth:`map` for applying element-wise UDFs to DataFrames or Series.
147149

148-
.. code-block:: python
150+
:meth:`Series.apply` and :meth:`DataFrame.apply`
151+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
149152

150-
from sklearn.linear_model import LinearRegression
153+
The :meth:`apply` method allows you to apply UDFs for a whole column or row. This is different
154+
from :meth:`map` in that the function will be called for each column (or row), not for each individual value.
151155

152-
df = pd.DataFrame({
153-
'group': ['A', 'A', 'A', 'B', 'B', 'B'],
154-
'x': [1, 2, 3, 1, 2, 3],
155-
'y': [2, 4, 6, 1, 2, 1.5]
156-
}).set_index("x")
156+
.. ipython:: python
157157
158-
# Function to fit a model to each group
159-
def fit_model(group):
160-
x = group.index.to_frame()
161-
y = group
162-
model = LinearRegression()
163-
model.fit(x, y)
164-
pred = model.predict(x)
165-
return pred
158+
temperature_celsius = pd.DataFrame({
159+
"NYC": [14, 21, 23],
160+
"Los Angeles": [22, 28, 31],
161+
})
166162
167-
result = df.groupby('group').transform(fit_model)
163+
def to_fahrenheit(column):
164+
return column * (9 / 5) + 32
165+
166+
temperature_celsius.apply(to_fahrenheit)
167+
168+
In the example, ``to_fahrenheit`` will be called only twice, as opposed to the 6 times with :meth:`map`.
169+
This will be faster than using :meth:`map`, since the operations for each column are vectorized, and the
170+
overhead of iterating over data in Python and calling Python functions is significantly reduced.
171+
172+
In some cases, the function may require all the data to be able to compute the result. So :meth:`apply`
173+
is needed, since with :meth:`map` the function can only access one element at a time.
174+
175+
.. ipython:: python
176+
177+
temperature = pd.DataFrame({
178+
"NYC": [14, 21, 23],
179+
"Los Angeles": [22, 28, 31],
180+
})
181+
182+
def normalize(column):
183+
return column / column.mean()
184+
185+
temperature.apply(normalize)
186+
187+
In the example, the ``normalize`` function needs to compute the mean of the whole column in order
188+
to divide each element by it. So, we cannot call the function for each element, but we need the
189+
function to receive the whole column.
190+
191+
:meth:`apply` can also execute function by row, by specifying ``axis=1``.
192+
193+
.. ipython:: python
194+
195+
temperature = pd.DataFrame({
196+
"NYC": [14, 21, 23],
197+
"Los Angeles": [22, 28, 31],
198+
})
199+
200+
def hotter(row):
201+
return row["Los Angeles"] - row["NYC"]
202+
203+
temperature.apply(hotter, axis=1)
204+
205+
In the example, the function ``hotter`` will be called 3 times, once for each row. And each
206+
call will receive the whole row as the argument, allowing computations that require more than
207+
one value in the row.
208+
209+
``apply`` is also available for :meth:`SeriesGroupBy.apply`, :meth:`DataFrameGroupBy.apply`,
210+
:meth:`Rolling.apply`, :meth:`Expanding.apply` and :meth:`Resampler.apply`. You can read more
211+
about ``apply`` in groupby operations :ref:`groupby.apply`.
212+
213+
When to use: :meth:`apply` is suitable when no alternative vectorized method or UDF method is available,
214+
but consider optimizing performance with vectorized operations wherever possible.
215+
216+
:meth:`DataFrame.pipe`
217+
~~~~~~~~~~~~~~~~~~~~~~
218+
219+
The :meth:`pipe` method is useful for chaining operations together into a clean and readable pipeline.
220+
It is a helpful tool for organizing complex data processing workflows.
221+
222+
When to use: Use :meth:`pipe` when you need to create a pipeline of operations and want to keep the code readable and maintainable.
168223

169224
:meth:`DataFrame.filter`
170225
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -199,20 +254,43 @@ When to use: Use :meth:`filter` when you want to use a UDF to create a subset of
199254
Since filter does not directly accept a UDF, you have to apply the UDF indirectly,
200255
for example, by using list comprehensions.
201256

202-
:meth:`DataFrame.map`
257+
:meth:`DataFrame.agg`
203258
~~~~~~~~~~~~~~~~~~~~~
204259

205-
The :meth:`map` method is used specifically to apply element-wise UDFs.
260+
If you need to aggregate data, :meth:`agg` is a better choice than apply because it is
261+
specifically designed for aggregation operations.
206262

207-
When to use: Use :meth:`map` for applying element-wise UDFs to DataFrames or Series.
263+
When to use: Use :meth:`agg` for performing custom aggregations, where the operation returns
264+
a scalar value on each input.
208265

209-
:meth:`DataFrame.pipe`
210-
~~~~~~~~~~~~~~~~~~~~~~
266+
:meth:`DataFrame.transform`
267+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
211268

212-
The :meth:`pipe` method is useful for chaining operations together into a clean and readable pipeline.
213-
It is a helpful tool for organizing complex data processing workflows.
269+
The :meth:`transform` method is ideal for performing element-wise transformations while preserving the shape of the original DataFrame.
270+
It is generally faster than apply because it can take advantage of pandas' internal optimizations.
214271

215-
When to use: Use :meth:`pipe` when you need to create a pipeline of operations and want to keep the code readable and maintainable.
272+
When to use: When you need to perform element-wise transformations that retain the original structure of the DataFrame.
273+
274+
.. code-block:: python
275+
276+
from sklearn.linear_model import LinearRegression
277+
278+
df = pd.DataFrame({
279+
'group': ['A', 'A', 'A', 'B', 'B', 'B'],
280+
'x': [1, 2, 3, 1, 2, 3],
281+
'y': [2, 4, 6, 1, 2, 1.5]
282+
}).set_index("x")
283+
284+
# Function to fit a model to each group
285+
def fit_model(group):
286+
x = group.index.to_frame()
287+
y = group
288+
model = LinearRegression()
289+
model.fit(x, y)
290+
pred = model.predict(x)
291+
return pred
292+
293+
result = df.groupby('group').transform(fit_model)
216294
217295
218296
Performance

0 commit comments

Comments
 (0)