Skip to content

Commit 8d6f0e0

Browse files
authored
Merge branch 'main' into issue-shareMemory
2 parents dfdf85c + ae8ea3e commit 8d6f0e0

File tree

12 files changed

+458
-25
lines changed

12 files changed

+458
-25
lines changed

doc/source/user_guide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ Guides
7878
boolean
7979
visualization
8080
style
81+
user_defined_functions
8182
groupby
8283
window
8384
timeseries
Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
.. _user_defined_functions:
2+
3+
{{ header }}
4+
5+
*****************************
6+
User-Defined Functions (UDFs)
7+
*****************************
8+
9+
In pandas, User-Defined Functions (UDFs) provide a way to extend the library’s
10+
functionality by allowing users to apply custom computations to their data. While
11+
pandas comes with a set of built-in functions for data manipulation, UDFs offer
12+
flexibility when built-in methods are not sufficient. These functions can be
13+
applied at different levels: element-wise, row-wise, column-wise, or group-wise,
14+
and behave differently, depending on the method used.
15+
16+
Here’s a simple example to illustrate a UDF applied to a Series:
17+
18+
.. ipython:: python
19+
20+
s = pd.Series([1, 2, 3])
21+
22+
# Simple UDF that adds 1 to a value
23+
def add_one(x):
24+
return x + 1
25+
26+
# Apply the function element-wise using .map
27+
s.map(add_one)
28+
29+
You can also apply UDFs to an entire DataFrame. For example:
30+
31+
.. ipython:: python
32+
33+
df = pd.DataFrame({"A": [1, 2, 3], "B": [10, 20, 30]})
34+
35+
# UDF that takes a row and returns the sum of columns A and B
36+
def sum_row(row):
37+
return row["A"] + row["B"]
38+
39+
# Apply the function row-wise (axis=1 means apply across columns per row)
40+
df.apply(sum_row, axis=1)
41+
42+
43+
Why Not To Use User-Defined Functions
44+
-------------------------------------
45+
46+
While UDFs provide flexibility, they come with significant drawbacks, primarily
47+
related to performance and behavior. When using UDFs, pandas must perform inference
48+
on the result, and that inference could be incorrect. Furthermore, unlike vectorized operations,
49+
UDFs are slower because pandas can't optimize their computations, leading to
50+
inefficient processing.
51+
52+
.. note::
53+
In general, most tasks can and should be accomplished using pandas’ built-in methods or vectorized operations.
54+
55+
Despite their drawbacks, UDFs can be helpful when:
56+
57+
* **Custom Computations Are Needed**: Implementing complex logic or domain-specific calculations that pandas'
58+
built-in methods cannot handle.
59+
* **Extending pandas' Functionality**: Applying external libraries or specialized algorithms unavailable in pandas.
60+
* **Handling Complex Grouped Operations**: Performing operations on grouped data that standard methods do not support.
61+
62+
For example:
63+
64+
.. code-block:: python
65+
66+
from sklearn.linear_model import LinearRegression
67+
68+
# Sample data
69+
df = pd.DataFrame({
70+
'group': ['A', 'A', 'A', 'B', 'B', 'B'],
71+
'x': [1, 2, 3, 1, 2, 3],
72+
'y': [2, 4, 6, 1, 2, 1.5]
73+
})
74+
75+
# Function to fit a model to each group
76+
def fit_model(group):
77+
model = LinearRegression()
78+
model.fit(group[['x']], group['y'])
79+
group['y_pred'] = model.predict(group[['x']])
80+
return group
81+
82+
result = df.groupby('group').apply(fit_model)
83+
84+
85+
Methods that support User-Defined Functions
86+
-------------------------------------------
87+
88+
User-Defined Functions can be applied across various pandas methods:
89+
90+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
91+
| Method | Function Input | Function Output | Description |
92+
+============================+========================+==========================+==============================================================================================================================================+
93+
| :meth:`map` | Scalar | Scalar | Apply a function to each element |
94+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
95+
| :meth:`apply` (axis=0) | Column (Series) | Column (Series) | Apply a function to each column |
96+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
97+
| :meth:`apply` (axis=1) | Row (Series) | Row (Series) | Apply a function to each row |
98+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
99+
| :meth:`agg` | Series/DataFrame | Scalar or Series | Aggregate and summarizes values, e.g., sum or custom reducer |
100+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
101+
| :meth:`transform` (axis=0) | Column (Series) | Column(Series) | Same as :meth:`apply` with (axis=0), but it raises an exception if the function changes the shape of the data |
102+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
103+
| :meth:`transform` (axis=1) | Row (Series) | Row (Series) | Same as :meth:`apply` with (axis=1), but it raises an exception if the function changes the shape of the data |
104+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
105+
| :meth:`filter` | Series or DataFrame | Boolean | Only accepts UDFs in group by. Function is called for each group, and the group is removed from the result if the function returns ``False`` |
106+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
107+
| :meth:`pipe` | Series/DataFrame | Series/DataFrame | Chain functions together to apply to Series or Dataframe |
108+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
109+
110+
When applying UDFs in pandas, it is essential to select the appropriate method based
111+
on your specific task. Each method has its strengths and is designed for different use
112+
cases. Understanding the purpose and behavior of each method will help you make informed
113+
decisions, ensuring more efficient and maintainable code.
114+
115+
.. note::
116+
Some of these methods are can also be applied to groupby, resample, and various window objects.
117+
See :ref:`groupby`, :ref:`resample()<timeseries>`, :ref:`rolling()<window>`, :ref:`expanding()<window>`,
118+
and :ref:`ewm()<window>` for details.
119+
120+
121+
:meth:`DataFrame.apply`
122+
~~~~~~~~~~~~~~~~~~~~~~~
123+
124+
The :meth:`apply` method allows you to apply UDFs along either rows or columns. While flexible,
125+
it is slower than vectorized operations and should be used only when you need operations
126+
that cannot be achieved with built-in pandas functions.
127+
128+
When to use: :meth:`apply` is suitable when no alternative vectorized method or UDF method is available,
129+
but consider optimizing performance with vectorized operations wherever possible.
130+
131+
:meth:`DataFrame.agg`
132+
~~~~~~~~~~~~~~~~~~~~~
133+
134+
If you need to aggregate data, :meth:`agg` is a better choice than apply because it is
135+
specifically designed for aggregation operations.
136+
137+
When to use: Use :meth:`agg` for performing custom aggregations, where the operation returns
138+
a scalar value on each input.
139+
140+
:meth:`DataFrame.transform`
141+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
142+
143+
The :meth:`transform` method is ideal for performing element-wise transformations while preserving the shape of the original DataFrame.
144+
It is generally faster than apply because it can take advantage of pandas' internal optimizations.
145+
146+
When to use: When you need to perform element-wise transformations that retain the original structure of the DataFrame.
147+
148+
.. code-block:: python
149+
150+
from sklearn.linear_model import LinearRegression
151+
152+
df = pd.DataFrame({
153+
'group': ['A', 'A', 'A', 'B', 'B', 'B'],
154+
'x': [1, 2, 3, 1, 2, 3],
155+
'y': [2, 4, 6, 1, 2, 1.5]
156+
}).set_index("x")
157+
158+
# Function to fit a model to each group
159+
def fit_model(group):
160+
x = group.index.to_frame()
161+
y = group
162+
model = LinearRegression()
163+
model.fit(x, y)
164+
pred = model.predict(x)
165+
return pred
166+
167+
result = df.groupby('group').transform(fit_model)
168+
169+
:meth:`DataFrame.filter`
170+
~~~~~~~~~~~~~~~~~~~~~~~~
171+
172+
The :meth:`filter` method is used to select subsets of the DataFrame’s
173+
columns or row. It is useful when you want to extract specific columns or rows that
174+
match particular conditions.
175+
176+
When to use: Use :meth:`filter` when you want to use a UDF to create a subset of a DataFrame or Series
177+
178+
.. note::
179+
:meth:`DataFrame.filter` does not accept UDFs, but can accept
180+
list comprehensions that have UDFs applied to them.
181+
182+
.. ipython:: python
183+
184+
# Sample DataFrame
185+
df = pd.DataFrame({
186+
'AA': [1, 2, 3],
187+
'BB': [4, 5, 6],
188+
'C': [7, 8, 9],
189+
'D': [10, 11, 12]
190+
})
191+
192+
# Function that filters out columns where the name is longer than 1 character
193+
def is_long_name(column_name):
194+
return len(column_name) > 1
195+
196+
df_filtered = df.filter(items=[col for col in df.columns if is_long_name(col)])
197+
print(df_filtered)
198+
199+
Since filter does not directly accept a UDF, you have to apply the UDF indirectly,
200+
for example, by using list comprehensions.
201+
202+
:meth:`DataFrame.map`
203+
~~~~~~~~~~~~~~~~~~~~~
204+
205+
The :meth:`map` method is used specifically to apply element-wise UDFs.
206+
207+
When to use: Use :meth:`map` for applying element-wise UDFs to DataFrames or Series.
208+
209+
:meth:`DataFrame.pipe`
210+
~~~~~~~~~~~~~~~~~~~~~~
211+
212+
The :meth:`pipe` method is useful for chaining operations together into a clean and readable pipeline.
213+
It is a helpful tool for organizing complex data processing workflows.
214+
215+
When to use: Use :meth:`pipe` when you need to create a pipeline of operations and want to keep the code readable and maintainable.
216+
217+
218+
Performance
219+
-----------
220+
221+
While UDFs provide flexibility, their use is generally discouraged as they can introduce
222+
performance issues, especially when written in pure Python. To improve efficiency,
223+
consider using built-in ``NumPy`` or ``pandas`` functions instead of UDFs
224+
for common operations.
225+
226+
.. note::
227+
If performance is critical, explore **vectorized operations** before resorting
228+
to UDFs.
229+
230+
Vectorized Operations
231+
~~~~~~~~~~~~~~~~~~~~~
232+
233+
Below is a comparison of using UDFs versus using Vectorized Operations:
234+
235+
.. code-block:: python
236+
237+
# User-defined function
238+
def calc_ratio(row):
239+
return 100 * (row["one"] / row["two"])
240+
241+
df["new_col"] = df.apply(calc_ratio, axis=1)
242+
243+
# Vectorized Operation
244+
df["new_col2"] = 100 * (df["one"] / df["two"])
245+
246+
Measuring how long each operation takes:
247+
248+
.. code-block:: text
249+
250+
User-defined function: 5.6435 secs
251+
Vectorized: 0.0043 secs
252+
253+
Vectorized operations in pandas are significantly faster than using :meth:`DataFrame.apply`
254+
with UDFs because they leverage highly optimized C functions
255+
via ``NumPy`` to process entire arrays at once. This approach avoids the overhead of looping
256+
through rows in Python and making separate function calls for each row, which is slow and
257+
inefficient. Additionally, ``NumPy`` arrays benefit from memory efficiency and CPU-level
258+
optimizations, making vectorized operations the preferred choice whenever possible.
259+
260+
261+
Improving Performance with UDFs
262+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
263+
264+
In scenarios where UDFs are necessary, there are still ways to mitigate their performance drawbacks.
265+
One approach is to use **Numba**, a Just-In-Time (JIT) compiler that can significantly speed up numerical
266+
Python code by compiling Python functions to optimized machine code at runtime.
267+
268+
By annotating your UDFs with ``@numba.jit``, you can achieve performance closer to vectorized operations,
269+
especially for computationally heavy tasks.
270+
271+
.. note::
272+
You may also refer to the user guide on `Enhancing performance <https://pandas.pydata.org/pandas-docs/dev/user_guide/enhancingperf.html#numba-jit-compilation>`_
273+
for a more detailed guide to using **Numba**.
274+
275+
Using :meth:`DataFrame.pipe` for Composable Logic
276+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
277+
278+
Another useful pattern for improving readability and composability, especially when mixing
279+
vectorized logic with UDFs, is to use the :meth:`DataFrame.pipe` method.
280+
281+
:meth:`DataFrame.pipe` doesn't improve performance directly, but it enables cleaner
282+
method chaining by passing the entire object into a function. This is especially helpful
283+
when chaining custom transformations:
284+
285+
.. code-block:: python
286+
287+
def add_ratio_column(df):
288+
df["ratio"] = 100 * (df["one"] / df["two"])
289+
return df
290+
291+
df = (
292+
df
293+
.query("one > 0")
294+
.pipe(add_ratio_column)
295+
.dropna()
296+
)
297+
298+
This is functionally equivalent to calling ``add_ratio_column(df)``, but keeps your code
299+
clean and composable. The function you pass to :meth:`DataFrame.pipe` can use vectorized operations,
300+
row-wise UDFs, or any other logic; :meth:`DataFrame.pipe` is agnostic.
301+
302+
.. note::
303+
While :meth:`DataFrame.pipe` does not improve performance on its own,
304+
it promotes clean, modular design and allows both vectorized and UDF-based logic
305+
to be composed in method chains.

doc/source/whatsnew/v2.3.0.rst

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,10 +50,20 @@ Notable bug fixes
5050

5151
These are bug fixes that might have notable behavior changes.
5252

53-
.. _whatsnew_230.notable_bug_fixes.notable_bug_fix1:
53+
.. _whatsnew_230.notable_bug_fixes.string_comparisons:
5454

55-
notable_bug_fix1
56-
^^^^^^^^^^^^^^^^
55+
Comparisons between different string dtypes
56+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
57+
58+
In previous versions, comparing Series of different string dtypes (e.g. ``pd.StringDtype("pyarrow", na_value=pd.NA)`` against ``pd.StringDtype("python", na_value=np.nan)``) would result in inconsistent resulting dtype or incorrectly raise. pandas will now use the hierarchy
59+
60+
object < (python, NaN) < (pyarrow, NaN) < (python, NA) < (pyarrow, NA)
61+
62+
in determining the result dtype when there are different string dtypes compared. Some examples:
63+
64+
- When ``pd.StringDtype("pyarrow", na_value=pd.NA)`` is compared against any other string dtype, the result will always be ``boolean[pyarrow]``.
65+
- When ``pd.StringDtype("python", na_value=pd.NA)`` is compared against ``pd.StringDtype("pyarrow", na_value=np.nan)``, the result will be ``boolean``, the NumPy-backed nullable extension array.
66+
- When ``pd.StringDtype("python", na_value=pd.NA)`` is compared against ``pd.StringDtype("python", na_value=np.nan)``, the result will be ``boolean``, the NumPy-backed nullable extension array.
5767

5868
.. _whatsnew_230.api_changes:
5969

doc/source/whatsnew/v3.0.0.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -904,6 +904,7 @@ Other
904904
- Bug in ``Series.list`` methods not preserving the original name. (:issue:`60522`)
905905
- Bug in printing a :class:`DataFrame` with a :class:`DataFrame` stored in :attr:`DataFrame.attrs` raised a ``ValueError`` (:issue:`60455`)
906906
- Bug in printing a :class:`Series` with a :class:`DataFrame` stored in :attr:`Series.attrs` raised a ``ValueError`` (:issue:`60568`)
907+
- Fixed bug where the :class:`DataFrame` constructor misclassified array-like objects with a ``.name`` attribute as :class:`Series` or :class:`Index` (:issue:`61443`)
907908
- Fixed regression in :meth:`DataFrame.from_records` not initializing subclasses properly (:issue:`57008`)
908909

909910
.. ***DO NOT USE THIS SECTION***

pandas/core/arrays/arrow/array.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,6 @@
3333
infer_dtype_from_scalar,
3434
)
3535
from pandas.core.dtypes.common import (
36-
CategoricalDtype,
3736
is_array_like,
3837
is_bool_dtype,
3938
is_float_dtype,
@@ -730,9 +729,7 @@ def __setstate__(self, state) -> None:
730729

731730
def _cmp_method(self, other, op) -> ArrowExtensionArray:
732731
pc_func = ARROW_CMP_FUNCS[op.__name__]
733-
if isinstance(
734-
other, (ArrowExtensionArray, np.ndarray, list, BaseMaskedArray)
735-
) or isinstance(getattr(other, "dtype", None), CategoricalDtype):
732+
if isinstance(other, (ExtensionArray, np.ndarray, list)):
736733
try:
737734
result = pc_func(self._pa_array, self._box_pa(other))
738735
except pa.ArrowNotImplementedError:

pandas/core/arrays/string_.py

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1015,7 +1015,30 @@ def searchsorted(
10151015
return super().searchsorted(value=value, side=side, sorter=sorter)
10161016

10171017
def _cmp_method(self, other, op):
1018-
from pandas.arrays import BooleanArray
1018+
from pandas.arrays import (
1019+
ArrowExtensionArray,
1020+
BooleanArray,
1021+
)
1022+
1023+
if (
1024+
isinstance(other, BaseStringArray)
1025+
and self.dtype.na_value is not libmissing.NA
1026+
and other.dtype.na_value is libmissing.NA
1027+
):
1028+
# NA has priority of NaN semantics
1029+
return NotImplemented
1030+
1031+
if isinstance(other, ArrowExtensionArray):
1032+
if isinstance(other, BaseStringArray):
1033+
# pyarrow storage has priority over python storage
1034+
# (except if we have NA semantics and other not)
1035+
if not (
1036+
self.dtype.na_value is libmissing.NA
1037+
and other.dtype.na_value is not libmissing.NA
1038+
):
1039+
return NotImplemented
1040+
else:
1041+
return NotImplemented
10191042

10201043
if isinstance(other, StringArray):
10211044
other = other._ndarray

0 commit comments

Comments
 (0)