From 6857e0a20d2001cf6585bb312b7c5667bc690f39 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler Date: Sat, 30 Sep 2023 00:13:41 +0100 Subject: [PATCH 1/5] DOC: Adjust user guide for CoW docs --- doc/source/user_guide/copy_on_write.rst | 128 ++++++++++++++---------- 1 file changed, 73 insertions(+), 55 deletions(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 59bdb1926895f..88acd0bd43b49 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -7,8 +7,8 @@ Copy-on-Write (CoW) ******************* Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the -optimizations that become possible through CoW are implemented and supported. A complete list -can be found at :ref:`Copy-on-Write optimizations `. +optimizations that become possible through CoW are implemented and supported. All possible +optimizations are supported starting from pandas 2.1. We expect that CoW will be enabled by default in version 3.0. @@ -154,6 +154,75 @@ With copy on write this can be done by using ``loc``. df.loc[df["bar"] > 5, "foo"] = 100 +Read-only NumPy arrays +---------------------- + +Accessing the underlying NumPy array of a DataFrame will return a read-only array if the array +shares data with the initial DataFrame: + +The array is a copy if the initial DataFrame consists of more than one array: + + +.. ipython:: python + + df = pd.DataFrame({"a": [1, 2], "b": [1.5, 2.5]}) + df.to_numpy() + +The array shares data with the DataFrame if the DataFrame consists of only one NumPy array: + +.. ipython:: python + + df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}) + df.to_numpy() + +This array is read-only, which means that it can't be modified inplace: + +.. ipython:: python + :okexcept: + + arr = df.to_numpy() + arr[0, 0] = 100 + +There are two potential solution to this: + +- Trigger a copy manually if you want to avoid updating DataFrames that share memory with your array. +- Make the array writeable. This is a more performant solution but circumvents Copy-on-Write rules, so + it should be used with caution. + +.. ipython:: python + + arr = df.to_numpy() + arr.flags.writeable = True + arr[0, 0] = 100 + arr + +Patterns to avoid +----------------- + +No defensiv copy will be performed if two objects share the same data while +you are modifying one object inplace. + +.. ipython:: python + + df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) + df2 = df.reset_index() + df2.iloc[0, 0] = 100 + +This creates two objects that share data and thus the setitem operation will trigger a +copy. This is not necessary if you the initial object ``df`` isn't needed anymore. +Simply reassigning to the same variable will invalidate the reference that is +held by the object. + +.. ipython:: python + + df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) + df = df.reset_index() + df.iloc[0, 0] = 100 + +No copy is necessary in this example. +Creating multiple references in the same method keeps unnecessary references alive +and thus will hurt performance with Copy-on-Write. + .. _copy_on_write.optimizations: Copy-on-Write optimizations @@ -161,59 +230,8 @@ Copy-on-Write optimizations A new lazy copy mechanism that defers the copy until the object in question is modified and only if this object shares data with another object. This mechanism was added to -following methods: - - - :meth:`DataFrame.reset_index` / :meth:`Series.reset_index` - - :meth:`DataFrame.set_index` - - :meth:`DataFrame.set_axis` / :meth:`Series.set_axis` - - :meth:`DataFrame.set_flags` / :meth:`Series.set_flags` - - :meth:`DataFrame.rename_axis` / :meth:`Series.rename_axis` - - :meth:`DataFrame.reindex` / :meth:`Series.reindex` - - :meth:`DataFrame.reindex_like` / :meth:`Series.reindex_like` - - :meth:`DataFrame.assign` - - :meth:`DataFrame.drop` - - :meth:`DataFrame.dropna` / :meth:`Series.dropna` - - :meth:`DataFrame.select_dtypes` - - :meth:`DataFrame.align` / :meth:`Series.align` - - :meth:`Series.to_frame` - - :meth:`DataFrame.rename` / :meth:`Series.rename` - - :meth:`DataFrame.add_prefix` / :meth:`Series.add_prefix` - - :meth:`DataFrame.add_suffix` / :meth:`Series.add_suffix` - - :meth:`DataFrame.drop_duplicates` / :meth:`Series.drop_duplicates` - - :meth:`DataFrame.droplevel` / :meth:`Series.droplevel` - - :meth:`DataFrame.reorder_levels` / :meth:`Series.reorder_levels` - - :meth:`DataFrame.between_time` / :meth:`Series.between_time` - - :meth:`DataFrame.filter` / :meth:`Series.filter` - - :meth:`DataFrame.head` / :meth:`Series.head` - - :meth:`DataFrame.tail` / :meth:`Series.tail` - - :meth:`DataFrame.isetitem` - - :meth:`DataFrame.pipe` / :meth:`Series.pipe` - - :meth:`DataFrame.pop` / :meth:`Series.pop` - - :meth:`DataFrame.replace` / :meth:`Series.replace` - - :meth:`DataFrame.shift` / :meth:`Series.shift` - - :meth:`DataFrame.sort_index` / :meth:`Series.sort_index` - - :meth:`DataFrame.sort_values` / :meth:`Series.sort_values` - - :meth:`DataFrame.squeeze` / :meth:`Series.squeeze` - - :meth:`DataFrame.swapaxes` - - :meth:`DataFrame.swaplevel` / :meth:`Series.swaplevel` - - :meth:`DataFrame.take` / :meth:`Series.take` - - :meth:`DataFrame.to_timestamp` / :meth:`Series.to_timestamp` - - :meth:`DataFrame.to_period` / :meth:`Series.to_period` - - :meth:`DataFrame.truncate` - - :meth:`DataFrame.iterrows` - - :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize` - - :meth:`DataFrame.fillna` / :meth:`Series.fillna` - - :meth:`DataFrame.interpolate` / :meth:`Series.interpolate` - - :meth:`DataFrame.ffill` / :meth:`Series.ffill` - - :meth:`DataFrame.bfill` / :meth:`Series.bfill` - - :meth:`DataFrame.where` / :meth:`Series.where` - - :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects` - - :meth:`DataFrame.astype` / :meth:`Series.astype` - - :meth:`DataFrame.convert_dtypes` / :meth:`Series.convert_dtypes` - - :meth:`DataFrame.join` - - :meth:`DataFrame.eval` - - :func:`concat` - - :func:`merge` +methods that don't require a copy of the underlying data. Popular examples are :meth:`DataFrame.drop` for ``axis=1`` +and :meth:`DataFrame.rename`. These methods return views when Copy-on-Write is enabled, which provides a significant performance improvement compared to the regular execution. From 70c8846d286419be887201da94de86f18cd8ab50 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Tue, 3 Oct 2023 11:49:28 +0200 Subject: [PATCH 2/5] Update doc/source/user_guide/copy_on_write.rst Co-authored-by: Joris Van den Bossche --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 88acd0bd43b49..4f5bac15db8f8 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -199,7 +199,7 @@ There are two potential solution to this: Patterns to avoid ----------------- -No defensiv copy will be performed if two objects share the same data while +No defensive copy will be performed if two objects share the same data while you are modifying one object inplace. .. ipython:: python From 1b5396dc06fb98488750830828c076333573122a Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Tue, 3 Oct 2023 11:49:41 +0200 Subject: [PATCH 3/5] Update doc/source/user_guide/copy_on_write.rst Co-authored-by: Joris Van den Bossche --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 4f5bac15db8f8..1bf4e35c89cfd 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -209,7 +209,7 @@ you are modifying one object inplace. df2.iloc[0, 0] = 100 This creates two objects that share data and thus the setitem operation will trigger a -copy. This is not necessary if you the initial object ``df`` isn't needed anymore. +copy. This is not necessary if the initial object ``df`` isn't needed anymore. Simply reassigning to the same variable will invalidate the reference that is held by the object. From 521a4989c5362481db3a7156b02986fd9bbf7145 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Tue, 3 Oct 2023 11:50:13 +0200 Subject: [PATCH 4/5] Update copy_on_write.rst --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 1bf4e35c89cfd..94545015b9492 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -220,7 +220,7 @@ held by the object. df.iloc[0, 0] = 100 No copy is necessary in this example. -Creating multiple references in the same method keeps unnecessary references alive +Creating multiple references keeps unnecessary references alive and thus will hurt performance with Copy-on-Write. .. _copy_on_write.optimizations: From cf9307329a6b40c320bed622c268738f3bc63ad6 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Tue, 3 Oct 2023 11:51:22 +0200 Subject: [PATCH 5/5] Update copy_on_write.rst --- doc/source/user_guide/copy_on_write.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 94545015b9492..d0c57b56585db 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -183,6 +183,8 @@ This array is read-only, which means that it can't be modified inplace: arr = df.to_numpy() arr[0, 0] = 100 +The same holds true for a Series, since a Series always consists of a single array. + There are two potential solution to this: - Trigger a copy manually if you want to avoid updating DataFrames that share memory with your array.