diff --git a/doc/source/user_guide/missing_data.rst b/doc/source/user_guide/missing_data.rst index 9d645c185d3b5..e0e752099b77a 100644 --- a/doc/source/user_guide/missing_data.rst +++ b/doc/source/user_guide/missing_data.rst @@ -6,350 +6,462 @@ Working with missing data ************************* -In this section, we will discuss missing (also referred to as NA) values in -pandas. +Values considered "missing" +~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. note:: +pandas uses different sentinel values to represent a missing (also referred to as NA) +depending on the data type. - The choice of using ``NaN`` internally to denote missing data was largely - for simplicity and performance reasons. - Starting from pandas 1.0, some optional data types start experimenting - with a native ``NA`` scalar using a mask-based approach. See - :ref:`here ` for more. +``numpy.nan`` for NumPy data types. The disadvantage of using NumPy data types +is that the original data type will be coerced to ``np.float64`` or ``object``. -See the :ref:`cookbook` for some advanced strategies. +.. ipython:: python -Values considered "missing" -~~~~~~~~~~~~~~~~~~~~~~~~~~~ + pd.Series([1, 2], dtype=np.int64).reindex([0, 1, 2]) + pd.Series([True, False], dtype=np.bool_).reindex([0, 1, 2]) + +:class:`NaT` for NumPy ``np.datetime64``, ``np.timedelta64``, and :class:`PeriodDtype`. For typing applications, +use :class:`api.types.NaTType`. -As data comes in many shapes and forms, pandas aims to be flexible with regard -to handling missing data. While ``NaN`` is the default missing value marker for -reasons of computational speed and convenience, we need to be able to easily -detect this value with data of different types: floating point, integer, -boolean, and general object. In many cases, however, the Python ``None`` will -arise and we wish to also consider that "missing" or "not available" or "NA". +.. ipython:: python + + pd.Series([1, 2], dtype=np.dtype("timedelta64[ns]")).reindex([0, 1, 2]) + pd.Series([1, 2], dtype=np.dtype("datetime64[ns]")).reindex([0, 1, 2]) + pd.Series(["2020", "2020"], dtype=pd.PeriodDtype("D")).reindex([0, 1, 2]) -.. _missing.isna: +:class:`NA` for :class:`StringDtype`, :class:`Int64Dtype` (and other bit widths), +:class:`Float64Dtype`(and other bit widths), :class:`BooleanDtype` and :class:`ArrowDtype`. +These types will maintain the original data type of the data. +For typing applications, use :class:`api.types.NAType`. .. ipython:: python - df = pd.DataFrame( - np.random.randn(5, 3), - index=["a", "c", "e", "f", "h"], - columns=["one", "two", "three"], - ) - df["four"] = "bar" - df["five"] = df["one"] > 0 - df - df2 = df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"]) - df2 + pd.Series([1, 2], dtype="Int64").reindex([0, 1, 2]) + pd.Series([True, False], dtype="boolean[pyarrow]").reindex([0, 1, 2]) -To make detecting missing values easier (and across different array dtypes), -pandas provides the :func:`isna` and -:func:`notna` functions, which are also methods on -Series and DataFrame objects: +To detect these missing value, use the :func:`isna` or :func:`notna` methods. .. ipython:: python - df2["one"] - pd.isna(df2["one"]) - df2["four"].notna() - df2.isna() + ser = pd.Series([pd.Timestamp("2020-01-01"), pd.NaT]) + ser + pd.isna(ser) + + +.. note:: + + :func:`isna` or :func:`notna` will also consider ``None`` a missing value. + + .. ipython:: python + + ser = pd.Series([1, None], dtype=object) + ser + pd.isna(ser) .. warning:: - One has to be mindful that in Python (and NumPy), the ``nan's`` don't compare equal, but ``None's`` **do**. - Note that pandas/NumPy uses the fact that ``np.nan != np.nan``, and treats ``None`` like ``np.nan``. + Equality compaisons between ``np.nan``, :class:`NaT`, and :class:`NA` + do not act like ``None`` .. ipython:: python None == None # noqa: E711 np.nan == np.nan + pd.NaT == pd.NaT + pd.NA == pd.NA - So as compared to above, a scalar equality comparison versus a ``None/np.nan`` doesn't provide useful information. + Therefore, an equality comparison between a :class:`DataFrame` or :class:`Series` + with one of these missing values does not provide the same information as + :func:`isna` or :func:`notna`. .. ipython:: python - df2["one"] == np.nan + ser = pd.Series([True, None], dtype="boolean[pyarrow]") + ser == pd.NA + pd.isna(ser) + -Integer dtypes and missing data -------------------------------- +.. _missing_data.NA: + +:class:`NA` semantics +~~~~~~~~~~~~~~~~~~~~~ + +.. warning:: -Because ``NaN`` is a float, a column of integers with even one missing values -is cast to floating-point dtype (see :ref:`gotchas.intna` for more). pandas -provides a nullable integer array, which can be used by explicitly requesting -the dtype: + Experimental: the behaviour of :class:`NA`` can still change without warning. + +Starting from pandas 1.0, an experimental :class:`NA` value (singleton) is +available to represent scalar missing values. The goal of :class:`NA` is provide a +"missing" indicator that can be used consistently across data types +(instead of ``np.nan``, ``None`` or ``pd.NaT`` depending on the data type). + +For example, when having missing values in a :class:`Series` with the nullable integer +dtype, it will use :class:`NA`: .. ipython:: python - pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype()) + s = pd.Series([1, 2, None], dtype="Int64") + s + s[2] + s[2] is pd.NA -Alternatively, the string alias ``dtype='Int64'`` (note the capital ``"I"``) can be -used. +Currently, pandas does not yet use those data types using :class:`NA` by default +a :class:`DataFrame` or :class:`Series`, so you need to specify +the dtype explicitly. An easy way to convert to those dtypes is explained in the +:ref:`conversion section `. -See :ref:`integer_na` for more. +Propagation in arithmetic and comparison operations +--------------------------------------------------- -Datetimes ---------- -.. note:: - If you are adding type checking to your application, you may need access to ``NaTType`` and ``NAType``. +In general, missing values *propagate* in operations involving :class:`NA`. When +one of the operands is unknown, the outcome of the operation is also unknown. -For datetime64[ns] types, ``NaT`` represents missing values. This is a pseudo-native -sentinel value that can be represented by NumPy in a singular dtype (datetime64[ns]). -pandas objects provide compatibility between ``NaT`` and ``NaN``. +For example, :class:`NA` propagates in arithmetic operations, similarly to +``np.nan``: .. ipython:: python - df2 = df.copy() - df2["timestamp"] = pd.Timestamp("20120101") - df2 - df2.loc[["a", "c", "h"], ["one", "timestamp"]] = np.nan - df2 - df2.dtypes.value_counts() + pd.NA + 1 + "a" * pd.NA -.. _missing.inserting: +There are a few special cases when the result is known, even when one of the +operands is ``NA``. -Inserting missing data -~~~~~~~~~~~~~~~~~~~~~~ +.. ipython:: python -You can insert missing values by simply assigning to containers. The -actual missing value used will be chosen based on the dtype. + pd.NA ** 0 + 1 ** pd.NA -For example, numeric containers will always use ``NaN`` regardless of -the missing value type chosen: +In equality and comparison operations, :class:`NA` also propagates. This deviates +from the behaviour of ``np.nan``, where comparisons with ``np.nan`` always +return ``False``. .. ipython:: python - s = pd.Series([1., 2., 3.]) - s.loc[0] = None - s - -Likewise, datetime containers will always use ``NaT``. + pd.NA == 1 + pd.NA == pd.NA + pd.NA < 2.5 -For object containers, pandas will use the value given: +To check if a value is equal to :class:`NA`, use :func:`isna` .. ipython:: python - s = pd.Series(["a", "b", "c"]) - s.loc[0] = None - s.loc[1] = np.nan - s + pd.isna(pd.NA) -.. _missing_data.calculations: -Calculations with missing data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +.. note:: + + An exception on this basic propagation rule are *reductions* (such as the + mean or the minimum), where pandas defaults to skipping missing values. See the + :ref:`calculation section ` for more. + +Logical operations +------------------ + +For logical operations, :class:`NA` follows the rules of the +`three-valued logic `__ (or +*Kleene logic*, similarly to R, SQL and Julia). This logic means to only +propagate missing values when it is logically required. -Missing values propagate naturally through arithmetic operations between pandas -objects. +For example, for the logical "or" operation (``|``), if one of the operands +is ``True``, we already know the result will be ``True``, regardless of the +other value (so regardless the missing value would be ``True`` or ``False``). +In this case, :class:`NA` does not propagate: .. ipython:: python - df = df2.loc[:, ["one", "two", "three"]] - a = df2.loc[df2.index[:5], ["one", "two"]].ffill() - b = df2.loc[df2.index[:5], ["one", "two", "three"]] - a - b - a + b + True | False + True | pd.NA + pd.NA | True -The descriptive statistics and computational methods discussed in the -:ref:`data structure overview ` (and listed :ref:`here -` and :ref:`here `) are all written to -account for missing data. For example: +On the other hand, if one of the operands is ``False``, the result depends +on the value of the other operand. Therefore, in this case :class:`NA` +propagates: -* When summing data, NA (missing) values will be treated as zero. -* If the data are all NA, the result will be 0. -* Cumulative methods like :meth:`~DataFrame.cumsum` and :meth:`~DataFrame.cumprod` ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and include NA values, use ``skipna=False``. +.. ipython:: python + + False | True + False | False + False | pd.NA + +The behaviour of the logical "and" operation (``&``) can be derived using +similar logic (where now :class:`NA` will not propagate if one of the operands +is already ``False``): .. ipython:: python - df - df["one"].sum() - df.mean(1) - df.cumsum() - df.cumsum(skipna=False) + False & True + False & False + False & pd.NA +.. ipython:: python -.. _missing_data.numeric_sum: + True & True + True & False + True & pd.NA -Sum/prod of empties/nans -~~~~~~~~~~~~~~~~~~~~~~~~ -The sum of an empty or all-NA Series or column of a DataFrame is 0. +``NA`` in a boolean context +--------------------------- + +Since the actual value of an NA is unknown, it is ambiguous to convert NA +to a boolean value. .. ipython:: python + :okexcept: - pd.Series([np.nan]).sum() + bool(pd.NA) - pd.Series([], dtype="float64").sum() +This also means that :class:`NA` cannot be used in a context where it is +evaluated to a boolean, such as ``if condition: ...`` where ``condition`` can +potentially be :class:`NA`. In such cases, :func:`isna` can be used to check +for :class:`NA` or ``condition`` being :class:`NA` can be avoided, for example by +filling missing values beforehand. + +A similar situation occurs when using :class:`Series` or :class:`DataFrame` objects in ``if`` +statements, see :ref:`gotchas.truth`. + +NumPy ufuncs +------------ -The product of an empty or all-NA Series or column of a DataFrame is 1. +:attr:`pandas.NA` implements NumPy's ``__array_ufunc__`` protocol. Most ufuncs +work with ``NA``, and generally return ``NA``: .. ipython:: python - pd.Series([np.nan]).prod() + np.log(pd.NA) + np.add(pd.NA, 1) - pd.Series([], dtype="float64").prod() +.. warning:: + Currently, ufuncs involving an ndarray and ``NA`` will return an + object-dtype filled with NA values. -NA values in GroupBy -~~~~~~~~~~~~~~~~~~~~ + .. ipython:: python + + a = np.array([1, 2, 3]) + np.greater(a, pd.NA) + + The return type here may change to return a different array type + in the future. + +See :ref:`dsintro.numpy_interop` for more on ufuncs. -NA groups in GroupBy are automatically excluded. This behavior is consistent -with R, for example: +.. _missing_data.NA.conversion: + +Conversion +^^^^^^^^^^ + +If you have a :class:`DataFrame` or :class:`Series` using ``np.nan``, +:meth:`Series.convert_dtypes` and :meth:`DataFrame.convert_dtypes` +in :class:`DataFrame` that can convert data to use the data types that use :class:`NA` +such as :class:`Int64Dtype` or :class:`ArrowDtype`. This is especially helpful after reading +in data sets from IO methods where data types were inferred. + +In this example, while the dtypes of all columns are changed, we show the results for +the first 10 columns. .. ipython:: python - df - df.groupby("one").mean() + import io + data = io.StringIO("a,b\n,True\n2,") + df = pd.read_csv(data) + df.dtypes + df_conv = df.convert_dtypes() + df_conv + df_conv.dtypes -See the groupby section :ref:`here ` for more information. +.. _missing.inserting: -Cleaning / filling missing data --------------------------------- +Inserting missing data +~~~~~~~~~~~~~~~~~~~~~~ -pandas objects are equipped with various data manipulation methods for dealing -with missing data. +You can insert missing values by simply assigning to a :class:`Series` or :class:`DataFrame`. +The missing value sentinel used will be chosen based on the dtype. -.. _missing_data.fillna: +.. ipython:: python + + ser = pd.Series([1., 2., 3.]) + ser.loc[0] = None + ser + + ser = pd.Series([pd.Timestamp("2021"), pd.Timestamp("2021")]) + ser.iloc[0] = np.nan + ser + + ser = pd.Series([True, False], dtype="boolean[pyarrow]") + ser.iloc[0] = None + ser + +For ``object`` types, pandas will use the value given: + +.. ipython:: python -Filling missing values: fillna + s = pd.Series(["a", "b", "c"], dtype=object) + s.loc[0] = None + s.loc[1] = np.nan + s + +.. _missing_data.calculations: + +Calculations with missing data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -:meth:`~DataFrame.fillna` can "fill in" NA values with non-NA data in a couple -of ways, which we illustrate: +Missing values propagate through arithmetic operations between pandas objects. + +.. ipython:: python + + ser1 = pd.Series([np.nan, np.nan, 2, 3]) + ser2 = pd.Series([np.nan, 1, np.nan, 4]) + ser1 + ser2 + ser1 + ser2 + +The descriptive statistics and computational methods discussed in the +:ref:`data structure overview ` (and listed :ref:`here +` and :ref:`here `) are all +account for missing data. + +When summing data, NA values or empty data will be treated as zero. -**Replace NA with a scalar value** +.. ipython:: python + + pd.Series([np.nan]).sum() + pd.Series([], dtype="float64").sum() + +When taking the product, NA values or empty data will be treated as 1. .. ipython:: python - df2 - df2.fillna(0) - df2["one"].fillna("missing") + pd.Series([np.nan]).prod() + pd.Series([], dtype="float64").prod() + +Cumulative methods like :meth:`~DataFrame.cumsum` and :meth:`~DataFrame.cumprod` +ignore NA values by default preserve them in the result. This behavior can be changed +with ``skipna`` -**Fill gaps forward or backward** +* Cumulative methods like :meth:`~DataFrame.cumsum` and :meth:`~DataFrame.cumprod` ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and include NA values, use ``skipna=False``. -Using the same filling arguments as :ref:`reindexing `, we -can propagate non-NA values forward or backward: .. ipython:: python - df - df.ffill() + ser = pd.Series([1, np.nan, 3, np.nan]) + ser + ser.cumsum() + ser.cumsum(skipna=False) -.. _missing_data.fillna.limit: +.. _missing_data.dropna: -**Limit the amount of filling** +Dropping missing data +~~~~~~~~~~~~~~~~~~~~~ -If we only want consecutive gaps filled up to a certain number of data points, -we can use the ``limit`` keyword: +:meth:`~DataFrame.dropna` dropa rows or columns with missing data. .. ipython:: python - df.iloc[2:4, :] = np.nan + df = pd.DataFrame([[np.nan, 1, 2], [1, 2, np.nan], [1, 2, 3]]) df - df.ffill(limit=1) + df.dropna() + df.dropna(axis=1) -To remind you, these are the available filling methods: + ser = pd.Series([1, pd.NA], dtype="int64[pyarrow]") + ser.dropna() -.. csv-table:: - :header: "Method", "Action" - :widths: 30, 50 +Filling missing data +~~~~~~~~~~~~~~~~~~~~ - pad / ffill, Fill values forward - bfill / backfill, Fill values backward +.. _missing_data.fillna: -With time series data, using pad/ffill is extremely common so that the "last -known value" is available at every time point. +Filling by value +---------------- -:meth:`~DataFrame.ffill` is equivalent to ``fillna(method='ffill')`` -and :meth:`~DataFrame.bfill` is equivalent to ``fillna(method='bfill')`` +:meth:`~DataFrame.fillna` replaces NA values with non-NA data. -.. _missing_data.PandasObject: +Replace NA with a scalar value -Filling with a PandasObject -~~~~~~~~~~~~~~~~~~~~~~~~~~~ +.. ipython:: python -You can also fillna using a dict or Series that is alignable. The labels of the dict or index of the Series -must match the columns of the frame you wish to fill. The -use case of this is to fill a DataFrame with the mean of that column. + data = {"np": [1.0, np.nan, np.nan, 2], "arrow": pd.array([1.0, pd.NA, pd.NA, 2], dtype="float64[pyarrow]")} + df = pd.DataFrame(data) + df + df.fillna(0) + +Fill gaps forward or backward .. ipython:: python - dff = pd.DataFrame(np.random.randn(10, 3), columns=list("ABC")) - dff.iloc[3:5, 0] = np.nan - dff.iloc[4:6, 1] = np.nan - dff.iloc[5:8, 2] = np.nan - dff + df.ffill() + df.bfill() - dff.fillna(dff.mean()) - dff.fillna(dff.mean()["B":"C"]) +.. _missing_data.fillna.limit: -Same result as above, but is aligning the 'fill' value which is -a Series in this case. +Limit the number of NA values filled .. ipython:: python - dff.where(pd.notna(dff), dff.mean(), axis="columns") + df.ffill(limit=1) +NA values can be replaced with corresponding value from a :class:`Series`` or :class:`DataFrame`` +where the index and column aligns between the original object and the filled object. -.. _missing_data.dropna: +.. ipython:: python -Dropping axis labels with missing data: dropna -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + dff = pd.DataFrame(np.arange(30, dtype=np.float64).reshape(10, 3), columns=list("ABC")) + dff.iloc[3:5, 0] = np.nan + dff.iloc[4:6, 1] = np.nan + dff.iloc[5:8, 2] = np.nan + dff + dff.fillna(dff.mean()) -You may wish to simply exclude labels from a data set which refer to missing -data. To do this, use :meth:`~DataFrame.dropna`: +.. note:: -.. ipython:: python + :meth:`DataFrame.where` can also be used to fill NA values.Same result as above. - df["two"] = df["two"].fillna(0) - df["three"] = df["three"].fillna(0) - df - df.dropna(axis=0) - df.dropna(axis=1) - df["one"].dropna() + .. ipython:: python + + dff.where(pd.notna(dff), dff.mean(), axis="columns") -An equivalent :meth:`~Series.dropna` is available for Series. -DataFrame.dropna has considerably more options than Series.dropna, which can be -examined :ref:`in the API `. .. _missing_data.interpolate: Interpolation -~~~~~~~~~~~~~ +------------- -Both Series and DataFrame objects have :meth:`~DataFrame.interpolate` -that, by default, performs linear interpolation at missing data points. +:meth:`DataFrame.interpolate` and :meth:`Series.interpolate` fills NA values +using various interpolation methods. .. ipython:: python - np.random.seed(123456) - idx = pd.date_range("1/1/2000", periods=100, freq="BM") - ts = pd.Series(np.random.randn(100), index=idx) - ts[1:5] = np.nan - ts[20:30] = np.nan - ts[60:80] = np.nan - ts = ts.cumsum() + df = pd.DataFrame( + { + "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8], + "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4], + } + ) + df + df.interpolate() + + idx = pd.date_range("2020-01-01", periods=10, freq="D") + data = np.random.default_rng(2).integers(0, 10, 10).astype(np.float64) + ts = pd.Series(data, index=idx) + ts.iloc[[1, 2, 5, 6, 9]] = np.nan ts - ts.count() @savefig series_before_interpolate.png ts.plot() .. ipython:: python ts.interpolate() - ts.interpolate().count() - @savefig series_interpolate.png ts.interpolate().plot() -Index aware interpolation is available via the ``method`` keyword: +Interpolation relative to a :class:`Timestamp` in the :class:`DatetimeIndex` +is available by setting ``method="time"`` .. ipython:: python - ts2 = ts.iloc[[0, 1, 30, 60, 99]] + ts2 = ts.iloc[[0, 1, 3, 7, 9]] ts2 ts2.interpolate() ts2.interpolate(method="time") @@ -360,46 +472,36 @@ For a floating-point index, use ``method='values'``: idx = [0.0, 1.0, 10.0] ser = pd.Series([0.0, np.nan, 10.0], idx) - ser ser.interpolate() ser.interpolate(method="values") -You can also interpolate with a DataFrame: - -.. ipython:: python - - df = pd.DataFrame( - { - "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8], - "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4], - } - ) - df - df.interpolate() - -The ``method`` argument gives access to fancier interpolation methods. If you have scipy_ installed, you can pass the name of a 1-d interpolation routine to ``method``. -You'll want to consult the full scipy interpolation documentation_ and reference guide_ for details. -The appropriate interpolation method will depend on the type of data you are working with. +as specified in the scipy interpolation documentation_ and reference guide_. +The appropriate interpolation method will depend on the data type. -* If you are dealing with a time series that is growing at an increasing rate, - ``method='quadratic'`` may be appropriate. -* If you have values approximating a cumulative distribution function, - then ``method='pchip'`` should work well. -* To fill missing values with goal of smooth plotting, consider ``method='akima'``. +.. tip:: -.. warning:: + If you are dealing with a time series that is growing at an increasing rate, + use ``method='barycentric'``. - These methods require ``scipy``. + If you have values approximating a cumulative distribution function, + use ``method='pchip'``. -.. ipython:: python + To fill missing values with goal of smooth plotting use ``method='akima'``. - df.interpolate(method="barycentric") - - df.interpolate(method="pchip") + .. ipython:: python - df.interpolate(method="akima") + df = pd.DataFrame( + { + "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8], + "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4], + } + ) + df + df.interpolate(method="barycentric") + df.interpolate(method="pchip") + df.interpolate(method="akima") When interpolating via a polynomial or spline approximation, you must also specify the degree or order of the approximation: @@ -407,10 +509,9 @@ the degree or order of the approximation: .. ipython:: python df.interpolate(method="spline", order=2) - df.interpolate(method="polynomial", order=2) -Compare several methods: +Comparing several methods. .. ipython:: python @@ -425,11 +526,7 @@ Compare several methods: @savefig compare_interpolations.png df.plot() -Another use case is interpolation at *new* values. -Suppose you have 100 observations from some distribution. And let's suppose -that you're particularly interested in what's happening around the middle. -You can mix pandas' ``reindex`` and ``interpolate`` methods to interpolate -at the new values. +Interpolating new observations from expanding data with :meth:`Series.reindex`. .. ipython:: python @@ -447,21 +544,17 @@ at the new values. .. _missing_data.interp_limits: Interpolation limits --------------------- +^^^^^^^^^^^^^^^^^^^^ -Like other pandas fill methods, :meth:`~DataFrame.interpolate` accepts a ``limit`` keyword -argument. Use this argument to limit the number of consecutive ``NaN`` values -filled since the last valid observation: +:meth:`~DataFrame.interpolate` accepts a ``limit`` keyword +argument to limit the number of consecutive ``NaN`` values +filled since the last valid observation .. ipython:: python ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan]) ser - - # fill all consecutive values in a forward direction ser.interpolate() - - # fill one consecutive value in a forward direction ser.interpolate(limit=1) By default, ``NaN`` values are filled in a ``forward`` direction. Use @@ -469,17 +562,12 @@ By default, ``NaN`` values are filled in a ``forward`` direction. Use .. ipython:: python - # fill one consecutive value backwards ser.interpolate(limit=1, limit_direction="backward") - - # fill one consecutive value in both directions ser.interpolate(limit=1, limit_direction="both") - - # fill all consecutive values in both directions ser.interpolate(limit_direction="both") -By default, ``NaN`` values are filled whether they are inside (surrounded by) -existing valid values, or outside existing valid values. The ``limit_area`` +By default, ``NaN`` values are filled whether they are surrounded by +existing valid values or outside existing valid values. The ``limit_area`` parameter restricts filling to either inside or outside values. .. ipython:: python @@ -495,58 +583,46 @@ parameter restricts filling to either inside or outside values. .. _missing_data.replace: -Replacing generic values -~~~~~~~~~~~~~~~~~~~~~~~~ -Often times we want to replace arbitrary values with other values. - -:meth:`~Series.replace` in Series and :meth:`~DataFrame.replace` in DataFrame provides an efficient yet -flexible way to perform such replacements. - -For a Series, you can replace a single value or a list of values by another -value: - -.. ipython:: python - - ser = pd.Series([0.0, 1.0, 2.0, 3.0, 4.0]) +Replacing values +---------------- - ser.replace(0, 5) - -You can replace a list of values by a list of other values: +:meth:`Series.replace` and :meth:`DataFrame.replace` can be used similar to +:meth:`Series.fillna` and :meth:`DataFrame.fillna` to replace or insert missing values. .. ipython:: python - ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0]) + df = pd.DataFrame(np.eye(3)) + df + df_missing = df.replace(0, np.nan) + df_missing + df_filled = df_missing.replace(np.nan, 2) + df_filled -You can also specify a mapping dict: +Replacing more than one value is possible by passing a list. .. ipython:: python - ser.replace({0: 10, 1: 100}) + df_filled.replace([1, 44], [2, 28]) -For a DataFrame, you can specify individual values by column: +Replacing using a mapping dict. .. ipython:: python - df = pd.DataFrame({"a": [0, 1, 2, 3, 4], "b": [5, 6, 7, 8, 9]}) - - df.replace({"a": 0, "b": 5}, 100) + df_filled.replace({1: 44, 2: 28}) .. _missing_data.replace_expression: -String/regular expression replacement -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Regular expression replacement +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. note:: Python strings prefixed with the ``r`` character such as ``r'hello world'`` - are so-called "raw" strings. They have different semantics regarding - backslashes than strings without this prefix. Backslashes in raw strings - will be interpreted as an escaped backslash, e.g., ``r'\' == '\\'``. You - should `read about them - `__ - if this is unclear. + are `"raw" strings `_. + They have different semantics regarding backslashes than strings without this prefix. + Backslashes in raw strings will be interpreted as an escaped backslash, e.g., ``r'\' == '\\'``. -Replace the '.' with ``NaN`` (str -> str): +Replace the '.' with ``NaN`` .. ipython:: python @@ -554,59 +630,33 @@ Replace the '.' with ``NaN`` (str -> str): df = pd.DataFrame(d) df.replace(".", np.nan) -Now do it with a regular expression that removes surrounding whitespace -(regex -> regex): +Replace the '.' with ``NaN`` with regular expression that removes surrounding whitespace .. ipython:: python df.replace(r"\s*\.\s*", np.nan, regex=True) -Replace a few different values (list -> list): - -.. ipython:: python - - df.replace(["a", "."], ["b", np.nan]) - -list of regex -> list of regex: +Replace with a list of regexes. .. ipython:: python df.replace([r"\.", r"(a)"], ["dot", r"\1stuff"], regex=True) -Only search in column ``'b'`` (dict -> dict): - -.. ipython:: python - - df.replace({"b": "."}, {"b": np.nan}) - -Same as the previous example, but use a regular expression for -searching instead (dict of regex -> dict): +Replace with a regex in a mapping dict. .. ipython:: python df.replace({"b": r"\s*\.\s*"}, {"b": np.nan}, regex=True) -You can pass nested dictionaries of regular expressions that use ``regex=True``: +Pass nested dictionaries of regular expressions that use the ``regex`` keyword. .. ipython:: python df.replace({"b": {"b": r""}}, regex=True) - -Alternatively, you can pass the nested dictionary like so: - -.. ipython:: python - df.replace(regex={"b": {r"\s*\.\s*": np.nan}}) - -You can also use the group of a regular expression match when replacing (dict -of regex -> dict of regex), this works for lists as well. - -.. ipython:: python - df.replace({"b": r"\s*(\.)\s*"}, {"b": r"\1ty"}, regex=True) -You can pass a list of regular expressions, of which those that match -will be replaced with a scalar (list of regex -> regex). +Pass a list of regular expressions that will replace matches with a scalar. .. ipython:: python @@ -615,288 +665,12 @@ will be replaced with a scalar (list of regex -> regex). All of the regular expression examples can also be passed with the ``to_replace`` argument as the ``regex`` argument. In this case the ``value`` argument must be passed explicitly by name or ``regex`` must be a nested -dictionary. The previous example, in this case, would then be: +dictionary. .. ipython:: python df.replace(regex=[r"\s*\.\s*", r"a|b"], value=np.nan) -This can be convenient if you do not want to pass ``regex=True`` every time you -want to use a regular expression. - .. note:: - Anywhere in the above ``replace`` examples that you see a regular expression - a compiled regular expression is valid as well. - -Numeric replacement -~~~~~~~~~~~~~~~~~~~ - -:meth:`~DataFrame.replace` is similar to :meth:`~DataFrame.fillna`. - -.. ipython:: python - - df = pd.DataFrame(np.random.randn(10, 2)) - df[np.random.rand(df.shape[0]) > 0.5] = 1.5 - df.replace(1.5, np.nan) - -Replacing more than one value is possible by passing a list. - -.. ipython:: python - - df00 = df.iloc[0, 0] - df.replace([1.5, df00], [np.nan, "a"]) - df[1].dtype - -Missing data casting rules and indexing ---------------------------------------- - -While pandas supports storing arrays of integer and boolean type, these types -are not capable of storing missing data. Until we can switch to using a native -NA type in NumPy, we've established some "casting rules". When a reindexing -operation introduces missing data, the Series will be cast according to the -rules introduced in the table below. - -.. csv-table:: - :header: "data type", "Cast to" - :widths: 40, 40 - - integer, float - boolean, object - float, no cast - object, no cast - -For example: - -.. ipython:: python - - s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7]) - s > 0 - (s > 0).dtype - crit = (s > 0).reindex(list(range(8))) - crit - crit.dtype - -Ordinarily NumPy will complain if you try to use an object array (even if it -contains boolean values) instead of a boolean array to get or set values from -an ndarray (e.g. selecting values based on some criteria). If a boolean vector -contains NAs, an exception will be generated: - -.. ipython:: python - :okexcept: - - reindexed = s.reindex(list(range(8))).fillna(0) - reindexed[crit] - -However, these can be filled in using :meth:`~DataFrame.fillna` and it will work fine: - -.. ipython:: python - - reindexed[crit.fillna(False)] - reindexed[crit.fillna(True)] - -pandas provides a nullable integer dtype, but you must explicitly request it -when creating the series or column. Notice that we use a capital "I" in -the ``dtype="Int64"``. - -.. ipython:: python - - s = pd.Series([0, 1, np.nan, 3, 4], dtype="Int64") - s - -See :ref:`integer_na` for more. - - -.. _missing_data.NA: - -Experimental ``NA`` scalar to denote missing values -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. warning:: - - Experimental: the behaviour of ``pd.NA`` can still change without warning. - -Starting from pandas 1.0, an experimental ``pd.NA`` value (singleton) is -available to represent scalar missing values. At this moment, it is used in -the nullable :doc:`integer `, boolean and -:ref:`dedicated string ` data types as the missing value indicator. - -The goal of ``pd.NA`` is provide a "missing" indicator that can be used -consistently across data types (instead of ``np.nan``, ``None`` or ``pd.NaT`` -depending on the data type). - -For example, when having missing values in a Series with the nullable integer -dtype, it will use ``pd.NA``: - -.. ipython:: python - - s = pd.Series([1, 2, None], dtype="Int64") - s - s[2] - s[2] is pd.NA - -Currently, pandas does not yet use those data types by default (when creating -a DataFrame or Series, or when reading in data), so you need to specify -the dtype explicitly. An easy way to convert to those dtypes is explained -:ref:`here `. - -Propagation in arithmetic and comparison operations ---------------------------------------------------- - -In general, missing values *propagate* in operations involving ``pd.NA``. When -one of the operands is unknown, the outcome of the operation is also unknown. - -For example, ``pd.NA`` propagates in arithmetic operations, similarly to -``np.nan``: - -.. ipython:: python - - pd.NA + 1 - "a" * pd.NA - -There are a few special cases when the result is known, even when one of the -operands is ``NA``. - -.. ipython:: python - - pd.NA ** 0 - 1 ** pd.NA - -In equality and comparison operations, ``pd.NA`` also propagates. This deviates -from the behaviour of ``np.nan``, where comparisons with ``np.nan`` always -return ``False``. - -.. ipython:: python - - pd.NA == 1 - pd.NA == pd.NA - pd.NA < 2.5 - -To check if a value is equal to ``pd.NA``, the :func:`isna` function can be -used: - -.. ipython:: python - - pd.isna(pd.NA) - -An exception on this basic propagation rule are *reductions* (such as the -mean or the minimum), where pandas defaults to skipping missing values. See -:ref:`above ` for more. - -Logical operations ------------------- - -For logical operations, ``pd.NA`` follows the rules of the -`three-valued logic `__ (or -*Kleene logic*, similarly to R, SQL and Julia). This logic means to only -propagate missing values when it is logically required. - -For example, for the logical "or" operation (``|``), if one of the operands -is ``True``, we already know the result will be ``True``, regardless of the -other value (so regardless the missing value would be ``True`` or ``False``). -In this case, ``pd.NA`` does not propagate: - -.. ipython:: python - - True | False - True | pd.NA - pd.NA | True - -On the other hand, if one of the operands is ``False``, the result depends -on the value of the other operand. Therefore, in this case ``pd.NA`` -propagates: - -.. ipython:: python - - False | True - False | False - False | pd.NA - -The behaviour of the logical "and" operation (``&``) can be derived using -similar logic (where now ``pd.NA`` will not propagate if one of the operands -is already ``False``): - -.. ipython:: python - - False & True - False & False - False & pd.NA - -.. ipython:: python - - True & True - True & False - True & pd.NA - - -``NA`` in a boolean context ---------------------------- - -Since the actual value of an NA is unknown, it is ambiguous to convert NA -to a boolean value. The following raises an error: - -.. ipython:: python - :okexcept: - - bool(pd.NA) - -This also means that ``pd.NA`` cannot be used in a context where it is -evaluated to a boolean, such as ``if condition: ...`` where ``condition`` can -potentially be ``pd.NA``. In such cases, :func:`isna` can be used to check -for ``pd.NA`` or ``condition`` being ``pd.NA`` can be avoided, for example by -filling missing values beforehand. - -A similar situation occurs when using Series or DataFrame objects in ``if`` -statements, see :ref:`gotchas.truth`. - -NumPy ufuncs ------------- - -:attr:`pandas.NA` implements NumPy's ``__array_ufunc__`` protocol. Most ufuncs -work with ``NA``, and generally return ``NA``: - -.. ipython:: python - - np.log(pd.NA) - np.add(pd.NA, 1) - -.. warning:: - - Currently, ufuncs involving an ndarray and ``NA`` will return an - object-dtype filled with NA values. - - .. ipython:: python - - a = np.array([1, 2, 3]) - np.greater(a, pd.NA) - - The return type here may change to return a different array type - in the future. - -See :ref:`dsintro.numpy_interop` for more on ufuncs. - -.. _missing_data.NA.conversion: - -Conversion ----------- - -If you have a DataFrame or Series using traditional types that have missing data -represented using ``np.nan``, there are convenience methods -:meth:`~Series.convert_dtypes` in Series and :meth:`~DataFrame.convert_dtypes` -in DataFrame that can convert data to use the newer dtypes for integers, strings and -booleans listed :ref:`here `. This is especially helpful after reading -in data sets when letting the readers such as :meth:`read_csv` and :meth:`read_excel` -infer default dtypes. - -In this example, while the dtypes of all columns are changed, we show the results for -the first 10 columns. - -.. ipython:: python - - bb = pd.read_csv("data/baseball.csv", index_col="id") - bb[bb.columns[:10]].dtypes - -.. ipython:: python - - bbn = bb.convert_dtypes() - bbn[bbn.columns[:10]].dtypes + A regular expression object from ``re.compile`` is a valid input as well. diff --git a/doc/source/whatsnew/v0.21.0.rst b/doc/source/whatsnew/v0.21.0.rst index f4976cc243f77..f8eacd28fa795 100644 --- a/doc/source/whatsnew/v0.21.0.rst +++ b/doc/source/whatsnew/v0.21.0.rst @@ -392,7 +392,7 @@ Sum/prod of all-NaN or empty Series/DataFrames is now consistently NaN The behavior of ``sum`` and ``prod`` on all-NaN Series/DataFrames no longer depends on whether `bottleneck `__ is installed, and return value of ``sum`` and ``prod`` on an empty Series has changed (:issue:`9422`, :issue:`15507`). -Calling ``sum`` or ``prod`` on an empty or all-``NaN`` ``Series``, or columns of a ``DataFrame``, will result in ``NaN``. See the :ref:`docs `. +Calling ``sum`` or ``prod`` on an empty or all-``NaN`` ``Series``, or columns of a ``DataFrame``, will result in ``NaN``. See the :ref:`docs `. .. ipython:: python diff --git a/doc/source/whatsnew/v0.4.x.rst b/doc/source/whatsnew/v0.4.x.rst index 0ed7bb396674e..83f6a6907f33c 100644 --- a/doc/source/whatsnew/v0.4.x.rst +++ b/doc/source/whatsnew/v0.4.x.rst @@ -11,8 +11,7 @@ New features - Added Python 3 support using 2to3 (:issue:`200`) - :ref:`Added ` ``name`` attribute to ``Series``, now prints as part of ``Series.__repr__`` -- :ref:`Added ` instance methods ``isnull`` and ``notnull`` to - Series (:issue:`209`, :issue:`203`) +- :meth:`Series.isnull`` and :meth:`Series.notnull` (:issue:`209`, :issue:`203`) - :ref:`Added ` ``Series.align`` method for aligning two series with choice of join method (ENH56_) - :ref:`Added ` method ``get_level_values`` to