diff --git a/doc/source/release.rst b/doc/source/release.rst index 5fe397a7cbb37..32db2ff5ebb24 100644 --- a/doc/source/release.rst +++ b/doc/source/release.rst @@ -42,7 +42,7 @@ pandas 0.23.0 **Release date**: May 15, 2017 -This is a major release from 0.23.0 and includes a number of API changes, new +This is a major release from 0.22.0 and includes a number of API changes, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version. @@ -54,6 +54,7 @@ Highlights include: - :ref:`Merging / sorting on a combination of columns and index levels `. - :ref:`Extending Pandas with custom types `. - :ref:`Excluding unobserved categories from groupby `. +- :ref:`Changes to make output shape of DataFrame.apply consistent `. See the :ref:`full whatsnew ` for a list of all the changes. diff --git a/doc/source/whatsnew/v0.23.0.txt b/doc/source/whatsnew/v0.23.0.txt index 89dab728d2bd4..3f89de1dc22d8 100644 --- a/doc/source/whatsnew/v0.23.0.txt +++ b/doc/source/whatsnew/v0.23.0.txt @@ -8,90 +8,114 @@ deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version. +Highlights include: + +- :ref:`Round-trippable JSON format with 'table' orient `. +- :ref:`Instantiation from dicts respects order for Python 3.6+ `. +- :ref:`Dependent column arguments for assign `. +- :ref:`Merging / sorting on a combination of columns and index levels `. +- :ref:`Extending Pandas with custom types `. +- :ref:`Excluding unobserved categories from groupby `. +- :ref:`Changes to make output shape of DataFrame.apply consistent `. + +Check the :ref:`API Changes ` and :ref:`deprecations ` before updating. + .. warning:: Starting January 1, 2019, pandas feature releases will support Python 3 only. See :ref:`install.dropping-27` for more. +.. contents:: What's new in v0.23.0 + :local: + :backlinks: none + :depth: 2 + .. _whatsnew_0230.enhancements: New features ~~~~~~~~~~~~ -.. _whatsnew_0210.enhancements.limit_area: - -``DataFrame.interpolate`` has gained the ``limit_area`` kwarg -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. _whatsnew_0230.enhancements.round-trippable_json: -:meth:`DataFrame.interpolate` has gained a ``limit_area`` parameter to allow further control of which ``NaN`` s are replaced. -Use ``limit_area='inside'`` to fill only NaNs surrounded by valid values or use ``limit_area='outside'`` to fill only ``NaN`` s -outside the existing valid values while preserving those inside. (:issue:`16284`) See the :ref:`full documentation here `. +JSON read/write round-trippable with ``orient='table'`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +A ``DataFrame`` can now be written to and subsequently read back via JSON while preserving metadata through usage of the ``orient='table'`` argument (see :issue:`18912` and :issue:`9146`). Previously, none of the available ``orient`` values guaranteed the preservation of dtypes and index names, amongst other metadata. .. ipython:: python - ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan]) - ser + df = pd.DataFrame({'foo': [1, 2, 3, 4], + 'bar': ['a', 'b', 'c', 'd'], + 'baz': pd.date_range('2018-01-01', freq='d', periods=4), + 'qux': pd.Categorical(['a', 'b', 'c', 'c']) + }, index=pd.Index(range(4), name='idx')) + df + df.dtypes + df.to_json('test.json', orient='table') + new_df = pd.read_json('test.json', orient='table') + new_df + new_df.dtypes -Fill one consecutive inside value in both directions +Please note that the string `index` is not supported with the round trip format, as it is used by default in ``write_json`` to indicate a missing index name. .. ipython:: python + :okwarning: - ser.interpolate(limit_direction='both', limit_area='inside', limit=1) + df.index.name = 'index' -Fill all consecutive outside values backward + df.to_json('test.json', orient='table') + new_df = pd.read_json('test.json', orient='table') + new_df + new_df.dtypes .. ipython:: python + :suppress: - ser.interpolate(limit_direction='backward', limit_area='outside') + import os + os.remove('test.json') -Fill all consecutive outside values in both directions -.. ipython:: python - - ser.interpolate(limit_direction='both', limit_area='outside') +.. _whatsnew_0230.enhancements.assign_dependent: -.. _whatsnew_0210.enhancements.get_dummies_dtype: -``get_dummies`` now supports ``dtype`` argument -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +``.assign()`` accepts dependent arguments +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The :func:`get_dummies` now accepts a ``dtype`` argument, which specifies a dtype for the new columns. The default remains uint8. (:issue:`18330`) +The :func:`DataFrame.assign` now accepts dependent keyword arguments for python version later than 3.6 (see also `PEP 468 +`_). Later keyword arguments may now refer to earlier ones if the argument is a callable. See the +:ref:`documentation here ` (:issue:`14207`) .. ipython:: python - df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]}) - pd.get_dummies(df, columns=['c']).dtypes - pd.get_dummies(df, columns=['c'], dtype=bool).dtypes - - -.. _whatsnew_0230.enhancements.window_raw: + df = pd.DataFrame({'A': [1, 2, 3]}) + df + df.assign(B=df.A, C=lambda x:x['A']+ x['B']) -Rolling/Expanding.apply() accepts a ``raw`` keyword to pass a ``Series`` to the function -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. warning:: -:func:`Series.rolling().apply() `, :func:`DataFrame.rolling().apply() `, -:func:`Series.expanding().apply() `, and :func:`DataFrame.expanding().apply() ` have gained a ``raw=None`` parameter. -This is similar to :func:`DataFame.apply`. This parameter, if ``True`` allows one to send a ``np.ndarray`` to the applied function. If ``False`` a ``Series`` will be passed. The -default is ``None``, which preserves backward compatibility, so this will default to ``True``, sending an ``np.ndarray``. -In a future version the default will be changed to ``False``, sending a ``Series``. (:issue:`5071`, :issue:`20584`) + This may subtly change the behavior of your code when you're + using ``.assign()`` to update an existing column. Previously, callables + referring to other variables being updated would get the "old" values -.. ipython:: python + Previous Behavior: - s = pd.Series(np.arange(5), np.arange(5) + 1) - s + .. code-block:: ipython -Pass a ``Series``: + In [2]: df = pd.DataFrame({"A": [1, 2, 3]}) -.. ipython:: python + In [3]: df.assign(A=lambda df: df.A + 1, C=lambda df: df.A * -1) + Out[3]: + A C + 0 2 -1 + 1 3 -2 + 2 4 -3 - s.rolling(2, min_periods=1).apply(lambda x: x.iloc[-1], raw=False) + New Behavior: -Mimic the original behavior of passing a ndarray: + .. ipython:: python -.. ipython:: python + df.assign(A=df.A+1, C= lambda df: df.A* -1) - s.rolling(2, min_periods=1).apply(lambda x: x[-1], raw=True) .. _whatsnew_0230.enhancements.merge_on_columns_and_levels: @@ -151,6 +175,194 @@ resetting indexes. See the :ref:`Sorting by Indexes and Values # Sort by 'second' (index) and 'A' (column) df_multi.sort_values(by=['second', 'A']) + +.. _whatsnew_023.enhancements.extension: + +Extending Pandas with Custom Types (Experimental) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Pandas now supports storing array-like objects that aren't necessarily 1-D NumPy +arrays as columns in a DataFrame or values in a Series. This allows third-party +libraries to implement extensions to NumPy's types, similar to how pandas +implemented categoricals, datetimes with timezones, periods, and intervals. + +As a demonstration, we'll use cyberpandas_, which provides an ``IPArray`` type +for storing ip addresses. + +.. code-block:: ipython + + In [1]: from cyberpandas import IPArray + + In [2]: values = IPArray([ + ...: 0, + ...: 3232235777, + ...: 42540766452641154071740215577757643572 + ...: ]) + ...: + ...: + +``IPArray`` isn't a normal 1-D NumPy array, but because it's a pandas +:ref:`~pandas.api.extension.ExtensionArray`, it can be stored properly inside pandas' containers. + +.. code-block:: ipython + + In [3]: ser = pd.Series(values) + + In [4]: ser + Out[4]: + 0 0.0.0.0 + 1 192.168.1.1 + 2 2001:db8:85a3::8a2e:370:7334 + dtype: ip + +Notice that the dtype is ``ip``. The missing value semantics of the underlying +array are respected: + +.. code-block:: ipython + + In [5]: ser.isna() + Out[5]: + 0 True + 1 False + 2 False + dtype: bool + +For more, see the :ref:`extension types ` +documentation. If you build an extension array, publicize it on our +:ref:`ecosystem page `. + +.. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest/ + + +.. _whatsnew_0230.enhancements.categorical_grouping: + +New ``observed`` keyword for excluding unobserved categories in ``groupby`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Grouping by a categorical includes the unobserved categories in the output. +When grouping by multiple categorical columns, this means you get the cartesian product of all the +categories, including combinations where there are no observations, which can result in a large +number of groups. We have added a keyword ``observed`` to control this behavior, it defaults to +``observed=False`` for backward-compatiblity. (:issue:`14942`, :issue:`8138`, :issue:`15217`, :issue:`17594`, :issue:`8669`, :issue:`20583`, :issue:`20902`) + +.. ipython:: python + + cat1 = pd.Categorical(["a", "a", "b", "b"], + categories=["a", "b", "z"], ordered=True) + cat2 = pd.Categorical(["c", "d", "c", "d"], + categories=["c", "d", "y"], ordered=True) + df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]}) + df['C'] = ['foo', 'bar'] * 2 + df + +To show all values, the previous behavior: + +.. ipython:: python + + df.groupby(['A', 'B', 'C'], observed=False).count() + + +To show only observed values: + +.. ipython:: python + + df.groupby(['A', 'B', 'C'], observed=True).count() + +For pivotting operations, this behavior is *already* controlled by the ``dropna`` keyword: + +.. ipython:: python + + cat1 = pd.Categorical(["a", "a", "b", "b"], + categories=["a", "b", "z"], ordered=True) + cat2 = pd.Categorical(["c", "d", "c", "d"], + categories=["c", "d", "y"], ordered=True) + df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]}) + df + +.. ipython:: python + + pd.pivot_table(df, values='values', index=['A', 'B'], + dropna=True) + pd.pivot_table(df, values='values', index=['A', 'B'], + dropna=False) + + +.. _whatsnew_0230.enhancements.window_raw: + +Rolling/Expanding.apply() accepts ``raw=False`` to pass a ``Series`` to the function +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:func:`Series.rolling().apply() `, :func:`DataFrame.rolling().apply() `, +:func:`Series.expanding().apply() `, and :func:`DataFrame.expanding().apply() ` have gained a ``raw=None`` parameter. +This is similar to :func:`DataFame.apply`. This parameter, if ``True`` allows one to send a ``np.ndarray`` to the applied function. If ``False`` a ``Series`` will be passed. The +default is ``None``, which preserves backward compatibility, so this will default to ``True``, sending an ``np.ndarray``. +In a future version the default will be changed to ``False``, sending a ``Series``. (:issue:`5071`, :issue:`20584`) + +.. ipython:: python + + s = pd.Series(np.arange(5), np.arange(5) + 1) + s + +Pass a ``Series``: + +.. ipython:: python + + s.rolling(2, min_periods=1).apply(lambda x: x.iloc[-1], raw=False) + +Mimic the original behavior of passing a ndarray: + +.. ipython:: python + + s.rolling(2, min_periods=1).apply(lambda x: x[-1], raw=True) + + +.. _whatsnew_0210.enhancements.limit_area: + +``DataFrame.interpolate`` has gained the ``limit_area`` kwarg +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:meth:`DataFrame.interpolate` has gained a ``limit_area`` parameter to allow further control of which ``NaN`` s are replaced. +Use ``limit_area='inside'`` to fill only NaNs surrounded by valid values or use ``limit_area='outside'`` to fill only ``NaN`` s +outside the existing valid values while preserving those inside. (:issue:`16284`) See the :ref:`full documentation here `. + + +.. ipython:: python + + ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan]) + ser + +Fill one consecutive inside value in both directions + +.. ipython:: python + + ser.interpolate(limit_direction='both', limit_area='inside', limit=1) + +Fill all consecutive outside values backward + +.. ipython:: python + + ser.interpolate(limit_direction='backward', limit_area='outside') + +Fill all consecutive outside values in both directions + +.. ipython:: python + + ser.interpolate(limit_direction='both', limit_area='outside') + +.. _whatsnew_0210.enhancements.get_dummies_dtype: + +``get_dummies`` now supports ``dtype`` argument +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The :func:`get_dummies` now accepts a ``dtype`` argument, which specifies a dtype for the new columns. The default remains uint8. (:issue:`18330`) + +.. ipython:: python + + df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]}) + pd.get_dummies(df, columns=['c']).dtypes + pd.get_dummies(df, columns=['c'], dtype=bool).dtypes + + .. _whatsnew_0230.enhancements.timedelta_mod: Timedelta mod method @@ -227,86 +439,6 @@ These bugs were squashed: - Bug in :meth:`Series.rank` and :meth:`DataFrame.rank` when ``ascending='False'`` failed to return correct ranks for infinity if ``NaN`` were present (:issue:`19538`) - Bug in :func:`DataFrameGroupBy.rank` where ranks were incorrect when both infinity and ``NaN`` were present (:issue:`20561`) -.. _whatsnew_0230.enhancements.round-trippable_json: - -JSON read/write round-trippable with ``orient='table'`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -A ``DataFrame`` can now be written to and subsequently read back via JSON while preserving metadata through usage of the ``orient='table'`` argument (see :issue:`18912` and :issue:`9146`). Previously, none of the available ``orient`` values guaranteed the preservation of dtypes and index names, amongst other metadata. - -.. ipython:: python - - df = pd.DataFrame({'foo': [1, 2, 3, 4], - 'bar': ['a', 'b', 'c', 'd'], - 'baz': pd.date_range('2018-01-01', freq='d', periods=4), - 'qux': pd.Categorical(['a', 'b', 'c', 'c']) - }, index=pd.Index(range(4), name='idx')) - df - df.dtypes - df.to_json('test.json', orient='table') - new_df = pd.read_json('test.json', orient='table') - new_df - new_df.dtypes - -Please note that the string `index` is not supported with the round trip format, as it is used by default in ``write_json`` to indicate a missing index name. - -.. ipython:: python - :okwarning: - - df.index.name = 'index' - - df.to_json('test.json', orient='table') - new_df = pd.read_json('test.json', orient='table') - new_df - new_df.dtypes - -.. ipython:: python - :suppress: - - import os - os.remove('test.json') - - -.. _whatsnew_0230.enhancements.assign_dependent: - - -``.assign()`` accepts dependent arguments -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The :func:`DataFrame.assign` now accepts dependent keyword arguments for python version later than 3.6 (see also `PEP 468 -`_). Later keyword arguments may now refer to earlier ones if the argument is a callable. See the -:ref:`documentation here ` (:issue:`14207`) - -.. ipython:: python - - df = pd.DataFrame({'A': [1, 2, 3]}) - df - df.assign(B=df.A, C=lambda x:x['A']+ x['B']) - -.. warning:: - - This may subtly change the behavior of your code when you're - using ``.assign()`` to update an existing column. Previously, callables - referring to other variables being updated would get the "old" values - - Previous Behavior: - - .. code-block:: ipython - - In [2]: df = pd.DataFrame({"A": [1, 2, 3]}) - - In [3]: df.assign(A=lambda df: df.A + 1, C=lambda df: df.A * -1) - Out[3]: - A C - 0 2 -1 - 1 3 -2 - 2 4 -3 - - New Behavior: - - .. ipython:: python - - df.assign(A=df.A+1, C= lambda df: df.A* -1) .. _whatsnew_0230.enhancements.str_cat_align: @@ -358,116 +490,6 @@ Supplying a ``CategoricalDtype`` will make the categories in each column consist df['A'].dtype df['B'].dtype -.. _whatsnew_023.enhancements.extension: - -Extending Pandas with Custom Types (Experimental) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Pandas now supports storing array-like objects that aren't necessarily 1-D NumPy -arrays as columns in a DataFrame or values in a Series. This allows third-party -libraries to implement extensions to NumPy's types, similar to how pandas -implemented categoricals, datetimes with timezones, periods, and intervals. - -As a demonstration, we'll use cyberpandas_, which provides an ``IPArray`` type -for storing ip addresses. - -.. code-block:: ipython - - In [1]: from cyberpandas import IPArray - - In [2]: values = IPArray([ - ...: 0, - ...: 3232235777, - ...: 42540766452641154071740215577757643572 - ...: ]) - ...: - ...: - -``IPArray`` isn't a normal 1-D NumPy array, but because it's a pandas -:ref:`~pandas.api.extension.ExtensionArray`, it can be stored properly inside pandas' containers. - -.. code-block:: ipython - - In [3]: ser = pd.Series(values) - - In [4]: ser - Out[4]: - 0 0.0.0.0 - 1 192.168.1.1 - 2 2001:db8:85a3::8a2e:370:7334 - dtype: ip - -Notice that the dtype is ``ip``. The missing value semantics of the underlying -array are respected: - -.. code-block:: ipython - - In [5]: ser.isna() - Out[5]: - 0 True - 1 False - 2 False - dtype: bool - -For more, see the :ref:`extension types ` -documentation. If you build an extension array, publicize it on our -:ref:`ecosystem page `. - -.. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest/ - -.. _whatsnew_0230.enhancements.categorical_grouping: - -Categorical Groupers has gained an observed keyword -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Grouping by a categorical includes the unobserved categories in the output. -When grouping with multiple groupers, this means you get the cartesian product of all the -categories, including combinations where there are no observations, which can result in a large -number of groupers. We have added a keyword ``observed`` to control this behavior, it defaults to -``observed=False`` for backward-compatiblity. (:issue:`14942`, :issue:`8138`, :issue:`15217`, :issue:`17594`, :issue:`8669`, :issue:`20583`, :issue:`20902`) - - -.. ipython:: python - - cat1 = pd.Categorical(["a", "a", "b", "b"], - categories=["a", "b", "z"], ordered=True) - cat2 = pd.Categorical(["c", "d", "c", "d"], - categories=["c", "d", "y"], ordered=True) - df = pd.DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]}) - df['C'] = ['foo', 'bar'] * 2 - df - -To show all values, the previous behavior: - -.. ipython:: python - - df.groupby(['A', 'B', 'C'], observed=False).count() - - -To show only observed values: - -.. ipython:: python - - df.groupby(['A', 'B', 'C'], observed=True).count() - -For pivotting operations, this behavior is *already* controlled by the ``dropna`` keyword: - -.. ipython:: python - - cat1 = pd.Categorical(["a", "a", "b", "b"], - categories=["a", "b", "z"], ordered=True) - cat2 = pd.Categorical(["c", "d", "c", "d"], - categories=["c", "d", "y"], ordered=True) - df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]}) - df - -.. ipython:: python - - pd.pivot_table(df, values='values', index=['A', 'B'], - dropna=True) - pd.pivot_table(df, values='values', index=['A', 'B'], - dropna=False) - .. _whatsnew_0230.enhancements.other: @@ -519,7 +541,7 @@ Other Enhancements - :func:`read_html` now reads all ```` elements in a ````, not just the first. (:issue:`20690`) - :meth:`~pandas.core.window.Rolling.quantile` and :meth:`~pandas.core.window.Expanding.quantile` now accept the ``interpolation`` keyword, ``linear`` by default (:issue:`20497`) - zip compression is supported via ``compression=zip`` in :func:`DataFrame.to_pickle`, :func:`Series.to_pickle`, :func:`DataFrame.to_csv`, :func:`Series.to_csv`, :func:`DataFrame.to_json`, :func:`Series.to_json`. (:issue:`17778`) -- :class:`pandas.tseries.api.offsets.WeekOfMonth` constructor now supports ``n=0`` (:issue:`20517`). +- :class:`~pandas.tseries.offsets.WeekOfMonth` constructor now supports ``n=0`` (:issue:`20517`). - :class:`DataFrame` and :class:`Series` now support matrix multiplication (``@``) operator (:issue:`10259`) for Python>=3.5 - Updated :meth:`DataFrame.to_gbq` and :meth:`pandas.read_gbq` signature and documentation to reflect changes from the Pandas-GBQ library version 0.4.0. Adds intersphinx mapping to Pandas-GBQ