Skip to content

Commit 785f4a4

Browse files
author
Khor Chean Wei
authored
Merge branch 'main' into groupby_describe_empty_dataset
2 parents 6a3cc67 + 8647298 commit 785f4a4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

85 files changed

+1626
-542
lines changed

.github/workflows/posix.yml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ jobs:
2828
pattern: ["not single_cpu", "single_cpu"]
2929
# Don't test pyarrow v2/3: Causes timeouts in read_csv engine
3030
# even if tests are skipped/xfailed
31-
pyarrow_version: ["5", "7"]
31+
pyarrow_version: ["5", "6", "7"]
3232
include:
3333
- name: "Downstream Compat"
3434
env_file: actions-38-downstream_compat.yaml
@@ -62,6 +62,15 @@ jobs:
6262
pattern: "not slow and not network and not single_cpu"
6363
pandas_testing_mode: "deprecate"
6464
test_args: "-W error::DeprecationWarning:numpy"
65+
exclude:
66+
- env_file: actions-39.yaml
67+
pyarrow_version: "6"
68+
- env_file: actions-39.yaml
69+
pyarrow_version: "7"
70+
- env_file: actions-310.yaml
71+
pyarrow_version: "6"
72+
- env_file: actions-310.yaml
73+
pyarrow_version: "7"
6574
fail-fast: false
6675
name: ${{ matrix.name || format('{0} pyarrow={1} {2}', matrix.env_file, matrix.pyarrow_version, matrix.pattern) }}
6776
env:

doc/source/user_guide/cookbook.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -511,7 +511,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
511511
512512
def replace(g):
513513
mask = g < 0
514-
return g.where(mask, g[~mask].mean())
514+
return g.where(~mask, g[~mask].mean())
515515
516516
gb.transform(replace)
517517

doc/source/whatsnew/v1.5.0.rst

Lines changed: 111 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,31 @@ as seen in the following example.
100100
1 2021-01-02 08:00:00 4
101101
2 2021-01-02 16:00:00 5
102102
103+
.. _whatsnew_150.enhancements.tar:
104+
105+
Reading directly from TAR archives
106+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
107+
108+
I/O methods like :func:`read_csv` or :meth:`DataFrame.to_json` now allow reading and writing
109+
directly on TAR archives (:issue:`44787`).
110+
111+
.. code-block:: python
112+
113+
df = pd.read_csv("./movement.tar.gz")
114+
# ...
115+
df.to_csv("./out.tar.gz")
116+
117+
This supports ``.tar``, ``.tar.gz``, ``.tar.bz`` and ``.tar.xz2`` archives.
118+
The used compression method is inferred from the filename.
119+
If the compression method cannot be inferred, use the ``compression`` argument:
120+
121+
.. code-block:: python
122+
123+
df = pd.read_csv(some_file_obj, compression={"method": "tar", "mode": "r:gz"}) # noqa F821
124+
125+
(``mode`` being one of ``tarfile.open``'s modes: https://docs.python.org/3/library/tarfile.html#tarfile.open)
126+
127+
103128
.. _whatsnew_150.enhancements.other:
104129

105130
Other enhancements
@@ -120,7 +145,7 @@ Other enhancements
120145
- :meth:`DataFrame.reset_index` now accepts a ``names`` argument which renames the index names (:issue:`6878`)
121146
- :meth:`pd.concat` now raises when ``levels`` is given but ``keys`` is None (:issue:`46653`)
122147
- :meth:`pd.concat` now raises when ``levels`` contains duplicate values (:issue:`46653`)
123-
- Added ``numeric_only`` argument to :meth:`DataFrame.corr`, :meth:`DataFrame.corrwith`, :meth:`DataFrame.cov`, :meth:`DataFrame.idxmin`, :meth:`DataFrame.idxmax`, :meth:`.GroupBy.idxmin`, :meth:`.GroupBy.idxmax`, :meth:`.GroupBy.var`, :meth:`.GroupBy.std`, :meth:`.GroupBy.sem`, and :meth:`.GroupBy.quantile` (:issue:`46560`)
148+
- Added ``numeric_only`` argument to :meth:`DataFrame.corr`, :meth:`DataFrame.corrwith`, :meth:`DataFrame.cov`, :meth:`DataFrame.idxmin`, :meth:`DataFrame.idxmax`, :meth:`.DataFrameGroupBy.idxmin`, :meth:`.DataFrameGroupBy.idxmax`, :meth:`.GroupBy.var`, :meth:`.GroupBy.std`, :meth:`.GroupBy.sem`, and :meth:`.DataFrameGroupBy.quantile` (:issue:`46560`)
124149
- A :class:`errors.PerformanceWarning` is now thrown when using ``string[pyarrow]`` dtype with methods that don't dispatch to ``pyarrow.compute`` methods (:issue:`42613`, :issue:`46725`)
125150
- Added ``validate`` argument to :meth:`DataFrame.join` (:issue:`46622`)
126151
- A :class:`errors.PerformanceWarning` is now thrown when using ``string[pyarrow]`` dtype with methods that don't dispatch to ``pyarrow.compute`` methods (:issue:`42613`)
@@ -194,10 +219,47 @@ did not have the same index as the input.
194219
df.groupby('a', dropna=True).transform('ffill')
195220
df.groupby('a', dropna=True).transform(lambda x: x)
196221
197-
.. _whatsnew_150.notable_bug_fixes.notable_bug_fix2:
222+
.. _whatsnew_150.notable_bug_fixes.to_json_incorrectly_localizing_naive_timestamps:
198223

199-
notable_bug_fix2
200-
^^^^^^^^^^^^^^^^
224+
Serializing tz-naive Timestamps with to_json() with ``iso_dates=True``
225+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
226+
227+
:meth:`DataFrame.to_json`, :meth:`Series.to_json`, and :meth:`Index.to_json`
228+
would incorrectly localize DatetimeArrays/DatetimeIndexes with tz-naive Timestamps
229+
to UTC. (:issue:`38760`)
230+
231+
Note that this patch does not fix the localization of tz-aware Timestamps to UTC
232+
upon serialization. (Related issue :issue:`12997`)
233+
234+
*Old Behavior*
235+
236+
.. ipython:: python
237+
238+
index = pd.date_range(
239+
start='2020-12-28 00:00:00',
240+
end='2020-12-28 02:00:00',
241+
freq='1H',
242+
)
243+
a = pd.Series(
244+
data=range(3),
245+
index=index,
246+
)
247+
248+
.. code-block:: ipython
249+
250+
In [4]: a.to_json(date_format='iso')
251+
Out[4]: '{"2020-12-28T00:00:00.000Z":0,"2020-12-28T01:00:00.000Z":1,"2020-12-28T02:00:00.000Z":2}'
252+
253+
In [5]: pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index
254+
Out[5]: array([False, False, False])
255+
256+
*New Behavior*
257+
258+
.. ipython:: python
259+
260+
a.to_json(date_format='iso')
261+
# Roundtripping now works
262+
pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index
201263
202264
.. ---------------------------------------------------------------------------
203265
.. _whatsnew_150.api_breaking:
@@ -426,6 +488,48 @@ As ``group_keys=True`` is the default value of :meth:`DataFrame.groupby` and
426488
raise a ``FutureWarning``. This can be silenced and the previous behavior
427489
retained by specifying ``group_keys=False``.
428490

491+
.. _whatsnew_150.deprecations.numeric_only_default:
492+
493+
``numeric_only`` default value
494+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
495+
496+
Across the DataFrame operations such as ``min``, ``sum``, and ``idxmax``, the default
497+
value of the ``numeric_only`` argument, if it exists at all, was inconsistent.
498+
Furthermore, operations with the default value ``None`` can lead to surprising
499+
results. (:issue:`46560`)
500+
501+
.. code-block:: ipython
502+
503+
In [1]: df = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]})
504+
505+
In [2]: # Reading the next line without knowing the contents of df, one would
506+
# expect the result to contain the products for both columns a and b.
507+
df[["a", "b"]].prod()
508+
Out[2]:
509+
a 2
510+
dtype: int64
511+
512+
To avoid this behavior, the specifying the value ``numeric_only=None`` has been
513+
deprecated, and will be removed in a future version of pandas. In the future,
514+
all operations with a ``numeric_only`` argument will default to ``False``. Users
515+
should either call the operation only with columns that can be operated on, or
516+
specify ``numeric_only=True`` to operate only on Boolean, integer, and float columns.
517+
518+
In order to support the transition to the new behavior, the following methods have
519+
gained the ``numeric_only`` argument.
520+
521+
- :meth:`DataFrame.corr`
522+
- :meth:`DataFrame.corrwith`
523+
- :meth:`DataFrame.cov`
524+
- :meth:`DataFrame.idxmin`
525+
- :meth:`DataFrame.idxmax`
526+
- :meth:`.DataFrameGroupBy.idxmin`
527+
- :meth:`.DataFrameGroupBy.idxmax`
528+
- :meth:`.GroupBy.var`
529+
- :meth:`.GroupBy.std`
530+
- :meth:`.GroupBy.sem`
531+
- :meth:`.DataFrameGroupBy.quantile`
532+
429533
.. _whatsnew_150.deprecations.other:
430534

431535
Other Deprecations
@@ -448,6 +552,7 @@ Other Deprecations
448552
- Deprecated passing arguments as positional in :meth:`DataFrame.any` and :meth:`Series.any` (:issue:`44802`)
449553
- Deprecated the ``closed`` argument in :meth:`interval_range` in favor of ``inclusive`` argument; In a future version passing ``closed`` will raise (:issue:`40245`)
450554
- Deprecated the methods :meth:`DataFrame.mad`, :meth:`Series.mad`, and the corresponding groupby methods (:issue:`11787`)
555+
- Deprecated positional arguments to :meth:`Index.join` except for ``other``, use keyword-only arguments instead of positional arguments (:issue:`46518`)
451556

452557
.. ---------------------------------------------------------------------------
453558
.. _whatsnew_150.performance:
@@ -629,8 +734,10 @@ Groupby/resample/rolling
629734
- Bug in :meth:`Rolling.var` and :meth:`Rolling.std` would give non-zero result with window of same values (:issue:`42064`)
630735
- Bug in :meth:`.Rolling.var` would segfault calculating weighted variance when window size was larger than data size (:issue:`46760`)
631736
- Bug in :meth:`Grouper.__repr__` where ``dropna`` was not included. Now it is (:issue:`46754`)
737+
- Bug in :meth:`DataFrame.rolling` gives ValueError when center=True, axis=1 and win_type is specified (:issue:`46135`)
632738
- Bug in :meth:`.DataFrameGroupBy.describe` and :meth:`.SeriesGroupBy.describe` produces inconsistent results for empty datasets (:issue:`41575`)
633739

740+
634741
Reshaping
635742
^^^^^^^^^
636743
- Bug in :func:`concat` between a :class:`Series` with integer dtype and another with :class:`CategoricalDtype` with integer categories and containing ``NaN`` values casting to object dtype instead of ``float64`` (:issue:`45359`)

pandas/__init__.py

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,33 +3,33 @@
33
__docformat__ = "restructuredtext"
44

55
# Let users know if they're missing any of our hard dependencies
6-
hard_dependencies = ("numpy", "pytz", "dateutil")
7-
missing_dependencies = []
6+
_hard_dependencies = ("numpy", "pytz", "dateutil")
7+
_missing_dependencies = []
88

9-
for dependency in hard_dependencies:
9+
for _dependency in _hard_dependencies:
1010
try:
11-
__import__(dependency)
12-
except ImportError as e:
13-
missing_dependencies.append(f"{dependency}: {e}")
11+
__import__(_dependency)
12+
except ImportError as _e:
13+
_missing_dependencies.append(f"{_dependency}: {_e}")
1414

15-
if missing_dependencies:
15+
if _missing_dependencies:
1616
raise ImportError(
17-
"Unable to import required dependencies:\n" + "\n".join(missing_dependencies)
17+
"Unable to import required dependencies:\n" + "\n".join(_missing_dependencies)
1818
)
19-
del hard_dependencies, dependency, missing_dependencies
19+
del _hard_dependencies, _dependency, _missing_dependencies
2020

2121
# numpy compat
2222
from pandas.compat import is_numpy_dev as _is_numpy_dev
2323

2424
try:
2525
from pandas._libs import hashtable as _hashtable, lib as _lib, tslib as _tslib
26-
except ImportError as err: # pragma: no cover
27-
module = err.name
26+
except ImportError as _err: # pragma: no cover
27+
_module = _err.name
2828
raise ImportError(
29-
f"C extension: {module} not built. If you want to import "
29+
f"C extension: {_module} not built. If you want to import "
3030
"pandas from the source directory, you may need to run "
3131
"'python setup.py build_ext --force' to build the C extensions first."
32-
) from err
32+
) from _err
3333
else:
3434
del _tslib, _lib, _hashtable
3535

pandas/_libs/algos.pyi

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ def rank_1d(
109109
ascending: bool = ...,
110110
pct: bool = ...,
111111
na_option=...,
112+
mask: npt.NDArray[np.bool_] | None = ...,
112113
) -> np.ndarray: ... # np.ndarray[float64_t, ndim=1]
113114
def rank_2d(
114115
in_arr: np.ndarray, # ndarray[numeric_object_t, ndim=2]

pandas/_libs/algos.pyx

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -889,6 +889,7 @@ def rank_1d(
889889
bint ascending=True,
890890
bint pct=False,
891891
na_option="keep",
892+
const uint8_t[:] mask=None,
892893
):
893894
"""
894895
Fast NaN-friendly version of ``scipy.stats.rankdata``.
@@ -918,6 +919,8 @@ def rank_1d(
918919
* keep: leave NA values where they are
919920
* top: smallest rank if ascending
920921
* bottom: smallest rank if descending
922+
mask : np.ndarray[bool], optional, default None
923+
Specify locations to be treated as NA, for e.g. Categorical.
921924
"""
922925
cdef:
923926
TiebreakEnumType tiebreak
@@ -927,7 +930,6 @@ def rank_1d(
927930
float64_t[::1] out
928931
ndarray[numeric_object_t, ndim=1] masked_vals
929932
numeric_object_t[:] masked_vals_memview
930-
uint8_t[:] mask
931933
bint keep_na, nans_rank_highest, check_labels, check_mask
932934
numeric_object_t nan_fill_val
933935

@@ -956,6 +958,7 @@ def rank_1d(
956958
or numeric_object_t is object
957959
or (numeric_object_t is int64_t and is_datetimelike)
958960
)
961+
check_mask = check_mask or mask is not None
959962

960963
# Copy values into new array in order to fill missing data
961964
# with mask, without obfuscating location of missing data
@@ -965,7 +968,9 @@ def rank_1d(
965968
else:
966969
masked_vals = values.copy()
967970

968-
if numeric_object_t is object:
971+
if mask is not None:
972+
pass
973+
elif numeric_object_t is object:
969974
mask = missing.isnaobj(masked_vals)
970975
elif numeric_object_t is int64_t and is_datetimelike:
971976
mask = (masked_vals == NPY_NAT).astype(np.uint8)

pandas/_libs/groupby.pyi

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,7 @@ def group_rank(
128128
ascending: bool = ...,
129129
pct: bool = ...,
130130
na_option: Literal["keep", "top", "bottom"] = ...,
131+
mask: npt.NDArray[np.bool_] | None = ...,
131132
) -> None: ...
132133
def group_max(
133134
out: np.ndarray, # groupby_t[:, ::1]

pandas/_libs/groupby.pyx

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1262,6 +1262,7 @@ def group_rank(
12621262
bint ascending=True,
12631263
bint pct=False,
12641264
str na_option="keep",
1265+
const uint8_t[:, :] mask=None,
12651266
) -> None:
12661267
"""
12671268
Provides the rank of values within each group.
@@ -1294,6 +1295,7 @@ def group_rank(
12941295
* keep: leave NA values where they are
12951296
* top: smallest rank if ascending
12961297
* bottom: smallest rank if descending
1298+
mask : np.ndarray[bool] or None, default None
12971299

12981300
Notes
12991301
-----
@@ -1302,22 +1304,29 @@ def group_rank(
13021304
cdef:
13031305
Py_ssize_t i, k, N
13041306
ndarray[float64_t, ndim=1] result
1307+
const uint8_t[:] sub_mask
13051308

13061309
N = values.shape[1]
13071310

13081311
for k in range(N):
1312+
if mask is None:
1313+
sub_mask = None
1314+
else:
1315+
sub_mask = mask[:, k]
1316+
13091317
result = rank_1d(
13101318
values=values[:, k],
13111319
labels=labels,
13121320
is_datetimelike=is_datetimelike,
13131321
ties_method=ties_method,
13141322
ascending=ascending,
13151323
pct=pct,
1316-
na_option=na_option
1324+
na_option=na_option,
1325+
mask=sub_mask,
13171326
)
13181327
for i in range(len(result)):
1319-
# TODO: why can't we do out[:, k] = result?
1320-
out[i, k] = result[i]
1328+
if labels[i] >= 0:
1329+
out[i, k] = result[i]
13211330

13221331

13231332
# ----------------------------------------------------------------------

0 commit comments

Comments
 (0)