Skip to content

Commit 2890f77

Browse files
committed
Merge remote-tracking branch 'upstream/main' into cow_fillna_fix
2 parents 73db7c2 + fe0cc48 commit 2890f77

28 files changed

+347
-100
lines changed

.github/actions/setup-conda/action.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ runs:
1818
- name: Set Arrow version in ${{ inputs.environment-file }} to ${{ inputs.pyarrow-version }}
1919
run: |
2020
grep -q ' - pyarrow' ${{ inputs.environment-file }}
21-
sed -i"" -e "s/ - pyarrow<11/ - pyarrow=${{ inputs.pyarrow-version }}/" ${{ inputs.environment-file }}
21+
sed -i"" -e "s/ - pyarrow/ - pyarrow=${{ inputs.pyarrow-version }}/" ${{ inputs.environment-file }}
2222
cat ${{ inputs.environment-file }}
2323
shell: bash
2424
if: ${{ inputs.pyarrow-version }}

ci/deps/actions-310.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ dependencies:
4242
- psycopg2
4343
- pymysql
4444
- pytables
45-
- pyarrow<11
45+
- pyarrow
4646
- pyreadstat
4747
- python-snappy
4848
- pyxlsb

ci/deps/actions-311.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ dependencies:
4242
- psycopg2
4343
- pymysql
4444
# - pytables>=3.8.0 # first version that supports 3.11
45-
- pyarrow<11
45+
- pyarrow
4646
- pyreadstat
4747
- python-snappy
4848
- pyxlsb

ci/deps/actions-38-downstream_compat.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ dependencies:
4040
- openpyxl
4141
- odfpy
4242
- psycopg2
43-
- pyarrow<11
43+
- pyarrow
4444
- pymysql
4545
- pyreadstat
4646
- pytables

ci/deps/actions-38.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ dependencies:
4040
- odfpy
4141
- pandas-gbq
4242
- psycopg2
43-
- pyarrow<11
43+
- pyarrow
4444
- pymysql
4545
- pyreadstat
4646
- pytables

ci/deps/actions-39.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ dependencies:
4141
- pandas-gbq
4242
- psycopg2
4343
- pymysql
44-
- pyarrow<11
44+
- pyarrow
4545
- pyreadstat
4646
- pytables
4747
- python-snappy

ci/deps/circle-38-arm64.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ dependencies:
4040
- odfpy
4141
- pandas-gbq
4242
- psycopg2
43-
- pyarrow<11
43+
- pyarrow
4444
- pymysql
4545
# Not provided on ARM
4646
#- pyreadstat

doc/source/user_guide/io.rst

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5498,11 +5498,8 @@ included in Python's standard library by default.
54985498
You can find an overview of supported drivers for each SQL dialect in the
54995499
`SQLAlchemy docs <https://docs.sqlalchemy.org/en/latest/dialects/index.html>`__.
55005500

5501-
If SQLAlchemy is not installed, a fallback is only provided for sqlite (and
5502-
for mysql for backwards compatibility, but this is deprecated and will be
5503-
removed in a future version).
5504-
This mode requires a Python database adapter which respect the `Python
5505-
DB-API <https://www.python.org/dev/peps/pep-0249/>`__.
5501+
If SQLAlchemy is not installed, you can use a :class:`sqlite3.Connection` in place of
5502+
a SQLAlchemy engine, connection, or URI string.
55065503

55075504
See also some :ref:`cookbook examples <cookbook.sql>` for some advanced strategies.
55085505

doc/source/whatsnew/v2.0.0.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -244,6 +244,10 @@ Copy-on-Write improvements
244244
a modification to the data happens) when constructing a Series from an existing
245245
Series with the default of ``copy=False`` (:issue:`50471`)
246246

247+
- The :class:`DataFrame` constructor, when constructing a DataFrame from a dictionary
248+
of Series objects and specifying ``copy=False``, will now use a lazy copy
249+
of those Series objects for the columns of the DataFrame (:issue:`50777`)
250+
247251
- Trying to set values using chained assignment (for example, ``df["a"][1:3] = 0``)
248252
will now always raise an exception when Copy-on-Write is enabled. In this mode,
249253
chained assignment can never work because we are always setting into a temporary
@@ -787,7 +791,9 @@ Other API changes
787791
- The levels of the index of the :class:`Series` returned from ``Series.sparse.from_coo`` now always have dtype ``int32``. Previously they had dtype ``int64`` (:issue:`50926`)
788792
- :func:`to_datetime` with ``unit`` of either "Y" or "M" will now raise if a sequence contains a non-round ``float`` value, matching the ``Timestamp`` behavior (:issue:`50301`)
789793
- The methods :meth:`Series.round`, :meth:`DataFrame.__invert__`, :meth:`Series.__invert__`, :meth:`DataFrame.swapaxes`, :meth:`DataFrame.first`, :meth:`DataFrame.last`, :meth:`Series.first`, :meth:`Series.last` and :meth:`DataFrame.align` will now always return new objects (:issue:`51032`)
794+
- :class:`DataFrameGroupBy` aggregations (e.g. "sum") with object-dtype columns no longer infer non-object dtypes for their results, explicitly call ``result.infer_objects(copy=False)`` on the result to obtain the old behavior (:issue:`51205`)
790795
- Added :func:`pandas.api.types.is_any_real_numeric_dtype` to check for real numeric dtypes (:issue:`51152`)
796+
-
791797

792798
.. ---------------------------------------------------------------------------
793799
.. _whatsnew_200.deprecations:

environment.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ dependencies:
4343
- odfpy
4444
- py
4545
- psycopg2
46-
- pyarrow<11
46+
- pyarrow
4747
- pymysql
4848
- pyreadstat
4949
- pytables

pandas/core/arrays/arrow/array.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -275,6 +275,18 @@ def _from_sequence_of_strings(
275275
from pandas.core.tools.timedeltas import to_timedelta
276276

277277
scalars = to_timedelta(strings, errors="raise")
278+
if pa_type.unit != "ns":
279+
# GH51175: test_from_sequence_of_strings_pa_array
280+
# attempt to parse as int64 reflecting pyarrow's
281+
# duration to string casting behavior
282+
mask = isna(scalars)
283+
if not isinstance(strings, (pa.Array, pa.ChunkedArray)):
284+
strings = pa.array(strings, type=pa.string(), from_pandas=True)
285+
strings = pc.if_else(mask, None, strings)
286+
try:
287+
scalars = strings.cast(pa.int64())
288+
except pa.ArrowInvalid:
289+
pass
278290
elif pa.types.is_time(pa_type):
279291
from pandas.core.tools.times import to_time
280292

pandas/core/generic.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9612,7 +9612,8 @@ def _where(
96129612
# align the cond to same shape as myself
96139613
cond = common.apply_if_callable(cond, self)
96149614
if isinstance(cond, NDFrame):
9615-
cond, _ = cond.align(self, join="right", broadcast_axis=1, copy=False)
9615+
# CoW: Make sure reference is not kept alive
9616+
cond = cond.align(self, join="right", broadcast_axis=1, copy=False)[0]
96169617
else:
96179618
if not hasattr(cond, "shape"):
96189619
cond = np.asanyarray(cond)
@@ -9648,14 +9649,15 @@ def _where(
96489649
# align with me
96499650
if other.ndim <= self.ndim:
96509651

9651-
_, other = self.align(
9652+
# CoW: Make sure reference is not kept alive
9653+
other = self.align(
96529654
other,
96539655
join="left",
96549656
axis=axis,
96559657
level=level,
96569658
fill_value=None,
96579659
copy=False,
9658-
)
9660+
)[1]
96599661

96609662
# if we are NOT aligned, raise as we cannot where index
96619663
if axis is None and not other._indexed_same(self):

pandas/core/groupby/groupby.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1495,6 +1495,9 @@ def _agg_py_fallback(
14951495
# TODO: if we ever get "rank" working, exclude it here.
14961496
res_values = type(values)._from_sequence(res_values, dtype=values.dtype)
14971497

1498+
elif ser.dtype == object:
1499+
res_values = res_values.astype(object, copy=False)
1500+
14981501
# If we are DataFrameGroupBy and went through a SeriesGroupByPath
14991502
# then we need to reshape
15001503
# GH#32223 includes case with IntegerArray values, ndarray res_values
@@ -1537,8 +1540,7 @@ def array_func(values: ArrayLike) -> ArrayLike:
15371540
new_mgr = data.grouped_reduce(array_func)
15381541
res = self._wrap_agged_manager(new_mgr)
15391542
out = self._wrap_aggregated_output(res)
1540-
if data.ndim == 2:
1541-
# TODO: don't special-case DataFrame vs Series
1543+
if self.axis == 1:
15421544
out = out.infer_objects(copy=False)
15431545
return out
15441546

pandas/core/internals/blocks.py

Lines changed: 42 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -946,7 +946,7 @@ def _unstack(
946946

947947
# ---------------------------------------------------------------------
948948

949-
def setitem(self, indexer, value) -> Block:
949+
def setitem(self, indexer, value, using_cow: bool = False) -> Block:
950950
"""
951951
Attempt self.values[indexer] = value, possibly creating a new array.
952952
@@ -956,6 +956,8 @@ def setitem(self, indexer, value) -> Block:
956956
The subset of self.values to set
957957
value : object
958958
The value being set
959+
using_cow: bool, default False
960+
Signaling if CoW is used.
959961
960962
Returns
961963
-------
@@ -991,10 +993,17 @@ def setitem(self, indexer, value) -> Block:
991993
# checking lib.is_scalar here fails on
992994
# test_iloc_setitem_custom_object
993995
casted = setitem_datetimelike_compat(values, len(vi), casted)
996+
997+
if using_cow and self.refs.has_reference():
998+
values = values.copy()
999+
self = self.make_block_same_class(
1000+
values.T if values.ndim == 2 else values
1001+
)
1002+
9941003
values[indexer] = casted
9951004
return self
9961005

997-
def putmask(self, mask, new) -> list[Block]:
1006+
def putmask(self, mask, new, using_cow: bool = False) -> list[Block]:
9981007
"""
9991008
putmask the data to the block; it is possible that we may create a
10001009
new dtype of block
@@ -1022,11 +1031,21 @@ def putmask(self, mask, new) -> list[Block]:
10221031
new = extract_array(new, extract_numpy=True)
10231032

10241033
if noop:
1034+
if using_cow:
1035+
return [self.copy(deep=False)]
10251036
return [self]
10261037

10271038
try:
10281039
casted = np_can_hold_element(values.dtype, new)
1040+
1041+
if using_cow and self.refs.has_reference():
1042+
# Do this here to avoid copying twice
1043+
values = values.copy()
1044+
self = self.make_block_same_class(values)
1045+
10291046
putmask_without_repeat(values.T, mask, casted)
1047+
if using_cow:
1048+
return [self.copy(deep=False)]
10301049
return [self]
10311050
except LossySetitemError:
10321051

@@ -1038,7 +1057,7 @@ def putmask(self, mask, new) -> list[Block]:
10381057
return self.coerce_to_target_dtype(new).putmask(mask, new)
10391058
else:
10401059
indexer = mask.nonzero()[0]
1041-
nb = self.setitem(indexer, new[indexer])
1060+
nb = self.setitem(indexer, new[indexer], using_cow=using_cow)
10421061
return [nb]
10431062

10441063
else:
@@ -1053,7 +1072,7 @@ def putmask(self, mask, new) -> list[Block]:
10531072
n = new[:, i : i + 1]
10541073

10551074
submask = orig_mask[:, i : i + 1]
1056-
rbs = nb.putmask(submask, n)
1075+
rbs = nb.putmask(submask, n, using_cow=using_cow)
10571076
res_blocks.extend(rbs)
10581077
return res_blocks
10591078

@@ -1462,7 +1481,7 @@ class EABackedBlock(Block):
14621481

14631482
values: ExtensionArray
14641483

1465-
def setitem(self, indexer, value):
1484+
def setitem(self, indexer, value, using_cow: bool = False):
14661485
"""
14671486
Attempt self.values[indexer] = value, possibly creating a new array.
14681487
@@ -1475,6 +1494,8 @@ def setitem(self, indexer, value):
14751494
The subset of self.values to set
14761495
value : object
14771496
The value being set
1497+
using_cow: bool, default False
1498+
Signaling if CoW is used.
14781499
14791500
Returns
14801501
-------
@@ -1581,7 +1602,7 @@ def where(self, other, cond, _downcast: str | bool = "infer") -> list[Block]:
15811602
nb = self.make_block_same_class(res_values)
15821603
return [nb]
15831604

1584-
def putmask(self, mask, new) -> list[Block]:
1605+
def putmask(self, mask, new, using_cow: bool = False) -> list[Block]:
15851606
"""
15861607
See Block.putmask.__doc__
15871608
"""
@@ -1599,8 +1620,16 @@ def putmask(self, mask, new) -> list[Block]:
15991620
mask = self._maybe_squeeze_arg(mask)
16001621

16011622
if not mask.any():
1623+
if using_cow:
1624+
return [self.copy(deep=False)]
16021625
return [self]
16031626

1627+
if using_cow and self.refs.has_reference():
1628+
values = values.copy()
1629+
self = self.make_block_same_class( # type: ignore[assignment]
1630+
values.T if values.ndim == 2 else values
1631+
)
1632+
16041633
try:
16051634
# Caller is responsible for ensuring matching lengths
16061635
values._putmask(mask, new)
@@ -1649,6 +1678,9 @@ def delete(self, loc) -> list[Block]:
16491678
values = self.values.delete(loc)
16501679
mgr_locs = self._mgr_locs.delete(loc)
16511680
return [type(self)(values, placement=mgr_locs, ndim=self.ndim)]
1681+
elif self.values.ndim == 1:
1682+
# We get here through to_stata
1683+
return []
16521684
return super().delete(loc)
16531685

16541686
@cache_readonly
@@ -2230,15 +2262,17 @@ def get_block_type(dtype: DtypeObj):
22302262
return cls
22312263

22322264

2233-
def new_block_2d(values: ArrayLike, placement: BlockPlacement):
2265+
def new_block_2d(
2266+
values: ArrayLike, placement: BlockPlacement, refs: BlockValuesRefs | None = None
2267+
):
22342268
# new_block specialized to case with
22352269
# ndim=2
22362270
# isinstance(placement, BlockPlacement)
22372271
# check_ndim/ensure_block_shape already checked
22382272
klass = get_block_type(values.dtype)
22392273

22402274
values = maybe_coerce_values(values)
2241-
return klass(values, ndim=2, placement=placement)
2275+
return klass(values, ndim=2, placement=placement, refs=refs)
22422276

22432277

22442278
def new_block(

pandas/core/internals/construction.py

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ def arrays_to_mgr(
116116
index = ensure_index(index)
117117

118118
# don't force copy because getting jammed in an ndarray anyway
119-
arrays = _homogenize(arrays, index, dtype)
119+
arrays, refs = _homogenize(arrays, index, dtype)
120120
# _homogenize ensures
121121
# - all(len(x) == len(index) for x in arrays)
122122
# - all(x.ndim == 1 for x in arrays)
@@ -126,8 +126,10 @@ def arrays_to_mgr(
126126
else:
127127
index = ensure_index(index)
128128
arrays = [extract_array(x, extract_numpy=True) for x in arrays]
129+
# with _from_arrays, the passed arrays should never be Series objects
130+
refs = [None] * len(arrays)
129131

130-
# Reached via DataFrame._from_arrays; we do validation here
132+
# Reached via DataFrame._from_arrays; we do minimal validation here
131133
for arr in arrays:
132134
if (
133135
not isinstance(arr, (np.ndarray, ExtensionArray))
@@ -148,7 +150,7 @@ def arrays_to_mgr(
148150

149151
if typ == "block":
150152
return create_block_manager_from_column_arrays(
151-
arrays, axes, consolidate=consolidate
153+
arrays, axes, consolidate=consolidate, refs=refs
152154
)
153155
elif typ == "array":
154156
return ArrayManager(arrays, [index, columns])
@@ -547,9 +549,13 @@ def _ensure_2d(values: np.ndarray) -> np.ndarray:
547549
return values
548550

549551

550-
def _homogenize(data, index: Index, dtype: DtypeObj | None) -> list[ArrayLike]:
552+
def _homogenize(
553+
data, index: Index, dtype: DtypeObj | None
554+
) -> tuple[list[ArrayLike], list[Any]]:
551555
oindex = None
552556
homogenized = []
557+
# if the original array-like in `data` is a Series, keep track of this Series' refs
558+
refs: list[Any] = []
553559

554560
for val in data:
555561
if isinstance(val, ABCSeries):
@@ -559,7 +565,10 @@ def _homogenize(data, index: Index, dtype: DtypeObj | None) -> list[ArrayLike]:
559565
# Forces alignment. No need to copy data since we
560566
# are putting it into an ndarray later
561567
val = val.reindex(index, copy=False)
562-
568+
if isinstance(val._mgr, SingleBlockManager):
569+
refs.append(val._mgr._block.refs)
570+
else:
571+
refs.append(None)
563572
val = val._values
564573
else:
565574
if isinstance(val, dict):
@@ -578,10 +587,11 @@ def _homogenize(data, index: Index, dtype: DtypeObj | None) -> list[ArrayLike]:
578587

579588
val = sanitize_array(val, index, dtype=dtype, copy=False)
580589
com.require_length_match(val, index)
590+
refs.append(None)
581591

582592
homogenized.append(val)
583593

584-
return homogenized
594+
return homogenized, refs
585595

586596

587597
def _extract_index(data) -> Index:

0 commit comments

Comments
 (0)