Skip to content

Commit 0f2eea1

Browse files
committed
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
2 parents 7acd4a5 + f15e31f commit 0f2eea1

33 files changed

+336
-113
lines changed

.pre-commit-config.yaml

Lines changed: 12 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15,15 +15,22 @@ default_stages: [
1515
ci:
1616
autofix_prs: false
1717
repos:
18+
- repo: local
19+
hooks:
20+
# NOTE: we make `black` a local hook because if it's installed from
21+
# PyPI (rather than from source) then it'll run twice as fast thanks to mypyc
22+
- id: black
23+
name: black
24+
description: "Black: The uncompromising Python code formatter"
25+
entry: black
26+
language: python
27+
require_serial: true
28+
types_or: [python, pyi]
29+
additional_dependencies: [black==23.1.0]
1830
- repo: https://github.com/charliermarsh/ruff-pre-commit
1931
rev: v0.0.244
2032
hooks:
2133
- id: ruff
22-
- repo: https://github.com/MarcoGorelli/absolufy-imports
23-
rev: v0.3.1
24-
hooks:
25-
- id: absolufy-imports
26-
files: ^pandas/
2734
- repo: https://github.com/jendrikseipp/vulture
2835
rev: 'v2.7'
2936
hooks:
@@ -116,16 +123,6 @@ repos:
116123
- id: sphinx-lint
117124
- repo: local
118125
hooks:
119-
# NOTE: we make `black` a local hook because if it's installed from
120-
# PyPI (rather than from source) then it'll run twice as fast thanks to mypyc
121-
- id: black
122-
name: black
123-
description: "Black: The uncompromising Python code formatter"
124-
entry: black
125-
language: python
126-
require_serial: true
127-
types_or: [python, pyi]
128-
additional_dependencies: [black==23.1.0]
129126
- id: pyright
130127
# note: assumes python env is setup and activated
131128
name: pyright

ci/code_checks.sh

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -591,14 +591,11 @@ if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then
591591
pandas.api.types.is_timedelta64_ns_dtype \
592592
pandas.api.types.is_unsigned_integer_dtype \
593593
pandas.core.groupby.DataFrameGroupBy.take \
594-
pandas.core.groupby.SeriesGroupBy.take \
595594
pandas.io.formats.style.Styler.concat \
596595
pandas.io.formats.style.Styler.export \
597596
pandas.io.formats.style.Styler.set_td_classes \
598597
pandas.io.formats.style.Styler.use \
599598
pandas.io.json.build_table_schema \
600-
pandas.merge_ordered \
601-
pandas.option_context \
602599
pandas.plotting.andrews_curves \
603600
pandas.plotting.autocorrelation_plot \
604601
pandas.plotting.lag_plot \

doc/source/development/contributing.rst

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -331,6 +331,26 @@ To automatically fix formatting errors on each commit you make, you can
331331
set up pre-commit yourself. First, create a Python :ref:`environment
332332
<contributing_environment>` and then set up :ref:`pre-commit <contributing.pre-commit>`.
333333

334+
.. _contributing.update-dev:
335+
336+
Updating the development environment
337+
------------------------------------
338+
339+
After updating your branch to merge in main from upstream, you may need to update
340+
your development environment to reflect any changes to the various packages that
341+
are used during development.
342+
343+
If using :ref:`mamba <contributing.mamba>`, do::
344+
345+
mamba deactivate
346+
mamba env update -f environment.yml
347+
mamba activate pandas-dev
348+
349+
If using :ref:`pip <contributing.pip>` , do::
350+
351+
# activate the virtual environment based on your platform
352+
pythom -m pip install --upgrade -r requirements-dev.txt
353+
334354
Tips for a successful pull request
335355
==================================
336356

doc/source/development/contributing_codebase.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,12 @@ without needing to have done ``pre-commit install`` beforehand.
8989
you may run into issues if you're using conda. To solve this, you can downgrade
9090
``virtualenv`` to version ``20.0.33``.
9191

92+
.. note::
93+
94+
If you have recently merged in main from the upstream branch, some of the
95+
dependencies used by ``pre-commit`` may have changed. Make sure to
96+
:ref:`update your development environment <contributing.update-dev>`.
97+
9298
Optional dependencies
9399
---------------------
94100

doc/source/development/contributing_environment.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,8 @@ Option 1: using mamba (recommended)
9595
mamba env create --file environment.yml
9696
mamba activate pandas-dev
9797
98+
.. _contributing.pip:
99+
98100
Option 2: using pip
99101
~~~~~~~~~~~~~~~~~~~
100102

doc/source/user_guide/io.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -294,9 +294,9 @@ date_parser : function, default ``None``
294294
.. deprecated:: 2.0.0
295295
Use ``date_format`` instead, or read in as ``object`` and then apply
296296
:func:`to_datetime` as-needed.
297-
date_format : str, default ``None``
297+
date_format : str or dict of column -> format, default ``None``
298298
If used in conjunction with ``parse_dates``, will parse dates according to this
299-
format. For anything more complex (e.g. different formats for different columns),
299+
format. For anything more complex,
300300
please read in as ``object`` and then apply :func:`to_datetime` as-needed.
301301

302302
.. versionadded:: 2.0.0
@@ -912,7 +912,7 @@ Finally, the parser allows you to specify a custom ``date_format``.
912912
Performance-wise, you should try these methods of parsing dates in order:
913913

914914
1. If you know the format, use ``date_format``, e.g.:
915-
``date_format="%d/%m/%Y"``.
915+
``date_format="%d/%m/%Y"`` or ``date_format={column_name: "%d/%m/%Y"}``.
916916

917917
2. If you different formats for different columns, or want to pass any extra options (such
918918
as ``utc``) to ``to_datetime``, then you should read in your data as ``object`` dtype, and

doc/source/whatsnew/v2.0.0.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,7 @@ Copy-on-Write improvements
222222
- :meth:`DataFrame.to_timestamp` / :meth:`Series.to_timestamp`
223223
- :meth:`DataFrame.to_period` / :meth:`Series.to_period`
224224
- :meth:`DataFrame.truncate`
225+
- :meth:`DataFrame.iterrows`
225226
- :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize`
226227
- :meth:`DataFrame.fillna` / :meth:`Series.fillna`
227228
- :meth:`DataFrame.interpolate` / :meth:`Series.interpolate`
@@ -1107,7 +1108,7 @@ Performance improvements
11071108
- Performance improvement in :meth:`Series.rank` for pyarrow-backed dtypes (:issue:`50264`)
11081109
- Performance improvement in :meth:`Series.searchsorted` for pyarrow-backed dtypes (:issue:`50447`)
11091110
- Performance improvement in :meth:`Series.fillna` for extension array dtypes (:issue:`49722`, :issue:`50078`)
1110-
- Performance improvement in :meth:`Index.join`, :meth:`Index.intersection` and :meth:`Index.union` for masked dtypes when :class:`Index` is monotonic (:issue:`50310`)
1111+
- Performance improvement in :meth:`Index.join`, :meth:`Index.intersection` and :meth:`Index.union` for masked and arrow dtypes when :class:`Index` is monotonic (:issue:`50310`, :issue:`51365`)
11111112
- Performance improvement for :meth:`Series.value_counts` with nullable dtype (:issue:`48338`)
11121113
- Performance improvement for :class:`Series` constructor passing integer numpy array with nullable dtype (:issue:`48338`)
11131114
- Performance improvement for :class:`DatetimeIndex` constructor passing a list (:issue:`48609`)
@@ -1125,7 +1126,7 @@ Performance improvements
11251126
- Performance improvement in :meth:`~arrays.ArrowExtensionArray.factorize` (:issue:`49177`)
11261127
- Performance improvement in :meth:`~arrays.ArrowExtensionArray.__setitem__` (:issue:`50248`, :issue:`50632`)
11271128
- Performance improvement in :class:`~arrays.ArrowExtensionArray` comparison methods when array contains NA (:issue:`50524`)
1128-
- Performance improvement in :meth:`~arrays.ArrowExtensionArray.to_numpy` (:issue:`49973`)
1129+
- Performance improvement in :meth:`~arrays.ArrowExtensionArray.to_numpy` (:issue:`49973`, :issue:`51227`)
11291130
- Performance improvement when parsing strings to :class:`BooleanDtype` (:issue:`50613`)
11301131
- Performance improvement in :meth:`DataFrame.join` when joining on a subset of a :class:`MultiIndex` (:issue:`48611`)
11311132
- Performance improvement for :meth:`MultiIndex.intersection` (:issue:`48604`)

pandas/_config/config.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -424,6 +424,7 @@ class option_context(ContextDecorator):
424424
425425
Examples
426426
--------
427+
>>> from pandas import option_context
427428
>>> with option_context('display.max_rows', 10, 'display.max_columns', 5):
428429
... pass
429430
"""

pandas/core/arrays/arrow/array.py

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -510,6 +510,18 @@ def __len__(self) -> int:
510510
"""
511511
return len(self._data)
512512

513+
def __contains__(self, key) -> bool:
514+
# https://github.com/pandas-dev/pandas/pull/51307#issuecomment-1426372604
515+
if isna(key) and key is not self.dtype.na_value:
516+
if self.dtype.kind == "f" and lib.is_float(key) and isna(key):
517+
return pc.any(pc.is_nan(self._data)).as_py()
518+
519+
# e.g. date or timestamp types we do not allow None here to match pd.NA
520+
return False
521+
# TODO: maybe complex? object?
522+
523+
return bool(super().__contains__(key))
524+
513525
@property
514526
def _hasna(self) -> bool:
515527
return self._data.null_count > 0
@@ -868,12 +880,15 @@ def to_numpy(
868880
na_value = self.dtype.na_value
869881

870882
pa_type = self._data.type
871-
if (
872-
is_object_dtype(dtype)
873-
or pa.types.is_timestamp(pa_type)
874-
or pa.types.is_duration(pa_type)
875-
):
883+
if pa.types.is_temporal(pa_type) and not pa.types.is_date(pa_type):
884+
# temporal types with units and/or timezones currently
885+
# require pandas/python scalars to pass all tests
886+
# TODO: improve performance (this is slow)
876887
result = np.array(list(self), dtype=dtype)
888+
elif is_object_dtype(dtype) and self._hasna:
889+
result = np.empty(len(self), dtype=object)
890+
mask = ~self.isna()
891+
result[mask] = np.asarray(self[mask]._data)
877892
else:
878893
result = np.asarray(self._data, dtype=dtype)
879894
if copy or self._hasna:

pandas/core/arrays/arrow/dtype.py

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,20 @@
11
from __future__ import annotations
22

3+
from datetime import (
4+
date,
5+
datetime,
6+
time,
7+
timedelta,
8+
)
9+
from decimal import Decimal
310
import re
411

512
import numpy as np
613

14+
from pandas._libs.tslibs import (
15+
Timedelta,
16+
Timestamp,
17+
)
718
from pandas._typing import (
819
TYPE_CHECKING,
920
DtypeObj,
@@ -88,9 +99,40 @@ def __repr__(self) -> str:
8899
@property
89100
def type(self):
90101
"""
91-
Returns pyarrow.DataType.
102+
Returns associated scalar type.
92103
"""
93-
return type(self.pyarrow_dtype)
104+
pa_type = self.pyarrow_dtype
105+
if pa.types.is_integer(pa_type):
106+
return int
107+
elif pa.types.is_floating(pa_type):
108+
return float
109+
elif pa.types.is_string(pa_type):
110+
return str
111+
elif pa.types.is_binary(pa_type):
112+
return bytes
113+
elif pa.types.is_boolean(pa_type):
114+
return bool
115+
elif pa.types.is_duration(pa_type):
116+
if pa_type.unit == "ns":
117+
return Timedelta
118+
else:
119+
return timedelta
120+
elif pa.types.is_timestamp(pa_type):
121+
if pa_type.unit == "ns":
122+
return Timestamp
123+
else:
124+
return datetime
125+
elif pa.types.is_date(pa_type):
126+
return date
127+
elif pa.types.is_time(pa_type):
128+
return time
129+
elif pa.types.is_decimal(pa_type):
130+
return Decimal
131+
elif pa.types.is_null(pa_type):
132+
# TODO: None? pd.NA? pa.null?
133+
return type(pa_type)
134+
else:
135+
raise NotImplementedError(pa_type)
94136

95137
@property
96138
def name(self) -> str: # type: ignore[override]

pandas/core/frame.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1392,8 +1392,14 @@ def iterrows(self) -> Iterable[tuple[Hashable, Series]]:
13921392
"""
13931393
columns = self.columns
13941394
klass = self._constructor_sliced
1395+
using_cow = using_copy_on_write()
13951396
for k, v in zip(self.index, self.values):
13961397
s = klass(v, index=columns, name=k).__finalize__(self)
1398+
if using_cow and self._mgr.is_single_block:
1399+
s._mgr.blocks[0].refs = self._mgr.blocks[0].refs # type: ignore[union-attr] # noqa
1400+
s._mgr.blocks[0].refs.add_reference( # type: ignore[union-attr]
1401+
s._mgr.blocks[0] # type: ignore[arg-type, union-attr]
1402+
)
13971403
yield k, s
13981404

13991405
def itertuples(

pandas/core/generic.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3329,7 +3329,7 @@ def to_latex(
33293329
>>> print(df.to_latex(index=False,
33303330
... formatters={"name": str.upper},
33313331
... float_format="{:.1f}".format,
3332-
... ) # doctest: +SKIP
3332+
... )) # doctest: +SKIP
33333333
\begin{tabular}{lrr}
33343334
\toprule
33353335
name & age & height \\

pandas/core/groupby/generic.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -933,7 +933,7 @@ def take(
933933
934934
Examples
935935
--------
936-
>>> df = DataFrame([('falcon', 'bird', 389.0),
936+
>>> df = pd.DataFrame([('falcon', 'bird', 389.0),
937937
... ('parrot', 'bird', 24.0),
938938
... ('lion', 'mammal', 80.5),
939939
... ('monkey', 'mammal', np.nan),
@@ -2358,7 +2358,7 @@ def take(
23582358
23592359
Examples
23602360
--------
2361-
>>> df = DataFrame([('falcon', 'bird', 389.0),
2361+
>>> df = pd.DataFrame([('falcon', 'bird', 389.0),
23622362
... ('parrot', 'bird', 24.0),
23632363
... ('lion', 'mammal', 80.5),
23642364
... ('monkey', 'mammal', np.nan),
@@ -2387,15 +2387,15 @@ def take(
23872387
2 2 lion mammal 80.5
23882388
1 monkey mammal NaN
23892389
2390-
The order of the specified indices influnces the order in the result.
2390+
The order of the specified indices influences the order in the result.
23912391
Here, the order is swapped from the previous example.
23922392
2393-
>>> gb.take([0, 1])
2393+
>>> gb.take([1, 0])
23942394
name class max_speed
2395-
1 4 falcon bird 389.0
2396-
3 parrot bird 24.0
2397-
2 2 lion mammal 80.5
2398-
1 monkey mammal NaN
2395+
1 3 parrot bird 24.0
2396+
4 falcon bird 389.0
2397+
2 1 monkey mammal NaN
2398+
2 lion mammal 80.5
23992399
24002400
Take elements at indices 1 and 2 along the axis 1 (column selection).
24012401

pandas/core/indexes/base.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,7 @@
144144
validate_putmask,
145145
)
146146
from pandas.core.arrays import (
147+
ArrowExtensionArray,
147148
BaseMaskedArray,
148149
Categorical,
149150
ExtensionArray,
@@ -4850,8 +4851,10 @@ def _can_use_libjoin(self) -> bool:
48504851
if type(self) is Index:
48514852
# excludes EAs, but include masks, we get here with monotonic
48524853
# values only, meaning no NA
4853-
return isinstance(self.dtype, np.dtype) or isinstance(
4854-
self.values, BaseMaskedArray
4854+
return (
4855+
isinstance(self.dtype, np.dtype)
4856+
or isinstance(self.values, BaseMaskedArray)
4857+
or isinstance(self._values, ArrowExtensionArray)
48554858
)
48564859
return not is_interval_dtype(self.dtype)
48574860

@@ -4942,6 +4945,10 @@ def _get_join_target(self) -> ArrayLike:
49424945
if isinstance(self._values, BaseMaskedArray):
49434946
# This is only used if our array is monotonic, so no NAs present
49444947
return self._values._data
4948+
elif isinstance(self._values, ArrowExtensionArray):
4949+
# This is only used if our array is monotonic, so no missing values
4950+
# present
4951+
return self._values.to_numpy()
49454952
return self._get_engine_target()
49464953

49474954
def _from_join_target(self, result: np.ndarray) -> ArrayLike:
@@ -4951,6 +4958,8 @@ def _from_join_target(self, result: np.ndarray) -> ArrayLike:
49514958
"""
49524959
if isinstance(self.values, BaseMaskedArray):
49534960
return type(self.values)(result, np.zeros(result.shape, dtype=np.bool_))
4961+
elif isinstance(self.values, ArrowExtensionArray):
4962+
return type(self.values)._from_sequence(result)
49544963
return result
49554964

49564965
@doc(IndexOpsMixin._memory_usage)

0 commit comments

Comments
 (0)