Skip to content

Commit 913712f

Browse files
committed
Merge remote-tracking branch 'upstream/main' into cow_fillna
# Conflicts: # doc/source/whatsnew/v2.0.0.rst # pandas/core/internals/blocks.py # pandas/tests/copy_view/test_interp_fillna.py
2 parents f3fce88 + 94f9412 commit 913712f

30 files changed

+1311
-478
lines changed

.github/workflows/ubuntu.yml

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ jobs:
2929
matrix:
3030
env_file: [actions-38.yaml, actions-39.yaml, actions-310.yaml, actions-311.yaml]
3131
pattern: ["not single_cpu", "single_cpu"]
32-
pyarrow_version: ["7", "8", "9", "10"]
32+
pyarrow_version: ["8", "9", "10"]
3333
include:
3434
- name: "Downstream Compat"
3535
env_file: actions-38-downstream_compat.yaml
@@ -79,23 +79,17 @@ jobs:
7979
test_args: "-W error::DeprecationWarning -W error::FutureWarning"
8080
error_on_warnings: "0"
8181
exclude:
82-
- env_file: actions-38.yaml
83-
pyarrow_version: "7"
8482
- env_file: actions-38.yaml
8583
pyarrow_version: "8"
8684
- env_file: actions-38.yaml
8785
pyarrow_version: "9"
88-
- env_file: actions-39.yaml
89-
pyarrow_version: "7"
9086
- env_file: actions-39.yaml
9187
pyarrow_version: "8"
9288
- env_file: actions-39.yaml
9389
pyarrow_version: "9"
94-
- env_file: actions-311.yaml
95-
pyarrow_version: "7"
96-
- env_file: actions-311.yaml
90+
- env_file: actions-310.yaml
9791
pyarrow_version: "8"
98-
- env_file: actions-311.yaml
92+
- env_file: actions-310.yaml
9993
pyarrow_version: "9"
10094
fail-fast: false
10195
name: ${{ matrix.name || format('{0} pyarrow={1} {2}', matrix.env_file, matrix.pyarrow_version, matrix.pattern) }}

doc/source/user_guide/io.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2069,6 +2069,8 @@ is ``None``. To explicitly force ``Series`` parsing, pass ``typ=series``
20692069
* ``lines`` : reads file as one json object per line.
20702070
* ``encoding`` : The encoding to use to decode py3 bytes.
20712071
* ``chunksize`` : when used in combination with ``lines=True``, return a JsonReader which reads in ``chunksize`` lines per iteration.
2072+
* ``engine``: Either ``"ujson"``, the built-in JSON parser, or ``"pyarrow"`` which dispatches to pyarrow's ``pyarrow.json.read_json``.
2073+
The ``"pyarrow"`` is only available when ``lines=True``
20722074

20732075
The parser will raise one of ``ValueError/TypeError/AssertionError`` if the JSON is not parseable.
20742076

@@ -2250,6 +2252,16 @@ For line-delimited json files, pandas can also return an iterator which reads in
22502252
for chunk in reader:
22512253
print(chunk)
22522254
2255+
Line-limited json can also be read using the pyarrow reader by specifying ``engine="pyarrow"``.
2256+
2257+
.. ipython:: python
2258+
2259+
from io import BytesIO
2260+
df = pd.read_json(BytesIO(jsonl.encode()), lines=True, engine="pyarrow")
2261+
df
2262+
2263+
.. versionadded:: 2.0.0
2264+
22532265
.. _io.table_schema:
22542266

22552267
Table schema

doc/source/user_guide/style.ipynb

Lines changed: 188 additions & 104 deletions
Large diffs are not rendered by default.

doc/source/whatsnew/v2.0.0.rst

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -223,9 +223,12 @@ Copy-on-Write improvements
223223
- :meth:`DataFrame.to_period` / :meth:`Series.to_period`
224224
- :meth:`DataFrame.truncate`
225225
- :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize`
226-
- :meth:`DataFrame.fillna` / :meth:`Series.fillna`
226+
- :meth:`DataFrame.interpolate` / :meth:`Series.interpolate`
227+
- :meth:`DataFrame.ffill` / :meth:`Series.ffill`
228+
- :meth:`DataFrame.bfill` / :meth:`Series.bfill`
227229
- :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects`
228230
- :meth:`DataFrame.astype` / :meth:`Series.astype`
231+
- :meth:`DataFrame.convert_dtypes` / :meth:`Series.convert_dtypes`
229232
- :func:`concat`
230233

231234
These methods return views when Copy-on-Write is enabled, which provides a significant
@@ -248,6 +251,9 @@ Copy-on-Write improvements
248251
can never update the original Series or DataFrame. Therefore, an informative
249252
error is raised to the user instead of silently doing nothing (:issue:`49467`)
250253

254+
- :meth:`DataFrame.replace` will now respect the Copy-on-Write mechanism
255+
when ``inplace=True``.
256+
251257
Copy-on-Write can be enabled through one of
252258

253259
.. code-block:: python
@@ -299,6 +305,7 @@ Other enhancements
299305
- Added :meth:`DatetimeIndex.as_unit` and :meth:`TimedeltaIndex.as_unit` to convert to different resolutions; supported resolutions are "s", "ms", "us", and "ns" (:issue:`50616`)
300306
- Added :meth:`Series.dt.unit` and :meth:`Series.dt.as_unit` to convert to different resolutions; supported resolutions are "s", "ms", "us", and "ns" (:issue:`51223`)
301307
- Added new argument ``dtype`` to :func:`read_sql` to be consistent with :func:`read_sql_query` (:issue:`50797`)
308+
- Added new argument ``engine`` to :func:`read_json` to support parsing JSON with pyarrow by specifying ``engine="pyarrow"`` (:issue:`48893`)
302309
- Added support for SQLAlchemy 2.0 (:issue:`40686`)
303310
-
304311

@@ -630,7 +637,9 @@ The arguments signature is similar, albeit ``col_space`` has been removed since
630637
it is ignored by LaTeX engines. This render engine also requires ``jinja2`` as a
631638
dependency which needs to be installed, since rendering is based upon jinja2 templates.
632639

633-
The pandas options below are no longer used and will be removed in future releases.
640+
The pandas latex options below are no longer used and have been removed. The generic
641+
max rows and columns arguments remain but for this functionality should be replaced
642+
by the Styler equivalents.
634643
The alternative options giving similar functionality are indicated below:
635644

636645
- ``display.latex.escape``: replaced with ``styler.format.escape``,
@@ -644,6 +653,13 @@ The alternative options giving similar functionality are indicated below:
644653
``styler.render.max_rows``, ``styler.render.max_columns`` and
645654
``styler.render.max_elements``.
646655

656+
Note that due to this change some defaults have also changed:
657+
658+
- ``multirow`` now defaults to *True*.
659+
- ``multirow_align`` defaults to *"r"* instead of *"l"*.
660+
- ``multicol_align`` defaults to *"r"* instead of *"l"*.
661+
- ``escape`` now defaults to *False*.
662+
647663
Note that the behaviour of ``_repr_latex_`` is also changed. Previously
648664
setting ``display.latex.repr`` would generate LaTeX only when using nbconvert for a
649665
JupyterNotebook, and not when the user is running the notebook. Now the

pandas/_typing.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -324,6 +324,9 @@ def closed(self) -> bool:
324324
# read_csv engines
325325
CSVEngine = Literal["c", "python", "pyarrow", "python-fwf"]
326326

327+
# read_json engines
328+
JSONEngine = Literal["ujson", "pyarrow"]
329+
327330
# read_xml parsers
328331
XMLParsers = Literal["lxml", "etree"]
329332

pandas/core/config_init.py

Lines changed: 0 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -210,13 +210,6 @@ def use_numba_cb(key) -> None:
210210
(default: False)
211211
"""
212212

213-
pc_latex_repr_doc = """
214-
: boolean
215-
Whether to produce a latex DataFrame representation for jupyter
216-
environments that support it.
217-
(default: False)
218-
"""
219-
220213
pc_table_schema_doc = """
221214
: boolean
222215
Whether to publish a Table Schema representation for frontends
@@ -292,41 +285,6 @@ def use_numba_cb(key) -> None:
292285
df.info() is called. Valid values True,False,'deep'
293286
"""
294287

295-
pc_latex_escape = """
296-
: bool
297-
This specifies if the to_latex method of a Dataframe uses escapes special
298-
characters.
299-
Valid values: False,True
300-
"""
301-
302-
pc_latex_longtable = """
303-
:bool
304-
This specifies if the to_latex method of a Dataframe uses the longtable
305-
format.
306-
Valid values: False,True
307-
"""
308-
309-
pc_latex_multicolumn = """
310-
: bool
311-
This specifies if the to_latex method of a Dataframe uses multicolumns
312-
to pretty-print MultiIndex columns.
313-
Valid values: False,True
314-
"""
315-
316-
pc_latex_multicolumn_format = """
317-
: string
318-
This specifies the format for multicolumn headers.
319-
Can be surrounded with '|'.
320-
Valid values: 'l', 'c', 'r', 'p{<width>}'
321-
"""
322-
323-
pc_latex_multirow = """
324-
: bool
325-
This specifies if the to_latex method of a Dataframe uses multirows
326-
to pretty-print MultiIndex rows.
327-
Valid values: False,True
328-
"""
329-
330288

331289
def table_schema_cb(key) -> None:
332290
from pandas.io.formats.printing import enable_data_resource_formatter
@@ -425,16 +383,6 @@ def is_terminal() -> bool:
425383
cf.register_option(
426384
"unicode.ambiguous_as_wide", False, pc_east_asian_width_doc, validator=is_bool
427385
)
428-
cf.register_option("latex.repr", False, pc_latex_repr_doc, validator=is_bool)
429-
cf.register_option("latex.escape", True, pc_latex_escape, validator=is_bool)
430-
cf.register_option("latex.longtable", False, pc_latex_longtable, validator=is_bool)
431-
cf.register_option(
432-
"latex.multicolumn", True, pc_latex_multicolumn, validator=is_bool
433-
)
434-
cf.register_option(
435-
"latex.multicolumn_format", "l", pc_latex_multicolumn, validator=is_text
436-
)
437-
cf.register_option("latex.multirow", False, pc_latex_multirow, validator=is_bool)
438386
cf.register_option(
439387
"html.table_schema",
440388
False,

pandas/core/generic.py

Lines changed: 41 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -3233,30 +3233,54 @@ def to_latex(
32333233
columns. By default, 'l' will be used for all columns except
32343234
columns of numbers, which default to 'r'.
32353235
longtable : bool, optional
3236-
By default, the value will be read from the pandas config
3237-
module. Use a longtable environment instead of tabular. Requires
3236+
Use a longtable environment instead of tabular. Requires
32383237
adding a \usepackage{{longtable}} to your LaTeX preamble.
3238+
By default, the value will be read from the pandas config
3239+
module, and set to `True` if the option ``styler.latex.environment`` is
3240+
`"longtable"`.
3241+
3242+
.. versionchanged:: 2.0.0
3243+
The pandas option affecting this argument has changed.
32393244
escape : bool, optional
32403245
By default, the value will be read from the pandas config
3241-
module. When set to False prevents from escaping latex special
3246+
module and set to `True` if the option ``styler.format.escape`` is
3247+
`"latex"`. When set to False prevents from escaping latex special
32423248
characters in column names.
3249+
3250+
.. versionchanged:: 2.0.0
3251+
The pandas option affecting this argument has changed, as has the
3252+
default value to `False`.
32433253
encoding : str, optional
32443254
A string representing the encoding to use in the output file,
32453255
defaults to 'utf-8'.
32463256
decimal : str, default '.'
32473257
Character recognized as decimal separator, e.g. ',' in Europe.
32483258
multicolumn : bool, default True
32493259
Use \multicolumn to enhance MultiIndex columns.
3250-
The default will be read from the config module.
3251-
multicolumn_format : str, default 'l'
3260+
The default will be read from the config module, and is set
3261+
as the option ``styler.sparse.columns``.
3262+
3263+
.. versionchanged:: 2.0.0
3264+
The pandas option affecting this argument has changed.
3265+
multicolumn_format : str, default 'r'
32523266
The alignment for multicolumns, similar to `column_format`
3253-
The default will be read from the config module.
3254-
multirow : bool, default False
3267+
The default will be read from the config module, and is set as the option
3268+
``styler.latex.multicol_align``.
3269+
3270+
.. versionchanged:: 2.0.0
3271+
The pandas option affecting this argument has changed, as has the
3272+
default value to "r".
3273+
multirow : bool, default True
32553274
Use \multirow to enhance MultiIndex rows. Requires adding a
32563275
\usepackage{{multirow}} to your LaTeX preamble. Will print
32573276
centered labels (instead of top-aligned) across the contained
32583277
rows, separating groups via clines. The default will be read
3259-
from the pandas config module.
3278+
from the pandas config module, and is set as the option
3279+
``styler.sparse.index``.
3280+
3281+
.. versionchanged:: 2.0.0
3282+
The pandas option affecting this argument has changed, as has the
3283+
default value to `True`.
32603284
caption : str or tuple, optional
32613285
Tuple (full_caption, short_caption),
32623286
which results in ``\caption[short_caption]{{full_caption}}``;
@@ -3324,15 +3348,15 @@ def to_latex(
33243348
if self.ndim == 1:
33253349
self = self.to_frame()
33263350
if longtable is None:
3327-
longtable = config.get_option("display.latex.longtable")
3351+
longtable = config.get_option("styler.latex.environment") == "longtable"
33283352
if escape is None:
3329-
escape = config.get_option("display.latex.escape")
3353+
escape = config.get_option("styler.format.escape") == "latex"
33303354
if multicolumn is None:
3331-
multicolumn = config.get_option("display.latex.multicolumn")
3355+
multicolumn = config.get_option("styler.sparse.columns")
33323356
if multicolumn_format is None:
3333-
multicolumn_format = config.get_option("display.latex.multicolumn_format")
3357+
multicolumn_format = config.get_option("styler.latex.multicol_align")
33343358
if multirow is None:
3335-
multirow = config.get_option("display.latex.multirow")
3359+
multirow = config.get_option("styler.sparse.index")
33363360

33373361
if column_format is not None and not isinstance(column_format, str):
33383362
raise ValueError("`column_format` must be str or unicode")
@@ -3418,7 +3442,9 @@ def _wrap(x, alt_format_):
34183442
"label": label,
34193443
"position": position,
34203444
"column_format": column_format,
3421-
"clines": "skip-last;data" if multirow else None,
3445+
"clines": "skip-last;data"
3446+
if (multirow and isinstance(self.index, MultiIndex))
3447+
else None,
34223448
"bold_rows": bold_rows,
34233449
}
34243450

@@ -6647,7 +6673,7 @@ def convert_dtypes(
66476673
# https://github.com/python/mypy/issues/8354
66486674
return cast(NDFrameT, result)
66496675
else:
6650-
return self.copy()
6676+
return self.copy(deep=None)
66516677

66526678
# ----------------------------------------------------------------------
66536679
# Filling NA's

pandas/core/groupby/generic.py

Lines changed: 19 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -219,16 +219,9 @@ def apply(self, func, *args, **kwargs) -> Series:
219219
def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs):
220220

221221
if maybe_use_numba(engine):
222-
data = self._obj_with_exclusions
223-
result = self._aggregate_with_numba(
224-
data.to_frame(), func, *args, engine_kwargs=engine_kwargs, **kwargs
222+
return self._aggregate_with_numba(
223+
func, *args, engine_kwargs=engine_kwargs, **kwargs
225224
)
226-
index = self.grouper.result_index
227-
result = self.obj._constructor(result.ravel(), index=index, name=data.name)
228-
if not self.as_index:
229-
result = self._insert_inaxis_grouper(result)
230-
result.index = default_index(len(result))
231-
return result
232225

233226
relabeling = func is None
234227
columns = None
@@ -1261,16 +1254,9 @@ class DataFrameGroupBy(GroupBy[DataFrame]):
12611254
def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs):
12621255

12631256
if maybe_use_numba(engine):
1264-
data = self._obj_with_exclusions
1265-
result = self._aggregate_with_numba(
1266-
data, func, *args, engine_kwargs=engine_kwargs, **kwargs
1257+
return self._aggregate_with_numba(
1258+
func, *args, engine_kwargs=engine_kwargs, **kwargs
12671259
)
1268-
index = self.grouper.result_index
1269-
result = self.obj._constructor(result, index=index, columns=data.columns)
1270-
if not self.as_index:
1271-
result = self._insert_inaxis_grouper(result)
1272-
result.index = default_index(len(result))
1273-
return result
12741260

12751261
relabeling, func, columns, order = reconstruct_func(func, **kwargs)
12761262
func = maybe_mangle_lambdas(func)
@@ -1283,7 +1269,12 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs)
12831269
# this should be the only (non-raising) case with relabeling
12841270
# used reordered index of columns
12851271
result = result.iloc[:, order]
1286-
result.columns = columns
1272+
result = cast(DataFrame, result)
1273+
# error: Incompatible types in assignment (expression has type
1274+
# "Optional[List[str]]", variable has type
1275+
# "Union[Union[Union[ExtensionArray, ndarray[Any, Any]],
1276+
# Index, Series], Sequence[Any]]")
1277+
result.columns = columns # type: ignore[assignment]
12871278

12881279
if result is None:
12891280

@@ -1312,11 +1303,18 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs)
13121303
except ValueError as err:
13131304
if "No objects to concatenate" not in str(err):
13141305
raise
1306+
# _aggregate_frame can fail with e.g. func=Series.mode,
1307+
# where it expects 1D values but would be getting 2D values
1308+
# In other tests, using aggregate_frame instead of GroupByApply
1309+
# would give correct values but incorrect dtypes
1310+
# object vs float64 in test_cython_agg_empty_buckets
1311+
# float64 vs int64 in test_category_order_apply
13151312
result = self._aggregate_frame(func)
13161313

13171314
else:
13181315
# GH#32040, GH#35246
13191316
# e.g. test_groupby_as_index_select_column_sum_empty_df
1317+
result = cast(DataFrame, result)
13201318
result.columns = self._obj_with_exclusions.columns.copy()
13211319

13221320
if not self.as_index:
@@ -1502,8 +1500,7 @@ def arr_func(bvalues: ArrayLike) -> ArrayLike:
15021500
res_mgr.set_axis(1, mgr.axes[1])
15031501

15041502
res_df = self.obj._constructor(res_mgr)
1505-
if self.axis == 1:
1506-
res_df = res_df.T
1503+
res_df = self._maybe_transpose_result(res_df)
15071504
return res_df
15081505

15091506
def _transform_general(self, func, *args, **kwargs):
@@ -1830,7 +1827,7 @@ def _iterate_column_groupbys(self, obj: DataFrame | Series):
18301827
observed=self.observed,
18311828
)
18321829

1833-
def _apply_to_column_groupbys(self, func, obj: DataFrame | Series) -> DataFrame:
1830+
def _apply_to_column_groupbys(self, func, obj: DataFrame) -> DataFrame:
18341831
from pandas.core.reshape.concat import concat
18351832

18361833
columns = obj.columns

0 commit comments

Comments
 (0)