Skip to content

Commit c76486e

Browse files
Merge branch 'master' into GH36666
2 parents 8701f26 + 9700b5a commit c76486e

File tree

93 files changed

+2435
-1797
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

93 files changed

+2435
-1797
lines changed

.pre-commit-config.yaml

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ repos:
4141
name: Generate pip dependency from conda
4242
description: This hook checks if the conda environment.yml and requirements-dev.txt are equal
4343
language: python
44-
entry: python -m scripts.generate_pip_deps_from_conda
44+
entry: python scripts/generate_pip_deps_from_conda.py
4545
files: ^(environment.yml|requirements-dev.txt)$
4646
pass_filenames: false
4747
additional_dependencies: [pyyaml]
@@ -99,6 +99,28 @@ repos:
9999
language: pygrep
100100
entry: (\.\. code-block ::|\.\. ipython ::)
101101
files: \.(py|pyx|rst)$
102+
- id: unwanted-patterns-strings-to-concatenate
103+
name: Check for use of not concatenated strings
104+
language: python
105+
entry: python scripts/validate_unwanted_patterns.py --validation-type="strings_to_concatenate"
106+
files: \.(py|pyx|pxd|pxi)$
107+
- id: unwanted-patterns-strings-with-wrong-placed-whitespace
108+
name: Check for strings with wrong placed spaces
109+
language: python
110+
entry: python scripts/validate_unwanted_patterns.py --validation-type="strings_with_wrong_placed_whitespace"
111+
files: \.(py|pyx|pxd|pxi)$
112+
- id: unwanted-patterns-private-import-across-module
113+
name: Check for import of private attributes across modules
114+
language: python
115+
entry: python scripts/validate_unwanted_patterns.py --validation-type="private_import_across_module"
116+
types: [python]
117+
exclude: ^(asv_bench|pandas/_vendored|pandas/tests|doc)/
118+
- id: unwanted-patterns-private-function-across-module
119+
name: Check for use of private functions across modules
120+
language: python
121+
entry: python scripts/validate_unwanted_patterns.py --validation-type="private_function_across_module"
122+
types: [python]
123+
exclude: ^(asv_bench|pandas/_vendored|pandas/tests|doc)/
102124
- repo: https://github.com/asottile/yesqa
103125
rev: v1.2.2
104126
hooks:

asv_bench/benchmarks/strings.py

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
import numpy as np
44

5-
from pandas import DataFrame, Series
5+
from pandas import Categorical, DataFrame, Series
66

77
from .pandas_vb_common import tm
88

@@ -16,6 +16,10 @@ def setup(self, dtype):
1616
self.series_arr = tm.rands_array(nchars=10, size=10 ** 5)
1717
self.frame_arr = self.series_arr.reshape((50_000, 2)).copy()
1818

19+
# GH37371. Testing construction of string series/frames from ExtensionArrays
20+
self.series_cat_arr = Categorical(self.series_arr)
21+
self.frame_cat_arr = Categorical(self.frame_arr)
22+
1923
def time_series_construction(self, dtype):
2024
Series(self.series_arr, dtype=dtype)
2125

@@ -28,6 +32,18 @@ def time_frame_construction(self, dtype):
2832
def peakmem_frame_construction(self, dtype):
2933
DataFrame(self.frame_arr, dtype=dtype)
3034

35+
def time_cat_series_construction(self, dtype):
36+
Series(self.series_cat_arr, dtype=dtype)
37+
38+
def peakmem_cat_series_construction(self, dtype):
39+
Series(self.series_cat_arr, dtype=dtype)
40+
41+
def time_cat_frame_construction(self, dtype):
42+
DataFrame(self.frame_cat_arr, dtype=dtype)
43+
44+
def peakmem_cat_frame_construction(self, dtype):
45+
DataFrame(self.frame_cat_arr, dtype=dtype)
46+
3147

3248
class Methods:
3349
def setup(self):

ci/code_checks.sh

Lines changed: 0 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -73,38 +73,6 @@ if [[ -z "$CHECK" || "$CHECK" == "lint" ]]; then
7373
cpplint --quiet --extensions=c,h --headers=h --recursive --filter=-readability/casting,-runtime/int,-build/include_subdir pandas/_libs/src/*.h pandas/_libs/src/parser pandas/_libs/ujson pandas/_libs/tslibs/src/datetime pandas/_libs/*.cpp
7474
RET=$(($RET + $?)) ; echo $MSG "DONE"
7575

76-
MSG='Check for use of not concatenated strings' ; echo $MSG
77-
if [[ "$GITHUB_ACTIONS" == "true" ]]; then
78-
$BASE_DIR/scripts/validate_unwanted_patterns.py --validation-type="strings_to_concatenate" --format="##[error]{source_path}:{line_number}:{msg}" .
79-
else
80-
$BASE_DIR/scripts/validate_unwanted_patterns.py --validation-type="strings_to_concatenate" .
81-
fi
82-
RET=$(($RET + $?)) ; echo $MSG "DONE"
83-
84-
MSG='Check for strings with wrong placed spaces' ; echo $MSG
85-
if [[ "$GITHUB_ACTIONS" == "true" ]]; then
86-
$BASE_DIR/scripts/validate_unwanted_patterns.py --validation-type="strings_with_wrong_placed_whitespace" --format="##[error]{source_path}:{line_number}:{msg}" .
87-
else
88-
$BASE_DIR/scripts/validate_unwanted_patterns.py --validation-type="strings_with_wrong_placed_whitespace" .
89-
fi
90-
RET=$(($RET + $?)) ; echo $MSG "DONE"
91-
92-
MSG='Check for import of private attributes across modules' ; echo $MSG
93-
if [[ "$GITHUB_ACTIONS" == "true" ]]; then
94-
$BASE_DIR/scripts/validate_unwanted_patterns.py --validation-type="private_import_across_module" --included-file-extensions="py" --excluded-file-paths=pandas/tests,asv_bench/,pandas/_vendored --format="##[error]{source_path}:{line_number}:{msg}" pandas/
95-
else
96-
$BASE_DIR/scripts/validate_unwanted_patterns.py --validation-type="private_import_across_module" --included-file-extensions="py" --excluded-file-paths=pandas/tests,asv_bench/,pandas/_vendored pandas/
97-
fi
98-
RET=$(($RET + $?)) ; echo $MSG "DONE"
99-
100-
MSG='Check for use of private functions across modules' ; echo $MSG
101-
if [[ "$GITHUB_ACTIONS" == "true" ]]; then
102-
$BASE_DIR/scripts/validate_unwanted_patterns.py --validation-type="private_function_across_module" --included-file-extensions="py" --excluded-file-paths=pandas/tests,asv_bench/,pandas/_vendored,doc/ --format="##[error]{source_path}:{line_number}:{msg}" pandas/
103-
else
104-
$BASE_DIR/scripts/validate_unwanted_patterns.py --validation-type="private_function_across_module" --included-file-extensions="py" --excluded-file-paths=pandas/tests,asv_bench/,pandas/_vendored,doc/ pandas/
105-
fi
106-
RET=$(($RET + $?)) ; echo $MSG "DONE"
107-
10876
fi
10977

11078
### PATTERNS ###

doc/source/development/contributing.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -598,7 +598,7 @@ Building master branch documentation
598598

599599
When pull requests are merged into the pandas ``master`` branch, the main parts of
600600
the documentation are also built by Travis-CI. These docs are then hosted `here
601-
<https://dev.pandas.io>`__, see also
601+
<https://pandas.pydata.org/docs/dev/>`__, see also
602602
the :ref:`Continuous Integration <contributing.ci>` section.
603603

604604
.. _contributing.code:

doc/source/getting_started/install.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,20 +28,20 @@ Installing pandas
2828
Installing with Anaconda
2929
~~~~~~~~~~~~~~~~~~~~~~~~
3030

31-
Installing pandas and the rest of the `NumPy <https://www.numpy.org/>`__ and
32-
`SciPy <https://www.scipy.org/>`__ stack can be a little
31+
Installing pandas and the rest of the `NumPy <https://numpy.org/>`__ and
32+
`SciPy <https://scipy.org/>`__ stack can be a little
3333
difficult for inexperienced users.
3434

3535
The simplest way to install not only pandas, but Python and the most popular
36-
packages that make up the `SciPy <https://www.scipy.org/>`__ stack
37-
(`IPython <https://ipython.org/>`__, `NumPy <https://www.numpy.org/>`__,
36+
packages that make up the `SciPy <https://scipy.org/>`__ stack
37+
(`IPython <https://ipython.org/>`__, `NumPy <https://numpy.org/>`__,
3838
`Matplotlib <https://matplotlib.org/>`__, ...) is with
3939
`Anaconda <https://docs.continuum.io/anaconda/>`__, a cross-platform
40-
(Linux, Mac OS X, Windows) Python distribution for data analytics and
40+
(Linux, macOS, Windows) Python distribution for data analytics and
4141
scientific computing.
4242

4343
After running the installer, the user will have access to pandas and the
44-
rest of the `SciPy <https://www.scipy.org/>`__ stack without needing to install
44+
rest of the `SciPy <https://scipy.org/>`__ stack without needing to install
4545
anything else, and without needing to wait for any software to be compiled.
4646

4747
Installation instructions for `Anaconda <https://docs.continuum.io/anaconda/>`__
@@ -220,7 +220,7 @@ Dependencies
220220
Package Minimum supported version
221221
================================================================ ==========================
222222
`setuptools <https://setuptools.readthedocs.io/en/latest/>`__ 24.2.0
223-
`NumPy <https://www.numpy.org>`__ 1.16.5
223+
`NumPy <https://numpy.org>`__ 1.16.5
224224
`python-dateutil <https://dateutil.readthedocs.io/en/stable/>`__ 2.7.3
225225
`pytz <https://pypi.org/project/pytz/>`__ 2017.3
226226
================================================================ ==========================

doc/source/user_guide/io.rst

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5686,7 +5686,7 @@ ignored.
56865686
dtypes: float64(1), int64(1)
56875687
memory usage: 15.3 MB
56885688
5689-
Given the next test set:
5689+
The following test functions will be used below to compare the performance of several IO methods:
56905690

56915691
.. code-block:: python
56925692
@@ -5791,7 +5791,7 @@ Given the next test set:
57915791
def test_parquet_read():
57925792
pd.read_parquet("test.parquet")
57935793
5794-
When writing, the top-three functions in terms of speed are ``test_feather_write``, ``test_hdf_fixed_write`` and ``test_hdf_fixed_write_compress``.
5794+
When writing, the top three functions in terms of speed are ``test_feather_write``, ``test_hdf_fixed_write`` and ``test_hdf_fixed_write_compress``.
57955795

57965796
.. code-block:: ipython
57975797
@@ -5825,7 +5825,7 @@ When writing, the top-three functions in terms of speed are ``test_feather_write
58255825
In [13]: %timeit test_parquet_write(df)
58265826
67.6 ms ± 706 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
58275827
5828-
When reading, the top three are ``test_feather_read``, ``test_pickle_read`` and
5828+
When reading, the top three functions in terms of speed are ``test_feather_read``, ``test_pickle_read`` and
58295829
``test_hdf_fixed_read``.
58305830

58315831

@@ -5862,8 +5862,7 @@ When reading, the top three are ``test_feather_read``, ``test_pickle_read`` and
58625862
24.4 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
58635863
58645864
5865-
For this test case ``test.pkl.compress``, ``test.parquet`` and ``test.feather`` took the least space on disk.
5866-
Space on disk (in bytes)
5865+
The files ``test.pkl.compress``, ``test.parquet`` and ``test.feather`` took the least space on disk (in bytes).
58675866

58685867
.. code-block:: none
58695868

doc/source/whatsnew/v1.1.4.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ Fixed regressions
2222
- Fixed regression in :class:`RollingGroupby` causing a segmentation fault with Index of dtype object (:issue:`36727`)
2323
- Fixed regression in :meth:`DataFrame.resample(...).apply(...)` raised ``AttributeError`` when input was a :class:`DataFrame` and only a :class:`Series` was evaluated (:issue:`36951`)
2424
- Fixed regression in :class:`PeriodDtype` comparing both equal and unequal to its string representation (:issue:`37265`)
25+
- Fixed regression where slicing :class:`DatetimeIndex` raised :exc:`AssertionError` on irregular time series with ``pd.NaT`` or on unsorted indices (:issue:`36953` and :issue:`35509`)
2526
- Fixed regression in certain offsets (:meth:`pd.offsets.Day() <pandas.tseries.offsets.Day>` and below) no longer being hashable (:issue:`37267`)
2627
- Fixed regression in :class:`StataReader` which required ``chunksize`` to be manually set when using an iterator to read a dataset (:issue:`37280`)
2728

@@ -35,6 +36,7 @@ Bug fixes
3536
- Bug in :meth:`Series.isin` and :meth:`DataFrame.isin` raising a ``ValueError`` when the target was read-only (:issue:`37174`)
3637
- Bug in :meth:`GroupBy.fillna` that introduced a performance regression after 1.0.5 (:issue:`36757`)
3738
- Bug in :meth:`DataFrame.info` was raising a ``KeyError`` when the DataFrame has integer column names (:issue:`37245`)
39+
- Bug in :meth:`DataFrameGroupby.apply` would drop a :class:`CategoricalIndex` when grouped on (:issue:`35792`)
3840

3941
.. ---------------------------------------------------------------------------
4042

doc/source/whatsnew/v1.2.0.rst

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -335,7 +335,7 @@ Deprecations
335335
Performance improvements
336336
~~~~~~~~~~~~~~~~~~~~~~~~
337337

338-
- Performance improvements when creating DataFrame or Series with dtype ``str`` or :class:`StringDtype` from array with many string elements (:issue:`36304`, :issue:`36317`, :issue:`36325`, :issue:`36432`)
338+
- Performance improvements when creating DataFrame or Series with dtype ``str`` or :class:`StringDtype` from array with many string elements (:issue:`36304`, :issue:`36317`, :issue:`36325`, :issue:`36432`, :issue:`37371`)
339339
- Performance improvement in :meth:`GroupBy.agg` with the ``numba`` engine (:issue:`35759`)
340340
- Performance improvements when creating :meth:`pd.Series.map` from a huge dictionary (:issue:`34717`)
341341
- Performance improvement in :meth:`GroupBy.transform` with the ``numba`` engine (:issue:`36240`)
@@ -375,6 +375,7 @@ Datetimelike
375375
- Bug in :class:`DatetimeIndex.shift` incorrectly raising when shifting empty indexes (:issue:`14811`)
376376
- :class:`Timestamp` and :class:`DatetimeIndex` comparisons between timezone-aware and timezone-naive objects now follow the standard library ``datetime`` behavior, returning ``True``/``False`` for ``!=``/``==`` and raising for inequality comparisons (:issue:`28507`)
377377
- Bug in :meth:`DatetimeIndex.equals` and :meth:`TimedeltaIndex.equals` incorrectly considering ``int64`` indexes as equal (:issue:`36744`)
378+
- Bug in :meth:`TimedeltaIndex.sum` and :meth:`Series.sum` with ``timedelta64`` dtype on an empty index or series returning ``NaT`` instead of ``Timedelta(0)`` (:issue:`31751`)
378379

379380
Timedelta
380381
^^^^^^^^^
@@ -403,6 +404,7 @@ Numeric
403404
- Bug in :class:`DataFrame` arithmetic ops incorrectly accepting keyword arguments (:issue:`36843`)
404405
- Bug in :class:`IntervalArray` comparisons with :class:`Series` not returning :class:`Series` (:issue:`36908`)
405406
- Bug in :class:`DataFrame` allowing arithmetic operations with list of array-likes with undefined results. Behavior changed to raising ``ValueError`` (:issue:`36702`)
407+
- Bug in :meth:`DataFrame.std`` with ``timedelta64`` dtype and ``skipna=False`` (:issue:`37392`)
406408

407409
Conversion
408410
^^^^^^^^^^
@@ -416,7 +418,6 @@ Strings
416418
- Bug in :func:`to_numeric` raising a ``TypeError`` when attempting to convert a string dtype :class:`Series` containing only numeric strings and ``NA`` (:issue:`37262`)
417419
-
418420

419-
420421
Interval
421422
^^^^^^^^
422423

@@ -467,6 +468,7 @@ I/O
467468
- Bug in :func:`read_table` and :func:`read_csv` when ``delim_whitespace=True`` and ``sep=default`` (:issue:`36583`)
468469
- Bug in :meth:`to_json` with ``lines=True`` and ``orient='records'`` the last line of the record is not appended with 'new line character' (:issue:`36888`)
469470
- Bug in :meth:`read_parquet` with fixed offset timezones. String representation of timezones was not recognized (:issue:`35997`, :issue:`36004`)
471+
- Bug in :meth:`DataFrame.to_html`, :meth:`DataFrame.to_string`, and :meth:`DataFrame.to_latex` ignoring the ``na_rep`` argument when ``float_format`` was also specified (:issue:`9046`, :issue:`13828`)
470472
- Bug in output rendering of complex numbers showing too many trailing zeros (:issue:`36799`)
471473
- Bug in :class:`HDFStore` threw a ``TypeError`` when exporting an empty :class:`DataFrame` with ``datetime64[ns, tz]`` dtypes with a fixed HDF5 store (:issue:`20594`)
472474

@@ -485,7 +487,6 @@ Groupby/resample/rolling
485487
- Bug in :meth:`DataFrame.resample(...)` that would throw a ``ValueError`` when resampling from "D" to "24H" over a transition into daylight savings time (DST) (:issue:`35219`)
486488
- Bug when combining methods :meth:`DataFrame.groupby` with :meth:`DataFrame.resample` and :meth:`DataFrame.interpolate` raising an ``TypeError`` (:issue:`35325`)
487489
- Bug in :meth:`DataFrameGroupBy.apply` where a non-nuisance grouping column would be dropped from the output columns if another groupby method was called before ``.apply()`` (:issue:`34656`)
488-
- Bug in :meth:`DataFrameGroupby.apply` would drop a :class:`CategoricalIndex` when grouped on. (:issue:`35792`)
489490
- Bug when subsetting columns on a :class:`~pandas.core.groupby.DataFrameGroupBy` (e.g. ``df.groupby('a')[['b']])``) would reset the attributes ``axis``, ``dropna``, ``group_keys``, ``level``, ``mutated``, ``sort``, and ``squeeze`` to their default values. (:issue:`9959`)
490491
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`)
491492
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
@@ -498,6 +499,7 @@ Groupby/resample/rolling
498499
- Bug in :meth:`DataFrame.groupby.rolling` returning wrong values with partial centered window (:issue:`36040`).
499500
- Bug in :meth:`DataFrameGroupBy.rolling` returned wrong values with timeaware window containing ``NaN``. Raises ``ValueError`` because windows are not monotonic now (:issue:`34617`)
500501
- Bug in :meth:`Rolling.__iter__` where a ``ValueError`` was not raised when ``min_periods`` was larger than ``window`` (:issue:`37156`)
502+
- Using :meth:`Rolling.var()` instead of :meth:`Rolling.std()` avoids numerical issues for :meth:`Rolling.corr()` when :meth:`Rolling.var()` is still within floating point precision while :meth:`Rolling.std()` is not (:issue:`31286`)
501503

502504
Reshaping
503505
^^^^^^^^^
@@ -530,9 +532,10 @@ Other
530532
- Bug in :meth:`DataFrame.replace` and :meth:`Series.replace` incorrectly raising ``AssertionError`` instead of ``ValueError`` when invalid parameter combinations are passed (:issue:`36045`)
531533
- Bug in :meth:`DataFrame.replace` and :meth:`Series.replace` with numeric values and string ``to_replace`` (:issue:`34789`)
532534
- Fixed bug in metadata propagation incorrectly copying DataFrame columns as metadata when the column name overlaps with the metadata name (:issue:`37037`)
533-
- Fixed metadata propagation in the :class:`Series.dt` and :class:`Series.str` accessors and :class:`DataFrame.duplicated` and ::class:`DataFrame.stack` methods (:issue:`28283`)
535+
- Fixed metadata propagation in the :class:`Series.dt` and :class:`Series.str` accessors and :class:`DataFrame.duplicated` and :class:`DataFrame.stack` and :class:`DataFrame.unstack` and :class:`DataFrame.pivot` methods (:issue:`28283`)
534536
- Bug in :meth:`Index.union` behaving differently depending on whether operand is a :class:`Index` or other list-like (:issue:`36384`)
535537
- Passing an array with 2 or more dimensions to the :class:`Series` constructor now raises the more specific ``ValueError``, from a bare ``Exception`` previously (:issue:`35744`)
538+
- Bug in ``accessor.DirNamesMixin``, where ``dir(obj)`` wouldn't show attributes defined on the instance (:issue:`37173`).
536539

537540
.. ---------------------------------------------------------------------------
538541

pandas/_libs/lib.pyx

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -651,6 +651,11 @@ cpdef ndarray[object] ensure_string_array(
651651
cdef:
652652
Py_ssize_t i = 0, n = len(arr)
653653

654+
if hasattr(arr, "to_numpy"):
655+
arr = arr.to_numpy()
656+
elif not isinstance(arr, np.ndarray):
657+
arr = np.array(arr, dtype="object")
658+
654659
result = np.asarray(arr, dtype="object")
655660

656661
if copy and result is arr:

0 commit comments

Comments
 (0)