Skip to content

Commit 9b78d71

Browse files
authored
Merge branch 'master' into fix-17407
2 parents 7a179d6 + 6da85b3 commit 9b78d71

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

88 files changed

+1951
-1316
lines changed

asv_bench/benchmarks/sparse.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from itertools import repeat
1+
import itertools
22

33
from .pandas_vb_common import *
44
import scipy.sparse
@@ -33,7 +33,7 @@ def time_sparse_from_scipy(self):
3333
SparseDataFrame(scipy.sparse.rand(1000, 1000, 0.005))
3434

3535
def time_sparse_from_dict(self):
36-
SparseDataFrame(dict(zip(range(1000), repeat([0]))))
36+
SparseDataFrame(dict(zip(range(1000), itertools.repeat([0]))))
3737

3838

3939
class sparse_series_from_coo(object):

asv_bench/benchmarks/timeseries.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ def setup(self):
5656
self.no_freq = self.rng7[:50000].append(self.rng7[50002:])
5757
self.d_freq = self.rng7[:50000].append(self.rng7[50000:])
5858

59-
self.rng8 = date_range(start='1/1/1700', freq='B', periods=100000)
59+
self.rng8 = date_range(start='1/1/1700', freq='B', periods=75000)
6060
self.b_freq = self.rng8[:50000].append(self.rng8[50000:])
6161

6262
def time_add_timedelta(self):

ci/requirements-3.6_NUMPY_DEV.build

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
python=3.6*
22
pytz
3-
cython

ci/requirements-3.6_NUMPY_DEV.build.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,7 @@ pip install --pre --upgrade --timeout=60 -f $PRE_WHEELS numpy scipy
1414
# install dateutil from master
1515
pip install -U git+git://github.com/dateutil/dateutil.git
1616

17+
# cython via pip
18+
pip install cython
19+
1720
true

doc/source/advanced.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -638,9 +638,11 @@ and allows efficient indexing and storage of an index with a large number of dup
638638

639639
.. ipython:: python
640640
641+
from pandas.api.types import CategoricalDtype
642+
641643
df = pd.DataFrame({'A': np.arange(6),
642644
'B': list('aabbca')})
643-
df['B'] = df['B'].astype('category', categories=list('cab'))
645+
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
644646
df
645647
df.dtypes
646648
df.B.cat.categories

doc/source/api.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like
646646
Categorical
647647
~~~~~~~~~~~
648648

649-
If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical
649+
.. autoclass:: api.types.CategoricalDtype
650+
:members: categories, ordered
651+
652+
If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical
650653
data. This accessor is similar to the ``Series.dt`` or ``Series.str`` and has the
651654
following usable methods and properties:
652655

doc/source/categorical.rst

Lines changed: 95 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -89,12 +89,22 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
8989
df["B"] = raw_cat
9090
df
9191
92-
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
92+
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of
93+
94+
1. categories are inferred from the data
95+
2. categories are unordered.
96+
97+
To control those behaviors, instead of passing ``'category'``, use an instance
98+
of :class:`~pandas.api.types.CategoricalDtype`.
9399

94100
.. ipython:: python
95101
96-
s = pd.Series(["a","b","c","a"])
97-
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
102+
from pandas.api.types import CategoricalDtype
103+
104+
s = pd.Series(["a", "b", "c", "a"])
105+
cat_type = CategoricalDtype(categories=["b", "c", "d"],
106+
ordered=True)
107+
s_cat = s.astype(cat_type)
98108
s_cat
99109
100110
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
@@ -133,6 +143,75 @@ constructor to save the factorize step during normal constructor mode:
133143
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
134144
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
135145
146+
.. _categorical.categoricaldtype:
147+
148+
CategoricalDtype
149+
----------------
150+
151+
.. versionchanged:: 0.21.0
152+
153+
A categorical's type is fully described by
154+
155+
1. ``categories``: a sequence of unique values and no missing values
156+
2. ``ordered``: a boolean
157+
158+
This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
159+
The ``categories`` argument is optional, which implies that the actual categories
160+
should be inferred from whatever is present in the data when the
161+
:class:`pandas.Categorical` is created. The categories are assumed to be unordered
162+
by default.
163+
164+
.. ipython:: python
165+
166+
from pandas.api.types import CategoricalDtype
167+
168+
CategoricalDtype(['a', 'b', 'c'])
169+
CategoricalDtype(['a', 'b', 'c'], ordered=True)
170+
CategoricalDtype()
171+
172+
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
173+
expects a `dtype`. For example :func:`pandas.read_csv`,
174+
:func:`pandas.DataFrame.astype`, or in the Series constructor.
175+
176+
.. note::
177+
178+
As a convenience, you can use the string ``'category'`` in place of a
179+
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
180+
the categories being unordered, and equal to the set values present in the
181+
array. In other words, ``dtype='category'`` is equivalent to
182+
``dtype=CategoricalDtype()``.
183+
184+
Equality Semantics
185+
~~~~~~~~~~~~~~~~~~
186+
187+
Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
188+
whenever they have the same categories and orderedness. When comparing two
189+
unordered categoricals, the order of the ``categories`` is not considered
190+
191+
.. ipython:: python
192+
193+
c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)
194+
195+
# Equal, since order is not considered when ordered=False
196+
c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False)
197+
198+
# Unequal, since the second CategoricalDtype is ordered
199+
c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True)
200+
201+
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``
202+
203+
.. ipython:: python
204+
205+
c1 == 'category'
206+
207+
.. warning::
208+
209+
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
210+
and since all instances ``CategoricalDtype`` compare equal to ``'category'``,
211+
all instances of ``CategoricalDtype`` compare equal to a
212+
``CategoricalDtype(None, False)``, regardless of ``categories`` or
213+
``ordered``.
214+
136215
Description
137216
-----------
138217

@@ -184,7 +263,7 @@ It's also possible to pass in the categories in a specific order:
184263

185264
.. ipython:: python
186265
187-
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
266+
s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))
188267
s
189268
190269
# categories
@@ -301,7 +380,9 @@ meaning and certain operations are possible. If the categorical is unordered, ``
301380
302381
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
303382
s.sort_values(inplace=True)
304-
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
383+
s = pd.Series(["a","b","c","a"]).astype(
384+
CategoricalDtype(ordered=True)
385+
)
305386
s.sort_values(inplace=True)
306387
s
307388
s.min(), s.max()
@@ -401,9 +482,15 @@ categories or a categorical with any list-like object, will raise a TypeError.
401482

402483
.. ipython:: python
403484
404-
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
405-
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
406-
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
485+
cat = pd.Series([1,2,3]).astype(
486+
CategoricalDtype([3, 2, 1], ordered=True)
487+
)
488+
cat_base = pd.Series([2,2,2]).astype(
489+
CategoricalDtype([3, 2, 1], ordered=True)
490+
)
491+
cat_base2 = pd.Series([2,2,2]).astype(
492+
CategoricalDtype(ordered=True)
493+
)
407494
408495
cat
409496
cat_base

doc/source/groupby.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1060,7 +1060,7 @@ To select from a DataFrame or Series the nth item, use the nth method. This is a
10601060
g.nth(-1)
10611061
g.nth(1)
10621062
1063-
If you want to select the nth not-null item, use the ``dropna`` kwarg. For a DataFrame this should be either ``'any'`` or ``'all'`` just like you would pass to dropna, for a Series this just needs to be truthy.
1063+
If you want to select the nth not-null item, use the ``dropna`` kwarg. For a DataFrame this should be either ``'any'`` or ``'all'`` just like you would pass to dropna:
10641064

10651065
.. ipython:: python
10661066
@@ -1072,7 +1072,7 @@ If you want to select the nth not-null item, use the ``dropna`` kwarg. For a Dat
10721072
g.nth(-1, dropna='any') # NaNs denote group exhausted when using dropna
10731073
g.last()
10741074
1075-
g.B.nth(0, dropna=True)
1075+
g.B.nth(0, dropna='all')
10761076
10771077
As with other methods, passing ``as_index=False``, will achieve a filtration, which returns the grouped row.
10781078

doc/source/io.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -113,8 +113,8 @@ header : int or list of ints, default ``'infer'``
113113
rather than the first line of the file.
114114
names : array-like, default ``None``
115115
List of column names to use. If file contains no header row, then you should
116-
explicitly pass ``header=None``. Duplicates in this list are not allowed unless
117-
``mangle_dupe_cols=True``, which is the default.
116+
explicitly pass ``header=None``. Duplicates in this list will cause
117+
a ``UserWarning`` to be issued.
118118
index_col : int or sequence or ``False``, default ``None``
119119
Column to use as the row labels of the DataFrame. If a sequence is given, a
120120
MultiIndex is used. If you have a malformed file with delimiters at the end of

doc/source/merging.rst

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -830,8 +830,10 @@ The left frame.
830830

831831
.. ipython:: python
832832
833+
from pandas.api.types import CategoricalDtype
834+
833835
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
834-
X = X.astype('category', categories=['foo', 'bar'])
836+
X = X.astype(CategoricalDtype(categories=['foo', 'bar']))
835837
836838
left = pd.DataFrame({'X': X,
837839
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
@@ -842,8 +844,11 @@ The right frame.
842844

843845
.. ipython:: python
844846
845-
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
846-
'Z': [1, 2]})
847+
right = pd.DataFrame({
848+
'X': pd.Series(['foo', 'bar'],
849+
dtype=CategoricalDtype(['foo', 'bar'])),
850+
'Z': [1, 2]
851+
})
847852
right
848853
right.dtypes
849854

doc/source/whatsnew/v0.21.0.txt

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ users upgrade to this version.
1010
Highlights include:
1111

1212
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
13+
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
14+
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.
1315

1416
Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating.
1517

@@ -89,6 +91,49 @@ This does not raise any obvious exceptions, but also does not create a new colum
8991

9092
Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.
9193

94+
``drop`` now also accepts index/columns keywords
95+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
96+
97+
The :meth:`~DataFrame.drop` method has gained ``index``/``columns`` keywords as an
98+
alternative to specify the ``axis`` and to make it similar in usage to ``reindex``
99+
(:issue:`12392`).
100+
101+
For example:
102+
103+
.. ipython:: python
104+
105+
df = pd.DataFrame(np.arange(8).reshape(2,4),
106+
columns=['A', 'B', 'C', 'D'])
107+
df
108+
df.drop(['B', 'C'], axis=1)
109+
# the following is now equivalent
110+
df.drop(columns=['B', 'C'])
111+
112+
.. _whatsnew_0210.enhancements.categorical_dtype:
113+
114+
``CategoricalDtype`` for specifying categoricals
115+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
116+
117+
:class:`pandas.api.types.CategoricalDtype` has been added to the public API and
118+
expanded to include the ``categories`` and ``ordered`` attributes. A
119+
``CategoricalDtype`` can be used to specify the set of categories and
120+
orderedness of an array, independent of the data themselves. This can be useful,
121+
e.g., when converting string data to a ``Categorical`` (:issue:`14711`,
122+
:issue:`15078`, :issue:`16015`):
123+
124+
.. ipython:: python
125+
126+
from pandas.api.types import CategoricalDtype
127+
128+
s = pd.Series(['a', 'b', 'c', 'a']) # strings
129+
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
130+
s.astype(dtype)
131+
132+
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
133+
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
134+
135+
See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more.
136+
92137
.. _whatsnew_0210.enhancements.other:
93138

94139
Other Enhancements
@@ -110,13 +155,14 @@ Other Enhancements
110155
- :func:`date_range` now accepts 'Y' in addition to 'A' as an alias for end of year (:issue:`9313`)
111156
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`. (:issue:`15838`, :issue:`17438`)
112157
- :func:`DataFrame.add_prefix` and :func:`DataFrame.add_suffix` now accept strings containing the '%' character. (:issue:`17151`)
113-
- `read_*` methods can now infer compression from non-string paths, such as ``pathlib.Path`` objects (:issue:`17206`).
158+
- Read/write methods that infer compression (:func:`read_csv`, :func:`read_table`, :func:`read_pickle`, and :meth:`~DataFrame.to_pickle`) can now infer from non-string paths, such as ``pathlib.Path`` objects (:issue:`17206`).
114159
- :func:`pd.read_sas()` now recognizes much more of the most frequently used date (datetime) formats in SAS7BDAT files (:issue:`15871`).
115160
- :func:`DataFrame.items` and :func:`Series.items` is now present in both Python 2 and 3 and is lazy in all cases (:issue:`13918`, :issue:`17213`)
116161
- :func:`Styler.where` has been implemented. It is as a convenience for :func:`Styler.applymap` and enables simple DataFrame styling on the Jupyter notebook (:issue:`17474`).
117162
- :func:`MultiIndex.is_monotonic_decreasing` has been implemented. Previously returned ``False`` in all cases. (:issue:`16554`)
118163
- :func:`Categorical.rename_categories` now accepts a dict-like argument as `new_categories` and only updates the categories found in that dict. (:issue:`17336`)
119164
- :func:`read_excel` raises ``ImportError`` with a better message if ``xlrd`` is not installed. (:issue:`17613`)
165+
- :meth:`DataFrame.assign` will preserve the original order of ``**kwargs`` for Python 3.6+ users instead of sorting the column names
120166

121167

122168
.. _whatsnew_0210.api_breaking:
@@ -422,6 +468,7 @@ Other API Changes
422468
- The Categorical constructor no longer accepts a scalar for the ``categories`` keyword. (:issue:`16022`)
423469
- Accessing a non-existent attribute on a closed :class:`~pandas.HDFStore` will now
424470
raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`)
471+
- :func:`read_csv` now issues a ``UserWarning`` if the ``names`` parameter contains duplicates (:issue:`17095`)
425472
- :func:`read_csv` now treats ``'null'`` strings as missing values by default (:issue:`16471`)
426473
- :func:`read_csv` now treats ``'n/a'`` strings as missing values by default (:issue:`16078`)
427474
- :class:`pandas.HDFStore`'s string representation is now faster and less detailed. For the previous behavior, use ``pandas.HDFStore.info()``. (:issue:`16503`).
@@ -498,6 +545,7 @@ Conversion
498545
- Bug in :func:`Series.fillna` returns frame when ``inplace=True`` and ``value`` is dict (:issue:`16156`)
499546
- Bug in :attr:`Timestamp.weekday_name` returning a UTC-based weekday name when localized to a timezone (:issue:`17354`)
500547
- Bug in ``Timestamp.replace`` when replacing ``tzinfo`` around DST changes (:issue:`15683`)
548+
- Bug in ``Timedelta`` construction and arithmetic that would not propagate the ``Overflow`` exception (:issue:`17367`)
501549

502550
Indexing
503551
^^^^^^^^
@@ -517,6 +565,7 @@ Indexing
517565
- Bug in ``CategoricalIndex`` reindexing in which specified indices containing duplicates were not being respected (:issue:`17323`)
518566
- Bug in intersection of ``RangeIndex`` with negative step (:issue:`17296`)
519567
- Bug in ``IntervalIndex`` where performing a scalar lookup fails for included right endpoints of non-overlapping monotonic decreasing indexes (:issue:`16417`, :issue:`17271`)
568+
- Bug in :meth:`DataFrame.first_valid_index` and :meth:`DataFrame.last_valid_index` when no valid entry (:issue:`17400`)
520569
- Bug in ``Series.rename`` when called with a `callable` alters name of series rather than index of series. (:issue:`17407`)
521570

522571
I/O

pandas/_libs/groupby.pyx

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,6 @@ cimport cython
77

88
cnp.import_array()
99

10-
cimport util
11-
1210
from numpy cimport (ndarray,
1311
double_t,
1412
int8_t, int16_t, int32_t, int64_t, uint8_t, uint16_t,

pandas/_libs/join.pyx

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@ from cython cimport Py_ssize_t
88

99
np.import_array()
1010

11-
cimport util
12-
1311
from numpy cimport (ndarray,
1412
int8_t, int16_t, int32_t, int64_t, uint8_t, uint16_t,
1513
uint32_t, uint64_t, float16_t, float32_t, float64_t)

pandas/_libs/parsers.pyx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -255,7 +255,7 @@ cdef extern from "parser/tokenizer.h":
255255

256256
# inline int to_complex(char *item, double *p_real,
257257
# double *p_imag, char sci, char decimal)
258-
inline int to_longlong(char *item, long long *p_value) nogil
258+
int to_longlong(char *item, long long *p_value) nogil
259259
# inline int to_longlong_thousands(char *item, long long *p_value,
260260
# char tsep)
261261
int to_boolean(const char *item, uint8_t *val) nogil

0 commit comments

Comments
 (0)