Skip to content

Commit 96adc2e

Browse files
merge master
1 parent 87baa97 commit 96adc2e

File tree

376 files changed

+6677
-2995
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

376 files changed

+6677
-2995
lines changed

.travis.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,23 +34,23 @@ matrix:
3434
include:
3535
- dist: trusty
3636
env:
37-
- JOB="3.7" ENV_FILE="ci/deps/travis-37.yaml" PATTERN="not slow and not network"
37+
- JOB="3.7" ENV_FILE="ci/deps/travis-37.yaml" PATTERN="(not slow and not network)"
3838

3939
- dist: trusty
4040
env:
41-
- JOB="2.7" ENV_FILE="ci/deps/travis-27.yaml" PATTERN="not slow and db"
41+
- JOB="2.7" ENV_FILE="ci/deps/travis-27.yaml" PATTERN="(not slow or (single and db))"
4242
addons:
4343
apt:
4444
packages:
4545
- python-gtk2
4646

4747
- dist: trusty
4848
env:
49-
- JOB="3.6, locale" ENV_FILE="ci/deps/travis-36-locale.yaml" PATTERN="not slow and not network and db" LOCALE_OVERRIDE="zh_CN.UTF-8"
49+
- JOB="3.6, locale" ENV_FILE="ci/deps/travis-36-locale.yaml" PATTERN="((not slow and not network) or (single and db))" LOCALE_OVERRIDE="zh_CN.UTF-8"
5050

5151
- dist: trusty
5252
env:
53-
- JOB="3.6, coverage" ENV_FILE="ci/deps/travis-36.yaml" PATTERN="not slow and not network and db" PANDAS_TESTING_MODE="deprecate" COVERAGE=true
53+
- JOB="3.6, coverage" ENV_FILE="ci/deps/travis-36.yaml" PATTERN="((not slow and not network) or (single and db))" PANDAS_TESTING_MODE="deprecate" COVERAGE=true
5454

5555
# In allow_failures
5656
- dist: trusty

LICENSES/DATEUTIL_LICENSE

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
Copyright 2017- Paul Ganssle <paul@ganssle.io>
2+
Copyright 2017- dateutil contributors (see AUTHORS file)
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
http://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing, software
11+
distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
16+
The above license applies to all contributions after 2017-12-01, as well as
17+
all contributions that have been re-licensed (see AUTHORS file for the list of
18+
contributors who have re-licensed their code).
19+
--------------------------------------------------------------------------------
20+
dateutil - Extensions to the standard Python datetime module.
21+
22+
Copyright (c) 2003-2011 - Gustavo Niemeyer <gustavo@niemeyer.net>
23+
Copyright (c) 2012-2014 - Tomi Pieviläinen <tomi.pievilainen@iki.fi>
24+
Copyright (c) 2014-2016 - Yaron de Leeuw <me@jarondl.net>
25+
Copyright (c) 2015- - Paul Ganssle <paul@ganssle.io>
26+
Copyright (c) 2015- - dateutil contributors (see AUTHORS file)
27+
28+
All rights reserved.
29+
30+
Redistribution and use in source and binary forms, with or without
31+
modification, are permitted provided that the following conditions are met:
32+
33+
* Redistributions of source code must retain the above copyright notice,
34+
this list of conditions and the following disclaimer.
35+
* Redistributions in binary form must reproduce the above copyright notice,
36+
this list of conditions and the following disclaimer in the documentation
37+
and/or other materials provided with the distribution.
38+
* Neither the name of the copyright holder nor the names of its
39+
contributors may be used to endorse or promote products derived from
40+
this software without specific prior written permission.
41+
42+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
43+
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
44+
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
45+
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
46+
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
47+
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
48+
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
49+
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
50+
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
51+
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
52+
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
53+
54+
The above BSD License Applies to all code, even that also covered by Apache 2.0.

doc/source/cookbook.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1236,7 +1236,7 @@ the following Python code will read the binary file ``'binary.dat'`` into a
12361236
pandas ``DataFrame``, where each element of the struct corresponds to a column
12371237
in the frame:
12381238

1239-
.. ipython:: python
1239+
.. code-block:: python
12401240
12411241
names = 'count', 'avg', 'scale'
12421242

doc/source/gotchas.rst

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -215,8 +215,28 @@ arrays. For example:
215215
s2.dtype
216216
217217
This trade-off is made largely for memory and performance reasons, and also so
218-
that the resulting ``Series`` continues to be "numeric". One possibility is to
219-
use ``dtype=object`` arrays instead.
218+
that the resulting ``Series`` continues to be "numeric".
219+
220+
If you need to represent integers with possibly missing values, use one of
221+
the nullable-integer extension dtypes provided by pandas
222+
223+
* :class:`Int8Dtype`
224+
* :class:`Int16Dtype`
225+
* :class:`Int32Dtype`
226+
* :class:`Int64Dtype`
227+
228+
.. ipython:: python
229+
230+
s_int = pd.Series([1, 2, 3, 4, 5], index=list('abcde'),
231+
dtype=pd.Int64Dtype())
232+
s_int
233+
s_int.dtype
234+
235+
s2_int = s_int.reindex(['a', 'b', 'c', 'f', 'u'])
236+
s2_int
237+
s2_int.dtype
238+
239+
See :ref:`integer_na` for more.
220240

221241
``NA`` type promotions
222242
~~~~~~~~~~~~~~~~~~~~~~

doc/source/index.rst.template

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,7 @@ See the package overview for more detail about what's in the library.
143143
timeseries
144144
timedeltas
145145
categorical
146+
integer_na
146147
visualization
147148
style
148149
io

doc/source/integer_na.rst

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
.. currentmodule:: pandas
2+
3+
{{ header }}
4+
5+
.. _integer_na:
6+
7+
**************************
8+
Nullable Integer Data Type
9+
**************************
10+
11+
.. versionadded:: 0.24.0
12+
13+
In :ref:`missing_data`, we saw that pandas primarily uses ``NaN`` to represent
14+
missing data. Because ``NaN`` is a float, this forces an array of integers with
15+
any missing values to become floating point. In some cases, this may not matter
16+
much. But if your integer column is, say, an identifier, casting to float can
17+
be problematic. Some integers cannot even be represented as floating point
18+
numbers.
19+
20+
Pandas can represent integer data with possibly missing values using
21+
:class:`arrays.IntegerArray`. This is an :ref:`extension types <extending.extension-types>`
22+
implemented within pandas. It is not the default dtype for integers, and will not be inferred;
23+
you must explicitly pass the dtype into :meth:`array` or :class:`Series`:
24+
25+
.. ipython:: python
26+
27+
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
28+
arr
29+
30+
Or the string alias ``"Int64"`` (note the capital ``"I"``, to differentiate from
31+
NumPy's ``'int64'`` dtype:
32+
33+
.. ipython:: python
34+
35+
pd.array([1, 2, np.nan], dtype="Int64")
36+
37+
This array can be stored in a :class:`DataFrame` or :class:`Series` like any
38+
NumPy array.
39+
40+
.. ipython:: python
41+
42+
pd.Series(arr)
43+
44+
You can also pass the list-like object to the :class:`Series` constructor
45+
with the dtype.
46+
47+
.. ipython:: python
48+
49+
s = pd.Series([1, 2, np.nan], dtype="Int64")
50+
s
51+
52+
By default (if you don't specify ``dtype``), NumPy is used, and you'll end
53+
up with a ``float64`` dtype Series:
54+
55+
.. ipython:: python
56+
57+
pd.Series([1, 2, np.nan])
58+
59+
Operations involving an integer array will behave similar to NumPy arrays.
60+
Missing values will be propagated, and and the data will be coerced to another
61+
dtype if needed.
62+
63+
.. ipython:: python
64+
65+
# arithmetic
66+
s + 1
67+
68+
# comparison
69+
s == 1
70+
71+
# indexing
72+
s.iloc[1:3]
73+
74+
# operate with other dtypes
75+
s + s.iloc[1:3].astype('Int8')
76+
77+
# coerce when needed
78+
s + 0.01
79+
80+
These dtypes can operate as part of of ``DataFrame``.
81+
82+
.. ipython:: python
83+
84+
df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})
85+
df
86+
df.dtypes
87+
88+
89+
These dtypes can be merged & reshaped & casted.
90+
91+
.. ipython:: python
92+
93+
pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
94+
df['A'].astype(float)
95+
96+
Reduction and groupby operations such as 'sum' work as well.
97+
98+
.. ipython:: python
99+
100+
df.sum()
101+
df.groupby('B').A.sum()

doc/source/io.rst

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -362,16 +362,17 @@ columns:
362362

363363
.. ipython:: python
364364
365-
data = ('a,b,c\n'
366-
'1,2,3\n'
367-
'4,5,6\n'
368-
'7,8,9')
365+
data = ('a,b,c,d\n'
366+
'1,2,3,4\n'
367+
'5,6,7,8\n'
368+
'9,10,11')
369369
print(data)
370370
371371
df = pd.read_csv(StringIO(data), dtype=object)
372372
df
373373
df['a'][0]
374-
df = pd.read_csv(StringIO(data), dtype={'b': object, 'c': np.float64})
374+
df = pd.read_csv(StringIO(data),
375+
dtype={'b': object, 'c': np.float64, 'd': 'Int64'})
375376
df.dtypes
376377
377378
Fortunately, pandas offers more than one way to ensure that your column(s)
@@ -4646,6 +4647,7 @@ Write to a feather file.
46464647
Read from a feather file.
46474648

46484649
.. ipython:: python
4650+
:okwarning:
46494651
46504652
result = pd.read_feather('example.feather')
46514653
result
@@ -4720,6 +4722,7 @@ Write to a parquet file.
47204722
Read from a parquet file.
47214723

47224724
.. ipython:: python
4725+
:okwarning:
47234726
47244727
result = pd.read_parquet('example_fp.parquet', engine='fastparquet')
47254728
result = pd.read_parquet('example_pa.parquet', engine='pyarrow')
@@ -4790,6 +4793,7 @@ Partitioning Parquet files
47904793
Parquet supports partitioning of data based on the values of one or more columns.
47914794

47924795
.. ipython:: python
4796+
:okwarning:
47934797
47944798
df = pd.DataFrame({'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1]})
47954799
df.to_parquet(fname='test', engine='pyarrow',
@@ -4879,7 +4883,7 @@ below and the SQLAlchemy `documentation <https://docs.sqlalchemy.org/en/latest/c
48794883
48804884
If you want to manage your own connections you can pass one of those instead:
48814885

4882-
.. ipython:: python
4886+
.. code-block:: python
48834887
48844888
with engine.connect() as conn, conn.begin():
48854889
data = pd.read_sql_table('data', conn)

doc/source/missing_data.rst

Lines changed: 38 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -19,32 +19,6 @@ pandas.
1919

2020
See the :ref:`cookbook<cookbook.missing_data>` for some advanced strategies.
2121

22-
Missing data basics
23-
-------------------
24-
25-
When / why does data become missing?
26-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
27-
28-
Some might quibble over our usage of *missing*. By "missing" we simply mean
29-
**NA** ("not available") or "not present for whatever reason". Many data sets simply arrive with
30-
missing data, either because it exists and was not collected or it never
31-
existed. For example, in a collection of financial time series, some of the time
32-
series might start on different dates. Thus, values prior to the start date
33-
would generally be marked as missing.
34-
35-
In pandas, one of the most common ways that missing data is **introduced** into
36-
a data set is by reindexing. For example:
37-
38-
.. ipython:: python
39-
40-
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
41-
columns=['one', 'two', 'three'])
42-
df['four'] = 'bar'
43-
df['five'] = df['one'] > 0
44-
df
45-
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
46-
df2
47-
4822
Values considered "missing"
4923
~~~~~~~~~~~~~~~~~~~~~~~~~~~
5024

@@ -62,6 +36,16 @@ arise and we wish to also consider that "missing" or "not available" or "NA".
6236

6337
.. _missing.isna:
6438

39+
.. ipython:: python
40+
41+
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
42+
columns=['one', 'two', 'three'])
43+
df['four'] = 'bar'
44+
df['five'] = df['one'] > 0
45+
df
46+
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
47+
df2
48+
6549
To make detecting missing values easier (and across different array dtypes),
6650
pandas provides the :func:`isna` and
6751
:func:`notna` functions, which are also methods on
@@ -90,6 +74,23 @@ Series and DataFrame objects:
9074
9175
df2['one'] == np.nan
9276
77+
Integer Dtypes and Missing Data
78+
-------------------------------
79+
80+
Because ``NaN`` is a float, a column of integers with even one missing values
81+
is cast to floating-point dtype (see :ref:`gotchas.intna` for more). Pandas
82+
provides a nullable integer array, which can be used by explicitly requesting
83+
the dtype:
84+
85+
.. ipython:: python
86+
87+
pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
88+
89+
Alternatively, the string alias ``dtype='Int64'`` (note the capital ``"I"``) can be
90+
used.
91+
92+
See :ref:`integer_na` for more.
93+
9394
Datetimes
9495
---------
9596

@@ -751,3 +752,14 @@ However, these can be filled in using :meth:`~DataFrame.fillna` and it will work
751752
752753
reindexed[crit.fillna(False)]
753754
reindexed[crit.fillna(True)]
755+
756+
Pandas provides a nullable integer dtype, but you must explicitly request it
757+
when creating the series or column. Notice that we use a capital "I" in
758+
the ``dtype="Int64"``.
759+
760+
.. ipython:: python
761+
762+
s = pd.Series([0, 1, np.nan, 3, 4], dtype="Int64")
763+
s
764+
765+
See :ref:`integer_na` for more.

0 commit comments

Comments
 (0)