`_. The same is true
+for several of the storage backends, and you should follow the links
+at `fsimpl1`_ for implementations built into ``fsspec`` and `fsimpl2`_
+for those not included in the main ``fsspec``
+distribution.
+You can also pass parameters directly to the backend driver. For example,
+if you do *not* have S3 credentials, you can still access public data by
+specifying an anonymous connection, such as
+.. versionadded:: 1.2.0
+
+.. code-block:: python
+
+ pd.read_csv(
+ "s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/SaKe2013"
+ "-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
+ storage_options={"anon": True},
+ )
+
+``fsspec`` also allows complex URLs, for accessing data in compressed
+archives, local caching of files, and more. To locally cache the above
+example, you would modify the call to
+
+.. code-block:: python
+
+ pd.read_csv(
+ "simplecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/"
+ "SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
+ storage_options={"s3": {"anon": True}},
+ )
+
+where we specify that the "anon" parameter is meant for the "s3" part of
+the implementation, not to the caching implementation. Note that this caches to a temporary
+directory for the duration of the session only, but you can also specify
+a permanent store.
+
+.. _fsspec: https://filesystem-spec.readthedocs.io/en/latest/
+.. _fsimpl1: https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations
+.. _fsimpl2: https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations
Writing out data
''''''''''''''''
@@ -1668,7 +1679,7 @@ The ``Series`` and ``DataFrame`` objects have an instance method ``to_csv`` whic
allows storing the contents of the object as a comma-separated-values file. The
function takes a number of arguments. Only the first is required.
-* ``path_or_buf``: A string path to the file to write or a file object. If a file object it must be opened with `newline=''`
+* ``path_or_buf``: A string path to the file to write or a file object. If a file object it must be opened with ``newline=''``
* ``sep`` : Field delimiter for the output file (default ",")
* ``na_rep``: A string representation of a missing value (default '')
* ``float_format``: Format string for floating point numbers
@@ -1676,13 +1687,13 @@ function takes a number of arguments. Only the first is required.
* ``header``: Whether to write out the column names (default True)
* ``index``: whether to write row (index) names (default True)
* ``index_label``: Column label(s) for index column(s) if desired. If None
- (default), and `header` and `index` are True, then the index names are
+ (default), and ``header`` and ``index`` are True, then the index names are
used. (A sequence should be given if the ``DataFrame`` uses MultiIndex).
* ``mode`` : Python write mode, default 'w'
* ``encoding``: a string representing the encoding to use if the contents are
non-ASCII, for Python versions prior to 3
-* ``line_terminator``: Character sequence denoting line end (default `os.linesep`)
-* ``quoting``: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL). Note that if you have set a `float_format` then floats are converted to strings and csv.QUOTE_NONNUMERIC will treat them as non-numeric
+* ``line_terminator``: Character sequence denoting line end (default ``os.linesep``)
+* ``quoting``: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL). Note that if you have set a ``float_format`` then floats are converted to strings and csv.QUOTE_NONNUMERIC will treat them as non-numeric
* ``quotechar``: Character used to quote fields (default '"')
* ``doublequote``: Control quoting of ``quotechar`` in fields (default True)
* ``escapechar``: Character used to escape ``sep`` and ``quotechar`` when
@@ -1769,7 +1780,7 @@ Note ``NaN``'s, ``NaT``'s and ``None`` will be converted to ``null`` and ``datet
.. ipython:: python
- dfj = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
+ dfj = pd.DataFrame(np.random.randn(5, 2), columns=list("AB"))
json = dfj.to_json()
json
@@ -1781,10 +1792,13 @@ file / string. Consider the following ``DataFrame`` and ``Series``:
.. ipython:: python
- dfjo = pd.DataFrame(dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)),
- columns=list('ABC'), index=list('xyz'))
+ dfjo = pd.DataFrame(
+ dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)),
+ columns=list("ABC"),
+ index=list("xyz"),
+ )
dfjo
- sjo = pd.Series(dict(x=15, y=16, z=17), name='D')
+ sjo = pd.Series(dict(x=15, y=16, z=17), name="D")
sjo
**Column oriented** (the default for ``DataFrame``) serializes the data as
@@ -1835,7 +1849,7 @@ preservation of metadata including but not limited to dtypes and index names.
Any orient option that encodes to a JSON object will not preserve the ordering of
index and column labels during round-trip serialization. If you wish to preserve
- label ordering use the `split` option as it uses ordered containers.
+ label ordering use the ``split`` option as it uses ordered containers.
Date handling
+++++++++++++
@@ -1844,24 +1858,24 @@ Writing in ISO date format:
.. ipython:: python
- dfd = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
- dfd['date'] = pd.Timestamp('20130101')
+ dfd = pd.DataFrame(np.random.randn(5, 2), columns=list("AB"))
+ dfd["date"] = pd.Timestamp("20130101")
dfd = dfd.sort_index(1, ascending=False)
- json = dfd.to_json(date_format='iso')
+ json = dfd.to_json(date_format="iso")
json
Writing in ISO date format, with microseconds:
.. ipython:: python
- json = dfd.to_json(date_format='iso', date_unit='us')
+ json = dfd.to_json(date_format="iso", date_unit="us")
json
Epoch timestamps, in seconds:
.. ipython:: python
- json = dfd.to_json(date_format='epoch', date_unit='s')
+ json = dfd.to_json(date_format="epoch", date_unit="s")
json
Writing to a file, with a date index and a date column:
@@ -1869,13 +1883,13 @@ Writing to a file, with a date index and a date column:
.. ipython:: python
dfj2 = dfj.copy()
- dfj2['date'] = pd.Timestamp('20130101')
- dfj2['ints'] = list(range(5))
- dfj2['bools'] = True
- dfj2.index = pd.date_range('20130101', periods=5)
- dfj2.to_json('test.json')
+ dfj2["date"] = pd.Timestamp("20130101")
+ dfj2["ints"] = list(range(5))
+ dfj2["bools"] = True
+ dfj2.index = pd.date_range("20130101", periods=5)
+ dfj2.to_json("test.json")
- with open('test.json') as fh:
+ with open("test.json") as fh:
print(fh.read())
Fallback behavior
@@ -1884,7 +1898,7 @@ Fallback behavior
If the JSON serializer cannot handle the container contents directly it will
fall back in the following manner:
-* if the dtype is unsupported (e.g. ``np.complex``) then the ``default_handler``, if provided, will be called
+* if the dtype is unsupported (e.g. ``np.complex_``) then the ``default_handler``, if provided, will be called
for each value, otherwise an exception is raised.
* if an object is unsupported it will attempt the following:
@@ -2010,26 +2024,27 @@ Reading from a file:
.. ipython:: python
- pd.read_json('test.json')
+ pd.read_json("test.json")
Don't convert any data (but still convert axes and dates):
.. ipython:: python
- pd.read_json('test.json', dtype=object).dtypes
+ pd.read_json("test.json", dtype=object).dtypes
Specify dtypes for conversion:
.. ipython:: python
- pd.read_json('test.json', dtype={'A': 'float32', 'bools': 'int8'}).dtypes
+ pd.read_json("test.json", dtype={"A": "float32", "bools": "int8"}).dtypes
Preserve string indices:
.. ipython:: python
- si = pd.DataFrame(np.zeros((4, 4)), columns=list(range(4)),
- index=[str(i) for i in range(4)])
+ si = pd.DataFrame(
+ np.zeros((4, 4)), columns=list(range(4)), index=[str(i) for i in range(4)]
+ )
si
si.index
si.columns
@@ -2044,10 +2059,10 @@ Dates written in nanoseconds need to be read back in nanoseconds:
.. ipython:: python
- json = dfj2.to_json(date_unit='ns')
+ json = dfj2.to_json(date_unit="ns")
# Try to parse timestamps as milliseconds -> Won't Work
- dfju = pd.read_json(json, date_unit='ms')
+ dfju = pd.read_json(json, date_unit="ms")
dfju
# Let pandas detect the correct precision
@@ -2055,7 +2070,7 @@ Dates written in nanoseconds need to be read back in nanoseconds:
dfju
# Or specify that all timestamps are in nanoseconds
- dfju = pd.read_json(json, date_unit='ns')
+ dfju = pd.read_json(json, date_unit="ns")
dfju
The Numpy parameter
@@ -2077,7 +2092,7 @@ data:
randfloats = np.random.uniform(-100, 1000, 10000)
randfloats.shape = (1000, 10)
- dffloats = pd.DataFrame(randfloats, columns=list('ABCDEFGHIJ'))
+ dffloats = pd.DataFrame(randfloats, columns=list("ABCDEFGHIJ"))
jsonfloats = dffloats.to_json()
@@ -2124,7 +2139,7 @@ The speedup is less noticeable for smaller datasets:
.. ipython:: python
:suppress:
- os.remove('test.json')
+ os.remove("test.json")
.. _io.json_normalize:
@@ -2136,38 +2151,54 @@ into a flat table.
.. ipython:: python
- data = [{'id': 1, 'name': {'first': 'Coleen', 'last': 'Volk'}},
- {'name': {'given': 'Mose', 'family': 'Regner'}},
- {'id': 2, 'name': 'Faye Raker'}]
+ data = [
+ {"id": 1, "name": {"first": "Coleen", "last": "Volk"}},
+ {"name": {"given": "Mose", "family": "Regner"}},
+ {"id": 2, "name": "Faye Raker"},
+ ]
pd.json_normalize(data)
.. ipython:: python
- data = [{'state': 'Florida',
- 'shortname': 'FL',
- 'info': {'governor': 'Rick Scott'},
- 'county': [{'name': 'Dade', 'population': 12345},
- {'name': 'Broward', 'population': 40000},
- {'name': 'Palm Beach', 'population': 60000}]},
- {'state': 'Ohio',
- 'shortname': 'OH',
- 'info': {'governor': 'John Kasich'},
- 'county': [{'name': 'Summit', 'population': 1234},
- {'name': 'Cuyahoga', 'population': 1337}]}]
-
- pd.json_normalize(data, 'county', ['state', 'shortname', ['info', 'governor']])
+ data = [
+ {
+ "state": "Florida",
+ "shortname": "FL",
+ "info": {"governor": "Rick Scott"},
+ "county": [
+ {"name": "Dade", "population": 12345},
+ {"name": "Broward", "population": 40000},
+ {"name": "Palm Beach", "population": 60000},
+ ],
+ },
+ {
+ "state": "Ohio",
+ "shortname": "OH",
+ "info": {"governor": "John Kasich"},
+ "county": [
+ {"name": "Summit", "population": 1234},
+ {"name": "Cuyahoga", "population": 1337},
+ ],
+ },
+ ]
+
+ pd.json_normalize(data, "county", ["state", "shortname", ["info", "governor"]])
The max_level parameter provides more control over which level to end normalization.
With max_level=1 the following snippet normalizes until 1st nesting level of the provided dict.
.. ipython:: python
- data = [{'CreatedBy': {'Name': 'User001'},
- 'Lookup': {'TextField': 'Some text',
- 'UserField': {'Id': 'ID001',
- 'Name': 'Name001'}},
- 'Image': {'a': 'b'}
- }]
+ data = [
+ {
+ "CreatedBy": {"Name": "User001"},
+ "Lookup": {
+ "TextField": "Some text",
+ "UserField": {"Id": "ID001", "Name": "Name001"},
+ },
+ "Image": {"a": "b"},
+ }
+ ]
pd.json_normalize(data, max_level=1)
.. _io.jsonl:
@@ -2182,15 +2213,15 @@ For line-delimited json files, pandas can also return an iterator which reads in
.. ipython:: python
- jsonl = '''
+ jsonl = """
{"a": 1, "b": 2}
{"a": 3, "b": 4}
- '''
+ """
df = pd.read_json(jsonl, lines=True)
df
- df.to_json(orient='records', lines=True)
+ df.to_json(orient="records", lines=True)
- # reader is an iterator that returns `chunksize` lines each iteration
+ # reader is an iterator that returns ``chunksize`` lines each iteration
reader = pd.read_json(StringIO(jsonl), lines=True, chunksize=1)
reader
for chunk in reader:
@@ -2208,12 +2239,16 @@ a JSON string with two fields, ``schema`` and ``data``.
.. ipython:: python
- df = pd.DataFrame({'A': [1, 2, 3],
- 'B': ['a', 'b', 'c'],
- 'C': pd.date_range('2016-01-01', freq='d', periods=3)},
- index=pd.Index(range(3), name='idx'))
+ df = pd.DataFrame(
+ {
+ "A": [1, 2, 3],
+ "B": ["a", "b", "c"],
+ "C": pd.date_range("2016-01-01", freq="d", periods=3),
+ },
+ index=pd.Index(range(3), name="idx"),
+ )
df
- df.to_json(orient='table', date_format="iso")
+ df.to_json(orient="table", date_format="iso")
The ``schema`` field contains the ``fields`` key, which itself contains
a list of column name to type pairs, including the ``Index`` or ``MultiIndex``
@@ -2230,7 +2265,7 @@ The full list of types supported are described in the Table Schema
spec. This table shows the mapping from pandas types:
=============== =================
-Pandas type Table Schema type
+pandas type Table Schema type
=============== =================
int64 integer
float64 number
@@ -2252,7 +2287,8 @@ A few notes on the generated table schema:
.. ipython:: python
from pandas.io.json import build_table_schema
- s = pd.Series(pd.date_range('2016', periods=4))
+
+ s = pd.Series(pd.date_range("2016", periods=4))
build_table_schema(s)
* datetimes with a timezone (before serializing), include an additional field
@@ -2260,8 +2296,7 @@ A few notes on the generated table schema:
.. ipython:: python
- s_tz = pd.Series(pd.date_range('2016', periods=12,
- tz='US/Central'))
+ s_tz = pd.Series(pd.date_range("2016", periods=12, tz="US/Central"))
build_table_schema(s_tz)
* Periods are converted to timestamps before serialization, and so have the
@@ -2270,8 +2305,7 @@ A few notes on the generated table schema:
.. ipython:: python
- s_per = pd.Series(1, index=pd.period_range('2016', freq='A-DEC',
- periods=4))
+ s_per = pd.Series(1, index=pd.period_range("2016", freq="A-DEC", periods=4))
build_table_schema(s_per)
* Categoricals use the ``any`` type and an ``enum`` constraint listing
@@ -2279,7 +2313,7 @@ A few notes on the generated table schema:
.. ipython:: python
- s_cat = pd.Series(pd.Categorical(['a', 'b', 'a']))
+ s_cat = pd.Series(pd.Categorical(["a", "b", "a"]))
build_table_schema(s_cat)
* A ``primaryKey`` field, containing an array of labels, is included
@@ -2295,8 +2329,7 @@ A few notes on the generated table schema:
.. ipython:: python
- s_multi = pd.Series(1, index=pd.MultiIndex.from_product([('a', 'b'),
- (0, 1)]))
+ s_multi = pd.Series(1, index=pd.MultiIndex.from_product([("a", "b"), (0, 1)]))
build_table_schema(s_multi)
* The default naming roughly follows these rules:
@@ -2310,24 +2343,26 @@ A few notes on the generated table schema:
then ``level_`` is used.
-.. versionadded:: 0.23.0
-
``read_json`` also accepts ``orient='table'`` as an argument. This allows for
the preservation of metadata such as dtypes and index names in a
round-trippable manner.
.. ipython:: python
- df = pd.DataFrame({'foo': [1, 2, 3, 4],
- 'bar': ['a', 'b', 'c', 'd'],
- 'baz': pd.date_range('2018-01-01', freq='d', periods=4),
- 'qux': pd.Categorical(['a', 'b', 'c', 'c'])
- }, index=pd.Index(range(4), name='idx'))
+ df = pd.DataFrame(
+ {
+ "foo": [1, 2, 3, 4],
+ "bar": ["a", "b", "c", "d"],
+ "baz": pd.date_range("2018-01-01", freq="d", periods=4),
+ "qux": pd.Categorical(["a", "b", "c", "c"]),
+ },
+ index=pd.Index(range(4), name="idx"),
+ )
df
df.dtypes
- df.to_json('test.json', orient='table')
- new_df = pd.read_json('test.json', orient='table')
+ df.to_json("test.json", orient="table")
+ new_df = pd.read_json("test.json", orient="table")
new_df
new_df.dtypes
@@ -2339,17 +2374,17 @@ indicate missing values and the subsequent read cannot distinguish the intent.
.. ipython:: python
:okwarning:
- df.index.name = 'index'
- df.to_json('test.json', orient='table')
- new_df = pd.read_json('test.json', orient='table')
+ df.index.name = "index"
+ df.to_json("test.json", orient="table")
+ new_df = pd.read_json("test.json", orient="table")
print(new_df.index.name)
.. ipython:: python
:suppress:
- os.remove('test.json')
+ os.remove("test.json")
-.. _Table Schema: https://specs.frictionlessdata.io/json-table-schema/
+.. _Table Schema: https://specs.frictionlessdata.io/table-schema/
HTML
----
@@ -2377,7 +2412,7 @@ Read a URL with no options:
.. ipython:: python
- url = 'https://www.fdic.gov/bank/individual/failed/banklist.html'
+ url = "https://www.fdic.gov/bank/individual/failed/banklist.html"
dfs = pd.read_html(url)
dfs
@@ -2392,11 +2427,11 @@ as a string:
.. ipython:: python
:suppress:
- file_path = os.path.abspath(os.path.join('source', '_static', 'banklist.html'))
+ file_path = os.path.abspath(os.path.join("source", "_static", "banklist.html"))
.. ipython:: python
- with open(file_path, 'r') as f:
+ with open(file_path, "r") as f:
dfs = pd.read_html(f.read())
dfs
@@ -2404,7 +2439,7 @@ You can even pass in an instance of ``StringIO`` if you so desire:
.. ipython:: python
- with open(file_path, 'r') as f:
+ with open(file_path, "r") as f:
sio = StringIO(f.read())
dfs = pd.read_html(sio)
@@ -2423,7 +2458,7 @@ Read a URL and match a table that contains specific text:
.. code-block:: python
- match = 'Metcalf Bank'
+ match = "Metcalf Bank"
df_list = pd.read_html(url, match=match)
Specify a header row (by default ```` or `` | `` elements located within a
@@ -2458,15 +2493,15 @@ Specify an HTML attribute:
.. code-block:: python
- dfs1 = pd.read_html(url, attrs={'id': 'table'})
- dfs2 = pd.read_html(url, attrs={'class': 'sortable'})
+ dfs1 = pd.read_html(url, attrs={"id": "table"})
+ dfs2 = pd.read_html(url, attrs={"class": "sortable"})
print(np.array_equal(dfs1[0], dfs2[0])) # Should be True
Specify values that should be converted to NaN:
.. code-block:: python
- dfs = pd.read_html(url, na_values=['No Acquirer'])
+ dfs = pd.read_html(url, na_values=["No Acquirer"])
Specify whether to keep the default set of NaN values:
@@ -2481,22 +2516,21 @@ columns to strings.
.. code-block:: python
- url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
- dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0,
- converters={'MNC': str})
+ url_mcc = "https://en.wikipedia.org/wiki/Mobile_country_code"
+ dfs = pd.read_html(url_mcc, match="Telekom Albania", header=0, converters={"MNC": str})
Use some combination of the above:
.. code-block:: python
- dfs = pd.read_html(url, match='Metcalf Bank', index_col=0)
+ dfs = pd.read_html(url, match="Metcalf Bank", index_col=0)
Read in pandas ``to_html`` output (with some loss of floating point precision):
.. code-block:: python
df = pd.DataFrame(np.random.randn(2, 2))
- s = df.to_html(float_format='{0:.40g}'.format)
+ s = df.to_html(float_format="{0:.40g}".format)
dfin = pd.read_html(s, index_col=0)
The ``lxml`` backend will raise an error on a failed parse if that is the only
@@ -2506,13 +2540,13 @@ for example, the function expects a sequence of strings. You may use:
.. code-block:: python
- dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor=['lxml'])
+ dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml"])
Or you could pass ``flavor='lxml'`` without a list:
.. code-block:: python
- dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor='lxml')
+ dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor="lxml")
However, if you have bs4 and html5lib installed and pass ``None`` or ``['lxml',
'bs4']`` then the parse will most likely succeed. Note that *as soon as a parse
@@ -2520,7 +2554,7 @@ succeeds, the function will return*.
.. code-block:: python
- dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor=['lxml', 'bs4'])
+ dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml", "bs4"])
.. _io.html:
@@ -2542,8 +2576,8 @@ in the method ``to_string`` described above.
:suppress:
def write_html(df, filename, *args, **kwargs):
- static = os.path.abspath(os.path.join('source', '_static'))
- with open(os.path.join(static, filename + '.html'), 'w') as f:
+ static = os.path.abspath(os.path.join("source", "_static"))
+ with open(os.path.join(static, filename + ".html"), "w") as f:
df.to_html(f, *args, **kwargs)
.. ipython:: python
@@ -2555,7 +2589,7 @@ in the method ``to_string`` described above.
.. ipython:: python
:suppress:
- write_html(df, 'basic')
+ write_html(df, "basic")
HTML:
@@ -2571,7 +2605,7 @@ The ``columns`` argument will limit the columns shown:
.. ipython:: python
:suppress:
- write_html(df, 'columns', columns=[0])
+ write_html(df, "columns", columns=[0])
HTML:
@@ -2583,12 +2617,12 @@ point values:
.. ipython:: python
- print(df.to_html(float_format='{0:.10f}'.format))
+ print(df.to_html(float_format="{0:.10f}".format))
.. ipython:: python
:suppress:
- write_html(df, 'float_format', float_format='{0:.10f}'.format)
+ write_html(df, "float_format", float_format="{0:.10f}".format)
HTML:
@@ -2605,7 +2639,7 @@ off:
.. ipython:: python
:suppress:
- write_html(df, 'nobold', bold_rows=False)
+ write_html(df, "nobold", bold_rows=False)
.. raw:: html
:file: ../_static/nobold.html
@@ -2616,7 +2650,7 @@ table CSS classes. Note that these classes are *appended* to the existing
.. ipython:: python
- print(df.to_html(classes=['awesome_table_class', 'even_more_awesome_class']))
+ print(df.to_html(classes=["awesome_table_class", "even_more_awesome_class"]))
The ``render_links`` argument provides the ability to add hyperlinks to cells
that contain URLs.
@@ -2625,15 +2659,18 @@ that contain URLs.
.. ipython:: python
- url_df = pd.DataFrame({
- 'name': ['Python', 'Pandas'],
- 'url': ['https://www.python.org/', 'https://pandas.pydata.org']})
+ url_df = pd.DataFrame(
+ {
+ "name": ["Python", "pandas"],
+ "url": ["https://www.python.org/", "https://pandas.pydata.org"],
+ }
+ )
print(url_df.to_html(render_links=True))
.. ipython:: python
:suppress:
- write_html(url_df, 'render_links', render_links=True)
+ write_html(url_df, "render_links", render_links=True)
HTML:
@@ -2646,14 +2683,14 @@ Finally, the ``escape`` argument allows you to control whether the
.. ipython:: python
- df = pd.DataFrame({'a': list('&<>'), 'b': np.random.randn(3)})
+ df = pd.DataFrame({"a": list("&<>"), "b": np.random.randn(3)})
.. ipython:: python
:suppress:
- write_html(df, 'escape')
- write_html(df, 'noescape', escape=False)
+ write_html(df, "escape")
+ write_html(df, "noescape", escape=False)
Escaped:
@@ -2780,7 +2817,7 @@ file, and the ``sheet_name`` indicating which sheet to parse.
.. code-block:: python
# Returns a DataFrame
- pd.read_excel('path_to_file.xls', sheet_name='Sheet1')
+ pd.read_excel("path_to_file.xls", sheet_name="Sheet1")
.. _io.excel.excelfile_class:
@@ -2795,16 +2832,16 @@ read into memory only once.
.. code-block:: python
- xlsx = pd.ExcelFile('path_to_file.xls')
- df = pd.read_excel(xlsx, 'Sheet1')
+ xlsx = pd.ExcelFile("path_to_file.xls")
+ df = pd.read_excel(xlsx, "Sheet1")
The ``ExcelFile`` class can also be used as a context manager.
.. code-block:: python
- with pd.ExcelFile('path_to_file.xls') as xls:
- df1 = pd.read_excel(xls, 'Sheet1')
- df2 = pd.read_excel(xls, 'Sheet2')
+ with pd.ExcelFile("path_to_file.xls") as xls:
+ df1 = pd.read_excel(xls, "Sheet1")
+ df2 = pd.read_excel(xls, "Sheet2")
The ``sheet_names`` property will generate
a list of the sheet names in the file.
@@ -2816,10 +2853,9 @@ different parameters:
data = {}
# For when Sheet1's format differs from Sheet2
- with pd.ExcelFile('path_to_file.xls') as xls:
- data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None,
- na_values=['NA'])
- data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=1)
+ with pd.ExcelFile("path_to_file.xls") as xls:
+ data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"])
+ data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=1)
Note that if the same parsing parameters are used for all sheets, a list
of sheet names can simply be passed to ``read_excel`` with no loss in performance.
@@ -2828,15 +2864,14 @@ of sheet names can simply be passed to ``read_excel`` with no loss in performanc
# using the ExcelFile class
data = {}
- with pd.ExcelFile('path_to_file.xls') as xls:
- data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None,
- na_values=['NA'])
- data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=None,
- na_values=['NA'])
+ with pd.ExcelFile("path_to_file.xls") as xls:
+ data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"])
+ data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=None, na_values=["NA"])
# equivalent using the read_excel function
- data = pd.read_excel('path_to_file.xls', ['Sheet1', 'Sheet2'],
- index_col=None, na_values=['NA'])
+ data = pd.read_excel(
+ "path_to_file.xls", ["Sheet1", "Sheet2"], index_col=None, na_values=["NA"]
+ )
``ExcelFile`` can also be called with a ``xlrd.book.Book`` object
as a parameter. This allows the user to control how the excel file is read.
@@ -2846,10 +2881,11 @@ with ``on_demand=True``.
.. code-block:: python
import xlrd
- xlrd_book = xlrd.open_workbook('path_to_file.xls', on_demand=True)
+
+ xlrd_book = xlrd.open_workbook("path_to_file.xls", on_demand=True)
with pd.ExcelFile(xlrd_book) as xls:
- df1 = pd.read_excel(xls, 'Sheet1')
- df2 = pd.read_excel(xls, 'Sheet2')
+ df1 = pd.read_excel(xls, "Sheet1")
+ df2 = pd.read_excel(xls, "Sheet2")
.. _io.excel.specifying_sheets:
@@ -2871,35 +2907,35 @@ Specifying sheets
.. code-block:: python
# Returns a DataFrame
- pd.read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])
+ pd.read_excel("path_to_file.xls", "Sheet1", index_col=None, na_values=["NA"])
Using the sheet index:
.. code-block:: python
# Returns a DataFrame
- pd.read_excel('path_to_file.xls', 0, index_col=None, na_values=['NA'])
+ pd.read_excel("path_to_file.xls", 0, index_col=None, na_values=["NA"])
Using all default values:
.. code-block:: python
# Returns a DataFrame
- pd.read_excel('path_to_file.xls')
+ pd.read_excel("path_to_file.xls")
Using None to get all sheets:
.. code-block:: python
# Returns a dictionary of DataFrames
- pd.read_excel('path_to_file.xls', sheet_name=None)
+ pd.read_excel("path_to_file.xls", sheet_name=None)
Using a list to get multiple sheets:
.. code-block:: python
# Returns the 1st and 4th sheet, as a dictionary of DataFrames.
- pd.read_excel('path_to_file.xls', sheet_name=['Sheet1', 3])
+ pd.read_excel("path_to_file.xls", sheet_name=["Sheet1", 3])
``read_excel`` can read more than one sheet, by setting ``sheet_name`` to either
a list of sheet names, a list of sheet positions, or ``None`` to read all sheets.
@@ -2920,10 +2956,12 @@ For example, to read in a ``MultiIndex`` index without names:
.. ipython:: python
- df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]},
- index=pd.MultiIndex.from_product([['a', 'b'], ['c', 'd']]))
- df.to_excel('path_to_file.xlsx')
- df = pd.read_excel('path_to_file.xlsx', index_col=[0, 1])
+ df = pd.DataFrame(
+ {"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]},
+ index=pd.MultiIndex.from_product([["a", "b"], ["c", "d"]]),
+ )
+ df.to_excel("path_to_file.xlsx")
+ df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1])
df
If the index has level names, they will parsed as well, using the same
@@ -2931,9 +2969,9 @@ parameters.
.. ipython:: python
- df.index = df.index.set_names(['lvl1', 'lvl2'])
- df.to_excel('path_to_file.xlsx')
- df = pd.read_excel('path_to_file.xlsx', index_col=[0, 1])
+ df.index = df.index.set_names(["lvl1", "lvl2"])
+ df.to_excel("path_to_file.xlsx")
+ df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1])
df
@@ -2942,16 +2980,15 @@ should be passed to ``index_col`` and ``header``:
.. ipython:: python
- df.columns = pd.MultiIndex.from_product([['a'], ['b', 'd']],
- names=['c1', 'c2'])
- df.to_excel('path_to_file.xlsx')
- df = pd.read_excel('path_to_file.xlsx', index_col=[0, 1], header=[0, 1])
+ df.columns = pd.MultiIndex.from_product([["a"], ["b", "d"]], names=["c1", "c2"])
+ df.to_excel("path_to_file.xlsx")
+ df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1], header=[0, 1])
df
.. ipython:: python
:suppress:
- os.remove('path_to_file.xlsx')
+ os.remove("path_to_file.xlsx")
Parsing specific columns
@@ -2961,30 +2998,23 @@ It is often the case that users will insert columns to do temporary computations
in Excel and you may not want to read in those columns. ``read_excel`` takes
a ``usecols`` keyword to allow you to specify a subset of columns to parse.
-.. deprecated:: 0.24.0
+.. versionchanged:: 1.0.0
-Passing in an integer for ``usecols`` has been deprecated. Please pass in a list
+Passing in an integer for ``usecols`` will no longer work. Please pass in a list
of ints from 0 to ``usecols`` inclusive instead.
-If ``usecols`` is an integer, then it is assumed to indicate the last column
-to be parsed.
+You can specify a comma-delimited set of Excel columns and ranges as a string:
.. code-block:: python
- pd.read_excel('path_to_file.xls', 'Sheet1', usecols=2)
-
-You can also specify a comma-delimited set of Excel columns and ranges as a string:
-
-.. code-block:: python
-
- pd.read_excel('path_to_file.xls', 'Sheet1', usecols='A,C:E')
+ pd.read_excel("path_to_file.xls", "Sheet1", usecols="A,C:E")
If ``usecols`` is a list of integers, then it is assumed to be the file column
indices to be parsed.
.. code-block:: python
- pd.read_excel('path_to_file.xls', 'Sheet1', usecols=[0, 2, 3])
+ pd.read_excel("path_to_file.xls", "Sheet1", usecols=[0, 2, 3])
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
@@ -2996,7 +3026,7 @@ document header row(s). Those strings define which columns will be parsed:
.. code-block:: python
- pd.read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])
+ pd.read_excel("path_to_file.xls", "Sheet1", usecols=["foo", "bar"])
Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as ``['joe', 'baz']``.
@@ -3007,7 +3037,7 @@ the column names, returning names where the callable function evaluates to ``Tru
.. code-block:: python
- pd.read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())
+ pd.read_excel("path_to_file.xls", "Sheet1", usecols=lambda x: x.isalpha())
Parsing dates
+++++++++++++
@@ -3019,7 +3049,7 @@ use the ``parse_dates`` keyword to parse those strings to datetimes:
.. code-block:: python
- pd.read_excel('path_to_file.xls', 'Sheet1', parse_dates=['date_strings'])
+ pd.read_excel("path_to_file.xls", "Sheet1", parse_dates=["date_strings"])
Cell converters
@@ -3030,7 +3060,7 @@ option. For instance, to convert a column to boolean:
.. code-block:: python
- pd.read_excel('path_to_file.xls', 'Sheet1', converters={'MyBools': bool})
+ pd.read_excel("path_to_file.xls", "Sheet1", converters={"MyBools": bool})
This options handles missing values and treats exceptions in the converters
as missing data. Transformations are applied cell by cell rather than to the
@@ -3045,19 +3075,19 @@ missing data to recover integer dtype:
return int(x) if x else -1
- pd.read_excel('path_to_file.xls', 'Sheet1', converters={'MyInts': cfun})
+ pd.read_excel("path_to_file.xls", "Sheet1", converters={"MyInts": cfun})
Dtype specifications
++++++++++++++++++++
As an alternative to converters, the type for an entire column can
-be specified using the `dtype` keyword, which takes a dictionary
+be specified using the ``dtype`` keyword, which takes a dictionary
mapping column names to types. To interpret data with
no type inference, use the type ``str`` or ``object``.
.. code-block:: python
- pd.read_excel('path_to_file.xls', dtype={'MyInts': 'int64', 'MyText': str})
+ pd.read_excel("path_to_file.xls", dtype={"MyInts": "int64", "MyText": str})
.. _io.excel_writer:
@@ -3075,7 +3105,7 @@ written. For example:
.. code-block:: python
- df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')
+ df.to_excel("path_to_file.xlsx", sheet_name="Sheet1")
Files with a ``.xls`` extension will be written using ``xlwt`` and those with a
``.xlsx`` extension will be written using ``xlsxwriter`` (if available) or
@@ -3088,16 +3118,16 @@ row instead of the first. You can place it in the first row by setting the
.. code-block:: python
- df.to_excel('path_to_file.xlsx', index_label='label', merge_cells=False)
+ df.to_excel("path_to_file.xlsx", index_label="label", merge_cells=False)
In order to write separate ``DataFrames`` to separate sheets in a single Excel file,
one can pass an :class:`~pandas.io.excel.ExcelWriter`.
.. code-block:: python
- with pd.ExcelWriter('path_to_file.xlsx') as writer:
- df1.to_excel(writer, sheet_name='Sheet1')
- df2.to_excel(writer, sheet_name='Sheet2')
+ with pd.ExcelWriter("path_to_file.xlsx") as writer:
+ df1.to_excel(writer, sheet_name="Sheet1")
+ df2.to_excel(writer, sheet_name="Sheet2")
.. note::
@@ -3113,7 +3143,7 @@ one can pass an :class:`~pandas.io.excel.ExcelWriter`.
Writing Excel files to memory
+++++++++++++++++++++++++++++
-Pandas supports writing Excel files to buffer-like objects such as ``StringIO`` or
+pandas supports writing Excel files to buffer-like objects such as ``StringIO`` or
``BytesIO`` using :class:`~pandas.io.excel.ExcelWriter`.
.. code-block:: python
@@ -3123,8 +3153,8 @@ Pandas supports writing Excel files to buffer-like objects such as ``StringIO``
bio = BytesIO()
# By setting the 'engine' in the ExcelWriter constructor.
- writer = pd.ExcelWriter(bio, engine='xlsxwriter')
- df.to_excel(writer, sheet_name='Sheet1')
+ writer = pd.ExcelWriter(bio, engine="xlsxwriter")
+ df.to_excel(writer, sheet_name="Sheet1")
# Save the workbook
writer.save()
@@ -3147,7 +3177,7 @@ Pandas supports writing Excel files to buffer-like objects such as ``StringIO``
Excel writer engines
''''''''''''''''''''
-Pandas chooses an Excel writer via two methods:
+pandas chooses an Excel writer via two methods:
1. the ``engine`` keyword argument
2. the filename extension (via the default specified in config options)
@@ -3173,16 +3203,17 @@ argument to ``to_excel`` and to ``ExcelWriter``. The built-in engines are:
.. code-block:: python
# By setting the 'engine' in the DataFrame 'to_excel()' methods.
- df.to_excel('path_to_file.xlsx', sheet_name='Sheet1', engine='xlsxwriter')
+ df.to_excel("path_to_file.xlsx", sheet_name="Sheet1", engine="xlsxwriter")
# By setting the 'engine' in the ExcelWriter constructor.
- writer = pd.ExcelWriter('path_to_file.xlsx', engine='xlsxwriter')
+ writer = pd.ExcelWriter("path_to_file.xlsx", engine="xlsxwriter")
# Or via pandas configuration.
from pandas import options # noqa: E402
- options.io.excel.xlsx.writer = 'xlsxwriter'
- df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')
+ options.io.excel.xlsx.writer = "xlsxwriter"
+
+ df.to_excel("path_to_file.xlsx", sheet_name="Sheet1")
.. _io.excel.style:
@@ -3213,7 +3244,7 @@ OpenDocument spreadsheets match what can be done for `Excel files`_ using
.. code-block:: python
# Returns a DataFrame
- pd.read_excel('path_to_file.ods', engine='odf')
+ pd.read_excel("path_to_file.ods", engine="odf")
.. note::
@@ -3236,7 +3267,7 @@ in files and will return floats instead.
.. code-block:: python
# Returns a DataFrame
- pd.read_excel('path_to_file.xlsb', engine='pyxlsb')
+ pd.read_excel("path_to_file.xlsb", engine="pyxlsb")
.. note::
@@ -3279,10 +3310,10 @@ applications (CTRL-V on many operating systems). Here we illustrate writing a
.. code-block:: python
- >>> df = pd.DataFrame({'A': [1, 2, 3],
- ... 'B': [4, 5, 6],
- ... 'C': ['p', 'q', 'r']},
- ... index=['x', 'y', 'z'])
+ >>> df = pd.DataFrame(
+ ... {"A": [1, 2, 3], "B": [4, 5, 6], "C": ["p", "q", "r"]}, index=["x", "y", "z"]
+ ... )
+
>>> df
A B C
x 1 4 p
@@ -3312,7 +3343,7 @@ All pandas objects are equipped with ``to_pickle`` methods which use Python's
.. ipython:: python
df
- df.to_pickle('foo.pkl')
+ df.to_pickle("foo.pkl")
The ``read_pickle`` function in the ``pandas`` namespace can be used to load
any pickled pandas object (or any other pickled object) from file:
@@ -3320,12 +3351,12 @@ any pickled pandas object (or any other pickled object) from file:
.. ipython:: python
- pd.read_pickle('foo.pkl')
+ pd.read_pickle("foo.pkl")
.. ipython:: python
:suppress:
- os.remove('foo.pkl')
+ os.remove("foo.pkl")
.. warning::
@@ -3359,10 +3390,13 @@ the underlying compression library.
.. ipython:: python
- df = pd.DataFrame({
- 'A': np.random.randn(1000),
- 'B': 'foo',
- 'C': pd.date_range('20130101', periods=1000, freq='s')})
+ df = pd.DataFrame(
+ {
+ "A": np.random.randn(1000),
+ "B": "foo",
+ "C": pd.date_range("20130101", periods=1000, freq="s"),
+ }
+ )
df
Using an explicit compression type:
@@ -3397,10 +3431,7 @@ Passing options to the compression protocol in order to speed up compression:
.. ipython:: python
- df.to_pickle(
- "data.pkl.gz",
- compression={"method": "gzip", 'compresslevel': 1}
- )
+ df.to_pickle("data.pkl.gz", compression={"method": "gzip", "compresslevel": 1})
.. ipython:: python
:suppress:
@@ -3421,11 +3452,13 @@ Example pyarrow usage:
.. code-block:: python
- >>> import pandas as pd
- >>> import pyarrow as pa
- >>> df = pd.DataFrame({'A': [1, 2, 3]})
- >>> context = pa.default_serialization_context()
- >>> df_bytestring = context.serialize(df).to_buffer().to_pybytes()
+ import pandas as pd
+ import pyarrow as pa
+
+ df = pd.DataFrame({"A": [1, 2, 3]})
+
+ context = pa.default_serialization_context()
+ df_bytestring = context.serialize(df).to_buffer().to_pybytes()
For documentation on pyarrow, see `here `__.
@@ -3441,20 +3474,21 @@ for some advanced strategies
.. warning::
- pandas requires ``PyTables`` >= 3.0.0.
- There is a indexing bug in ``PyTables`` < 3.2 which may appear when querying stores using an index.
- If you see a subset of results being returned, upgrade to ``PyTables`` >= 3.2.
- Stores created previously will need to be rewritten using the updated version.
+ pandas uses PyTables for reading and writing HDF5 files, which allows
+ serializing object-dtype data with pickle. Loading pickled data received from
+ untrusted sources can be unsafe.
+
+ See: https://docs.python.org/3/library/pickle.html for more.
.. ipython:: python
:suppress:
:okexcept:
- os.remove('store.h5')
+ os.remove("store.h5")
.. ipython:: python
- store = pd.HDFStore('store.h5')
+ store = pd.HDFStore("store.h5")
print(store)
Objects can be written to the file just like adding key-value pairs to a
@@ -3462,15 +3496,14 @@ dict:
.. ipython:: python
- index = pd.date_range('1/1/2000', periods=8)
- s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
- df = pd.DataFrame(np.random.randn(8, 3), index=index,
- columns=['A', 'B', 'C'])
+ index = pd.date_range("1/1/2000", periods=8)
+ s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
+ df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])
# store.put('s', s) is an equivalent method
- store['s'] = s
+ store["s"] = s
- store['df'] = df
+ store["df"] = df
store
@@ -3479,7 +3512,7 @@ In a current or later Python session, you can retrieve stored objects:
.. ipython:: python
# store.get('df') is an equivalent method
- store['df']
+ store["df"]
# dotted (attribute) access provides get as well
store.df
@@ -3489,7 +3522,7 @@ Deletion of the object specified by the key:
.. ipython:: python
# store.remove('df') is an equivalent method
- del store['df']
+ del store["df"]
store
@@ -3502,14 +3535,14 @@ Closing a Store and using a context manager:
store.is_open
# Working with, and automatically closing the store using a context manager
- with pd.HDFStore('store.h5') as store:
+ with pd.HDFStore("store.h5") as store:
store.keys()
.. ipython:: python
:suppress:
store.close()
- os.remove('store.h5')
+ os.remove("store.h5")
@@ -3521,15 +3554,15 @@ similar to how ``read_csv`` and ``to_csv`` work.
.. ipython:: python
- df_tl = pd.DataFrame({'A': list(range(5)), 'B': list(range(5))})
- df_tl.to_hdf('store_tl.h5', 'table', append=True)
- pd.read_hdf('store_tl.h5', 'table', where=['index>2'])
+ df_tl = pd.DataFrame({"A": list(range(5)), "B": list(range(5))})
+ df_tl.to_hdf("store_tl.h5", "table", append=True)
+ pd.read_hdf("store_tl.h5", "table", where=["index>2"])
.. ipython:: python
:suppress:
:okexcept:
- os.remove('store_tl.h5')
+ os.remove("store_tl.h5")
HDFStore will by default not drop rows that are all missing. This behavior can be changed by setting ``dropna=True``.
@@ -3537,24 +3570,23 @@ HDFStore will by default not drop rows that are all missing. This behavior can b
.. ipython:: python
- df_with_missing = pd.DataFrame({'col1': [0, np.nan, 2],
- 'col2': [1, np.nan, np.nan]})
+ df_with_missing = pd.DataFrame({"col1": [0, np.nan, 2], "col2": [1, np.nan, np.nan]})
df_with_missing
- df_with_missing.to_hdf('file.h5', 'df_with_missing',
- format='table', mode='w')
+ df_with_missing.to_hdf("file.h5", "df_with_missing", format="table", mode="w")
- pd.read_hdf('file.h5', 'df_with_missing')
+ pd.read_hdf("file.h5", "df_with_missing")
- df_with_missing.to_hdf('file.h5', 'df_with_missing',
- format='table', mode='w', dropna=True)
- pd.read_hdf('file.h5', 'df_with_missing')
+ df_with_missing.to_hdf(
+ "file.h5", "df_with_missing", format="table", mode="w", dropna=True
+ )
+ pd.read_hdf("file.h5", "df_with_missing")
.. ipython:: python
:suppress:
- os.remove('file.h5')
+ os.remove("file.h5")
.. _io.hdf5-fixed:
@@ -3575,8 +3607,8 @@ This format is specified by default when using ``put`` or ``to_hdf`` or by ``for
.. code-block:: python
- >>> pd.DataFrame(np.random.randn(10, 2)).to_hdf('test_fixed.h5', 'df')
- >>> pd.read_hdf('test_fixed.h5', 'df', where='index>5')
+ >>> pd.DataFrame(np.random.randn(10, 2)).to_hdf("test_fixed.h5", "df")
+ >>> pd.read_hdf("test_fixed.h5", "df", where="index>5")
TypeError: cannot pass a where specification when reading a fixed format.
this store must be selected in its entirety
@@ -3600,21 +3632,21 @@ enable ``put/append/to_hdf`` to by default store in the ``table`` format.
:suppress:
:okexcept:
- os.remove('store.h5')
+ os.remove("store.h5")
.. ipython:: python
- store = pd.HDFStore('store.h5')
+ store = pd.HDFStore("store.h5")
df1 = df[0:4]
df2 = df[4:]
# append data (creates a table automatically)
- store.append('df', df1)
- store.append('df', df2)
+ store.append("df", df1)
+ store.append("df", df2)
store
# select the entire object
- store.select('df')
+ store.select("df")
# the type of stored data
store.root.df._v_attrs.pandas_type
@@ -3637,16 +3669,16 @@ everything in the sub-store and **below**, so be *careful*.
.. ipython:: python
- store.put('foo/bar/bah', df)
- store.append('food/orange', df)
- store.append('food/apple', df)
+ store.put("foo/bar/bah", df)
+ store.append("food/orange", df)
+ store.append("food/apple", df)
store
# a list of keys are returned
store.keys()
# remove all nodes under this level
- store.remove('food')
+ store.remove("food")
store
@@ -3660,10 +3692,10 @@ will yield a tuple for each group key along with the relative keys of its conten
for (path, subgroups, subkeys) in store.walk():
for subgroup in subgroups:
- print('GROUP: {}/{}'.format(path, subgroup))
+ print("GROUP: {}/{}".format(path, subgroup))
for subkey in subkeys:
- key = '/'.join([path, subkey])
- print('KEY: {}'.format(key))
+ key = "/".join([path, subkey])
+ print("KEY: {}".format(key))
print(store.get(key))
@@ -3687,7 +3719,7 @@ will yield a tuple for each group key along with the relative keys of its conten
.. ipython:: python
- store['foo/bar/bah']
+ store["foo/bar/bah"]
.. _io.hdf5-types:
@@ -3706,24 +3738,27 @@ Passing ``min_itemsize={`values`: size}`` as a parameter to append
will set a larger minimum for the string columns. Storing ``floats,
strings, ints, bools, datetime64`` are currently supported. For string
columns, passing ``nan_rep = 'nan'`` to append will change the default
-nan representation on disk (which converts to/from `np.nan`), this
-defaults to `nan`.
-
-.. ipython:: python
-
- df_mixed = pd.DataFrame({'A': np.random.randn(8),
- 'B': np.random.randn(8),
- 'C': np.array(np.random.randn(8), dtype='float32'),
- 'string': 'string',
- 'int': 1,
- 'bool': True,
- 'datetime64': pd.Timestamp('20010102')},
- index=list(range(8)))
- df_mixed.loc[df_mixed.index[3:5],
- ['A', 'B', 'string', 'datetime64']] = np.nan
-
- store.append('df_mixed', df_mixed, min_itemsize={'values': 50})
- df_mixed1 = store.select('df_mixed')
+nan representation on disk (which converts to/from ``np.nan``), this
+defaults to ``nan``.
+
+.. ipython:: python
+
+ df_mixed = pd.DataFrame(
+ {
+ "A": np.random.randn(8),
+ "B": np.random.randn(8),
+ "C": np.array(np.random.randn(8), dtype="float32"),
+ "string": "string",
+ "int": 1,
+ "bool": True,
+ "datetime64": pd.Timestamp("20010102"),
+ },
+ index=list(range(8)),
+ )
+ df_mixed.loc[df_mixed.index[3:5], ["A", "B", "string", "datetime64"]] = np.nan
+
+ store.append("df_mixed", df_mixed, min_itemsize={"values": 50})
+ df_mixed1 = store.select("df_mixed")
df_mixed1
df_mixed1.dtypes.value_counts()
@@ -3738,20 +3773,19 @@ storing/selecting from homogeneous index ``DataFrames``.
.. ipython:: python
- index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
- ['one', 'two', 'three']],
- codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
- [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
- names=['foo', 'bar'])
- df_mi = pd.DataFrame(np.random.randn(10, 3), index=index,
- columns=['A', 'B', 'C'])
+ index = pd.MultiIndex(
+ levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
+ codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
+ names=["foo", "bar"],
+ )
+ df_mi = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])
df_mi
- store.append('df_mi', df_mi)
- store.select('df_mi')
+ store.append("df_mi", df_mi)
+ store.select("df_mi")
# the levels are automatically included as data columns
- store.select('df_mi', 'foo=bar')
+ store.select("df_mi", "foo=bar")
.. note::
The ``index`` keyword is reserved and cannot be use as a level name.
@@ -3828,7 +3862,7 @@ The right-hand side of the sub-expression (after a comparison operator) can be:
.. code-block:: python
string = "HolyMoly'"
- store.select('df', 'index == string')
+ store.select("df", "index == string")
instead of this
@@ -3845,7 +3879,7 @@ The right-hand side of the sub-expression (after a comparison operator) can be:
.. code-block:: python
- store.select('df', 'index == %r' % string)
+ store.select("df", "index == %r" % string)
which will quote ``string``.
@@ -3854,21 +3888,24 @@ Here are some examples:
.. ipython:: python
- dfq = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'),
- index=pd.date_range('20130101', periods=10))
- store.append('dfq', dfq, format='table', data_columns=True)
+ dfq = pd.DataFrame(
+ np.random.randn(10, 4),
+ columns=list("ABCD"),
+ index=pd.date_range("20130101", periods=10),
+ )
+ store.append("dfq", dfq, format="table", data_columns=True)
Use boolean expressions, with in-line function evaluation.
.. ipython:: python
- store.select('dfq', "index>pd.Timestamp('20130104') & columns=['A', 'B']")
+ store.select("dfq", "index>pd.Timestamp('20130104') & columns=['A', 'B']")
Use inline column reference.
.. ipython:: python
- store.select('dfq', where="A>0 or C>0")
+ store.select("dfq", where="A>0 or C>0")
The ``columns`` keyword can be supplied to select a list of columns to be
returned, this is equivalent to passing a
@@ -3876,7 +3913,7 @@ returned, this is equivalent to passing a
.. ipython:: python
- store.select('df', "columns=['A', 'B']")
+ store.select("df", "columns=['A', 'B']")
``start`` and ``stop`` parameters can be specified to limit the total search
space. These are in terms of the total number of rows in a table.
@@ -3902,14 +3939,19 @@ specified in the format: ``()``, where float may be signed (and fra
.. ipython:: python
from datetime import timedelta
- dftd = pd.DataFrame({'A': pd.Timestamp('20130101'),
- 'B': [pd.Timestamp('20130101') + timedelta(days=i,
- seconds=10)
- for i in range(10)]})
- dftd['C'] = dftd['A'] - dftd['B']
+
+ dftd = pd.DataFrame(
+ {
+ "A": pd.Timestamp("20130101"),
+ "B": [
+ pd.Timestamp("20130101") + timedelta(days=i, seconds=10) for i in range(10)
+ ],
+ }
+ )
+ dftd["C"] = dftd["A"] - dftd["B"]
dftd
- store.append('dftd', dftd, data_columns=True)
- store.select('dftd', "C<'-3.5D'")
+ store.append("dftd", dftd, data_columns=True)
+ store.select("dftd", "C<'-3.5D'")
.. _io.query_multi:
@@ -3921,7 +3963,7 @@ Selecting from a ``MultiIndex`` can be achieved by using the name of the level.
.. ipython:: python
df_mi.index.names
- store.select('df_mi', "foo=baz and bar=two")
+ store.select("df_mi", "foo=baz and bar=two")
If the ``MultiIndex`` levels names are ``None``, the levels are automatically made available via
the ``level_n`` keyword with ``n`` the level of the ``MultiIndex`` you want to select from.
@@ -3932,8 +3974,7 @@ the ``level_n`` keyword with ``n`` the level of the ``MultiIndex`` you want to s
levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
)
- df_mi_2 = pd.DataFrame(np.random.randn(10, 3),
- index=index, columns=["A", "B", "C"])
+ df_mi_2 = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])
df_mi_2
store.append("df_mi_2", df_mi_2)
@@ -3964,7 +4005,7 @@ indexed dimension as the ``where``.
i.optlevel, i.kind
# change an index by passing new parameters
- store.create_table_index('df', optlevel=9, kind='full')
+ store.create_table_index("df", optlevel=9, kind="full")
i = store.root.df.table.cols.index.index
i.optlevel, i.kind
@@ -3972,20 +4013,20 @@ Oftentimes when appending large amounts of data to a store, it is useful to turn
.. ipython:: python
- df_1 = pd.DataFrame(np.random.randn(10, 2), columns=list('AB'))
- df_2 = pd.DataFrame(np.random.randn(10, 2), columns=list('AB'))
+ df_1 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))
+ df_2 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))
- st = pd.HDFStore('appends.h5', mode='w')
- st.append('df', df_1, data_columns=['B'], index=False)
- st.append('df', df_2, data_columns=['B'], index=False)
- st.get_storer('df').table
+ st = pd.HDFStore("appends.h5", mode="w")
+ st.append("df", df_1, data_columns=["B"], index=False)
+ st.append("df", df_2, data_columns=["B"], index=False)
+ st.get_storer("df").table
Then create the index when finished appending.
.. ipython:: python
- st.create_table_index('df', columns=['B'], optlevel=9, kind='full')
- st.get_storer('df').table
+ st.create_table_index("df", columns=["B"], optlevel=9, kind="full")
+ st.get_storer("df").table
st.close()
@@ -3993,7 +4034,7 @@ Then create the index when finished appending.
:suppress:
:okexcept:
- os.remove('appends.h5')
+ os.remove("appends.h5")
See `here `__ for how to create a completely-sorted-index (CSI) on an existing store.
@@ -4003,7 +4044,7 @@ Query via data columns
++++++++++++++++++++++
You can designate (and index) certain columns that you want to be able
-to perform queries (other than the `indexable` columns, which you can
+to perform queries (other than the ``indexable`` columns, which you can
always query). For instance say you want to perform this common
operation, on-disk, and return just the frame that matches this
query. You can specify ``data_columns = True`` to force all columns to
@@ -4012,29 +4053,29 @@ be ``data_columns``.
.. ipython:: python
df_dc = df.copy()
- df_dc['string'] = 'foo'
- df_dc.loc[df_dc.index[4:6], 'string'] = np.nan
- df_dc.loc[df_dc.index[7:9], 'string'] = 'bar'
- df_dc['string2'] = 'cool'
- df_dc.loc[df_dc.index[1:3], ['B', 'C']] = 1.0
+ df_dc["string"] = "foo"
+ df_dc.loc[df_dc.index[4:6], "string"] = np.nan
+ df_dc.loc[df_dc.index[7:9], "string"] = "bar"
+ df_dc["string2"] = "cool"
+ df_dc.loc[df_dc.index[1:3], ["B", "C"]] = 1.0
df_dc
# on-disk operations
- store.append('df_dc', df_dc, data_columns=['B', 'C', 'string', 'string2'])
- store.select('df_dc', where='B > 0')
+ store.append("df_dc", df_dc, data_columns=["B", "C", "string", "string2"])
+ store.select("df_dc", where="B > 0")
# getting creative
- store.select('df_dc', 'B > 0 & C > 0 & string == foo')
+ store.select("df_dc", "B > 0 & C > 0 & string == foo")
# this is in-memory version of this type of selection
- df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == 'foo')]
+ df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == "foo")]
# we have automagically created this index and the B/C/string/string2
# columns are stored separately as ``PyTables`` columns
store.root.df_dc.table
There is some performance degradation by making lots of columns into
-`data columns`, so it is up to the user to designate these. In addition,
+``data columns``, so it is up to the user to designate these. In addition,
you cannot change data columns (nor indexables) after the first
append/put operation (Of course you can simply read in the data and
create a new table!).
@@ -4048,7 +4089,7 @@ The default is 50,000 rows returned in a chunk.
.. ipython:: python
- for df in store.select('df', chunksize=3):
+ for df in store.select("df", chunksize=3):
print(df)
.. note::
@@ -4058,7 +4099,7 @@ The default is 50,000 rows returned in a chunk.
.. code-block:: python
- for df in pd.read_hdf('store.h5', 'df', chunksize=3):
+ for df in pd.read_hdf("store.h5", "df", chunksize=3):
print(df)
Note, that the chunksize keyword applies to the **source** rows. So if you
@@ -4070,18 +4111,20 @@ chunks.
.. ipython:: python
- dfeq = pd.DataFrame({'number': np.arange(1, 11)})
+ dfeq = pd.DataFrame({"number": np.arange(1, 11)})
dfeq
- store.append('dfeq', dfeq, data_columns=['number'])
+ store.append("dfeq", dfeq, data_columns=["number"])
+
def chunks(l, n):
- return [l[i:i + n] for i in range(0, len(l), n)]
+ return [l[i: i + n] for i in range(0, len(l), n)]
+
evens = [2, 4, 6, 8, 10]
- coordinates = store.select_as_coordinates('dfeq', 'number=evens')
+ coordinates = store.select_as_coordinates("dfeq", "number=evens")
for c in chunks(coordinates, 2):
- print(store.select('dfeq', where=c))
+ print(store.select("dfeq", where=c))
Advanced queries
++++++++++++++++
@@ -4096,8 +4139,8 @@ These do not currently accept the ``where`` selector.
.. ipython:: python
- store.select_column('df_dc', 'index')
- store.select_column('df_dc', 'string')
+ store.select_column("df_dc", "index")
+ store.select_column("df_dc", "string")
.. _io.hdf5-selecting_coordinates:
@@ -4110,12 +4153,13 @@ Sometimes you want to get the coordinates (a.k.a the index locations) of your qu
.. ipython:: python
- df_coord = pd.DataFrame(np.random.randn(1000, 2),
- index=pd.date_range('20000101', periods=1000))
- store.append('df_coord', df_coord)
- c = store.select_as_coordinates('df_coord', 'index > 20020101')
+ df_coord = pd.DataFrame(
+ np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
+ )
+ store.append("df_coord", df_coord)
+ c = store.select_as_coordinates("df_coord", "index > 20020101")
c
- store.select('df_coord', where=c)
+ store.select("df_coord", where=c)
.. _io.hdf5-where_mask:
@@ -4128,12 +4172,13 @@ a datetimeindex which are 5.
.. ipython:: python
- df_mask = pd.DataFrame(np.random.randn(1000, 2),
- index=pd.date_range('20000101', periods=1000))
- store.append('df_mask', df_mask)
- c = store.select_column('df_mask', 'index')
+ df_mask = pd.DataFrame(
+ np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
+ )
+ store.append("df_mask", df_mask)
+ c = store.select_column("df_mask", "index")
where = c[pd.DatetimeIndex(c).month == 5].index
- store.select('df_mask', where=where)
+ store.select("df_mask", where=where)
Storer object
^^^^^^^^^^^^^
@@ -4144,7 +4189,7 @@ of rows in an object.
.. ipython:: python
- store.get_storer('df_dc').nrows
+ store.get_storer("df_dc").nrows
Multiple table queries
@@ -4161,7 +4206,7 @@ having a very wide table, but enables more efficient queries.
The ``append_to_multiple`` method splits a given single DataFrame
into multiple tables according to ``d``, a dictionary that maps the
-table names to a list of 'columns' you want in that table. If `None`
+table names to a list of 'columns' you want in that table. If ``None``
is used in place of a list, that table will have the remaining
unspecified columns of the given DataFrame. The argument ``selector``
defines which table is the selector table (which you can make queries from).
@@ -4177,24 +4222,26 @@ results.
.. ipython:: python
- df_mt = pd.DataFrame(np.random.randn(8, 6),
- index=pd.date_range('1/1/2000', periods=8),
- columns=['A', 'B', 'C', 'D', 'E', 'F'])
- df_mt['foo'] = 'bar'
- df_mt.loc[df_mt.index[1], ('A', 'B')] = np.nan
+ df_mt = pd.DataFrame(
+ np.random.randn(8, 6),
+ index=pd.date_range("1/1/2000", periods=8),
+ columns=["A", "B", "C", "D", "E", "F"],
+ )
+ df_mt["foo"] = "bar"
+ df_mt.loc[df_mt.index[1], ("A", "B")] = np.nan
# you can also create the tables individually
- store.append_to_multiple({'df1_mt': ['A', 'B'], 'df2_mt': None},
- df_mt, selector='df1_mt')
+ store.append_to_multiple(
+ {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
+ )
store
# individual tables were created
- store.select('df1_mt')
- store.select('df2_mt')
+ store.select("df1_mt")
+ store.select("df2_mt")
# as a multiple
- store.select_as_multiple(['df1_mt', 'df2_mt'], where=['A>0', 'B>0'],
- selector='df1_mt')
+ store.select_as_multiple(["df1_mt", "df2_mt"], where=["A>0", "B>0"], selector="df1_mt")
Delete from a table
@@ -4303,14 +4350,15 @@ Enable compression for all objects within the file:
.. code-block:: python
- store_compressed = pd.HDFStore('store_compressed.h5', complevel=9,
- complib='blosc:blosclz')
+ store_compressed = pd.HDFStore(
+ "store_compressed.h5", complevel=9, complib="blosc:blosclz"
+ )
Or on-the-fly compression (this only applies to tables) in stores where compression is not enabled:
.. code-block:: python
- store.append('df', df, complib='zlib', complevel=5)
+ store.append("df", df, complib="zlib", complevel=5)
.. _io.hdf5-ptrepack:
@@ -4399,13 +4447,14 @@ stored in a more efficient manner.
.. ipython:: python
- dfcat = pd.DataFrame({'A': pd.Series(list('aabbcdba')).astype('category'),
- 'B': np.random.randn(8)})
+ dfcat = pd.DataFrame(
+ {"A": pd.Series(list("aabbcdba")).astype("category"), "B": np.random.randn(8)}
+ )
dfcat
dfcat.dtypes
- cstore = pd.HDFStore('cats.h5', mode='w')
- cstore.append('dfcat', dfcat, format='table', data_columns=['A'])
- result = cstore.select('dfcat', where="A in ['b', 'c']")
+ cstore = pd.HDFStore("cats.h5", mode="w")
+ cstore.append("dfcat", dfcat, format="table", data_columns=["A"])
+ result = cstore.select("dfcat", where="A in ['b', 'c']")
result
result.dtypes
@@ -4414,7 +4463,7 @@ stored in a more efficient manner.
:okexcept:
cstore.close()
- os.remove('cats.h5')
+ os.remove("cats.h5")
String columns
@@ -4441,17 +4490,17 @@ Passing a ``min_itemsize`` dict will cause all passed columns to be created as *
.. ipython:: python
- dfs = pd.DataFrame({'A': 'foo', 'B': 'bar'}, index=list(range(5)))
+ dfs = pd.DataFrame({"A": "foo", "B": "bar"}, index=list(range(5)))
dfs
# A and B have a size of 30
- store.append('dfs', dfs, min_itemsize=30)
- store.get_storer('dfs').table
+ store.append("dfs", dfs, min_itemsize=30)
+ store.get_storer("dfs").table
# A is created as a data_column with a size of 30
# B is size is calculated
- store.append('dfs2', dfs, min_itemsize={'A': 30})
- store.get_storer('dfs2').table
+ store.append("dfs2", dfs, min_itemsize={"A": 30})
+ store.get_storer("dfs2").table
**nan_rep**
@@ -4460,15 +4509,15 @@ You could inadvertently turn an actual ``nan`` value into a missing value.
.. ipython:: python
- dfss = pd.DataFrame({'A': ['foo', 'bar', 'nan']})
+ dfss = pd.DataFrame({"A": ["foo", "bar", "nan"]})
dfss
- store.append('dfss', dfss)
- store.select('dfss')
+ store.append("dfss", dfss)
+ store.select("dfss")
# here you need to specify a different nan rep
- store.append('dfss2', dfss, nan_rep='_nan_')
- store.select('dfss2')
+ store.append("dfss2", dfss, nan_rep="_nan_")
+ store.select("dfss2")
.. _io.external_compatibility:
@@ -4487,21 +4536,25 @@ It is possible to write an ``HDFStore`` object that can easily be imported into
.. ipython:: python
- df_for_r = pd.DataFrame({"first": np.random.rand(100),
- "second": np.random.rand(100),
- "class": np.random.randint(0, 2, (100, ))},
- index=range(100))
+ df_for_r = pd.DataFrame(
+ {
+ "first": np.random.rand(100),
+ "second": np.random.rand(100),
+ "class": np.random.randint(0, 2, (100,)),
+ },
+ index=range(100),
+ )
df_for_r.head()
- store_export = pd.HDFStore('export.h5')
- store_export.append('df_for_r', df_for_r, data_columns=df_dc.columns)
+ store_export = pd.HDFStore("export.h5")
+ store_export.append("df_for_r", df_for_r, data_columns=df_dc.columns)
store_export
.. ipython:: python
:suppress:
store_export.close()
- os.remove('export.h5')
+ os.remove("export.h5")
In R this file can be read into a ``data.frame`` object using the ``rhdf5``
library. The following example function reads the corresponding column names
@@ -4588,7 +4641,7 @@ Performance
:suppress:
store.close()
- os.remove('store.h5')
+ os.remove("store.h5")
.. _io.feather:
@@ -4618,21 +4671,26 @@ See the `Full Documentation `__.
:suppress:
import warnings
+
# This can be removed once building with pyarrow >=0.15.0
warnings.filterwarnings("ignore", "The Sparse", FutureWarning)
.. ipython:: python
- df = pd.DataFrame({'a': list('abc'),
- 'b': list(range(1, 4)),
- 'c': np.arange(3, 6).astype('u1'),
- 'd': np.arange(4.0, 7.0, dtype='float64'),
- 'e': [True, False, True],
- 'f': pd.Categorical(list('abc')),
- 'g': pd.date_range('20130101', periods=3),
- 'h': pd.date_range('20130101', periods=3, tz='US/Eastern'),
- 'i': pd.date_range('20130101', periods=3, freq='ns')})
+ df = pd.DataFrame(
+ {
+ "a": list("abc"),
+ "b": list(range(1, 4)),
+ "c": np.arange(3, 6).astype("u1"),
+ "d": np.arange(4.0, 7.0, dtype="float64"),
+ "e": [True, False, True],
+ "f": pd.Categorical(list("abc")),
+ "g": pd.date_range("20130101", periods=3),
+ "h": pd.date_range("20130101", periods=3, tz="US/Eastern"),
+ "i": pd.date_range("20130101", periods=3, freq="ns"),
+ }
+ )
df
df.dtypes
@@ -4641,13 +4699,13 @@ Write to a feather file.
.. ipython:: python
- df.to_feather('example.feather')
+ df.to_feather("example.feather")
Read from a feather file.
.. ipython:: python
- result = pd.read_feather('example.feather')
+ result = pd.read_feather("example.feather")
result
# we preserve dtypes
@@ -4656,7 +4714,7 @@ Read from a feather file.
.. ipython:: python
:suppress:
- os.remove('example.feather')
+ os.remove("example.feather")
.. _io.parquet:
@@ -4676,7 +4734,7 @@ Several caveats.
* Duplicate column names and non-string columns names are not supported.
* The ``pyarrow`` engine always writes the index to the output, but ``fastparquet`` only writes non-default
- indexes. This extra column can cause problems for non-Pandas consumers that are not expecting it. You can
+ indexes. This extra column can cause problems for non-pandas consumers that are not expecting it. You can
force including or omitting indexes with the ``index`` argument, regardless of the underlying engine.
* Index level names, if specified, must be strings.
* In the ``pyarrow`` engine, categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as their primitive dtype.
@@ -4701,15 +4759,19 @@ See the documentation for `pyarrow `__ an
.. ipython:: python
- df = pd.DataFrame({'a': list('abc'),
- 'b': list(range(1, 4)),
- 'c': np.arange(3, 6).astype('u1'),
- 'd': np.arange(4.0, 7.0, dtype='float64'),
- 'e': [True, False, True],
- 'f': pd.date_range('20130101', periods=3),
- 'g': pd.date_range('20130101', periods=3, tz='US/Eastern'),
- 'h': pd.Categorical(list('abc')),
- 'i': pd.Categorical(list('abc'), ordered=True)})
+ df = pd.DataFrame(
+ {
+ "a": list("abc"),
+ "b": list(range(1, 4)),
+ "c": np.arange(3, 6).astype("u1"),
+ "d": np.arange(4.0, 7.0, dtype="float64"),
+ "e": [True, False, True],
+ "f": pd.date_range("20130101", periods=3),
+ "g": pd.date_range("20130101", periods=3, tz="US/Eastern"),
+ "h": pd.Categorical(list("abc")),
+ "i": pd.Categorical(list("abc"), ordered=True),
+ }
+ )
df
df.dtypes
@@ -4719,15 +4781,15 @@ Write to a parquet file.
.. ipython:: python
:okwarning:
- df.to_parquet('example_pa.parquet', engine='pyarrow')
- df.to_parquet('example_fp.parquet', engine='fastparquet')
+ df.to_parquet("example_pa.parquet", engine="pyarrow")
+ df.to_parquet("example_fp.parquet", engine="fastparquet")
Read from a parquet file.
.. ipython:: python
- result = pd.read_parquet('example_fp.parquet', engine='fastparquet')
- result = pd.read_parquet('example_pa.parquet', engine='pyarrow')
+ result = pd.read_parquet("example_fp.parquet", engine="fastparquet")
+ result = pd.read_parquet("example_pa.parquet", engine="pyarrow")
result.dtypes
@@ -4735,18 +4797,16 @@ Read only certain columns of a parquet file.
.. ipython:: python
- result = pd.read_parquet('example_fp.parquet',
- engine='fastparquet', columns=['a', 'b'])
- result = pd.read_parquet('example_pa.parquet',
- engine='pyarrow', columns=['a', 'b'])
+ result = pd.read_parquet("example_fp.parquet", engine="fastparquet", columns=["a", "b"])
+ result = pd.read_parquet("example_pa.parquet", engine="pyarrow", columns=["a", "b"])
result.dtypes
.. ipython:: python
:suppress:
- os.remove('example_pa.parquet')
- os.remove('example_fp.parquet')
+ os.remove("example_pa.parquet")
+ os.remove("example_fp.parquet")
Handling indexes
@@ -4757,8 +4817,8 @@ more columns in the output file. Thus, this code:
.. ipython:: python
- df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
- df.to_parquet('test.parquet', engine='pyarrow')
+ df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
+ df.to_parquet("test.parquet", engine="pyarrow")
creates a parquet file with *three* columns if you use ``pyarrow`` for serialization:
``a``, ``b``, and ``__index_level_0__``. If you're using ``fastparquet``, the
@@ -4773,7 +4833,7 @@ If you want to omit a dataframe's indexes when writing, pass ``index=False`` to
.. ipython:: python
- df.to_parquet('test.parquet', index=False)
+ df.to_parquet("test.parquet", index=False)
This creates a parquet file with just the two expected columns, ``a`` and ``b``.
If your ``DataFrame`` has a custom index, you won't get it back when you load
@@ -4785,7 +4845,7 @@ underlying engine's default behavior.
.. ipython:: python
:suppress:
- os.remove('test.parquet')
+ os.remove("test.parquet")
Partitioning Parquet files
@@ -4797,12 +4857,11 @@ Parquet supports partitioning of data based on the values of one or more columns
.. ipython:: python
- df = pd.DataFrame({'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1]})
- df.to_parquet(path='test', engine='pyarrow',
- partition_cols=['a'], compression=None)
+ df = pd.DataFrame({"a": [0, 0, 1, 1], "b": [0, 1, 0, 1]})
+ df.to_parquet(path="test", engine="pyarrow", partition_cols=["a"], compression=None)
-The `path` specifies the parent directory to which data will be saved.
-The `partition_cols` are the column names by which the dataset will be partitioned.
+The ``path`` specifies the parent directory to which data will be saved.
+The ``partition_cols`` are the column names by which the dataset will be partitioned.
Columns are partitioned in the order they are given. The partition splits are
determined by the unique values in the partition columns.
The above example creates a partitioned dataset that may look like:
@@ -4821,8 +4880,9 @@ The above example creates a partitioned dataset that may look like:
:suppress:
from shutil import rmtree
+
try:
- rmtree('test')
+ rmtree("test")
except OSError:
pass
@@ -4834,7 +4894,7 @@ ORC
.. versionadded:: 1.0.0
Similar to the :ref:`parquet ` format, the `ORC Format `__ is a binary columnar serialization
-for data frames. It is designed to make reading data frames efficient. Pandas provides *only* a reader for the
+for data frames. It is designed to make reading data frames efficient. pandas provides *only* a reader for the
ORC format, :func:`~pandas.read_orc`. This requires the `pyarrow `__ library.
.. _io.sql:
@@ -4890,15 +4950,16 @@ below and the SQLAlchemy `documentation /
# where is relative:
- engine = create_engine('sqlite:///foo.db')
+ engine = create_engine("sqlite:///foo.db")
# or absolute, starting with a slash:
- engine = create_engine('sqlite:////absolute/path/to/foo.db')
+ engine = create_engine("sqlite:////absolute/path/to/foo.db")
For more information see the examples the SQLAlchemy `documentation `__
@@ -5215,21 +5280,25 @@ Use :func:`sqlalchemy.text` to specify query parameters in a backend-neutral way
.. ipython:: python
import sqlalchemy as sa
- pd.read_sql(sa.text('SELECT * FROM data where Col_1=:col1'),
- engine, params={'col1': 'X'})
+
+ pd.read_sql(
+ sa.text("SELECT * FROM data where Col_1=:col1"), engine, params={"col1": "X"}
+ )
If you have an SQLAlchemy description of your database you can express where conditions using SQLAlchemy expressions
.. ipython:: python
metadata = sa.MetaData()
- data_table = sa.Table('data', metadata,
- sa.Column('index', sa.Integer),
- sa.Column('Date', sa.DateTime),
- sa.Column('Col_1', sa.String),
- sa.Column('Col_2', sa.Float),
- sa.Column('Col_3', sa.Boolean),
- )
+ data_table = sa.Table(
+ "data",
+ metadata,
+ sa.Column("index", sa.Integer),
+ sa.Column("Date", sa.DateTime),
+ sa.Column("Col_1", sa.String),
+ sa.Column("Col_2", sa.Float),
+ sa.Column("Col_3", sa.Boolean),
+ )
pd.read_sql(sa.select([data_table]).where(data_table.c.Col_3 is True), engine)
@@ -5238,8 +5307,9 @@ You can combine SQLAlchemy expressions with parameters passed to :func:`read_sql
.. ipython:: python
import datetime as dt
- expr = sa.select([data_table]).where(data_table.c.Date > sa.bindparam('date'))
- pd.read_sql(expr, engine, params={'date': dt.datetime(2010, 10, 18)})
+
+ expr = sa.select([data_table]).where(data_table.c.Date > sa.bindparam("date"))
+ pd.read_sql(expr, engine, params={"date": dt.datetime(2010, 10, 18)})
Sqlite fallback
@@ -5254,13 +5324,14 @@ You can create connections like so:
.. code-block:: python
import sqlite3
- con = sqlite3.connect(':memory:')
+
+ con = sqlite3.connect(":memory:")
And then issue the following queries:
.. code-block:: python
- data.to_sql('data', con)
+ data.to_sql("data", con)
pd.read_sql_query("SELECT * FROM data", con)
@@ -5297,8 +5368,8 @@ into a .dta file. The format version of this file is always 115 (Stata 12).
.. ipython:: python
- df = pd.DataFrame(np.random.randn(10, 2), columns=list('AB'))
- df.to_stata('stata.dta')
+ df = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))
+ df.to_stata("stata.dta")
*Stata* data files have limited data type support; only strings with
244 or fewer characters, ``int8``, ``int16``, ``int32``, ``float32``
@@ -5348,7 +5419,7 @@ be used to read the file incrementally.
.. ipython:: python
- pd.read_stata('stata.dta')
+ pd.read_stata("stata.dta")
Specifying a ``chunksize`` yields a
:class:`~pandas.io.stata.StataReader` instance that can be used to
@@ -5357,7 +5428,7 @@ object can be used as an iterator.
.. ipython:: python
- reader = pd.read_stata('stata.dta', chunksize=3)
+ reader = pd.read_stata("stata.dta", chunksize=3)
for df in reader:
print(df.shape)
@@ -5367,7 +5438,7 @@ For more fine-grained control, use ``iterator=True`` and specify
.. ipython:: python
- reader = pd.read_stata('stata.dta', iterator=True)
+ reader = pd.read_stata("stata.dta", iterator=True)
chunk1 = reader.read(5)
chunk2 = reader.read(5)
@@ -5399,7 +5470,7 @@ values will have ``object`` data type.
.. ipython:: python
:suppress:
- os.remove('stata.dta')
+ os.remove("stata.dta")
.. _io.stata-categorical:
@@ -5453,7 +5524,7 @@ SAS formats
-----------
The top-level function :func:`read_sas` can read (but not write) SAS
-`xport` (.XPT) and (since *v0.18.0*) `SAS7BDAT` (.sas7bdat) format files.
+XPORT (.xpt) and (since *v0.18.0*) SAS7BDAT (.sas7bdat) format files.
SAS files only contain two value types: ASCII text and floating point
values (usually 8 bytes but sometimes truncated). For xport files,
@@ -5471,7 +5542,7 @@ Read a SAS7BDAT file:
.. code-block:: python
- df = pd.read_sas('sas_data.sas7bdat')
+ df = pd.read_sas("sas_data.sas7bdat")
Obtain an iterator and read an XPORT file 100,000 lines at a time:
@@ -5480,7 +5551,8 @@ Obtain an iterator and read an XPORT file 100,000 lines at a time:
def do_something(chunk):
pass
- rdr = pd.read_sas('sas_xport.xpt', chunk=100000)
+
+ rdr = pd.read_sas("sas_xport.xpt", chunk=100000)
for chunk in rdr:
do_something(chunk)
@@ -5501,7 +5573,7 @@ SPSS formats
.. versionadded:: 0.25.0
The top-level function :func:`read_spss` can read (but not write) SPSS
-`sav` (.sav) and `zsav` (.zsav) format files.
+SAV (.sav) and ZSAV (.zsav) format files.
SPSS files contain column names. By default the
whole file is read, categorical columns are converted into ``pd.Categorical``,
@@ -5514,17 +5586,16 @@ Read an SPSS file:
.. code-block:: python
- df = pd.read_spss('spss_data.sav')
+ df = pd.read_spss("spss_data.sav")
Extract a subset of columns contained in ``usecols`` from an SPSS file and
avoid converting categorical columns into ``pd.Categorical``:
.. code-block:: python
- df = pd.read_spss('spss_data.sav', usecols=['foo', 'bar'],
- convert_categoricals=False)
+ df = pd.read_spss("spss_data.sav", usecols=["foo", "bar"], convert_categoricals=False)
-More information about the `sav` and `zsav` file format is available here_.
+More information about the SAV and ZSAV file formats is available here_.
.. _here: https://www.ibm.com/support/knowledgecenter/en/SSLVMB_22.0.0/com.ibm.spss.statistics.help/spss/base/savedatatypes.htm
@@ -5580,78 +5651,99 @@ Given the next test set:
import os
sz = 1000000
- df = pd.DataFrame({'A': np.random.randn(sz), 'B': [1] * sz})
+ df = pd.DataFrame({"A": np.random.randn(sz), "B": [1] * sz})
sz = 1000000
np.random.seed(42)
- df = pd.DataFrame({'A': np.random.randn(sz), 'B': [1] * sz})
+ df = pd.DataFrame({"A": np.random.randn(sz), "B": [1] * sz})
+
def test_sql_write(df):
- if os.path.exists('test.sql'):
- os.remove('test.sql')
- sql_db = sqlite3.connect('test.sql')
- df.to_sql(name='test_table', con=sql_db)
+ if os.path.exists("test.sql"):
+ os.remove("test.sql")
+ sql_db = sqlite3.connect("test.sql")
+ df.to_sql(name="test_table", con=sql_db)
sql_db.close()
+
def test_sql_read():
- sql_db = sqlite3.connect('test.sql')
+ sql_db = sqlite3.connect("test.sql")
pd.read_sql_query("select * from test_table", sql_db)
sql_db.close()
+
def test_hdf_fixed_write(df):
- df.to_hdf('test_fixed.hdf', 'test', mode='w')
+ df.to_hdf("test_fixed.hdf", "test", mode="w")
+
def test_hdf_fixed_read():
- pd.read_hdf('test_fixed.hdf', 'test')
+ pd.read_hdf("test_fixed.hdf", "test")
+
def test_hdf_fixed_write_compress(df):
- df.to_hdf('test_fixed_compress.hdf', 'test', mode='w', complib='blosc')
+ df.to_hdf("test_fixed_compress.hdf", "test", mode="w", complib="blosc")
+
def test_hdf_fixed_read_compress():
- pd.read_hdf('test_fixed_compress.hdf', 'test')
+ pd.read_hdf("test_fixed_compress.hdf", "test")
+
def test_hdf_table_write(df):
- df.to_hdf('test_table.hdf', 'test', mode='w', format='table')
+ df.to_hdf("test_table.hdf", "test", mode="w", format="table")
+
def test_hdf_table_read():
- pd.read_hdf('test_table.hdf', 'test')
+ pd.read_hdf("test_table.hdf", "test")
+
def test_hdf_table_write_compress(df):
- df.to_hdf('test_table_compress.hdf', 'test', mode='w',
- complib='blosc', format='table')
+ df.to_hdf(
+ "test_table_compress.hdf", "test", mode="w", complib="blosc", format="table"
+ )
+
def test_hdf_table_read_compress():
- pd.read_hdf('test_table_compress.hdf', 'test')
+ pd.read_hdf("test_table_compress.hdf", "test")
+
def test_csv_write(df):
- df.to_csv('test.csv', mode='w')
+ df.to_csv("test.csv", mode="w")
+
def test_csv_read():
- pd.read_csv('test.csv', index_col=0)
+ pd.read_csv("test.csv", index_col=0)
+
def test_feather_write(df):
- df.to_feather('test.feather')
+ df.to_feather("test.feather")
+
def test_feather_read():
- pd.read_feather('test.feather')
+ pd.read_feather("test.feather")
+
def test_pickle_write(df):
- df.to_pickle('test.pkl')
+ df.to_pickle("test.pkl")
+
def test_pickle_read():
- pd.read_pickle('test.pkl')
+ pd.read_pickle("test.pkl")
+
def test_pickle_write_compress(df):
- df.to_pickle('test.pkl.compress', compression='xz')
+ df.to_pickle("test.pkl.compress", compression="xz")
+
def test_pickle_read_compress():
- pd.read_pickle('test.pkl.compress', compression='xz')
+ pd.read_pickle("test.pkl.compress", compression="xz")
+
def test_parquet_write(df):
- df.to_parquet('test.parquet')
+ df.to_parquet("test.parquet")
+
def test_parquet_read():
- pd.read_parquet('test.parquet')
+ pd.read_parquet("test.parquet")
When writing, the top-three functions in terms of speed are ``test_feather_write``, ``test_hdf_fixed_write`` and ``test_hdf_fixed_write_compress``.
diff --git a/doc/source/user_guide/merging.rst b/doc/source/user_guide/merging.rst
index 56ff8c1fc7c9b..8dbfc261e6fa8 100644
--- a/doc/source/user_guide/merging.rst
+++ b/doc/source/user_guide/merging.rst
@@ -7,6 +7,7 @@
from matplotlib import pyplot as plt
import pandas.util._doctools as doctools
+
p = doctools.TablePlotter()
@@ -38,23 +39,35 @@ a simple example:
.. ipython:: python
- df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
- 'B': ['B0', 'B1', 'B2', 'B3'],
- 'C': ['C0', 'C1', 'C2', 'C3'],
- 'D': ['D0', 'D1', 'D2', 'D3']},
- index=[0, 1, 2, 3])
+ df1 = pd.DataFrame(
+ {
+ "A": ["A0", "A1", "A2", "A3"],
+ "B": ["B0", "B1", "B2", "B3"],
+ "C": ["C0", "C1", "C2", "C3"],
+ "D": ["D0", "D1", "D2", "D3"],
+ },
+ index=[0, 1, 2, 3],
+ )
- df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
- 'B': ['B4', 'B5', 'B6', 'B7'],
- 'C': ['C4', 'C5', 'C6', 'C7'],
- 'D': ['D4', 'D5', 'D6', 'D7']},
- index=[4, 5, 6, 7])
+ df2 = pd.DataFrame(
+ {
+ "A": ["A4", "A5", "A6", "A7"],
+ "B": ["B4", "B5", "B6", "B7"],
+ "C": ["C4", "C5", "C6", "C7"],
+ "D": ["D4", "D5", "D6", "D7"],
+ },
+ index=[4, 5, 6, 7],
+ )
- df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
- 'B': ['B8', 'B9', 'B10', 'B11'],
- 'C': ['C8', 'C9', 'C10', 'C11'],
- 'D': ['D8', 'D9', 'D10', 'D11']},
- index=[8, 9, 10, 11])
+ df3 = pd.DataFrame(
+ {
+ "A": ["A8", "A9", "A10", "A11"],
+ "B": ["B8", "B9", "B10", "B11"],
+ "C": ["C8", "C9", "C10", "C11"],
+ "D": ["D8", "D9", "D10", "D11"],
+ },
+ index=[8, 9, 10, 11],
+ )
frames = [df1, df2, df3]
result = pd.concat(frames)
@@ -77,7 +90,7 @@ some configurable handling of "what to do with the other axes":
levels=None, names=None, verify_integrity=False, copy=True)
* ``objs`` : a sequence or mapping of Series or DataFrame objects. If a
- dict is passed, the sorted keys will be used as the `keys` argument, unless
+ dict is passed, the sorted keys will be used as the ``keys`` argument, unless
it is passed, in which case the values will be selected (see below). Any None
objects will be dropped silently unless they are all None in which case a
ValueError will be raised.
@@ -109,7 +122,7 @@ with each of the pieces of the chopped up DataFrame. We can do this using the
.. ipython:: python
- result = pd.concat(frames, keys=['x', 'y', 'z'])
+ result = pd.concat(frames, keys=["x", "y", "z"])
.. ipython:: python
:suppress:
@@ -125,7 +138,7 @@ means that we can now select out each chunk by key:
.. ipython:: python
- result.loc['y']
+ result.loc["y"]
It's not a stretch to see how this can be very useful. More detail on this
functionality below.
@@ -158,10 +171,14 @@ behavior:
.. ipython:: python
- df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
- 'D': ['D2', 'D3', 'D6', 'D7'],
- 'F': ['F2', 'F3', 'F6', 'F7']},
- index=[2, 3, 6, 7])
+ df4 = pd.DataFrame(
+ {
+ "B": ["B2", "B3", "B6", "B7"],
+ "D": ["D2", "D3", "D6", "D7"],
+ "F": ["F2", "F3", "F6", "F7"],
+ },
+ index=[2, 3, 6, 7],
+ )
result = pd.concat([df1, df4], axis=1, sort=False)
@@ -175,8 +192,6 @@ behavior:
.. warning::
- .. versionchanged:: 0.23.0
-
The default behavior with ``join='outer'`` is to sort the other axis
(columns in this case). In a future version of pandas, the default will
be to not sort. We specified ``sort=False`` to opt in to the new
@@ -186,7 +201,7 @@ Here is the same thing with ``join='inner'``:
.. ipython:: python
- result = pd.concat([df1, df4], axis=1, join='inner')
+ result = pd.concat([df1, df4], axis=1, join="inner")
.. ipython:: python
:suppress:
@@ -318,7 +333,7 @@ the name of the ``Series``.
.. ipython:: python
- s1 = pd.Series(['X0', 'X1', 'X2', 'X3'], name='X')
+ s1 = pd.Series(["X0", "X1", "X2", "X3"], name="X")
result = pd.concat([df1, s1], axis=1)
.. ipython:: python
@@ -340,7 +355,7 @@ If unnamed ``Series`` are passed they will be numbered consecutively.
.. ipython:: python
- s2 = pd.Series(['_0', '_1', '_2', '_3'])
+ s2 = pd.Series(["_0", "_1", "_2", "_3"])
result = pd.concat([df1, s2, s2, s2], axis=1)
.. ipython:: python
@@ -375,7 +390,7 @@ inherit the parent ``Series``' name, when these existed.
.. ipython:: python
- s3 = pd.Series([0, 1, 2, 3], name='foo')
+ s3 = pd.Series([0, 1, 2, 3], name="foo")
s4 = pd.Series([0, 1, 2, 3])
s5 = pd.Series([0, 1, 4, 5])
@@ -385,13 +400,13 @@ Through the ``keys`` argument we can override the existing column names.
.. ipython:: python
- pd.concat([s3, s4, s5], axis=1, keys=['red', 'blue', 'yellow'])
+ pd.concat([s3, s4, s5], axis=1, keys=["red", "blue", "yellow"])
Let's consider a variation of the very first example presented:
.. ipython:: python
- result = pd.concat(frames, keys=['x', 'y', 'z'])
+ result = pd.concat(frames, keys=["x", "y", "z"])
.. ipython:: python
:suppress:
@@ -406,7 +421,7 @@ for the ``keys`` argument (unless other keys are specified):
.. ipython:: python
- pieces = {'x': df1, 'y': df2, 'z': df3}
+ pieces = {"x": df1, "y": df2, "z": df3}
result = pd.concat(pieces)
.. ipython:: python
@@ -419,7 +434,7 @@ for the ``keys`` argument (unless other keys are specified):
.. ipython:: python
- result = pd.concat(pieces, keys=['z', 'y'])
+ result = pd.concat(pieces, keys=["z", "y"])
.. ipython:: python
:suppress:
@@ -441,9 +456,9 @@ do so using the ``levels`` argument:
.. ipython:: python
- result = pd.concat(pieces, keys=['x', 'y', 'z'],
- levels=[['z', 'y', 'x', 'w']],
- names=['group_key'])
+ result = pd.concat(
+ pieces, keys=["x", "y", "z"], levels=[["z", "y", "x", "w"]], names=["group_key"]
+ )
.. ipython:: python
:suppress:
@@ -471,7 +486,7 @@ append a single row to a ``DataFrame`` by passing a ``Series`` or dict to
.. ipython:: python
- s2 = pd.Series(['X0', 'X1', 'X2', 'X3'], index=['A', 'B', 'C', 'D'])
+ s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])
result = df1.append(s2, ignore_index=True)
.. ipython:: python
@@ -490,8 +505,7 @@ You can also pass a list of dicts or Series:
.. ipython:: python
- dicts = [{'A': 1, 'B': 2, 'C': 3, 'X': 4},
- {'A': 5, 'B': 6, 'C': 7, 'Y': 8}]
+ dicts = [{"A": 1, "B": 2, "C": 3, "X": 4}, {"A": 5, "B": 6, "C": 7, "Y": 8}]
result = df1.append(dicts, ignore_index=True, sort=False)
.. ipython:: python
@@ -621,14 +635,22 @@ key combination:
.. ipython:: python
- left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
- 'A': ['A0', 'A1', 'A2', 'A3'],
- 'B': ['B0', 'B1', 'B2', 'B3']})
+ left = pd.DataFrame(
+ {
+ "key": ["K0", "K1", "K2", "K3"],
+ "A": ["A0", "A1", "A2", "A3"],
+ "B": ["B0", "B1", "B2", "B3"],
+ }
+ )
- right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
- 'C': ['C0', 'C1', 'C2', 'C3'],
- 'D': ['D0', 'D1', 'D2', 'D3']})
- result = pd.merge(left, right, on='key')
+ right = pd.DataFrame(
+ {
+ "key": ["K0", "K1", "K2", "K3"],
+ "C": ["C0", "C1", "C2", "C3"],
+ "D": ["D0", "D1", "D2", "D3"],
+ }
+ )
+ result = pd.merge(left, right, on="key")
.. ipython:: python
:suppress:
@@ -644,17 +666,25 @@ appearing in ``left`` and ``right`` are present (the intersection), since
.. ipython:: python
- left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
- 'key2': ['K0', 'K1', 'K0', 'K1'],
- 'A': ['A0', 'A1', 'A2', 'A3'],
- 'B': ['B0', 'B1', 'B2', 'B3']})
+ left = pd.DataFrame(
+ {
+ "key1": ["K0", "K0", "K1", "K2"],
+ "key2": ["K0", "K1", "K0", "K1"],
+ "A": ["A0", "A1", "A2", "A3"],
+ "B": ["B0", "B1", "B2", "B3"],
+ }
+ )
- right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
- 'key2': ['K0', 'K0', 'K0', 'K0'],
- 'C': ['C0', 'C1', 'C2', 'C3'],
- 'D': ['D0', 'D1', 'D2', 'D3']})
+ right = pd.DataFrame(
+ {
+ "key1": ["K0", "K1", "K1", "K2"],
+ "key2": ["K0", "K0", "K0", "K0"],
+ "C": ["C0", "C1", "C2", "C3"],
+ "D": ["D0", "D1", "D2", "D3"],
+ }
+ )
- result = pd.merge(left, right, on=['key1', 'key2'])
+ result = pd.merge(left, right, on=["key1", "key2"])
.. ipython:: python
:suppress:
@@ -680,7 +710,7 @@ either the left or right tables, the values in the joined table will be
.. ipython:: python
- result = pd.merge(left, right, how='left', on=['key1', 'key2'])
+ result = pd.merge(left, right, how="left", on=["key1", "key2"])
.. ipython:: python
:suppress:
@@ -692,7 +722,7 @@ either the left or right tables, the values in the joined table will be
.. ipython:: python
- result = pd.merge(left, right, how='right', on=['key1', 'key2'])
+ result = pd.merge(left, right, how="right", on=["key1", "key2"])
.. ipython:: python
:suppress:
@@ -703,7 +733,7 @@ either the left or right tables, the values in the joined table will be
.. ipython:: python
- result = pd.merge(left, right, how='outer', on=['key1', 'key2'])
+ result = pd.merge(left, right, how="outer", on=["key1", "key2"])
.. ipython:: python
:suppress:
@@ -715,7 +745,7 @@ either the left or right tables, the values in the joined table will be
.. ipython:: python
- result = pd.merge(left, right, how='inner', on=['key1', 'key2'])
+ result = pd.merge(left, right, how="inner", on=["key1", "key2"])
.. ipython:: python
:suppress:
@@ -743,18 +773,18 @@ as shown in the following example.
)
ser
- pd.merge(df, ser.reset_index(), on=['Let', 'Num'])
+ pd.merge(df, ser.reset_index(), on=["Let", "Num"])
Here is another example with duplicate join keys in DataFrames:
.. ipython:: python
- left = pd.DataFrame({'A': [1, 2], 'B': [2, 2]})
+ left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})
- right = pd.DataFrame({'A': [4, 5, 6], 'B': [2, 2, 2]})
+ right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})
- result = pd.merge(left, right, on='B', how='outer')
+ result = pd.merge(left, right, on="B", how="outer")
.. ipython:: python
:suppress:
@@ -786,8 +816,8 @@ In the following example, there are duplicate values of ``B`` in the right
.. ipython:: python
- left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
- right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
+ left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})
+ right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})
.. code-block:: ipython
@@ -801,7 +831,7 @@ ensure there are no duplicates in the left DataFrame, one can use the
.. ipython:: python
- pd.merge(left, right, on='B', how='outer', validate="one_to_many")
+ pd.merge(left, right, on="B", how="outer", validate="one_to_many")
.. _merging.indicator:
@@ -823,15 +853,15 @@ that takes on values:
.. ipython:: python
- df1 = pd.DataFrame({'col1': [0, 1], 'col_left': ['a', 'b']})
- df2 = pd.DataFrame({'col1': [1, 2, 2], 'col_right': [2, 2, 2]})
- pd.merge(df1, df2, on='col1', how='outer', indicator=True)
+ df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]})
+ df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]})
+ pd.merge(df1, df2, on="col1", how="outer", indicator=True)
The ``indicator`` argument will also accept string arguments, in which case the indicator function will use the value of the passed string as the name for the indicator column.
.. ipython:: python
- pd.merge(df1, df2, on='col1', how='outer', indicator='indicator_column')
+ pd.merge(df1, df2, on="col1", how="outer", indicator="indicator_column")
.. _merging.dtypes:
@@ -843,25 +873,25 @@ Merging will preserve the dtype of the join keys.
.. ipython:: python
- left = pd.DataFrame({'key': [1], 'v1': [10]})
+ left = pd.DataFrame({"key": [1], "v1": [10]})
left
- right = pd.DataFrame({'key': [1, 2], 'v1': [20, 30]})
+ right = pd.DataFrame({"key": [1, 2], "v1": [20, 30]})
right
We are able to preserve the join keys:
.. ipython:: python
- pd.merge(left, right, how='outer')
- pd.merge(left, right, how='outer').dtypes
+ pd.merge(left, right, how="outer")
+ pd.merge(left, right, how="outer").dtypes
Of course if you have missing values that are introduced, then the
resulting dtype will be upcast.
.. ipython:: python
- pd.merge(left, right, how='outer', on='key')
- pd.merge(left, right, how='outer', on='key').dtypes
+ pd.merge(left, right, how="outer", on="key")
+ pd.merge(left, right, how="outer", on="key").dtypes
Merging will preserve ``category`` dtypes of the mergands. See also the section on :ref:`categoricals `.
@@ -871,12 +901,12 @@ The left frame.
from pandas.api.types import CategoricalDtype
- X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
- X = X.astype(CategoricalDtype(categories=['foo', 'bar']))
+ X = pd.Series(np.random.choice(["foo", "bar"], size=(10,)))
+ X = X.astype(CategoricalDtype(categories=["foo", "bar"]))
- left = pd.DataFrame({'X': X,
- 'Y': np.random.choice(['one', 'two', 'three'],
- size=(10,))})
+ left = pd.DataFrame(
+ {"X": X, "Y": np.random.choice(["one", "two", "three"], size=(10,))}
+ )
left
left.dtypes
@@ -884,9 +914,12 @@ The right frame.
.. ipython:: python
- right = pd.DataFrame({'X': pd.Series(['foo', 'bar'],
- dtype=CategoricalDtype(['foo', 'bar'])),
- 'Z': [1, 2]})
+ right = pd.DataFrame(
+ {
+ "X": pd.Series(["foo", "bar"], dtype=CategoricalDtype(["foo", "bar"])),
+ "Z": [1, 2],
+ }
+ )
right
right.dtypes
@@ -894,7 +927,7 @@ The merged result:
.. ipython:: python
- result = pd.merge(left, right, how='outer')
+ result = pd.merge(left, right, how="outer")
result
result.dtypes
@@ -918,13 +951,13 @@ potentially differently-indexed ``DataFrames`` into a single result
.. ipython:: python
- left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
- 'B': ['B0', 'B1', 'B2']},
- index=['K0', 'K1', 'K2'])
+ left = pd.DataFrame(
+ {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
+ )
- right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
- 'D': ['D0', 'D2', 'D3']},
- index=['K0', 'K2', 'K3'])
+ right = pd.DataFrame(
+ {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
+ )
result = left.join(right)
@@ -938,7 +971,7 @@ potentially differently-indexed ``DataFrames`` into a single result
.. ipython:: python
- result = left.join(right, how='outer')
+ result = left.join(right, how="outer")
.. ipython:: python
:suppress:
@@ -952,7 +985,7 @@ The same as above, but with ``how='inner'``.
.. ipython:: python
- result = left.join(right, how='inner')
+ result = left.join(right, how="inner")
.. ipython:: python
:suppress:
@@ -968,7 +1001,7 @@ indexes:
.. ipython:: python
- result = pd.merge(left, right, left_index=True, right_index=True, how='outer')
+ result = pd.merge(left, right, left_index=True, right_index=True, how="outer")
.. ipython:: python
:suppress:
@@ -980,7 +1013,7 @@ indexes:
.. ipython:: python
- result = pd.merge(left, right, left_index=True, right_index=True, how='inner');
+ result = pd.merge(left, right, left_index=True, right_index=True, how="inner")
.. ipython:: python
:suppress:
@@ -1010,15 +1043,17 @@ join key), using ``join`` may be more convenient. Here is a simple example:
.. ipython:: python
- left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
- 'B': ['B0', 'B1', 'B2', 'B3'],
- 'key': ['K0', 'K1', 'K0', 'K1']})
+ left = pd.DataFrame(
+ {
+ "A": ["A0", "A1", "A2", "A3"],
+ "B": ["B0", "B1", "B2", "B3"],
+ "key": ["K0", "K1", "K0", "K1"],
+ }
+ )
- right = pd.DataFrame({'C': ['C0', 'C1'],
- 'D': ['D0', 'D1']},
- index=['K0', 'K1'])
+ right = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"])
- result = left.join(right, on='key')
+ result = left.join(right, on="key")
.. ipython:: python
:suppress:
@@ -1030,8 +1065,7 @@ join key), using ``join`` may be more convenient. Here is a simple example:
.. ipython:: python
- result = pd.merge(left, right, left_on='key', right_index=True,
- how='left', sort=False);
+ result = pd.merge(left, right, left_on="key", right_index=True, how="left", sort=False)
.. ipython:: python
:suppress:
@@ -1047,22 +1081,27 @@ To join on multiple keys, the passed DataFrame must have a ``MultiIndex``:
.. ipython:: python
- left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
- 'B': ['B0', 'B1', 'B2', 'B3'],
- 'key1': ['K0', 'K0', 'K1', 'K2'],
- 'key2': ['K0', 'K1', 'K0', 'K1']})
+ left = pd.DataFrame(
+ {
+ "A": ["A0", "A1", "A2", "A3"],
+ "B": ["B0", "B1", "B2", "B3"],
+ "key1": ["K0", "K0", "K1", "K2"],
+ "key2": ["K0", "K1", "K0", "K1"],
+ }
+ )
- index = pd.MultiIndex.from_tuples([('K0', 'K0'), ('K1', 'K0'),
- ('K2', 'K0'), ('K2', 'K1')])
- right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
- 'D': ['D0', 'D1', 'D2', 'D3']},
- index=index)
+ index = pd.MultiIndex.from_tuples(
+ [("K0", "K0"), ("K1", "K0"), ("K2", "K0"), ("K2", "K1")]
+ )
+ right = pd.DataFrame(
+ {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=index
+ )
Now this can be joined by passing the two key column names:
.. ipython:: python
- result = left.join(right, on=['key1', 'key2'])
+ result = left.join(right, on=["key1", "key2"])
.. ipython:: python
:suppress:
@@ -1081,7 +1120,7 @@ easily performed:
.. ipython:: python
- result = left.join(right, on=['key1', 'key2'], how='inner')
+ result = left.join(right, on=["key1", "key2"], how="inner")
.. ipython:: python
:suppress:
@@ -1151,39 +1190,38 @@ the left argument, as in this example:
.. ipython:: python
- leftindex = pd.MultiIndex.from_product([list('abc'), list('xy'), [1, 2]],
- names=['abc', 'xy', 'num'])
- left = pd.DataFrame({'v1': range(12)}, index=leftindex)
+ leftindex = pd.MultiIndex.from_product(
+ [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
+ )
+ left = pd.DataFrame({"v1": range(12)}, index=leftindex)
left
- rightindex = pd.MultiIndex.from_product([list('abc'), list('xy')],
- names=['abc', 'xy'])
- right = pd.DataFrame({'v2': [100 * i for i in range(1, 7)]}, index=rightindex)
+ rightindex = pd.MultiIndex.from_product([list("abc"), list("xy")], names=["abc", "xy"])
+ right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)
right
- left.join(right, on=['abc', 'xy'], how='inner')
+ left.join(right, on=["abc", "xy"], how="inner")
If that condition is not satisfied, a join with two multi-indexes can be
done using the following code.
.. ipython:: python
- leftindex = pd.MultiIndex.from_tuples([('K0', 'X0'), ('K0', 'X1'),
- ('K1', 'X2')],
- names=['key', 'X'])
- left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
- 'B': ['B0', 'B1', 'B2']},
- index=leftindex)
+ leftindex = pd.MultiIndex.from_tuples(
+ [("K0", "X0"), ("K0", "X1"), ("K1", "X2")], names=["key", "X"]
+ )
+ left = pd.DataFrame({"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=leftindex)
- rightindex = pd.MultiIndex.from_tuples([('K0', 'Y0'), ('K1', 'Y1'),
- ('K2', 'Y2'), ('K2', 'Y3')],
- names=['key', 'Y'])
- right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
- 'D': ['D0', 'D1', 'D2', 'D3']},
- index=rightindex)
+ rightindex = pd.MultiIndex.from_tuples(
+ [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")], names=["key", "Y"]
+ )
+ right = pd.DataFrame(
+ {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=rightindex
+ )
- result = pd.merge(left.reset_index(), right.reset_index(),
- on=['key'], how='inner').set_index(['key', 'X', 'Y'])
+ result = pd.merge(
+ left.reset_index(), right.reset_index(), on=["key"], how="inner"
+ ).set_index(["key", "X", "Y"])
.. ipython:: python
:suppress:
@@ -1198,8 +1236,6 @@ done using the following code.
Merging on a combination of columns and index levels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. versionadded:: 0.23
-
Strings passed as the ``on``, ``left_on``, and ``right_on`` parameters
may refer to either column names or index level names. This enables merging
``DataFrame`` instances on a combination of index levels and columns without
@@ -1207,21 +1243,29 @@ resetting indexes.
.. ipython:: python
- left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
+ left_index = pd.Index(["K0", "K0", "K1", "K2"], name="key1")
- left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
- 'B': ['B0', 'B1', 'B2', 'B3'],
- 'key2': ['K0', 'K1', 'K0', 'K1']},
- index=left_index)
+ left = pd.DataFrame(
+ {
+ "A": ["A0", "A1", "A2", "A3"],
+ "B": ["B0", "B1", "B2", "B3"],
+ "key2": ["K0", "K1", "K0", "K1"],
+ },
+ index=left_index,
+ )
- right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
+ right_index = pd.Index(["K0", "K1", "K2", "K2"], name="key1")
- right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
- 'D': ['D0', 'D1', 'D2', 'D3'],
- 'key2': ['K0', 'K0', 'K0', 'K1']},
- index=right_index)
+ right = pd.DataFrame(
+ {
+ "C": ["C0", "C1", "C2", "C3"],
+ "D": ["D0", "D1", "D2", "D3"],
+ "key2": ["K0", "K0", "K0", "K1"],
+ },
+ index=right_index,
+ )
- result = left.merge(right, on=['key1', 'key2'])
+ result = left.merge(right, on=["key1", "key2"])
.. ipython:: python
:suppress:
@@ -1238,7 +1282,7 @@ resetting indexes.
DataFrame.
.. note::
- When DataFrames are merged using only some of the levels of a `MultiIndex`,
+ When DataFrames are merged using only some of the levels of a ``MultiIndex``,
the extra levels will be dropped from the resulting merge. In order to
preserve those levels, use ``reset_index`` on those level names to move
those levels to columns prior to doing the merge.
@@ -1258,10 +1302,10 @@ columns:
.. ipython:: python
- left = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'v': [1, 2, 3]})
- right = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'v': [4, 5, 6]})
+ left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]})
+ right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]})
- result = pd.merge(left, right, on='k')
+ result = pd.merge(left, right, on="k")
.. ipython:: python
:suppress:
@@ -1273,7 +1317,7 @@ columns:
.. ipython:: python
- result = pd.merge(left, right, on='k', suffixes=['_l', '_r'])
+ result = pd.merge(left, right, on="k", suffixes=("_l", "_r"))
.. ipython:: python
:suppress:
@@ -1288,9 +1332,9 @@ similarly.
.. ipython:: python
- left = left.set_index('k')
- right = right.set_index('k')
- result = left.join(right, lsuffix='_l', rsuffix='_r')
+ left = left.set_index("k")
+ right = right.set_index("k")
+ result = left.join(right, lsuffix="_l", rsuffix="_r")
.. ipython:: python
:suppress:
@@ -1310,7 +1354,7 @@ to join them together on their indexes.
.. ipython:: python
- right2 = pd.DataFrame({'v': [7, 8, 9]}, index=['K1', 'K1', 'K2'])
+ right2 = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"])
result = left.join([right, right2])
.. ipython:: python
@@ -1332,10 +1376,8 @@ one object from values for matching indices in the other. Here is an example:
.. ipython:: python
- df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, np.nan, np.nan],
- [np.nan, 7., np.nan]])
- df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4]],
- index=[1, 2])
+ df1 = pd.DataFrame([[np.nan, 3.0, 5.0], [-4.6, np.nan, np.nan], [np.nan, 7.0, np.nan]])
+ df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2])
For this, use the :meth:`~DataFrame.combine_first` method:
@@ -1388,14 +1430,13 @@ fill/interpolate missing data:
.. ipython:: python
- left = pd.DataFrame({'k': ['K0', 'K1', 'K1', 'K2'],
- 'lv': [1, 2, 3, 4],
- 's': ['a', 'b', 'c', 'd']})
+ left = pd.DataFrame(
+ {"k": ["K0", "K1", "K1", "K2"], "lv": [1, 2, 3, 4], "s": ["a", "b", "c", "d"]}
+ )
- right = pd.DataFrame({'k': ['K1', 'K2', 'K4'],
- 'rv': [1, 2, 3]})
+ right = pd.DataFrame({"k": ["K1", "K2", "K4"], "rv": [1, 2, 3]})
- pd.merge_ordered(left, right, fill_method='ffill', left_by='s')
+ pd.merge_ordered(left, right, fill_method="ffill", left_by="s")
.. _merging.merge_asof:
@@ -1415,37 +1456,44 @@ merge them.
.. ipython:: python
- trades = pd.DataFrame({
- 'time': pd.to_datetime(['20160525 13:30:00.023',
- '20160525 13:30:00.038',
- '20160525 13:30:00.048',
- '20160525 13:30:00.048',
- '20160525 13:30:00.048']),
- 'ticker': ['MSFT', 'MSFT',
- 'GOOG', 'GOOG', 'AAPL'],
- 'price': [51.95, 51.95,
- 720.77, 720.92, 98.00],
- 'quantity': [75, 155,
- 100, 100, 100]},
- columns=['time', 'ticker', 'price', 'quantity'])
-
- quotes = pd.DataFrame({
- 'time': pd.to_datetime(['20160525 13:30:00.023',
- '20160525 13:30:00.023',
- '20160525 13:30:00.030',
- '20160525 13:30:00.041',
- '20160525 13:30:00.048',
- '20160525 13:30:00.049',
- '20160525 13:30:00.072',
- '20160525 13:30:00.075']),
- 'ticker': ['GOOG', 'MSFT', 'MSFT',
- 'MSFT', 'GOOG', 'AAPL', 'GOOG',
- 'MSFT'],
- 'bid': [720.50, 51.95, 51.97, 51.99,
- 720.50, 97.99, 720.50, 52.01],
- 'ask': [720.93, 51.96, 51.98, 52.00,
- 720.93, 98.01, 720.88, 52.03]},
- columns=['time', 'ticker', 'bid', 'ask'])
+ trades = pd.DataFrame(
+ {
+ "time": pd.to_datetime(
+ [
+ "20160525 13:30:00.023",
+ "20160525 13:30:00.038",
+ "20160525 13:30:00.048",
+ "20160525 13:30:00.048",
+ "20160525 13:30:00.048",
+ ]
+ ),
+ "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
+ "price": [51.95, 51.95, 720.77, 720.92, 98.00],
+ "quantity": [75, 155, 100, 100, 100],
+ },
+ columns=["time", "ticker", "price", "quantity"],
+ )
+
+ quotes = pd.DataFrame(
+ {
+ "time": pd.to_datetime(
+ [
+ "20160525 13:30:00.023",
+ "20160525 13:30:00.023",
+ "20160525 13:30:00.030",
+ "20160525 13:30:00.041",
+ "20160525 13:30:00.048",
+ "20160525 13:30:00.049",
+ "20160525 13:30:00.072",
+ "20160525 13:30:00.075",
+ ]
+ ),
+ "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG", "AAPL", "GOOG", "MSFT"],
+ "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
+ "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03],
+ },
+ columns=["time", "ticker", "bid", "ask"],
+ )
.. ipython:: python
@@ -1456,18 +1504,13 @@ By default we are taking the asof of the quotes.
.. ipython:: python
- pd.merge_asof(trades, quotes,
- on='time',
- by='ticker')
+ pd.merge_asof(trades, quotes, on="time", by="ticker")
We only asof within ``2ms`` between the quote time and the trade time.
.. ipython:: python
- pd.merge_asof(trades, quotes,
- on='time',
- by='ticker',
- tolerance=pd.Timedelta('2ms'))
+ pd.merge_asof(trades, quotes, on="time", by="ticker", tolerance=pd.Timedelta("2ms"))
We only asof within ``10ms`` between the quote time and the trade time and we
exclude exact matches on time. Note that though we exclude the exact matches
@@ -1475,11 +1518,14 @@ exclude exact matches on time. Note that though we exclude the exact matches
.. ipython:: python
- pd.merge_asof(trades, quotes,
- on='time',
- by='ticker',
- tolerance=pd.Timedelta('10ms'),
- allow_exact_matches=False)
+ pd.merge_asof(
+ trades,
+ quotes,
+ on="time",
+ by="ticker",
+ tolerance=pd.Timedelta("10ms"),
+ allow_exact_matches=False,
+ )
.. _merging.compare:
@@ -1491,7 +1537,7 @@ compare two DataFrame or Series, respectively, and summarize their differences.
This feature was added in :ref:`V1.1.0 `.
-For example, you might want to compare two `DataFrame` and stack their differences
+For example, you might want to compare two ``DataFrame`` and stack their differences
side by side.
.. ipython:: python
@@ -1500,7 +1546,7 @@ side by side.
{
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
- "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
+ "col3": [1.0, 2.0, 3.0, 4.0, 5.0],
},
columns=["col1", "col2", "col3"],
)
@@ -1509,8 +1555,8 @@ side by side.
.. ipython:: python
df2 = df.copy()
- df2.loc[0, 'col1'] = 'c'
- df2.loc[2, 'col3'] = 4.0
+ df2.loc[0, "col1"] = "c"
+ df2.loc[2, "col3"] = 4.0
df2
.. ipython:: python
@@ -1527,7 +1573,7 @@ If you wish, you may choose to stack the differences on rows.
df.compare(df2, align_axis=0)
-If you wish to keep all original rows and columns, set `keep_shape` argument
+If you wish to keep all original rows and columns, set ``keep_shape`` argument
to ``True``.
.. ipython:: python
diff --git a/doc/source/user_guide/missing_data.rst b/doc/source/user_guide/missing_data.rst
index 2e68a0598bb71..7eb377694910b 100644
--- a/doc/source/user_guide/missing_data.rst
+++ b/doc/source/user_guide/missing_data.rst
@@ -38,12 +38,15 @@ arise and we wish to also consider that "missing" or "not available" or "NA".
.. ipython:: python
- df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
- columns=['one', 'two', 'three'])
- df['four'] = 'bar'
- df['five'] = df['one'] > 0
+ df = pd.DataFrame(
+ np.random.randn(5, 3),
+ index=["a", "c", "e", "f", "h"],
+ columns=["one", "two", "three"],
+ )
+ df["four"] = "bar"
+ df["five"] = df["one"] > 0
df
- df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
+ df2 = df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"])
df2
To make detecting missing values easier (and across different array dtypes),
@@ -53,9 +56,9 @@ Series and DataFrame objects:
.. ipython:: python
- df2['one']
- pd.isna(df2['one'])
- df2['four'].notna()
+ df2["one"]
+ pd.isna(df2["one"])
+ df2["four"].notna()
df2.isna()
.. warning::
@@ -65,20 +68,20 @@ Series and DataFrame objects:
.. ipython:: python
- None == None # noqa: E711
+ None == None # noqa: E711
np.nan == np.nan
So as compared to above, a scalar equality comparison versus a ``None/np.nan`` doesn't provide useful information.
.. ipython:: python
- df2['one'] == np.nan
+ df2["one"] == np.nan
Integer dtypes and missing data
-------------------------------
Because ``NaN`` is a float, a column of integers with even one missing values
-is cast to floating-point dtype (see :ref:`gotchas.intna` for more). Pandas
+is cast to floating-point dtype (see :ref:`gotchas.intna` for more). pandas
provides a nullable integer array, which can be used by explicitly requesting
the dtype:
@@ -101,9 +104,9 @@ pandas objects provide compatibility between ``NaT`` and ``NaN``.
.. ipython:: python
df2 = df.copy()
- df2['timestamp'] = pd.Timestamp('20120101')
+ df2["timestamp"] = pd.Timestamp("20120101")
df2
- df2.loc[['a', 'c', 'h'], ['one', 'timestamp']] = np.nan
+ df2.loc[["a", "c", "h"], ["one", "timestamp"]] = np.nan
df2
df2.dtypes.value_counts()
@@ -146,9 +149,9 @@ objects.
.. ipython:: python
:suppress:
- df = df2.loc[:, ['one', 'two', 'three']]
- a = df2.loc[df2.index[:5], ['one', 'two']].fillna(method='pad')
- b = df2.loc[df2.index[:5], ['one', 'two', 'three']]
+ df = df2.loc[:, ["one", "two", "three"]]
+ a = df2.loc[df2.index[:5], ["one", "two"]].fillna(method="pad")
+ b = df2.loc[df2.index[:5], ["one", "two", "three"]]
.. ipython:: python
@@ -168,7 +171,7 @@ account for missing data. For example:
.. ipython:: python
df
- df['one'].sum()
+ df["one"].sum()
df.mean(1)
df.cumsum()
df.cumsum(skipna=False)
@@ -210,7 +213,7 @@ with R, for example:
.. ipython:: python
df
- df.groupby('one').mean()
+ df.groupby("one").mean()
See the groupby section :ref:`here ` for more information.
@@ -234,7 +237,7 @@ of ways, which we illustrate:
df2
df2.fillna(0)
- df2['one'].fillna('missing')
+ df2["one"].fillna("missing")
**Fill gaps forward or backward**
@@ -244,14 +247,14 @@ can propagate non-NA values forward or backward:
.. ipython:: python
df
- df.fillna(method='pad')
+ df.fillna(method="pad")
.. _missing_data.fillna.limit:
**Limit the amount of filling**
If we only want consecutive gaps filled up to a certain number of data points,
-we can use the `limit` keyword:
+we can use the ``limit`` keyword:
.. ipython:: python
:suppress:
@@ -261,7 +264,7 @@ we can use the `limit` keyword:
.. ipython:: python
df
- df.fillna(method='pad', limit=1)
+ df.fillna(method="pad", limit=1)
To remind you, these are the available filling methods:
@@ -289,21 +292,21 @@ use case of this is to fill a DataFrame with the mean of that column.
.. ipython:: python
- dff = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
+ dff = pd.DataFrame(np.random.randn(10, 3), columns=list("ABC"))
dff.iloc[3:5, 0] = np.nan
dff.iloc[4:6, 1] = np.nan
dff.iloc[5:8, 2] = np.nan
dff
dff.fillna(dff.mean())
- dff.fillna(dff.mean()['B':'C'])
+ dff.fillna(dff.mean()["B":"C"])
Same result as above, but is aligning the 'fill' value which is
a Series in this case.
.. ipython:: python
- dff.where(pd.notna(dff), dff.mean(), axis='columns')
+ dff.where(pd.notna(dff), dff.mean(), axis="columns")
.. _missing_data.dropna:
@@ -317,15 +320,15 @@ data. To do this, use :meth:`~DataFrame.dropna`:
.. ipython:: python
:suppress:
- df['two'] = df['two'].fillna(0)
- df['three'] = df['three'].fillna(0)
+ df["two"] = df["two"].fillna(0)
+ df["three"] = df["three"].fillna(0)
.. ipython:: python
df
df.dropna(axis=0)
df.dropna(axis=1)
- df['one'].dropna()
+ df["one"].dropna()
An equivalent :meth:`~Series.dropna` is available for Series.
DataFrame.dropna has considerably more options than Series.dropna, which can be
@@ -336,10 +339,6 @@ examined :ref:`in the API `.
Interpolation
~~~~~~~~~~~~~
-.. versionadded:: 0.23.0
-
- The ``limit_area`` keyword argument was added.
-
Both Series and DataFrame objects have :meth:`~DataFrame.interpolate`
that, by default, performs linear interpolation at missing data points.
@@ -347,7 +346,7 @@ that, by default, performs linear interpolation at missing data points.
:suppress:
np.random.seed(123456)
- idx = pd.date_range('1/1/2000', periods=100, freq='BM')
+ idx = pd.date_range("1/1/2000", periods=100, freq="BM")
ts = pd.Series(np.random.randn(100), index=idx)
ts[1:5] = np.nan
ts[20:30] = np.nan
@@ -380,28 +379,29 @@ Index aware interpolation is available via the ``method`` keyword:
ts2
ts2.interpolate()
- ts2.interpolate(method='time')
+ ts2.interpolate(method="time")
For a floating-point index, use ``method='values'``:
.. ipython:: python
:suppress:
- idx = [0., 1., 10.]
- ser = pd.Series([0., np.nan, 10.], idx)
+ idx = [0.0, 1.0, 10.0]
+ ser = pd.Series([0.0, np.nan, 10.0], idx)
.. ipython:: python
ser
ser.interpolate()
- ser.interpolate(method='values')
+ ser.interpolate(method="values")
You can also interpolate with a DataFrame:
.. ipython:: python
- df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
- 'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
+ df = pd.DataFrame(
+ {"A": [1, 2.1, np.nan, 4.7, 5.6, 6.8], "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4]}
+ )
df
df.interpolate()
@@ -422,20 +422,20 @@ The appropriate interpolation method will depend on the type of data you are wor
.. ipython:: python
- df.interpolate(method='barycentric')
+ df.interpolate(method="barycentric")
- df.interpolate(method='pchip')
+ df.interpolate(method="pchip")
- df.interpolate(method='akima')
+ df.interpolate(method="akima")
When interpolating via a polynomial or spline approximation, you must also specify
the degree or order of the approximation:
.. ipython:: python
- df.interpolate(method='spline', order=2)
+ df.interpolate(method="spline", order=2)
- df.interpolate(method='polynomial', order=2)
+ df.interpolate(method="polynomial", order=2)
Compare several methods:
@@ -443,10 +443,10 @@ Compare several methods:
np.random.seed(2)
- ser = pd.Series(np.arange(1, 10.1, .25) ** 2 + np.random.randn(37))
+ ser = pd.Series(np.arange(1, 10.1, 0.25) ** 2 + np.random.randn(37))
missing = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29])
ser[missing] = np.nan
- methods = ['linear', 'quadratic', 'cubic']
+ methods = ["linear", "quadratic", "cubic"]
df = pd.DataFrame({m: ser.interpolate(method=m) for m in methods})
@savefig compare_interpolations.png
@@ -464,7 +464,7 @@ at the new values.
# interpolate at new_index
new_index = ser.index | pd.Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.75])
- interp_s = ser.reindex(new_index).interpolate(method='pchip')
+ interp_s = ser.reindex(new_index).interpolate(method="pchip")
interp_s[49:51]
.. _scipy: https://www.scipy.org
@@ -482,8 +482,7 @@ filled since the last valid observation:
.. ipython:: python
- ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan,
- np.nan, 13, np.nan, np.nan])
+ ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan])
ser
# fill all consecutive values in a forward direction
@@ -498,28 +497,28 @@ By default, ``NaN`` values are filled in a ``forward`` direction. Use
.. ipython:: python
# fill one consecutive value backwards
- ser.interpolate(limit=1, limit_direction='backward')
+ ser.interpolate(limit=1, limit_direction="backward")
# fill one consecutive value in both directions
- ser.interpolate(limit=1, limit_direction='both')
+ ser.interpolate(limit=1, limit_direction="both")
# fill all consecutive values in both directions
- ser.interpolate(limit_direction='both')
+ ser.interpolate(limit_direction="both")
By default, ``NaN`` values are filled whether they are inside (surrounded by)
-existing valid values, or outside existing valid values. Introduced in v0.23
-the ``limit_area`` parameter restricts filling to either inside or outside values.
+existing valid values, or outside existing valid values. The ``limit_area``
+parameter restricts filling to either inside or outside values.
.. ipython:: python
# fill one consecutive inside value in both directions
- ser.interpolate(limit_direction='both', limit_area='inside', limit=1)
+ ser.interpolate(limit_direction="both", limit_area="inside", limit=1)
# fill all consecutive outside values backward
- ser.interpolate(limit_direction='backward', limit_area='outside')
+ ser.interpolate(limit_direction="backward", limit_area="outside")
# fill all consecutive outside values in both directions
- ser.interpolate(limit_direction='both', limit_area='outside')
+ ser.interpolate(limit_direction="both", limit_area="outside")
.. _missing_data.replace:
@@ -535,7 +534,7 @@ value:
.. ipython:: python
- ser = pd.Series([0., 1., 2., 3., 4.])
+ ser = pd.Series([0.0, 1.0, 2.0, 3.0, 4.0])
ser.replace(0, 5)
@@ -555,16 +554,16 @@ For a DataFrame, you can specify individual values by column:
.. ipython:: python
- df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})
+ df = pd.DataFrame({"a": [0, 1, 2, 3, 4], "b": [5, 6, 7, 8, 9]})
- df.replace({'a': 0, 'b': 5}, 100)
+ df.replace({"a": 0, "b": 5}, 100)
Instead of replacing with specified values, you can treat all given values as
missing and interpolate over them:
.. ipython:: python
- ser.replace([1, 2, 3], method='pad')
+ ser.replace([1, 2, 3], method="pad")
.. _missing_data.replace_expression:
@@ -585,67 +584,67 @@ Replace the '.' with ``NaN`` (str -> str):
.. ipython:: python
- d = {'a': list(range(4)), 'b': list('ab..'), 'c': ['a', 'b', np.nan, 'd']}
+ d = {"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", np.nan, "d"]}
df = pd.DataFrame(d)
- df.replace('.', np.nan)
+ df.replace(".", np.nan)
Now do it with a regular expression that removes surrounding whitespace
(regex -> regex):
.. ipython:: python
- df.replace(r'\s*\.\s*', np.nan, regex=True)
+ df.replace(r"\s*\.\s*", np.nan, regex=True)
Replace a few different values (list -> list):
.. ipython:: python
- df.replace(['a', '.'], ['b', np.nan])
+ df.replace(["a", "."], ["b", np.nan])
list of regex -> list of regex:
.. ipython:: python
- df.replace([r'\.', r'(a)'], ['dot', r'\1stuff'], regex=True)
+ df.replace([r"\.", r"(a)"], ["dot", r"\1stuff"], regex=True)
Only search in column ``'b'`` (dict -> dict):
.. ipython:: python
- df.replace({'b': '.'}, {'b': np.nan})
+ df.replace({"b": "."}, {"b": np.nan})
Same as the previous example, but use a regular expression for
searching instead (dict of regex -> dict):
.. ipython:: python
- df.replace({'b': r'\s*\.\s*'}, {'b': np.nan}, regex=True)
+ df.replace({"b": r"\s*\.\s*"}, {"b": np.nan}, regex=True)
You can pass nested dictionaries of regular expressions that use ``regex=True``:
.. ipython:: python
- df.replace({'b': {'b': r''}}, regex=True)
+ df.replace({"b": {"b": r""}}, regex=True)
Alternatively, you can pass the nested dictionary like so:
.. ipython:: python
- df.replace(regex={'b': {r'\s*\.\s*': np.nan}})
+ df.replace(regex={"b": {r"\s*\.\s*": np.nan}})
You can also use the group of a regular expression match when replacing (dict
of regex -> dict of regex), this works for lists as well.
.. ipython:: python
- df.replace({'b': r'\s*(\.)\s*'}, {'b': r'\1ty'}, regex=True)
+ df.replace({"b": r"\s*(\.)\s*"}, {"b": r"\1ty"}, regex=True)
You can pass a list of regular expressions, of which those that match
will be replaced with a scalar (list of regex -> regex).
.. ipython:: python
- df.replace([r'\s*\.\s*', r'a|b'], np.nan, regex=True)
+ df.replace([r"\s*\.\s*", r"a|b"], np.nan, regex=True)
All of the regular expression examples can also be passed with the
``to_replace`` argument as the ``regex`` argument. In this case the ``value``
@@ -654,7 +653,7 @@ dictionary. The previous example, in this case, would then be:
.. ipython:: python
- df.replace(regex=[r'\s*\.\s*', r'a|b'], value=np.nan)
+ df.replace(regex=[r"\s*\.\s*", r"a|b"], value=np.nan)
This can be convenient if you do not want to pass ``regex=True`` every time you
want to use a regular expression.
@@ -680,7 +679,7 @@ Replacing more than one value is possible by passing a list.
.. ipython:: python
df00 = df.iloc[0, 0]
- df.replace([1.5, df00], [np.nan, 'a'])
+ df.replace([1.5, df00], [np.nan, "a"])
df[1].dtype
You can also operate on the DataFrame in place:
@@ -689,32 +688,6 @@ You can also operate on the DataFrame in place:
df.replace(1.5, np.nan, inplace=True)
-.. warning::
-
- When replacing multiple ``bool`` or ``datetime64`` objects, the first
- argument to ``replace`` (``to_replace``) must match the type of the value
- being replaced. For example,
-
- .. code-block:: python
-
- >>> s = pd.Series([True, False, True])
- >>> s.replace({'a string': 'new value', True: False}) # raises
- TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'
-
- will raise a ``TypeError`` because one of the ``dict`` keys is not of the
- correct type for replacement.
-
- However, when replacing a *single* object such as,
-
- .. ipython:: python
-
- s = pd.Series([True, False, True])
- s.replace('a string', 'another string')
-
- the original ``NDFrame`` object will be returned untouched. We're working on
- unifying this API, but for backwards compatibility reasons we cannot break
- the latter behavior. See :issue:`6354` for more details.
-
Missing data casting rules and indexing
---------------------------------------
@@ -762,7 +735,7 @@ However, these can be filled in using :meth:`~DataFrame.fillna` and it will work
reindexed[crit.fillna(False)]
reindexed[crit.fillna(True)]
-Pandas provides a nullable integer dtype, but you must explicitly request it
+pandas provides a nullable integer dtype, but you must explicitly request it
when creating the series or column. Notice that we use a capital "I" in
the ``dtype="Int64"``.
@@ -962,7 +935,7 @@ the first 10 columns.
.. ipython:: python
- bb = pd.read_csv('data/baseball.csv', index_col='id')
+ bb = pd.read_csv("data/baseball.csv", index_col="id")
bb[bb.columns[:10]].dtypes
.. ipython:: python
diff --git a/doc/source/user_guide/options.rst b/doc/source/user_guide/options.rst
index 398336960e769..d222297abc70b 100644
--- a/doc/source/user_guide/options.rst
+++ b/doc/source/user_guide/options.rst
@@ -17,6 +17,7 @@ You can get/set options directly as attributes of the top-level ``options`` attr
.. ipython:: python
import pandas as pd
+
pd.options.display.max_rows
pd.options.display.max_rows = 999
pd.options.display.max_rows
@@ -77,9 +78,9 @@ are available from the pandas namespace. To change an option, call
.. ipython:: python
- pd.get_option('mode.sim_interactive')
- pd.set_option('mode.sim_interactive', True)
- pd.get_option('mode.sim_interactive')
+ pd.get_option("mode.sim_interactive")
+ pd.set_option("mode.sim_interactive", True)
+ pd.get_option("mode.sim_interactive")
**Note:** The option 'mode.sim_interactive' is mostly used for debugging purposes.
@@ -109,7 +110,7 @@ It's also possible to reset multiple options at once (using a regex):
``option_context`` context manager has been exposed through
the top-level API, allowing you to execute code with given option values. Option values
-are restored automatically when you exit the `with` block:
+are restored automatically when you exit the ``with`` block:
.. ipython:: python
@@ -135,8 +136,9 @@ More information can be found in the `ipython documentation
.. code-block:: python
import pandas as pd
- pd.set_option('display.max_rows', 999)
- pd.set_option('precision', 5)
+
+ pd.set_option("display.max_rows", 999)
+ pd.set_option("precision", 5)
.. _options.frequently_used:
@@ -151,27 +153,27 @@ lines are replaced by an ellipsis.
.. ipython:: python
df = pd.DataFrame(np.random.randn(7, 2))
- pd.set_option('max_rows', 7)
+ pd.set_option("max_rows", 7)
df
- pd.set_option('max_rows', 5)
+ pd.set_option("max_rows", 5)
df
- pd.reset_option('max_rows')
+ pd.reset_option("max_rows")
Once the ``display.max_rows`` is exceeded, the ``display.min_rows`` options
determines how many rows are shown in the truncated repr.
.. ipython:: python
- pd.set_option('max_rows', 8)
- pd.set_option('min_rows', 4)
+ pd.set_option("max_rows", 8)
+ pd.set_option("min_rows", 4)
# below max_rows -> all rows shown
df = pd.DataFrame(np.random.randn(7, 2))
df
# above max_rows -> only min_rows (4) rows shown
df = pd.DataFrame(np.random.randn(9, 2))
df
- pd.reset_option('max_rows')
- pd.reset_option('min_rows')
+ pd.reset_option("max_rows")
+ pd.reset_option("min_rows")
``display.expand_frame_repr`` allows for the representation of
dataframes to stretch across pages, wrapped over the full column vs row-wise.
@@ -179,11 +181,11 @@ dataframes to stretch across pages, wrapped over the full column vs row-wise.
.. ipython:: python
df = pd.DataFrame(np.random.randn(5, 10))
- pd.set_option('expand_frame_repr', True)
+ pd.set_option("expand_frame_repr", True)
df
- pd.set_option('expand_frame_repr', False)
+ pd.set_option("expand_frame_repr", False)
df
- pd.reset_option('expand_frame_repr')
+ pd.reset_option("expand_frame_repr")
``display.large_repr`` lets you select whether to display dataframes that exceed
``max_columns`` or ``max_rows`` as a truncated frame, or as a summary.
@@ -191,26 +193,32 @@ dataframes to stretch across pages, wrapped over the full column vs row-wise.
.. ipython:: python
df = pd.DataFrame(np.random.randn(10, 10))
- pd.set_option('max_rows', 5)
- pd.set_option('large_repr', 'truncate')
+ pd.set_option("max_rows", 5)
+ pd.set_option("large_repr", "truncate")
df
- pd.set_option('large_repr', 'info')
+ pd.set_option("large_repr", "info")
df
- pd.reset_option('large_repr')
- pd.reset_option('max_rows')
+ pd.reset_option("large_repr")
+ pd.reset_option("max_rows")
``display.max_colwidth`` sets the maximum width of columns. Cells
of this length or longer will be truncated with an ellipsis.
.. ipython:: python
- df = pd.DataFrame(np.array([['foo', 'bar', 'bim', 'uncomfortably long string'],
- ['horse', 'cow', 'banana', 'apple']]))
- pd.set_option('max_colwidth', 40)
+ df = pd.DataFrame(
+ np.array(
+ [
+ ["foo", "bar", "bim", "uncomfortably long string"],
+ ["horse", "cow", "banana", "apple"],
+ ]
+ )
+ )
+ pd.set_option("max_colwidth", 40)
df
- pd.set_option('max_colwidth', 6)
+ pd.set_option("max_colwidth", 6)
df
- pd.reset_option('max_colwidth')
+ pd.reset_option("max_colwidth")
``display.max_info_columns`` sets a threshold for when by-column info
will be given.
@@ -218,11 +226,11 @@ will be given.
.. ipython:: python
df = pd.DataFrame(np.random.randn(10, 10))
- pd.set_option('max_info_columns', 11)
+ pd.set_option("max_info_columns", 11)
df.info()
- pd.set_option('max_info_columns', 5)
+ pd.set_option("max_info_columns", 5)
df.info()
- pd.reset_option('max_info_columns')
+ pd.reset_option("max_info_columns")
``display.max_info_rows``: ``df.info()`` will usually show null-counts for each column.
For large frames this can be quite slow. ``max_info_rows`` and ``max_info_cols``
@@ -233,11 +241,11 @@ can specify the option ``df.info(null_counts=True)`` to override on showing a pa
df = pd.DataFrame(np.random.choice([0, 1, np.nan], size=(10, 10)))
df
- pd.set_option('max_info_rows', 11)
+ pd.set_option("max_info_rows", 11)
df.info()
- pd.set_option('max_info_rows', 5)
+ pd.set_option("max_info_rows", 5)
df.info()
- pd.reset_option('max_info_rows')
+ pd.reset_option("max_info_rows")
``display.precision`` sets the output display precision in terms of decimal places.
This is only a suggestion.
@@ -245,9 +253,9 @@ This is only a suggestion.
.. ipython:: python
df = pd.DataFrame(np.random.randn(5, 5))
- pd.set_option('precision', 7)
+ pd.set_option("precision", 7)
df
- pd.set_option('precision', 4)
+ pd.set_option("precision", 4)
df
``display.chop_threshold`` sets at what level pandas rounds to zero when
@@ -257,26 +265,27 @@ precision at which the number is stored.
.. ipython:: python
df = pd.DataFrame(np.random.randn(6, 6))
- pd.set_option('chop_threshold', 0)
+ pd.set_option("chop_threshold", 0)
df
- pd.set_option('chop_threshold', .5)
+ pd.set_option("chop_threshold", 0.5)
df
- pd.reset_option('chop_threshold')
+ pd.reset_option("chop_threshold")
``display.colheader_justify`` controls the justification of the headers.
The options are 'right', and 'left'.
.. ipython:: python
- df = pd.DataFrame(np.array([np.random.randn(6),
- np.random.randint(1, 9, 6) * .1,
- np.zeros(6)]).T,
- columns=['A', 'B', 'C'], dtype='float')
- pd.set_option('colheader_justify', 'right')
+ df = pd.DataFrame(
+ np.array([np.random.randn(6), np.random.randint(1, 9, 6) * 0.1, np.zeros(6)]).T,
+ columns=["A", "B", "C"],
+ dtype="float",
+ )
+ pd.set_option("colheader_justify", "right")
df
- pd.set_option('colheader_justify', 'left')
+ pd.set_option("colheader_justify", "left")
df
- pd.reset_option('colheader_justify')
+ pd.reset_option("colheader_justify")
@@ -306,10 +315,10 @@ display.encoding UTF-8 Defaults to the detected en
meant to be displayed on the console.
display.expand_frame_repr True Whether to print out the full DataFrame
repr for wide DataFrames across
- multiple lines, `max_columns` is
+ multiple lines, ``max_columns`` is
still respected, but the output will
wrap-around across multiple "pages"
- if its width exceeds `display.width`.
+ if its width exceeds ``display.width``.
display.float_format None The callable should accept a floating
point number and return a string with
the desired format of the number.
@@ -371,11 +380,11 @@ display.max_rows 60 This sets the maximum numbe
fully or just a truncated or summary repr.
'None' value means unlimited.
display.min_rows 10 The numbers of rows to show in a truncated
- repr (when `max_rows` is exceeded). Ignored
- when `max_rows` is set to None or 0. When set
- to None, follows the value of `max_rows`.
+ repr (when ``max_rows`` is exceeded). Ignored
+ when ``max_rows`` is set to None or 0. When set
+ to None, follows the value of ``max_rows``.
display.max_seq_items 100 when pretty-printing a long sequence,
- no more then `max_seq_items` will
+ no more then ``max_seq_items`` will
be printed. If items are omitted,
they will be denoted by the addition
of "..." to the resulting string.
@@ -481,9 +490,9 @@ For instance:
import numpy as np
pd.set_eng_float_format(accuracy=3, use_eng_prefix=True)
- s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
- s / 1.e3
- s / 1.e6
+ s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
+ s / 1.0e3
+ s / 1.0e6
.. ipython:: python
:suppress:
@@ -510,7 +519,7 @@ If a DataFrame or Series contains these characters, the default output mode may
.. ipython:: python
- df = pd.DataFrame({'国籍': ['UK', '日本'], '名前': ['Alice', 'しのぶ']})
+ df = pd.DataFrame({"国籍": ["UK", "日本"], "名前": ["Alice", "しのぶ"]})
df
.. image:: ../_static/option_unicode01.png
@@ -521,7 +530,7 @@ times than the standard ``len`` function.
.. ipython:: python
- pd.set_option('display.unicode.east_asian_width', True)
+ pd.set_option("display.unicode.east_asian_width", True)
df
.. image:: ../_static/option_unicode02.png
@@ -533,7 +542,7 @@ By default, an "Ambiguous" character's width, such as "¡" (inverted exclamation
.. ipython:: python
- df = pd.DataFrame({'a': ['xxx', '¡¡'], 'b': ['yyy', '¡¡']})
+ df = pd.DataFrame({"a": ["xxx", "¡¡"], "b": ["yyy", "¡¡"]})
df
.. image:: ../_static/option_unicode03.png
@@ -545,7 +554,7 @@ However, setting this option incorrectly for your terminal will cause these char
.. ipython:: python
- pd.set_option('display.unicode.ambiguous_as_wide', True)
+ pd.set_option("display.unicode.ambiguous_as_wide", True)
df
.. image:: ../_static/option_unicode04.png
@@ -553,8 +562,8 @@ However, setting this option incorrectly for your terminal will cause these char
.. ipython:: python
:suppress:
- pd.set_option('display.unicode.east_asian_width', False)
- pd.set_option('display.unicode.ambiguous_as_wide', False)
+ pd.set_option("display.unicode.east_asian_width", False)
+ pd.set_option("display.unicode.ambiguous_as_wide", False)
.. _options.table_schema:
@@ -567,7 +576,7 @@ by default. False by default, this can be enabled globally with the
.. ipython:: python
- pd.set_option('display.html.table_schema', True)
+ pd.set_option("display.html.table_schema", True)
Only ``'display.max_rows'`` are serialized and published.
@@ -575,4 +584,4 @@ Only ``'display.max_rows'`` are serialized and published.
.. ipython:: python
:suppress:
- pd.reset_option('display.html.table_schema')
+ pd.reset_option("display.html.table_schema")
diff --git a/doc/source/user_guide/reshaping.rst b/doc/source/user_guide/reshaping.rst
index c476e33b8ddde..2061185b25416 100644
--- a/doc/source/user_guide/reshaping.rst
+++ b/doc/source/user_guide/reshaping.rst
@@ -18,14 +18,18 @@ Reshaping by pivoting DataFrame objects
import pandas._testing as tm
+
def unpivot(frame):
N, K = frame.shape
- data = {'value': frame.to_numpy().ravel('F'),
- 'variable': np.asarray(frame.columns).repeat(N),
- 'date': np.tile(np.asarray(frame.index), K)}
- columns = ['date', 'variable', 'value']
+ data = {
+ "value": frame.to_numpy().ravel("F"),
+ "variable": np.asarray(frame.columns).repeat(N),
+ "date": np.tile(np.asarray(frame.index), K),
+ }
+ columns = ["date", "variable", "value"]
return pd.DataFrame(data, columns=columns)
+
df = unpivot(tm.makeTimeDataFrame(3))
Data is often stored in so-called "stacked" or "record" format:
@@ -41,12 +45,15 @@ For the curious here is how the above ``DataFrame`` was created:
import pandas._testing as tm
+
def unpivot(frame):
N, K = frame.shape
- data = {'value': frame.to_numpy().ravel('F'),
- 'variable': np.asarray(frame.columns).repeat(N),
- 'date': np.tile(np.asarray(frame.index), K)}
- return pd.DataFrame(data, columns=['date', 'variable', 'value'])
+ data = {
+ "value": frame.to_numpy().ravel("F"),
+ "variable": np.asarray(frame.columns).repeat(N),
+ "date": np.tile(np.asarray(frame.index), K),
+ }
+ return pd.DataFrame(data, columns=["date", "variable", "value"])
df = unpivot(tm.makeTimeDataFrame(3))
@@ -55,7 +62,7 @@ To select out everything for variable ``A`` we could do:
.. ipython:: python
- df[df['variable'] == 'A']
+ df[df["variable"] == "A"]
But suppose we wish to do time series operations with the variables. A better
representation would be where the ``columns`` are the unique variables and an
@@ -65,7 +72,7 @@ top level function :func:`~pandas.pivot`):
.. ipython:: python
- df.pivot(index='date', columns='variable', values='value')
+ df.pivot(index="date", columns="variable", values="value")
If the ``values`` argument is omitted, and the input ``DataFrame`` has more than
one column of values which are not used as column or index inputs to ``pivot``,
@@ -75,15 +82,15 @@ column:
.. ipython:: python
- df['value2'] = df['value'] * 2
- pivoted = df.pivot(index='date', columns='variable')
+ df["value2"] = df["value"] * 2
+ pivoted = df.pivot(index="date", columns="variable")
pivoted
You can then select subsets from the pivoted ``DataFrame``:
.. ipython:: python
- pivoted['value2']
+ pivoted["value2"]
Note that this returns a view on the underlying data in the case where the data
are homogeneously-typed.
@@ -121,12 +128,16 @@ from the hierarchical indexing section:
.. ipython:: python
- tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
- 'foo', 'foo', 'qux', 'qux'],
- ['one', 'two', 'one', 'two',
- 'one', 'two', 'one', 'two']]))
- index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
- df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
+ tuples = list(
+ zip(
+ *[
+ ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
+ ["one", "two", "one", "two", "one", "two", "one", "two"],
+ ]
+ )
+ )
+ index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
+ df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df2 = df[:4]
df2
@@ -163,7 +174,7 @@ the level numbers:
.. ipython:: python
- stacked.unstack('second')
+ stacked.unstack("second")
.. image:: ../_static/reshaping_unstack_0.png
@@ -174,8 +185,8 @@ will result in a **sorted** copy of the original ``DataFrame`` or ``Series``:
.. ipython:: python
- index = pd.MultiIndex.from_product([[2, 1], ['a', 'b']])
- df = pd.DataFrame(np.random.randn(4), index=index, columns=['A'])
+ index = pd.MultiIndex.from_product([[2, 1], ["a", "b"]])
+ df = pd.DataFrame(np.random.randn(4), index=index, columns=["A"])
df
all(df.unstack().stack() == df.sort_index())
@@ -193,15 +204,19 @@ processed individually.
.. ipython:: python
- columns = pd.MultiIndex.from_tuples([
- ('A', 'cat', 'long'), ('B', 'cat', 'long'),
- ('A', 'dog', 'short'), ('B', 'dog', 'short')],
- names=['exp', 'animal', 'hair_length']
+ columns = pd.MultiIndex.from_tuples(
+ [
+ ("A", "cat", "long"),
+ ("B", "cat", "long"),
+ ("A", "dog", "short"),
+ ("B", "dog", "short"),
+ ],
+ names=["exp", "animal", "hair_length"],
)
df = pd.DataFrame(np.random.randn(4, 4), columns=columns)
df
- df.stack(level=['animal', 'hair_length'])
+ df.stack(level=["animal", "hair_length"])
The list of levels can contain either level names or level numbers (but
not a mixture of the two).
@@ -222,12 +237,12 @@ calling ``sort_index``, of course). Here is a more complex example:
.. ipython:: python
- columns = pd.MultiIndex.from_tuples([('A', 'cat'), ('B', 'dog'),
- ('B', 'cat'), ('A', 'dog')],
- names=['exp', 'animal'])
- index = pd.MultiIndex.from_product([('bar', 'baz', 'foo', 'qux'),
- ('one', 'two')],
- names=['first', 'second'])
+ columns = pd.MultiIndex.from_tuples(
+ [("A", "cat"), ("B", "dog"), ("B", "cat"), ("A", "dog")], names=["exp", "animal"]
+ )
+ index = pd.MultiIndex.from_product(
+ [("bar", "baz", "foo", "qux"), ("one", "two")], names=["first", "second"]
+ )
df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)
df2 = df.iloc[[0, 1, 2, 4, 5, 7]]
df2
@@ -237,8 +252,8 @@ which level in the columns to stack:
.. ipython:: python
- df2.stack('exp')
- df2.stack('animal')
+ df2.stack("exp")
+ df2.stack("animal")
Unstacking can result in missing values if subgroups do not have the same
set of labels. By default, missing values will be replaced with the default
@@ -288,13 +303,37 @@ For instance,
.. ipython:: python
- cheese = pd.DataFrame({'first': ['John', 'Mary'],
- 'last': ['Doe', 'Bo'],
- 'height': [5.5, 6.0],
- 'weight': [130, 150]})
+ cheese = pd.DataFrame(
+ {
+ "first": ["John", "Mary"],
+ "last": ["Doe", "Bo"],
+ "height": [5.5, 6.0],
+ "weight": [130, 150],
+ }
+ )
+ cheese
+ cheese.melt(id_vars=["first", "last"])
+ cheese.melt(id_vars=["first", "last"], var_name="quantity")
+
+When transforming a DataFrame using :func:`~pandas.melt`, the index will be ignored. The original index values can be kept around by setting the ``ignore_index`` parameter to ``False`` (default is ``True``). This will however duplicate them.
+
+.. versionadded:: 1.1.0
+
+.. ipython:: python
+
+ index = pd.MultiIndex.from_tuples([("person", "A"), ("person", "B")])
+ cheese = pd.DataFrame(
+ {
+ "first": ["John", "Mary"],
+ "last": ["Doe", "Bo"],
+ "height": [5.5, 6.0],
+ "weight": [130, 150],
+ },
+ index=index,
+ )
cheese
- cheese.melt(id_vars=['first', 'last'])
- cheese.melt(id_vars=['first', 'last'], var_name='quantity')
+ cheese.melt(id_vars=["first", "last"])
+ cheese.melt(id_vars=["first", "last"], ignore_index=False)
Another way to transform is to use the :func:`~pandas.wide_to_long` panel data
convenience function. It is less flexible than :func:`~pandas.melt`, but more
@@ -302,12 +341,15 @@ user-friendly.
.. ipython:: python
- dft = pd.DataFrame({"A1970": {0: "a", 1: "b", 2: "c"},
- "A1980": {0: "d", 1: "e", 2: "f"},
- "B1970": {0: 2.5, 1: 1.2, 2: .7},
- "B1980": {0: 3.2, 1: 1.3, 2: .1},
- "X": dict(zip(range(3), np.random.randn(3)))
- })
+ dft = pd.DataFrame(
+ {
+ "A1970": {0: "a", 1: "b", 2: "c"},
+ "A1980": {0: "d", 1: "e", 2: "f"},
+ "B1970": {0: 2.5, 1: 1.2, 2: 0.7},
+ "B1980": {0: 3.2, 1: 1.3, 2: 0.1},
+ "X": dict(zip(range(3), np.random.randn(3))),
+ }
+ )
dft["id"] = dft.index
dft
pd.wide_to_long(dft, ["A", "B"], i="id", j="year")
@@ -364,23 +406,27 @@ Consider a data set like this:
.. ipython:: python
import datetime
- df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 6,
- 'B': ['A', 'B', 'C'] * 8,
- 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
- 'D': np.random.randn(24),
- 'E': np.random.randn(24),
- 'F': [datetime.datetime(2013, i, 1) for i in range(1, 13)]
- + [datetime.datetime(2013, i, 15) for i in range(1, 13)]})
+
+ df = pd.DataFrame(
+ {
+ "A": ["one", "one", "two", "three"] * 6,
+ "B": ["A", "B", "C"] * 8,
+ "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 4,
+ "D": np.random.randn(24),
+ "E": np.random.randn(24),
+ "F": [datetime.datetime(2013, i, 1) for i in range(1, 13)]
+ + [datetime.datetime(2013, i, 15) for i in range(1, 13)],
+ }
+ )
df
We can produce pivot tables from this data very easily:
.. ipython:: python
- pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
- pd.pivot_table(df, values='D', index=['B'], columns=['A', 'C'], aggfunc=np.sum)
- pd.pivot_table(df, values=['D', 'E'], index=['B'], columns=['A', 'C'],
- aggfunc=np.sum)
+ pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
+ pd.pivot_table(df, values="D", index=["B"], columns=["A", "C"], aggfunc=np.sum)
+ pd.pivot_table(df, values=["D", "E"], index=["B"], columns=["A", "C"], aggfunc=np.sum)
The result object is a ``DataFrame`` having potentially hierarchical indexes on the
rows and columns. If the ``values`` column name is not given, the pivot table
@@ -389,22 +435,21 @@ hierarchy in the columns:
.. ipython:: python
- pd.pivot_table(df, index=['A', 'B'], columns=['C'])
+ pd.pivot_table(df, index=["A", "B"], columns=["C"])
Also, you can use ``Grouper`` for ``index`` and ``columns`` keywords. For detail of ``Grouper``, see :ref:`Grouping with a Grouper specification `.
.. ipython:: python
- pd.pivot_table(df, values='D', index=pd.Grouper(freq='M', key='F'),
- columns='C')
+ pd.pivot_table(df, values="D", index=pd.Grouper(freq="M", key="F"), columns="C")
You can render a nice output of the table omitting the missing values by
calling ``to_string`` if you wish:
.. ipython:: python
- table = pd.pivot_table(df, index=['A', 'B'], columns=['C'])
- print(table.to_string(na_rep=''))
+ table = pd.pivot_table(df, index=["A", "B"], columns=["C"])
+ print(table.to_string(na_rep=""))
Note that ``pivot_table`` is also available as an instance method on DataFrame,
i.e. :meth:`DataFrame.pivot_table`.
@@ -420,7 +465,7 @@ rows and columns:
.. ipython:: python
- df.pivot_table(index=['A', 'B'], columns='C', margins=True, aggfunc=np.std)
+ df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.std)
.. _reshaping.crosstabulations:
@@ -454,30 +499,31 @@ For example:
.. ipython:: python
- foo, bar, dull, shiny, one, two = 'foo', 'bar', 'dull', 'shiny', 'one', 'two'
+ foo, bar, dull, shiny, one, two = "foo", "bar", "dull", "shiny", "one", "two"
a = np.array([foo, foo, bar, bar, foo, foo], dtype=object)
b = np.array([one, one, two, one, two, one], dtype=object)
c = np.array([dull, dull, shiny, dull, dull, shiny], dtype=object)
- pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
+ pd.crosstab(a, [b, c], rownames=["a"], colnames=["b", "c"])
If ``crosstab`` receives only two Series, it will provide a frequency table.
.. ipython:: python
- df = pd.DataFrame({'A': [1, 2, 2, 2, 2], 'B': [3, 3, 4, 4, 4],
- 'C': [1, 1, np.nan, 1, 1]})
+ df = pd.DataFrame(
+ {"A": [1, 2, 2, 2, 2], "B": [3, 3, 4, 4, 4], "C": [1, 1, np.nan, 1, 1]}
+ )
df
- pd.crosstab(df['A'], df['B'])
+ pd.crosstab(df["A"], df["B"])
``crosstab`` can also be implemented
to ``Categorical`` data.
.. ipython:: python
- foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
- bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
+ foo = pd.Categorical(["a", "b"], categories=["a", "b", "c"])
+ bar = pd.Categorical(["d", "e"], categories=["d", "e", "f"])
pd.crosstab(foo, bar)
If you want to include **all** of data categories even if the actual data does
@@ -497,13 +543,13 @@ using the ``normalize`` argument:
.. ipython:: python
- pd.crosstab(df['A'], df['B'], normalize=True)
+ pd.crosstab(df["A"], df["B"], normalize=True)
``normalize`` can also normalize values within each row or within each column:
.. ipython:: python
- pd.crosstab(df['A'], df['B'], normalize='columns')
+ pd.crosstab(df["A"], df["B"], normalize="columns")
``crosstab`` can also be passed a third ``Series`` and an aggregation function
(``aggfunc``) that will be applied to the values of the third ``Series`` within
@@ -511,7 +557,7 @@ each group defined by the first two ``Series``:
.. ipython:: python
- pd.crosstab(df['A'], df['B'], values=df['C'], aggfunc=np.sum)
+ pd.crosstab(df["A"], df["B"], values=df["C"], aggfunc=np.sum)
Adding margins
~~~~~~~~~~~~~~
@@ -520,8 +566,9 @@ Finally, one can also add margins or normalize this output.
.. ipython:: python
- pd.crosstab(df['A'], df['B'], values=df['C'], aggfunc=np.sum, normalize=True,
- margins=True)
+ pd.crosstab(
+ df["A"], df["B"], values=df["C"], aggfunc=np.sum, normalize=True, margins=True
+ )
.. _reshaping.tile:
.. _reshaping.tile.cut:
@@ -565,19 +612,19 @@ values, can derive a ``DataFrame`` containing ``k`` columns of 1s and 0s using
.. ipython:: python
- df = pd.DataFrame({'key': list('bbacab'), 'data1': range(6)})
+ df = pd.DataFrame({"key": list("bbacab"), "data1": range(6)})
- pd.get_dummies(df['key'])
+ pd.get_dummies(df["key"])
Sometimes it's useful to prefix the column names, for example when merging the result
with the original ``DataFrame``:
.. ipython:: python
- dummies = pd.get_dummies(df['key'], prefix='key')
+ dummies = pd.get_dummies(df["key"], prefix="key")
dummies
- df[['data1']].join(dummies)
+ df[["data1"]].join(dummies)
This function is often used along with discretization functions like ``cut``:
@@ -593,14 +640,13 @@ This function is often used along with discretization functions like ``cut``:
See also :func:`Series.str.get_dummies `.
:func:`get_dummies` also accepts a ``DataFrame``. By default all categorical
-variables (categorical in the statistical sense, those with `object` or
-`categorical` dtype) are encoded as dummy variables.
+variables (categorical in the statistical sense, those with ``object`` or
+``categorical`` dtype) are encoded as dummy variables.
.. ipython:: python
- df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
- 'C': [1, 2, 3]})
+ df = pd.DataFrame({"A": ["a", "b", "a"], "B": ["c", "c", "b"], "C": [1, 2, 3]})
pd.get_dummies(df)
All non-object columns are included untouched in the output. You can control
@@ -608,7 +654,7 @@ the columns that are encoded with the ``columns`` keyword.
.. ipython:: python
- pd.get_dummies(df, columns=['A'])
+ pd.get_dummies(df, columns=["A"])
Notice that the ``B`` column is still included in the output, it just hasn't
been encoded. You can drop ``B`` before calling ``get_dummies`` if you don't
@@ -625,11 +671,11 @@ the prefix separator. You can specify ``prefix`` and ``prefix_sep`` in 3 ways:
.. ipython:: python
- simple = pd.get_dummies(df, prefix='new_prefix')
+ simple = pd.get_dummies(df, prefix="new_prefix")
simple
- from_list = pd.get_dummies(df, prefix=['from_A', 'from_B'])
+ from_list = pd.get_dummies(df, prefix=["from_A", "from_B"])
from_list
- from_dict = pd.get_dummies(df, prefix={'B': 'from_B', 'A': 'from_A'})
+ from_dict = pd.get_dummies(df, prefix={"B": "from_B", "A": "from_A"})
from_dict
Sometimes it will be useful to only keep k-1 levels of a categorical
@@ -638,7 +684,7 @@ You can switch to this mode by turn on ``drop_first``.
.. ipython:: python
- s = pd.Series(list('abcaa'))
+ s = pd.Series(list("abcaa"))
pd.get_dummies(s)
@@ -648,7 +694,7 @@ When a column contains only one level, it will be omitted in the result.
.. ipython:: python
- df = pd.DataFrame({'A': list('aaaaa'), 'B': list('ababc')})
+ df = pd.DataFrame({"A": list("aaaaa"), "B": list("ababc")})
pd.get_dummies(df)
@@ -659,12 +705,10 @@ To choose another dtype, use the ``dtype`` argument:
.. ipython:: python
- df = pd.DataFrame({'A': list('abc'), 'B': [1.1, 2.2, 3.3]})
+ df = pd.DataFrame({"A": list("abc"), "B": [1.1, 2.2, 3.3]})
pd.get_dummies(df, dtype=bool).dtypes
-.. versionadded:: 0.23.0
-
.. _reshaping.factorize:
@@ -675,7 +719,7 @@ To encode 1-d values as an enumerated type use :func:`~pandas.factorize`:
.. ipython:: python
- x = pd.Series(['A', 'A', np.nan, 'B', 3.14, np.inf])
+ x = pd.Series(["A", "A", np.nan, "B", 3.14, np.inf])
x
labels, uniques = pd.factorize(x)
labels
@@ -719,11 +763,12 @@ DataFrame will be pivoted in the answers below.
np.random.seed([3, 1415])
n = 20
- cols = np.array(['key', 'row', 'item', 'col'])
- df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4))
- // [2, 1, 2, 1]).astype(str))
+ cols = np.array(["key", "row", "item", "col"])
+ df = cols + pd.DataFrame(
+ (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)
+ )
df.columns = cols
- df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val'))
+ df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix("val"))
df
@@ -748,24 +793,21 @@ This solution uses :func:`~pandas.pivot_table`. Also note that
.. ipython:: python
- df.pivot_table(
- values='val0', index='row', columns='col', aggfunc='mean')
+ df.pivot_table(values="val0", index="row", columns="col", aggfunc="mean")
Note that we can also replace the missing values by using the ``fill_value``
parameter.
.. ipython:: python
- df.pivot_table(
- values='val0', index='row', columns='col', aggfunc='mean', fill_value=0)
+ df.pivot_table(values="val0", index="row", columns="col", aggfunc="mean", fill_value=0)
Also note that we can pass in other aggregation functions as well. For example,
we can also pass in ``sum``.
.. ipython:: python
- df.pivot_table(
- values='val0', index='row', columns='col', aggfunc='sum', fill_value=0)
+ df.pivot_table(values="val0", index="row", columns="col", aggfunc="sum", fill_value=0)
Another aggregation we can do is calculate the frequency in which the columns
and rows occur together a.k.a. "cross tabulation". To do this, we can pass
@@ -773,7 +815,7 @@ and rows occur together a.k.a. "cross tabulation". To do this, we can pass
.. ipython:: python
- df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
+ df.pivot_table(index="row", columns="col", fill_value=0, aggfunc="size")
Pivoting with multiple aggregations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -783,24 +825,21 @@ We can also perform multiple aggregations. For example, to perform both a
.. ipython:: python
- df.pivot_table(
- values='val0', index='row', columns='col', aggfunc=['mean', 'sum'])
+ df.pivot_table(values="val0", index="row", columns="col", aggfunc=["mean", "sum"])
Note to aggregate over multiple value columns, we can pass in a list to the
``values`` parameter.
.. ipython:: python
- df.pivot_table(
- values=['val0', 'val1'], index='row', columns='col', aggfunc=['mean'])
+ df.pivot_table(values=["val0", "val1"], index="row", columns="col", aggfunc=["mean"])
Note to subdivide over multiple columns we can pass in a list to the
``columns`` parameter.
.. ipython:: python
- df.pivot_table(
- values=['val0'], index='row', columns=['item', 'col'], aggfunc=['mean'])
+ df.pivot_table(values=["val0"], index="row", columns=["item", "col"], aggfunc=["mean"])
.. _reshaping.explode:
@@ -813,28 +852,28 @@ Sometimes the values in a column are list-like.
.. ipython:: python
- keys = ['panda1', 'panda2', 'panda3']
- values = [['eats', 'shoots'], ['shoots', 'leaves'], ['eats', 'leaves']]
- df = pd.DataFrame({'keys': keys, 'values': values})
+ keys = ["panda1", "panda2", "panda3"]
+ values = [["eats", "shoots"], ["shoots", "leaves"], ["eats", "leaves"]]
+ df = pd.DataFrame({"keys": keys, "values": values})
df
We can 'explode' the ``values`` column, transforming each list-like to a separate row, by using :meth:`~Series.explode`. This will replicate the index values from the original row:
.. ipython:: python
- df['values'].explode()
+ df["values"].explode()
You can also explode the column in the ``DataFrame``.
.. ipython:: python
- df.explode('values')
+ df.explode("values")
:meth:`Series.explode` will replace empty lists with ``np.nan`` and preserve scalar entries. The dtype of the resulting ``Series`` is always ``object``.
.. ipython:: python
- s = pd.Series([[1, 2, 3], 'foo', [], ['a', 'b']])
+ s = pd.Series([[1, 2, 3], "foo", [], ["a", "b"]])
s
s.explode()
@@ -842,12 +881,11 @@ Here is a typical usecase. You have comma separated strings in a column and want
.. ipython:: python
- df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
- {'var1': 'd,e,f', 'var2': 2}])
+ df = pd.DataFrame([{"var1": "a,b,c", "var2": 1}, {"var1": "d,e,f", "var2": 2}])
df
Creating a long form DataFrame is now straightforward using explode and chained operations
.. ipython:: python
- df.assign(var1=df.var1.str.split(',')).explode('var1')
+ df.assign(var1=df.var1.str.split(",")).explode("var1")
diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst
index cddc3cb2600fd..7f2419bc7f19d 100644
--- a/doc/source/user_guide/scale.rst
+++ b/doc/source/user_guide/scale.rst
@@ -4,7 +4,7 @@
Scaling to large datasets
*************************
-Pandas provides data structures for in-memory analytics, which makes using pandas
+pandas provides data structures for in-memory analytics, which makes using pandas
to analyze datasets that are larger than memory datasets somewhat tricky. Even datasets
that are a sizable fraction of memory become unwieldy, as some pandas operations need
to make intermediate copies.
@@ -13,7 +13,7 @@ This document provides a few recommendations for scaling your analysis to larger
It's a complement to :ref:`enhancingperf`, which focuses on speeding up analysis
for datasets that fit in memory.
-But first, it's worth considering *not using pandas*. Pandas isn't the right
+But first, it's worth considering *not using pandas*. pandas isn't the right
tool for all situations. If you're working with very large datasets and a tool
like PostgreSQL fits your needs, then you should probably be using that.
Assuming you want or need the expressiveness and power of pandas, let's carry on.
@@ -72,7 +72,7 @@ Option 1 loads in all the data and then filters to what we need.
.. ipython:: python
- columns = ['id_0', 'name_0', 'x_0', 'y_0']
+ columns = ["id_0", "name_0", "x_0", "y_0"]
pd.read_parquet("timeseries_wide.parquet")[columns]
@@ -123,7 +123,7 @@ space-efficient integers to know which specific name is used in each row.
.. ipython:: python
ts2 = ts.copy()
- ts2['name'] = ts2['name'].astype('category')
+ ts2["name"] = ts2["name"].astype("category")
ts2.memory_usage(deep=True)
We can go a bit further and downcast the numeric columns to their smallest types
@@ -131,8 +131,8 @@ using :func:`pandas.to_numeric`.
.. ipython:: python
- ts2['id'] = pd.to_numeric(ts2['id'], downcast='unsigned')
- ts2[['x', 'y']] = ts2[['x', 'y']].apply(pd.to_numeric, downcast='float')
+ ts2["id"] = pd.to_numeric(ts2["id"], downcast="unsigned")
+ ts2[["x", "y"]] = ts2[["x", "y"]].apply(pd.to_numeric, downcast="float")
ts2.dtypes
.. ipython:: python
@@ -141,8 +141,7 @@ using :func:`pandas.to_numeric`.
.. ipython:: python
- reduction = (ts2.memory_usage(deep=True).sum()
- / ts.memory_usage(deep=True).sum())
+ reduction = ts2.memory_usage(deep=True).sum() / ts.memory_usage(deep=True).sum()
print(f"{reduction:0.2f}")
In all, we've reduced the in-memory footprint of this dataset to 1/5 of its
@@ -174,13 +173,13 @@ files. Each file in the directory represents a different year of the entire data
import pathlib
N = 12
- starts = [f'20{i:>02d}-01-01' for i in range(N)]
- ends = [f'20{i:>02d}-12-13' for i in range(N)]
+ starts = [f"20{i:>02d}-01-01" for i in range(N)]
+ ends = [f"20{i:>02d}-12-13" for i in range(N)]
pathlib.Path("data/timeseries").mkdir(exist_ok=True)
for i, (start, end) in enumerate(zip(starts, ends)):
- ts = _make_timeseries(start=start, end=end, freq='1T', seed=i)
+ ts = _make_timeseries(start=start, end=end, freq="1T", seed=i)
ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")
@@ -214,8 +213,8 @@ work for arbitrary-sized datasets.
for path in files:
# Only one dataframe is in memory at a time...
df = pd.read_parquet(path)
- # ... plus a small Series `counts`, which is updated.
- counts = counts.add(df['name'].value_counts(), fill_value=0)
+ # ... plus a small Series ``counts``, which is updated.
+ counts = counts.add(df["name"].value_counts(), fill_value=0)
counts.astype(int)
Some readers, like :meth:`pandas.read_csv`, offer parameters to control the
@@ -231,7 +230,7 @@ different library that implements these out-of-core algorithms for you.
Use other libraries
-------------------
-Pandas is just one library offering a DataFrame API. Because of its popularity,
+pandas is just one library offering a DataFrame API. Because of its popularity,
pandas' API has become something of a standard that other libraries implement.
The pandas documentation maintains a list of libraries implementing a DataFrame API
in :ref:`our ecosystem page `.
@@ -260,7 +259,7 @@ Inspecting the ``ddf`` object, we see a few things
* There are new attributes like ``.npartitions`` and ``.divisions``
The partitions and divisions are how Dask parallelizes computation. A **Dask**
-DataFrame is made up of many **Pandas** DataFrames. A single method call on a
+DataFrame is made up of many pandas DataFrames. A single method call on a
Dask DataFrame ends up making many pandas method calls, and Dask knows how to
coordinate everything to get the result.
@@ -278,8 +277,8 @@ Rather than executing immediately, doing operations build up a **task graph**.
.. ipython:: python
ddf
- ddf['name']
- ddf['name'].value_counts()
+ ddf["name"]
+ ddf["name"].value_counts()
Each of these calls is instant because the result isn't being computed yet.
We're just building up a list of computation to do when someone needs the
@@ -291,7 +290,7 @@ To get the actual result you can call ``.compute()``.
.. ipython:: python
- %time ddf['name'].value_counts().compute()
+ %time ddf["name"].value_counts().compute()
At that point, you get back the same thing you'd get with pandas, in this case
a concrete pandas Series with the count of each ``name``.
@@ -324,7 +323,7 @@ a familiar groupby aggregation.
.. ipython:: python
- %time ddf.groupby('name')[['x', 'y']].mean().compute().head()
+ %time ddf.groupby("name")[["x", "y"]].mean().compute().head()
The grouping and aggregation is done out-of-core and in parallel.
@@ -336,8 +335,8 @@ we need to supply the divisions manually.
.. ipython:: python
N = 12
- starts = [f'20{i:>02d}-01-01' for i in range(N)]
- ends = [f'20{i:>02d}-12-13' for i in range(N)]
+ starts = [f"20{i:>02d}-01-01" for i in range(N)]
+ ends = [f"20{i:>02d}-12-13" for i in range(N)]
divisions = tuple(pd.to_datetime(starts)) + (pd.Timestamp(ends[-1]),)
ddf.divisions = divisions
@@ -347,9 +346,9 @@ Now we can do things like fast random access with ``.loc``.
.. ipython:: python
- ddf.loc['2002-01-01 12:01':'2002-01-01 12:05'].compute()
+ ddf.loc["2002-01-01 12:01":"2002-01-01 12:05"].compute()
-Dask knows to just look in the 3rd partition for selecting values in `2002`. It
+Dask knows to just look in the 3rd partition for selecting values in 2002. It
doesn't need to look at any other data.
Many workflows involve a large amount of data and processing it in a way that
@@ -362,7 +361,7 @@ out of memory. At that point it's just a regular pandas object.
:okwarning:
@savefig dask_resample.png
- ddf[['x', 'y']].resample("1D").mean().cumsum().compute().plot()
+ ddf[["x", "y"]].resample("1D").mean().cumsum().compute().plot()
These Dask examples have all be done using multiple processes on a single
machine. Dask can be `deployed on a cluster
diff --git a/doc/source/user_guide/sparse.rst b/doc/source/user_guide/sparse.rst
index ca8e9a2f313f6..3156e3088d860 100644
--- a/doc/source/user_guide/sparse.rst
+++ b/doc/source/user_guide/sparse.rst
@@ -6,7 +6,7 @@
Sparse data structures
**********************
-Pandas provides data structures for efficiently storing sparse data.
+pandas provides data structures for efficiently storing sparse data.
These are not necessarily sparse in the typical "mostly 0". Rather, you can view these
objects as being "compressed" where any data matching a specific value (``NaN`` / missing value, though any value
can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.
@@ -87,14 +87,15 @@ The :attr:`SparseArray.dtype` property stores two pieces of information
sparr.dtype
-A :class:`SparseDtype` may be constructed by passing each of these
+A :class:`SparseDtype` may be constructed by passing only a dtype
.. ipython:: python
pd.SparseDtype(np.dtype('datetime64[ns]'))
-The default fill value for a given NumPy dtype is the "missing" value for that dtype,
-though it may be overridden.
+in which case a default fill value will be used (for NumPy dtypes this is often the
+"missing" value for that dtype). To override this default an explicit fill value may be
+passed instead
.. ipython:: python
@@ -115,7 +116,7 @@ Sparse accessor
.. versionadded:: 0.24.0
-Pandas provides a ``.sparse`` accessor, similar to ``.str`` for string data, ``.cat``
+pandas provides a ``.sparse`` accessor, similar to ``.str`` for string data, ``.cat``
for categorical data, and ``.dt`` for datetime-like data. This namespace provides
attributes and methods that are specific to sparse data.
@@ -302,14 +303,17 @@ The method requires a ``MultiIndex`` with two or more levels.
.. ipython:: python
s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])
- s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
- (1, 2, 'a', 1),
- (1, 1, 'b', 0),
- (1, 1, 'b', 1),
- (2, 1, 'b', 0),
- (2, 1, 'b', 1)],
- names=['A', 'B', 'C', 'D'])
- s
+ s.index = pd.MultiIndex.from_tuples(
+ [
+ (1, 2, "a", 0),
+ (1, 2, "a", 1),
+ (1, 1, "b", 0),
+ (1, 1, "b", 1),
+ (2, 1, "b", 0),
+ (2, 1, "b", 1),
+ ],
+ names=["A", "B", "C", "D"],
+ )
ss = s.astype('Sparse')
ss
@@ -317,9 +321,10 @@ In the example below, we transform the ``Series`` to a sparse representation of
.. ipython:: python
- A, rows, columns = ss.sparse.to_coo(row_levels=['A', 'B'],
- column_levels=['C', 'D'],
- sort_labels=True)
+ A, rows, columns = ss.sparse.to_coo(
+ row_levels=["A", "B"], column_levels=["C", "D"], sort_labels=True
+ )
+
A
A.todense()
@@ -330,9 +335,9 @@ Specifying different row and column labels (and not sorting them) yields a diffe
.. ipython:: python
- A, rows, columns = ss.sparse.to_coo(row_levels=['A', 'B', 'C'],
- column_levels=['D'],
- sort_labels=False)
+ A, rows, columns = ss.sparse.to_coo(
+ row_levels=["A", "B", "C"], column_levels=["D"], sort_labels=False
+ )
A
A.todense()
@@ -344,8 +349,7 @@ A convenience method :meth:`Series.sparse.from_coo` is implemented for creating
.. ipython:: python
from scipy import sparse
- A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])),
- shape=(3, 4))
+ A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4))
A
A.todense()
diff --git a/doc/source/user_guide/style.ipynb b/doc/source/user_guide/style.ipynb
index fd8dda4fe365e..12dd72f761408 100644
--- a/doc/source/user_guide/style.ipynb
+++ b/doc/source/user_guide/style.ipynb
@@ -141,7 +141,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "In this case, the cell's style depends only on it's own value.\n",
+ "In this case, the cell's style depends only on its own value.\n",
"That means we should use the `Styler.applymap` method which works elementwise."
]
},
@@ -793,7 +793,7 @@
"source": [
"The next option you have are \"table styles\".\n",
"These are styles that apply to the table as a whole, but don't look at the data.\n",
- "Certain sytlings, including pseudo-selectors like `:hover` can only be used this way."
+ "Certain stylings, including pseudo-selectors like `:hover` can only be used this way."
]
},
{
diff --git a/doc/source/user_guide/text.rst b/doc/source/user_guide/text.rst
index 3408b98b3179d..2ada09117273d 100644
--- a/doc/source/user_guide/text.rst
+++ b/doc/source/user_guide/text.rst
@@ -46,20 +46,20 @@ infer a list of strings to
.. ipython:: python
- pd.Series(['a', 'b', 'c'])
+ pd.Series(["a", "b", "c"])
To explicitly request ``string`` dtype, specify the ``dtype``
.. ipython:: python
- pd.Series(['a', 'b', 'c'], dtype="string")
- pd.Series(['a', 'b', 'c'], dtype=pd.StringDtype())
+ pd.Series(["a", "b", "c"], dtype="string")
+ pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
Or ``astype`` after the ``Series`` or ``DataFrame`` is created
.. ipython:: python
- s = pd.Series(['a', 'b', 'c'])
+ s = pd.Series(["a", "b", "c"])
s
s.astype("string")
@@ -71,7 +71,7 @@ it will be converted to ``string`` dtype:
.. ipython:: python
- s = pd.Series(['a', 2, np.nan], dtype="string")
+ s = pd.Series(["a", 2, np.nan], dtype="string")
s
type(s[1])
@@ -147,15 +147,16 @@ the equivalent (scalar) built-in string methods:
.. ipython:: python
- s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
- dtype="string")
+ s = pd.Series(
+ ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
+ )
s.str.lower()
s.str.upper()
s.str.len()
.. ipython:: python
- idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])
+ idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
idx.str.strip()
idx.str.lstrip()
idx.str.rstrip()
@@ -166,8 +167,9 @@ leading or trailing whitespace:
.. ipython:: python
- df = pd.DataFrame(np.random.randn(3, 2),
- columns=[' Column A ', ' Column B '], index=range(3))
+ df = pd.DataFrame(
+ np.random.randn(3, 2), columns=[" Column A ", " Column B "], index=range(3)
+ )
df
Since ``df.columns`` is an Index object, we can use the ``.str`` accessor
@@ -183,7 +185,7 @@ and replacing any remaining whitespaces with underscores:
.. ipython:: python
- df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
+ df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df
.. note::
@@ -221,21 +223,21 @@ Methods like ``split`` return a Series of lists:
.. ipython:: python
- s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")
- s2.str.split('_')
+ s2 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"], dtype="string")
+ s2.str.split("_")
Elements in the split lists can be accessed using ``get`` or ``[]`` notation:
.. ipython:: python
- s2.str.split('_').str.get(1)
- s2.str.split('_').str[1]
+ s2.str.split("_").str.get(1)
+ s2.str.split("_").str[1]
It is easy to expand this to return a DataFrame using ``expand``.
.. ipython:: python
- s2.str.split('_', expand=True)
+ s2.str.split("_", expand=True)
When original ``Series`` has :class:`StringDtype`, the output columns will all
be :class:`StringDtype` as well.
@@ -244,45 +246,43 @@ It is also possible to limit the number of splits:
.. ipython:: python
- s2.str.split('_', expand=True, n=1)
+ s2.str.split("_", expand=True, n=1)
``rsplit`` is similar to ``split`` except it works in the reverse direction,
i.e., from the end of the string to the beginning of the string:
.. ipython:: python
- s2.str.rsplit('_', expand=True, n=1)
+ s2.str.rsplit("_", expand=True, n=1)
``replace`` by default replaces `regular expressions
`__:
.. ipython:: python
- s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca',
- '', np.nan, 'CABA', 'dog', 'cat'],
- dtype="string")
+ s3 = pd.Series(
+ ["A", "B", "C", "Aaba", "Baca", "", np.nan, "CABA", "dog", "cat"], dtype="string"
+ )
s3
- s3.str.replace('^.a|dog', 'XX-XX ', case=False)
+ s3.str.replace("^.a|dog", "XX-XX ", case=False)
Some caution must be taken to keep regular expressions in mind! For example, the
following code will cause trouble because of the regular expression meaning of
-`$`:
+``$``:
.. ipython:: python
# Consider the following badly formatted financial data
- dollars = pd.Series(['12', '-$10', '$10,000'], dtype="string")
+ dollars = pd.Series(["12", "-$10", "$10,000"], dtype="string")
# This does what you'd naively expect:
- dollars.str.replace('$', '')
+ dollars.str.replace("$", "")
# But this doesn't:
- dollars.str.replace('-$', '-')
+ dollars.str.replace("-$", "-")
# We need to escape the special character (for >1 len patterns)
- dollars.str.replace(r'-\$', '-')
-
-.. versionadded:: 0.23.0
+ dollars.str.replace(r"-\$", "-")
If you do want literal replacement of a string (equivalent to
:meth:`str.replace`), you can set the optional ``regex`` parameter to
@@ -292,8 +292,8 @@ and ``repl`` must be strings:
.. ipython:: python
# These lines are equivalent
- dollars.str.replace(r'-\$', '-')
- dollars.str.replace('-$', '-', regex=False)
+ dollars.str.replace(r"-\$", "-")
+ dollars.str.replace("-$", "-", regex=False)
The ``replace`` method can also take a callable as replacement. It is called
on every ``pat`` using :func:`re.sub`. The callable should expect one
@@ -302,22 +302,24 @@ positional argument (a regex object) and return a string.
.. ipython:: python
# Reverse every lowercase alphabetic word
- pat = r'[a-z]+'
+ pat = r"[a-z]+"
+
def repl(m):
return m.group(0)[::-1]
- pd.Series(['foo 123', 'bar baz', np.nan],
- dtype="string").str.replace(pat, repl)
+
+ pd.Series(["foo 123", "bar baz", np.nan], dtype="string").str.replace(pat, repl)
# Using regex groups
pat = r"(?P\w+) (?P\w+) (?P\w+)"
+
def repl(m):
- return m.group('two').swapcase()
+ return m.group("two").swapcase()
- pd.Series(['Foo Bar Baz', np.nan],
- dtype="string").str.replace(pat, repl)
+
+ pd.Series(["Foo Bar Baz", np.nan], dtype="string").str.replace(pat, repl)
The ``replace`` method also accepts a compiled regular expression object
from :func:`re.compile` as a pattern. All flags should be included in the
@@ -326,8 +328,9 @@ compiled regular expression object.
.. ipython:: python
import re
- regex_pat = re.compile(r'^.a|dog', flags=re.IGNORECASE)
- s3.str.replace(regex_pat, 'XX-XX ')
+
+ regex_pat = re.compile(r"^.a|dog", flags=re.IGNORECASE)
+ s3.str.replace(regex_pat, "XX-XX ")
Including a ``flags`` argument when calling ``replace`` with a compiled
regular expression object will raise a ``ValueError``.
@@ -354,8 +357,8 @@ The content of a ``Series`` (or ``Index``) can be concatenated:
.. ipython:: python
- s = pd.Series(['a', 'b', 'c', 'd'], dtype="string")
- s.str.cat(sep=',')
+ s = pd.Series(["a", "b", "c", "d"], dtype="string")
+ s.str.cat(sep=",")
If not specified, the keyword ``sep`` for the separator defaults to the empty string, ``sep=''``:
@@ -367,9 +370,9 @@ By default, missing values are ignored. Using ``na_rep``, they can be given a re
.. ipython:: python
- t = pd.Series(['a', 'b', np.nan, 'd'], dtype="string")
- t.str.cat(sep=',')
- t.str.cat(sep=',', na_rep='-')
+ t = pd.Series(["a", "b", np.nan, "d"], dtype="string")
+ t.str.cat(sep=",")
+ t.str.cat(sep=",", na_rep="-")
Concatenating a Series and something list-like into a Series
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -378,20 +381,18 @@ The first argument to :meth:`~Series.str.cat` can be a list-like object, provide
.. ipython:: python
- s.str.cat(['A', 'B', 'C', 'D'])
+ s.str.cat(["A", "B", "C", "D"])
Missing values on either side will result in missing values in the result as well, *unless* ``na_rep`` is specified:
.. ipython:: python
s.str.cat(t)
- s.str.cat(t, na_rep='-')
+ s.str.cat(t, na_rep="-")
Concatenating a Series and something array-like into a Series
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-.. versionadded:: 0.23.0
-
The parameter ``others`` can also be two-dimensional. In this case, the number or rows must match the lengths of the calling ``Series`` (or ``Index``).
.. ipython:: python
@@ -399,25 +400,22 @@ The parameter ``others`` can also be two-dimensional. In this case, the number o
d = pd.concat([t, s], axis=1)
s
d
- s.str.cat(d, na_rep='-')
+ s.str.cat(d, na_rep="-")
Concatenating a Series and an indexed object into a Series, with alignment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-.. versionadded:: 0.23.0
-
For concatenation with a ``Series`` or ``DataFrame``, it is possible to align the indexes before concatenation by setting
the ``join``-keyword.
.. ipython:: python
:okwarning:
- u = pd.Series(['b', 'd', 'a', 'c'], index=[1, 3, 0, 2],
- dtype="string")
+ u = pd.Series(["b", "d", "a", "c"], index=[1, 3, 0, 2], dtype="string")
s
u
s.str.cat(u)
- s.str.cat(u, join='left')
+ s.str.cat(u, join="left")
.. warning::
@@ -429,12 +427,11 @@ In particular, alignment also means that the different lengths do not need to co
.. ipython:: python
- v = pd.Series(['z', 'a', 'b', 'd', 'e'], index=[-1, 0, 1, 3, 4],
- dtype="string")
+ v = pd.Series(["z", "a", "b", "d", "e"], index=[-1, 0, 1, 3, 4], dtype="string")
s
v
- s.str.cat(v, join='left', na_rep='-')
- s.str.cat(v, join='outer', na_rep='-')
+ s.str.cat(v, join="left", na_rep="-")
+ s.str.cat(v, join="outer", na_rep="-")
The same alignment can be used when ``others`` is a ``DataFrame``:
@@ -443,7 +440,7 @@ The same alignment can be used when ``others`` is a ``DataFrame``:
f = d.loc[[3, 2, 1, 0], :]
s
f
- s.str.cat(f, join='left', na_rep='-')
+ s.str.cat(f, join="left", na_rep="-")
Concatenating a Series and many objects into a Series
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -455,7 +452,7 @@ can be combined in a list-like container (including iterators, ``dict``-views, e
s
u
- s.str.cat([u, u.to_numpy()], join='left')
+ s.str.cat([u, u.to_numpy()], join="left")
All elements without an index (e.g. ``np.ndarray``) within the passed list-like must match in length to the calling ``Series`` (or ``Index``),
but ``Series`` and ``Index`` may have arbitrary length (as long as alignment is not disabled with ``join=None``):
@@ -463,7 +460,7 @@ but ``Series`` and ``Index`` may have arbitrary length (as long as alignment is
.. ipython:: python
v
- s.str.cat([v, u, u.to_numpy()], join='outer', na_rep='-')
+ s.str.cat([v, u, u.to_numpy()], join="outer", na_rep="-")
If using ``join='right'`` on a list-like of ``others`` that contains different indexes,
the union of these indexes will be used as the basis for the final concatenation:
@@ -472,7 +469,7 @@ the union of these indexes will be used as the basis for the final concatenation
u.loc[[3]]
v.loc[[-1, 0]]
- s.str.cat([u.loc[[3]], v.loc[[-1, 0]]], join='right', na_rep='-')
+ s.str.cat([u.loc[[3]], v.loc[[-1, 0]]], join="right", na_rep="-")
Indexing with ``.str``
----------------------
@@ -485,9 +482,9 @@ of the string, the result will be a ``NaN``.
.. ipython:: python
- s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,
- 'CABA', 'dog', 'cat'],
- dtype="string")
+ s = pd.Series(
+ ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
+ )
s.str[0]
s.str[1]
@@ -518,8 +515,7 @@ DataFrame with one column per group.
.. ipython:: python
- pd.Series(['a1', 'b2', 'c3'],
- dtype="string").str.extract(r'([ab])(\d)', expand=False)
+ pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"([ab])(\d)", expand=False)
Elements that do not match return a row filled with ``NaN``. Thus, a
Series of messy strings can be "converted" into a like-indexed Series
@@ -532,16 +528,15 @@ Named groups like
.. ipython:: python
- pd.Series(['a1', 'b2', 'c3'],
- dtype="string").str.extract(r'(?P[ab])(?P\d)',
- expand=False)
+ pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(
+ r"(?P[ab])(?P\d)", expand=False
+ )
and optional groups like
.. ipython:: python
- pd.Series(['a1', 'b2', '3'],
- dtype="string").str.extract(r'([ab])?(\d)', expand=False)
+ pd.Series(["a1", "b2", "3"], dtype="string").str.extract(r"([ab])?(\d)", expand=False)
can also be used. Note that any capture group names in the regular
expression will be used for column names; otherwise capture group
@@ -552,23 +547,20 @@ with one column if ``expand=True``.
.. ipython:: python
- pd.Series(['a1', 'b2', 'c3'],
- dtype="string").str.extract(r'[ab](\d)', expand=True)
+ pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=True)
It returns a Series if ``expand=False``.
.. ipython:: python
- pd.Series(['a1', 'b2', 'c3'],
- dtype="string").str.extract(r'[ab](\d)', expand=False)
+ pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=False)
Calling on an ``Index`` with a regex with exactly one capture group
returns a ``DataFrame`` with one column if ``expand=True``.
.. ipython:: python
- s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"],
- dtype="string")
+ s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], dtype="string")
s
s.index.str.extract("(?P[a-zA-Z])", expand=True)
@@ -613,10 +605,9 @@ Unlike ``extract`` (which returns only the first match),
.. ipython:: python
- s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],
- dtype="string")
+ s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"], dtype="string")
s
- two_groups = '(?P[a-z])(?P[0-9])'
+ two_groups = "(?P[a-z])(?P[0-9])"
s.str.extract(two_groups, expand=True)
the ``extractall`` method returns every match. The result of
@@ -632,7 +623,7 @@ When each subject string in the Series has exactly one match,
.. ipython:: python
- s = pd.Series(['a3', 'b3', 'c2'], dtype="string")
+ s = pd.Series(["a3", "b3", "c2"], dtype="string")
s
then ``extractall(pat).xs(0, level='match')`` gives the same result as
@@ -663,23 +654,20 @@ You can check whether elements contain a pattern:
.. ipython:: python
- pattern = r'[0-9][a-z]'
- pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
- dtype="string").str.contains(pattern)
+ pattern = r"[0-9][a-z]"
+ pd.Series(["1", "2", "3a", "3b", "03c", "4dx"], dtype="string").str.contains(pattern)
Or whether elements match a pattern:
.. ipython:: python
- pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
- dtype="string").str.match(pattern)
+ pd.Series(["1", "2", "3a", "3b", "03c", "4dx"], dtype="string").str.match(pattern)
.. versionadded:: 1.1.0
.. ipython:: python
- pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
- dtype="string").str.fullmatch(pattern)
+ pd.Series(["1", "2", "3a", "3b", "03c", "4dx"], dtype="string").str.fullmatch(pattern)
.. note::
@@ -701,9 +689,10 @@ True or False:
.. ipython:: python
- s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
- dtype="string")
- s4.str.contains('A', na=False)
+ s4 = pd.Series(
+ ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
+ )
+ s4.str.contains("A", na=False)
.. _text.indicator:
@@ -715,15 +704,15 @@ For example if they are separated by a ``'|'``:
.. ipython:: python
- s = pd.Series(['a', 'a|b', np.nan, 'a|c'], dtype="string")
- s.str.get_dummies(sep='|')
+ s = pd.Series(["a", "a|b", np.nan, "a|c"], dtype="string")
+ s.str.get_dummies(sep="|")
String ``Index`` also supports ``get_dummies`` which returns a ``MultiIndex``.
.. ipython:: python
- idx = pd.Index(['a', 'a|b', np.nan, 'a|c'])
- idx.str.get_dummies(sep='|')
+ idx = pd.Index(["a", "a|b", np.nan, "a|c"])
+ idx.str.get_dummies(sep="|")
See also :func:`~pandas.get_dummies`.
diff --git a/doc/source/user_guide/timedeltas.rst b/doc/source/user_guide/timedeltas.rst
index 3439a0a4c13c7..cb265d34229dd 100644
--- a/doc/source/user_guide/timedeltas.rst
+++ b/doc/source/user_guide/timedeltas.rst
@@ -18,44 +18,40 @@ parsing, and attributes.
Parsing
-------
-You can construct a ``Timedelta`` scalar through various arguments:
+You can construct a ``Timedelta`` scalar through various arguments, including `ISO 8601 Duration`_ strings.
.. ipython:: python
import datetime
# strings
- pd.Timedelta('1 days')
- pd.Timedelta('1 days 00:00:00')
- pd.Timedelta('1 days 2 hours')
- pd.Timedelta('-1 days 2 min 3us')
+ pd.Timedelta("1 days")
+ pd.Timedelta("1 days 00:00:00")
+ pd.Timedelta("1 days 2 hours")
+ pd.Timedelta("-1 days 2 min 3us")
# like datetime.timedelta
# note: these MUST be specified as keyword arguments
pd.Timedelta(days=1, seconds=1)
# integers with a unit
- pd.Timedelta(1, unit='d')
+ pd.Timedelta(1, unit="d")
# from a datetime.timedelta/np.timedelta64
pd.Timedelta(datetime.timedelta(days=1, seconds=1))
- pd.Timedelta(np.timedelta64(1, 'ms'))
+ pd.Timedelta(np.timedelta64(1, "ms"))
# negative Timedeltas have this string repr
# to be more consistent with datetime.timedelta conventions
- pd.Timedelta('-1us')
+ pd.Timedelta("-1us")
# a NaT
- pd.Timedelta('nan')
- pd.Timedelta('nat')
+ pd.Timedelta("nan")
+ pd.Timedelta("nat")
# ISO 8601 Duration strings
- pd.Timedelta('P0DT0H1M0S')
- pd.Timedelta('P0DT0H0M0.000000123S')
-
-.. versionadded:: 0.23.0
-
- Added constructor for `ISO 8601 Duration`_ strings
+ pd.Timedelta("P0DT0H1M0S")
+ pd.Timedelta("P0DT0H0M0.000000123S")
:ref:`DateOffsets` (``Day, Hour, Minute, Second, Milli, Micro, Nano``) can also be used in construction.
@@ -67,8 +63,9 @@ Further, operations among the scalars yield another scalar ``Timedelta``.
.. ipython:: python
- pd.Timedelta(pd.offsets.Day(2)) + pd.Timedelta(pd.offsets.Second(2)) +\
- pd.Timedelta('00:00:00.000123')
+ pd.Timedelta(pd.offsets.Day(2)) + pd.Timedelta(pd.offsets.Second(2)) + pd.Timedelta(
+ "00:00:00.000123"
+ )
to_timedelta
~~~~~~~~~~~~
@@ -82,28 +79,28 @@ You can parse a single string to a Timedelta:
.. ipython:: python
- pd.to_timedelta('1 days 06:05:01.00003')
- pd.to_timedelta('15.5us')
+ pd.to_timedelta("1 days 06:05:01.00003")
+ pd.to_timedelta("15.5us")
or a list/array of strings:
.. ipython:: python
- pd.to_timedelta(['1 days 06:05:01.00003', '15.5us', 'nan'])
+ pd.to_timedelta(["1 days 06:05:01.00003", "15.5us", "nan"])
The ``unit`` keyword argument specifies the unit of the Timedelta:
.. ipython:: python
- pd.to_timedelta(np.arange(5), unit='s')
- pd.to_timedelta(np.arange(5), unit='d')
+ pd.to_timedelta(np.arange(5), unit="s")
+ pd.to_timedelta(np.arange(5), unit="d")
.. _timedeltas.limitations:
Timedelta limitations
~~~~~~~~~~~~~~~~~~~~~
-Pandas represents ``Timedeltas`` in nanosecond resolution using
+pandas represents ``Timedeltas`` in nanosecond resolution using
64 bit integers. As such, the 64 bit integer limits determine
the ``Timedelta`` limits.
@@ -122,11 +119,11 @@ subtraction operations on ``datetime64[ns]`` Series, or ``Timestamps``.
.. ipython:: python
- s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))
+ s = pd.Series(pd.date_range("2012-1-1", periods=3, freq="D"))
td = pd.Series([pd.Timedelta(days=i) for i in range(3)])
- df = pd.DataFrame({'A': s, 'B': td})
+ df = pd.DataFrame({"A": s, "B": td})
df
- df['C'] = df['A'] + df['B']
+ df["C"] = df["A"] + df["B"]
df
df.dtypes
@@ -169,10 +166,10 @@ Operands can also appear in a reversed order (a singular object operated with a
.. ipython:: python
- A = s - pd.Timestamp('20120101') - pd.Timedelta('00:05:05')
- B = s - pd.Series(pd.date_range('2012-1-2', periods=3, freq='D'))
+ A = s - pd.Timestamp("20120101") - pd.Timedelta("00:05:05")
+ B = s - pd.Series(pd.date_range("2012-1-2", periods=3, freq="D"))
- df = pd.DataFrame({'A': A, 'B': B})
+ df = pd.DataFrame({"A": A, "B": B})
df
df.min()
@@ -196,17 +193,17 @@ You can fillna on timedeltas, passing a timedelta to get a particular value.
.. ipython:: python
y.fillna(pd.Timedelta(0))
- y.fillna(pd.Timedelta(10, unit='s'))
- y.fillna(pd.Timedelta('-1 days, 00:00:05'))
+ y.fillna(pd.Timedelta(10, unit="s"))
+ y.fillna(pd.Timedelta("-1 days, 00:00:05"))
You can also negate, multiply and use ``abs`` on ``Timedeltas``:
.. ipython:: python
- td1 = pd.Timedelta('-1 days 2 hours 3 seconds')
+ td1 = pd.Timedelta("-1 days 2 hours 3 seconds")
td1
-1 * td1
- - td1
+ -td1
abs(td1)
.. _timedeltas.timedeltas_reductions:
@@ -219,12 +216,13 @@ Numeric reduction operation for ``timedelta64[ns]`` will return ``Timedelta`` ob
.. ipython:: python
- y2 = pd.Series(pd.to_timedelta(['-1 days +00:00:05', 'nat',
- '-1 days +00:00:05', '1 days']))
+ y2 = pd.Series(
+ pd.to_timedelta(["-1 days +00:00:05", "nat", "-1 days +00:00:05", "1 days"])
+ )
y2
y2.mean()
y2.median()
- y2.quantile(.1)
+ y2.quantile(0.1)
y2.sum()
.. _timedeltas.timedeltas_convert:
@@ -238,8 +236,8 @@ Note that division by the NumPy scalar is true division, while astyping is equiv
.. ipython:: python
- december = pd.Series(pd.date_range('20121201', periods=4))
- january = pd.Series(pd.date_range('20130101', periods=4))
+ december = pd.Series(pd.date_range("20121201", periods=4))
+ january = pd.Series(pd.date_range("20130101", periods=4))
td = january - december
td[2] += datetime.timedelta(minutes=5, seconds=3)
@@ -247,15 +245,15 @@ Note that division by the NumPy scalar is true division, while astyping is equiv
td
# to days
- td / np.timedelta64(1, 'D')
- td.astype('timedelta64[D]')
+ td / np.timedelta64(1, "D")
+ td.astype("timedelta64[D]")
# to seconds
- td / np.timedelta64(1, 's')
- td.astype('timedelta64[s]')
+ td / np.timedelta64(1, "s")
+ td.astype("timedelta64[s]")
# to months (these are constant months)
- td / np.timedelta64(1, 'M')
+ td / np.timedelta64(1, "M")
Dividing or multiplying a ``timedelta64[ns]`` Series by an integer or integer Series
yields another ``timedelta64[ns]`` dtypes Series.
@@ -309,7 +307,7 @@ You can access the value of the fields for a scalar ``Timedelta`` directly.
.. ipython:: python
- tds = pd.Timedelta('31 days 5 min 3 sec')
+ tds = pd.Timedelta("31 days 5 min 3 sec")
tds.days
tds.seconds
(-tds).seconds
@@ -329,9 +327,9 @@ You can convert a ``Timedelta`` to an `ISO 8601 Duration`_ string with the
.. ipython:: python
- pd.Timedelta(days=6, minutes=50, seconds=3,
- milliseconds=10, microseconds=10,
- nanoseconds=12).isoformat()
+ pd.Timedelta(
+ days=6, minutes=50, seconds=3, milliseconds=10, microseconds=10, nanoseconds=12
+ ).isoformat()
.. _ISO 8601 Duration: https://en.wikipedia.org/wiki/ISO_8601#Durations
@@ -348,15 +346,21 @@ or ``np.timedelta64`` objects. Passing ``np.nan/pd.NaT/nat`` will represent miss
.. ipython:: python
- pd.TimedeltaIndex(['1 days', '1 days, 00:00:05', np.timedelta64(2, 'D'),
- datetime.timedelta(days=2, seconds=2)])
+ pd.TimedeltaIndex(
+ [
+ "1 days",
+ "1 days, 00:00:05",
+ np.timedelta64(2, "D"),
+ datetime.timedelta(days=2, seconds=2),
+ ]
+ )
The string 'infer' can be passed in order to set the frequency of the index as the
inferred frequency upon creation:
.. ipython:: python
- pd.TimedeltaIndex(['0 days', '10 days', '20 days'], freq='infer')
+ pd.TimedeltaIndex(["0 days", "10 days", "20 days"], freq="infer")
Generating ranges of time deltas
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -367,27 +371,25 @@ calendar day:
.. ipython:: python
- pd.timedelta_range(start='1 days', periods=5)
+ pd.timedelta_range(start="1 days", periods=5)
Various combinations of ``start``, ``end``, and ``periods`` can be used with
``timedelta_range``:
.. ipython:: python
- pd.timedelta_range(start='1 days', end='5 days')
+ pd.timedelta_range(start="1 days", end="5 days")
- pd.timedelta_range(end='10 days', periods=4)
+ pd.timedelta_range(end="10 days", periods=4)
The ``freq`` parameter can passed a variety of :ref:`frequency aliases `:
.. ipython:: python
- pd.timedelta_range(start='1 days', end='2 days', freq='30T')
-
- pd.timedelta_range(start='1 days', periods=5, freq='2D5H')
+ pd.timedelta_range(start="1 days", end="2 days", freq="30T")
+ pd.timedelta_range(start="1 days", periods=5, freq="2D5H")
-.. versionadded:: 0.23.0
Specifying ``start``, ``end``, and ``periods`` will generate a range of evenly spaced
timedeltas from ``start`` to ``end`` inclusively, with ``periods`` number of elements
@@ -395,9 +397,9 @@ in the resulting ``TimedeltaIndex``:
.. ipython:: python
- pd.timedelta_range('0 days', '4 days', periods=5)
+ pd.timedelta_range("0 days", "4 days", periods=5)
- pd.timedelta_range('0 days', '4 days', periods=10)
+ pd.timedelta_range("0 days", "4 days", periods=10)
Using the TimedeltaIndex
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -407,23 +409,22 @@ Similarly to other of the datetime-like indices, ``DatetimeIndex`` and ``PeriodI
.. ipython:: python
- s = pd.Series(np.arange(100),
- index=pd.timedelta_range('1 days', periods=100, freq='h'))
+ s = pd.Series(np.arange(100), index=pd.timedelta_range("1 days", periods=100, freq="h"))
s
Selections work similarly, with coercion on string-likes and slices:
.. ipython:: python
- s['1 day':'2 day']
- s['1 day 01:00:00']
- s[pd.Timedelta('1 day 1h')]
+ s["1 day":"2 day"]
+ s["1 day 01:00:00"]
+ s[pd.Timedelta("1 day 1h")]
Furthermore you can use partial string selection and the range will be inferred:
.. ipython:: python
- s['1 day':'1 day 5 hours']
+ s["1 day":"1 day 5 hours"]
Operations
~~~~~~~~~~
@@ -432,9 +433,9 @@ Finally, the combination of ``TimedeltaIndex`` with ``DatetimeIndex`` allow cert
.. ipython:: python
- tdi = pd.TimedeltaIndex(['1 days', pd.NaT, '2 days'])
+ tdi = pd.TimedeltaIndex(["1 days", pd.NaT, "2 days"])
tdi.to_list()
- dti = pd.date_range('20130101', periods=3)
+ dti = pd.date_range("20130101", periods=3)
dti.to_list()
(dti + tdi).to_list()
(dti - tdi).to_list()
@@ -446,22 +447,22 @@ Similarly to frequency conversion on a ``Series`` above, you can convert these i
.. ipython:: python
- tdi / np.timedelta64(1, 's')
- tdi.astype('timedelta64[s]')
+ tdi / np.timedelta64(1, "s")
+ tdi.astype("timedelta64[s]")
Scalars type ops work as well. These can potentially return a *different* type of index.
.. ipython:: python
# adding or timedelta and date -> datelike
- tdi + pd.Timestamp('20130101')
+ tdi + pd.Timestamp("20130101")
# subtraction of a date and a timedelta -> datelike
# note that trying to subtract a date from a Timedelta will raise an exception
- (pd.Timestamp('20130101') - tdi).to_list()
+ (pd.Timestamp("20130101") - tdi).to_list()
# timedelta + timedelta -> timedelta
- tdi + pd.Timedelta('10 days')
+ tdi + pd.Timedelta("10 days")
# division can result in a Timedelta if the divisor is an integer
tdi / 2
@@ -478,4 +479,4 @@ Similar to :ref:`timeseries resampling `, we can resample
.. ipython:: python
- s.resample('D').mean()
+ s.resample("D").mean()
diff --git a/doc/source/user_guide/timeseries.rst b/doc/source/user_guide/timeseries.rst
index 5351c3ee6b624..be2c67521dc5d 100644
--- a/doc/source/user_guide/timeseries.rst
+++ b/doc/source/user_guide/timeseries.rst
@@ -19,42 +19,43 @@ Parsing time series information from various sources and formats
import datetime
- dti = pd.to_datetime(['1/1/2018', np.datetime64('2018-01-01'),
- datetime.datetime(2018, 1, 1)])
+ dti = pd.to_datetime(
+ ["1/1/2018", np.datetime64("2018-01-01"), datetime.datetime(2018, 1, 1)]
+ )
dti
Generate sequences of fixed-frequency dates and time spans
.. ipython:: python
- dti = pd.date_range('2018-01-01', periods=3, freq='H')
+ dti = pd.date_range("2018-01-01", periods=3, freq="H")
dti
Manipulating and converting date times with timezone information
.. ipython:: python
- dti = dti.tz_localize('UTC')
+ dti = dti.tz_localize("UTC")
dti
- dti.tz_convert('US/Pacific')
+ dti.tz_convert("US/Pacific")
Resampling or converting a time series to a particular frequency
.. ipython:: python
- idx = pd.date_range('2018-01-01', periods=5, freq='H')
+ idx = pd.date_range("2018-01-01", periods=5, freq="H")
ts = pd.Series(range(len(idx)), index=idx)
ts
- ts.resample('2H').mean()
+ ts.resample("2H").mean()
Performing date and time arithmetic with absolute or relative time increments
.. ipython:: python
- friday = pd.Timestamp('2018-01-05')
+ friday = pd.Timestamp("2018-01-05")
friday.day_name()
# Add 1 day
- saturday = friday + pd.Timedelta('1 day')
+ saturday = friday + pd.Timedelta("1 day")
saturday.day_name()
# Add 1 business day (Friday --> Monday)
monday = friday + pd.offsets.BDay()
@@ -90,13 +91,13 @@ so manipulations can be performed with respect to the time element.
.. ipython:: python
- pd.Series(range(3), index=pd.date_range('2000', freq='D', periods=3))
+ pd.Series(range(3), index=pd.date_range("2000", freq="D", periods=3))
However, :class:`Series` and :class:`DataFrame` can directly also support the time component as data itself.
.. ipython:: python
- pd.Series(pd.date_range('2000', freq='D', periods=3))
+ pd.Series(pd.date_range("2000", freq="D", periods=3))
:class:`Series` and :class:`DataFrame` have extended data type support and functionality for ``datetime``, ``timedelta``
and ``Period`` data when passed into those constructors. ``DateOffset``
@@ -104,9 +105,9 @@ data however will be stored as ``object`` data.
.. ipython:: python
- pd.Series(pd.period_range('1/1/2011', freq='M', periods=3))
+ pd.Series(pd.period_range("1/1/2011", freq="M", periods=3))
pd.Series([pd.DateOffset(1), pd.DateOffset(2)])
- pd.Series(pd.date_range('1/1/2011', freq='M', periods=3))
+ pd.Series(pd.date_range("1/1/2011", freq="M", periods=3))
Lastly, pandas represents null date times, time deltas, and time spans as ``NaT`` which
is useful for representing missing or null date like values and behaves similar
@@ -132,7 +133,7 @@ time.
.. ipython:: python
pd.Timestamp(datetime.datetime(2012, 5, 1))
- pd.Timestamp('2012-05-01')
+ pd.Timestamp("2012-05-01")
pd.Timestamp(2012, 5, 1)
However, in many cases it is more natural to associate things like change
@@ -143,9 +144,9 @@ For example:
.. ipython:: python
- pd.Period('2011-01')
+ pd.Period("2011-01")
- pd.Period('2012-05', freq='D')
+ pd.Period("2012-05", freq="D")
:class:`Timestamp` and :class:`Period` can serve as an index. Lists of
``Timestamp`` and ``Period`` are automatically coerced to :class:`DatetimeIndex`
@@ -153,9 +154,11 @@ and :class:`PeriodIndex` respectively.
.. ipython:: python
- dates = [pd.Timestamp('2012-05-01'),
- pd.Timestamp('2012-05-02'),
- pd.Timestamp('2012-05-03')]
+ dates = [
+ pd.Timestamp("2012-05-01"),
+ pd.Timestamp("2012-05-02"),
+ pd.Timestamp("2012-05-03"),
+ ]
ts = pd.Series(np.random.randn(3), dates)
type(ts.index)
@@ -163,7 +166,7 @@ and :class:`PeriodIndex` respectively.
ts
- periods = [pd.Period('2012-01'), pd.Period('2012-02'), pd.Period('2012-03')]
+ periods = [pd.Period("2012-01"), pd.Period("2012-02"), pd.Period("2012-03")]
ts = pd.Series(np.random.randn(3), periods)
@@ -193,18 +196,18 @@ is converted to a ``DatetimeIndex``:
.. ipython:: python
- pd.to_datetime(pd.Series(['Jul 31, 2009', '2010-01-10', None]))
+ pd.to_datetime(pd.Series(["Jul 31, 2009", "2010-01-10", None]))
- pd.to_datetime(['2005/11/23', '2010.12.31'])
+ pd.to_datetime(["2005/11/23", "2010.12.31"])
If you use dates which start with the day first (i.e. European style),
you can pass the ``dayfirst`` flag:
.. ipython:: python
- pd.to_datetime(['04-01-2012 10:00'], dayfirst=True)
+ pd.to_datetime(["04-01-2012 10:00"], dayfirst=True)
- pd.to_datetime(['14-01-2012', '01-14-2012'], dayfirst=True)
+ pd.to_datetime(["14-01-2012", "01-14-2012"], dayfirst=True)
.. warning::
@@ -218,22 +221,24 @@ options like ``dayfirst`` or ``format``, so use ``to_datetime`` if these are req
.. ipython:: python
- pd.to_datetime('2010/11/12')
+ pd.to_datetime("2010/11/12")
- pd.Timestamp('2010/11/12')
+ pd.Timestamp("2010/11/12")
You can also use the ``DatetimeIndex`` constructor directly:
.. ipython:: python
- pd.DatetimeIndex(['2018-01-01', '2018-01-03', '2018-01-05'])
+ pd.DatetimeIndex(["2018-01-01", "2018-01-03", "2018-01-05"])
The string 'infer' can be passed in order to set the frequency of the index as the
inferred frequency upon creation:
.. ipython:: python
- pd.DatetimeIndex(['2018-01-01', '2018-01-03', '2018-01-05'], freq='infer')
+ pd.DatetimeIndex(["2018-01-01", "2018-01-03", "2018-01-05"], freq="infer")
+
+.. _timeseries.converting.format:
Providing a format argument
~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -243,9 +248,9 @@ This could also potentially speed up the conversion considerably.
.. ipython:: python
- pd.to_datetime('2010/11/12', format='%Y/%m/%d')
+ pd.to_datetime("2010/11/12", format="%Y/%m/%d")
- pd.to_datetime('12-11-2010 00:00', format='%d-%m-%Y %H:%M')
+ pd.to_datetime("12-11-2010 00:00", format="%d-%m-%Y %H:%M")
For more information on the choices available when specifying the ``format``
option, see the Python `datetime documentation`_.
@@ -259,10 +264,9 @@ You can also pass a ``DataFrame`` of integer or string columns to assemble into
.. ipython:: python
- df = pd.DataFrame({'year': [2015, 2016],
- 'month': [2, 3],
- 'day': [4, 5],
- 'hour': [2, 3]})
+ df = pd.DataFrame(
+ {"year": [2015, 2016], "month": [2, 3], "day": [4, 5], "hour": [2, 3]}
+ )
pd.to_datetime(df)
@@ -270,7 +274,7 @@ You can pass only the columns that you need to assemble.
.. ipython:: python
- pd.to_datetime(df[['year', 'month', 'day']])
+ pd.to_datetime(df[["year", "month", "day"]])
``pd.to_datetime`` looks for standard designations of the datetime component in the column names, including:
@@ -280,24 +284,24 @@ You can pass only the columns that you need to assemble.
Invalid data
~~~~~~~~~~~~
-The default behavior, ``errors='raise'``, is to raise when unparseable:
+The default behavior, ``errors='raise'``, is to raise when unparsable:
.. code-block:: ipython
In [2]: pd.to_datetime(['2009/07/31', 'asd'], errors='raise')
ValueError: Unknown string format
-Pass ``errors='ignore'`` to return the original input when unparseable:
+Pass ``errors='ignore'`` to return the original input when unparsable:
.. ipython:: python
- pd.to_datetime(['2009/07/31', 'asd'], errors='ignore')
+ pd.to_datetime(["2009/07/31", "asd"], errors="ignore")
-Pass ``errors='coerce'`` to convert unparseable data to ``NaT`` (not a time):
+Pass ``errors='coerce'`` to convert unparsable data to ``NaT`` (not a time):
.. ipython:: python
- pd.to_datetime(['2009/07/31', 'asd'], errors='coerce')
+ pd.to_datetime(["2009/07/31", "asd"], errors="coerce")
.. _timeseries.converting.epoch:
@@ -313,23 +317,30 @@ which can be specified. These are computed from the starting point specified by
.. ipython:: python
- pd.to_datetime([1349720105, 1349806505, 1349892905,
- 1349979305, 1350065705], unit='s')
+ pd.to_datetime([1349720105, 1349806505, 1349892905, 1349979305, 1350065705], unit="s")
+
+ pd.to_datetime(
+ [1349720105100, 1349720105200, 1349720105300, 1349720105400, 1349720105500],
+ unit="ms",
+ )
+
+.. note::
+
+ The ``unit`` parameter does not use the same strings as the ``format`` parameter
+ that was discussed :ref:`above`). The
+ available units are listed on the documentation for :func:`pandas.to_datetime`.
- pd.to_datetime([1349720105100, 1349720105200, 1349720105300,
- 1349720105400, 1349720105500], unit='ms')
+.. versionchanged:: 1.0.0
Constructing a :class:`Timestamp` or :class:`DatetimeIndex` with an epoch timestamp
-with the ``tz`` argument specified will currently localize the epoch timestamps to UTC
-first then convert the result to the specified time zone. However, this behavior
-is :ref:`deprecated `, and if you have
-epochs in wall time in another timezone, it is recommended to read the epochs
+with the ``tz`` argument specified will raise a ValueError. If you have
+epochs in wall time in another timezone, you can read the epochs
as timezone-naive timestamps and then localize to the appropriate timezone:
.. ipython:: python
- pd.Timestamp(1262347200000000000).tz_localize('US/Pacific')
- pd.DatetimeIndex([1262347200000000000]).tz_localize('US/Pacific')
+ pd.Timestamp(1262347200000000000).tz_localize("US/Pacific")
+ pd.DatetimeIndex([1262347200000000000]).tz_localize("US/Pacific")
.. note::
@@ -345,8 +356,8 @@ as timezone-naive timestamps and then localize to the appropriate timezone:
.. ipython:: python
- pd.to_datetime([1490195805.433, 1490195805.433502912], unit='s')
- pd.to_datetime(1490195805433502912, unit='ns')
+ pd.to_datetime([1490195805.433, 1490195805.433502912], unit="s")
+ pd.to_datetime(1490195805433502912, unit="ns")
.. seealso::
@@ -361,7 +372,7 @@ To invert the operation from above, namely, to convert from a ``Timestamp`` to a
.. ipython:: python
- stamps = pd.date_range('2012-10-08 18:15:05', periods=4, freq='D')
+ stamps = pd.date_range("2012-10-08 18:15:05", periods=4, freq="D")
stamps
We subtract the epoch (midnight at January 1, 1970 UTC) and then floor divide by the
@@ -369,7 +380,7 @@ We subtract the epoch (midnight at January 1, 1970 UTC) and then floor divide by
.. ipython:: python
- (stamps - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
+ (stamps - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s")
.. _timeseries.origin:
@@ -381,14 +392,14 @@ of a ``DatetimeIndex``. For example, to use 1960-01-01 as the starting date:
.. ipython:: python
- pd.to_datetime([1, 2, 3], unit='D', origin=pd.Timestamp('1960-01-01'))
+ pd.to_datetime([1, 2, 3], unit="D", origin=pd.Timestamp("1960-01-01"))
The default is set at ``origin='unix'``, which defaults to ``1970-01-01 00:00:00``.
Commonly called 'unix epoch' or POSIX time.
.. ipython:: python
- pd.to_datetime([1, 2, 3], unit='D')
+ pd.to_datetime([1, 2, 3], unit="D")
.. _timeseries.daterange:
@@ -400,9 +411,11 @@ To generate an index with timestamps, you can use either the ``DatetimeIndex`` o
.. ipython:: python
- dates = [datetime.datetime(2012, 5, 1),
- datetime.datetime(2012, 5, 2),
- datetime.datetime(2012, 5, 3)]
+ dates = [
+ datetime.datetime(2012, 5, 1),
+ datetime.datetime(2012, 5, 2),
+ datetime.datetime(2012, 5, 3),
+ ]
# Note the frequency information
index = pd.DatetimeIndex(dates)
@@ -434,9 +447,9 @@ variety of :ref:`frequency aliases `:
.. ipython:: python
- pd.date_range(start, periods=1000, freq='M')
+ pd.date_range(start, periods=1000, freq="M")
- pd.bdate_range(start, periods=250, freq='BQS')
+ pd.bdate_range(start, periods=250, freq="BQS")
``date_range`` and ``bdate_range`` make it easy to generate a range of dates
using various combinations of parameters like ``start``, ``end``, ``periods``,
@@ -445,25 +458,23 @@ of those specified will not be generated:
.. ipython:: python
- pd.date_range(start, end, freq='BM')
+ pd.date_range(start, end, freq="BM")
- pd.date_range(start, end, freq='W')
+ pd.date_range(start, end, freq="W")
pd.bdate_range(end=end, periods=20)
pd.bdate_range(start=start, periods=20)
-.. versionadded:: 0.23.0
-
Specifying ``start``, ``end``, and ``periods`` will generate a range of evenly spaced
dates from ``start`` to ``end`` inclusively, with ``periods`` number of elements in the
resulting ``DatetimeIndex``:
.. ipython:: python
- pd.date_range('2018-01-01', '2018-01-05', periods=5)
+ pd.date_range("2018-01-01", "2018-01-05", periods=5)
- pd.date_range('2018-01-01', '2018-01-05', periods=10)
+ pd.date_range("2018-01-01", "2018-01-05", periods=10)
.. _timeseries.custom-freq-ranges:
@@ -476,13 +487,13 @@ used if a custom frequency string is passed.
.. ipython:: python
- weekmask = 'Mon Wed Fri'
+ weekmask = "Mon Wed Fri"
holidays = [datetime.datetime(2011, 1, 5), datetime.datetime(2011, 3, 14)]
- pd.bdate_range(start, end, freq='C', weekmask=weekmask, holidays=holidays)
+ pd.bdate_range(start, end, freq="C", weekmask=weekmask, holidays=holidays)
- pd.bdate_range(start, end, freq='CBMS', weekmask=weekmask)
+ pd.bdate_range(start, end, freq="CBMS", weekmask=weekmask)
.. seealso::
@@ -516,7 +527,7 @@ The ``DatetimeIndex`` class contains many time series related optimizations:
* A large range of dates for various offsets are pre-computed and cached
under the hood in order to make generating subsequent date ranges very fast
(just have to grab a slice).
-* Fast shifting using the ``shift`` and ``tshift`` method on pandas objects.
+* Fast shifting using the ``shift`` method on pandas objects.
* Unioning of overlapping ``DatetimeIndex`` objects with the same frequency is
very fast (important for fast data alignment).
* Quick access to date fields via properties such as ``year``, ``month``, etc.
@@ -539,7 +550,7 @@ intelligent functionality like selection, slicing, etc.
.. ipython:: python
- rng = pd.date_range(start, end, freq='BM')
+ rng = pd.date_range(start, end, freq="BM")
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts.index
ts[:5].index
@@ -554,133 +565,150 @@ Dates and strings that parse to timestamps can be passed as indexing parameters:
.. ipython:: python
- ts['1/31/2011']
+ ts["1/31/2011"]
ts[datetime.datetime(2011, 12, 25):]
- ts['10/31/2011':'12/31/2011']
+ ts["10/31/2011":"12/31/2011"]
To provide convenience for accessing longer time series, you can also pass in
the year or year and month as strings:
.. ipython:: python
- ts['2011']
+ ts["2011"]
- ts['2011-6']
+ ts["2011-6"]
This type of slicing will work on a ``DataFrame`` with a ``DatetimeIndex`` as well. Since the
partial string selection is a form of label slicing, the endpoints **will be** included. This
would include matching times on an included date:
+.. warning::
+
+ Indexing ``DataFrame`` rows with strings is deprecated in pandas 1.2.0 and will be removed in a future version. Use ``frame.loc[dtstring]`` instead.
+
.. ipython:: python
+ :okwarning:
- dft = pd.DataFrame(np.random.randn(100000, 1), columns=['A'],
- index=pd.date_range('20130101', periods=100000, freq='T'))
+ dft = pd.DataFrame(
+ np.random.randn(100000, 1),
+ columns=["A"],
+ index=pd.date_range("20130101", periods=100000, freq="T"),
+ )
dft
- dft['2013']
+ dft["2013"]
This starts on the very first time in the month, and includes the last date and
time for the month:
.. ipython:: python
+ :okwarning:
- dft['2013-1':'2013-2']
+ dft["2013-1":"2013-2"]
This specifies a stop time **that includes all of the times on the last day**:
.. ipython:: python
+ :okwarning:
- dft['2013-1':'2013-2-28']
+ dft["2013-1":"2013-2-28"]
This specifies an **exact** stop time (and is not the same as the above):
.. ipython:: python
+ :okwarning:
- dft['2013-1':'2013-2-28 00:00:00']
+ dft["2013-1":"2013-2-28 00:00:00"]
We are stopping on the included end-point as it is part of the index:
.. ipython:: python
+ :okwarning:
- dft['2013-1-15':'2013-1-15 12:30:00']
+ dft["2013-1-15":"2013-1-15 12:30:00"]
``DatetimeIndex`` partial string indexing also works on a ``DataFrame`` with a ``MultiIndex``:
.. ipython:: python
- dft2 = pd.DataFrame(np.random.randn(20, 1),
- columns=['A'],
- index=pd.MultiIndex.from_product(
- [pd.date_range('20130101', periods=10, freq='12H'),
- ['a', 'b']]))
+ dft2 = pd.DataFrame(
+ np.random.randn(20, 1),
+ columns=["A"],
+ index=pd.MultiIndex.from_product(
+ [pd.date_range("20130101", periods=10, freq="12H"), ["a", "b"]]
+ ),
+ )
dft2
- dft2.loc['2013-01-05']
+ dft2.loc["2013-01-05"]
idx = pd.IndexSlice
dft2 = dft2.swaplevel(0, 1).sort_index()
- dft2.loc[idx[:, '2013-01-05'], :]
+ dft2.loc[idx[:, "2013-01-05"], :]
.. versionadded:: 0.25.0
Slicing with string indexing also honors UTC offset.
.. ipython:: python
+ :okwarning:
- df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))
+ df = pd.DataFrame([0], index=pd.DatetimeIndex(["2019-01-01"], tz="US/Pacific"))
df
- df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
+ df["2019-01-01 12:00:00+04:00":"2019-01-01 13:00:00+04:00"]
.. _timeseries.slice_vs_exact_match:
Slice vs. exact match
~~~~~~~~~~~~~~~~~~~~~
-.. versionchanged:: 0.20.0
-
The same string used as an indexing parameter can be treated either as a slice or as an exact match depending on the resolution of the index. If the string is less accurate than the index, it will be treated as a slice, otherwise as an exact match.
Consider a ``Series`` object with a minute resolution index:
.. ipython:: python
- series_minute = pd.Series([1, 2, 3],
- pd.DatetimeIndex(['2011-12-31 23:59:00',
- '2012-01-01 00:00:00',
- '2012-01-01 00:02:00']))
+ series_minute = pd.Series(
+ [1, 2, 3],
+ pd.DatetimeIndex(
+ ["2011-12-31 23:59:00", "2012-01-01 00:00:00", "2012-01-01 00:02:00"]
+ ),
+ )
series_minute.index.resolution
A timestamp string less accurate than a minute gives a ``Series`` object.
.. ipython:: python
- series_minute['2011-12-31 23']
+ series_minute["2011-12-31 23"]
A timestamp string with minute resolution (or more accurate), gives a scalar instead, i.e. it is not casted to a slice.
.. ipython:: python
- series_minute['2011-12-31 23:59']
- series_minute['2011-12-31 23:59:00']
+ series_minute["2011-12-31 23:59"]
+ series_minute["2011-12-31 23:59:00"]
If index resolution is second, then the minute-accurate timestamp gives a
``Series``.
.. ipython:: python
- series_second = pd.Series([1, 2, 3],
- pd.DatetimeIndex(['2011-12-31 23:59:59',
- '2012-01-01 00:00:00',
- '2012-01-01 00:00:01']))
+ series_second = pd.Series(
+ [1, 2, 3],
+ pd.DatetimeIndex(
+ ["2011-12-31 23:59:59", "2012-01-01 00:00:00", "2012-01-01 00:00:01"]
+ ),
+ )
series_second.index.resolution
- series_second['2011-12-31 23:59']
+ series_second["2011-12-31 23:59"]
If the timestamp string is treated as a slice, it can be used to index ``DataFrame`` with ``[]`` as well.
.. ipython:: python
+ :okwarning:
- dft_minute = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]},
- index=series_minute.index)
- dft_minute['2011-12-31 23']
+ dft_minute = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}, index=series_minute.index)
+ dft_minute["2011-12-31 23"]
.. warning::
@@ -691,16 +719,17 @@ If the timestamp string is treated as a slice, it can be used to index ``DataFra
.. ipython:: python
- dft_minute.loc['2011-12-31 23:59']
+ dft_minute.loc["2011-12-31 23:59"]
Note also that ``DatetimeIndex`` resolution cannot be less precise than day.
.. ipython:: python
- series_monthly = pd.Series([1, 2, 3],
- pd.DatetimeIndex(['2011-12', '2012-01', '2012-02']))
+ series_monthly = pd.Series(
+ [1, 2, 3], pd.DatetimeIndex(["2011-12", "2012-01", "2012-02"])
+ )
series_monthly.index.resolution
- series_monthly['2011-12'] # returns Series
+ series_monthly["2011-12"] # returns Series
Exact indexing
@@ -712,14 +741,15 @@ These ``Timestamp`` and ``datetime`` objects have exact ``hours, minutes,`` and
.. ipython:: python
- dft[datetime.datetime(2013, 1, 1):datetime.datetime(2013, 2, 28)]
+ dft[datetime.datetime(2013, 1, 1): datetime.datetime(2013, 2, 28)]
With no defaults.
.. ipython:: python
- dft[datetime.datetime(2013, 1, 1, 10, 12, 0):
- datetime.datetime(2013, 2, 28, 10, 12, 0)]
+ dft[
+ datetime.datetime(2013, 1, 1, 10, 12, 0): datetime.datetime(2013, 2, 28, 10, 12, 0)
+ ]
Truncating & fancy indexing
@@ -732,11 +762,11 @@ partially matching dates:
.. ipython:: python
- rng2 = pd.date_range('2011-01-01', '2012-01-01', freq='W')
+ rng2 = pd.date_range("2011-01-01", "2012-01-01", freq="W")
ts2 = pd.Series(np.random.randn(len(rng2)), index=rng2)
- ts2.truncate(before='2011-11', after='2011-12')
- ts2['2011-11':'2011-12']
+ ts2.truncate(before="2011-11", after="2011-12")
+ ts2["2011-11":"2011-12"]
Even complicated fancy indexing that breaks the ``DatetimeIndex`` frequency
regularity will result in a ``DatetimeIndex``, although frequency is lost:
@@ -792,7 +822,7 @@ You may obtain the year, week and day components of the ISO year from the ISO 86
.. ipython:: python
- idx = pd.date_range(start='2019-12-29', freq='D', periods=4)
+ idx = pd.date_range(start="2019-12-29", freq="D", periods=4)
idx.isocalendar()
idx.to_series().dt.isocalendar()
@@ -822,12 +852,12 @@ arithmetic operator (``+``) or the ``apply`` method can be used to perform the s
.. ipython:: python
# This particular day contains a day light savings time transition
- ts = pd.Timestamp('2016-10-30 00:00:00', tz='Europe/Helsinki')
+ ts = pd.Timestamp("2016-10-30 00:00:00", tz="Europe/Helsinki")
# Respects absolute time
ts + pd.Timedelta(days=1)
# Respects calendar time
ts + pd.DateOffset(days=1)
- friday = pd.Timestamp('2018-01-05')
+ friday = pd.Timestamp("2018-01-05")
friday.day_name()
# Add 2 business days (Friday --> Tuesday)
two_business_days = 2 * pd.offsets.BDay()
@@ -885,10 +915,10 @@ business offsets operate on the weekdays.
.. ipython:: python
- ts = pd.Timestamp('2018-01-06 00:00:00')
+ ts = pd.Timestamp("2018-01-06 00:00:00")
ts.day_name()
# BusinessHour's valid offset dates are Monday through Friday
- offset = pd.offsets.BusinessHour(start='09:00')
+ offset = pd.offsets.BusinessHour(start="09:00")
# Bring the date to the closest offset date (Monday)
offset.rollforward(ts)
# Date is brought to the closest offset date first and then the hour is added
@@ -901,12 +931,12 @@ in the operation).
.. ipython:: python
- ts = pd.Timestamp('2014-01-01 09:00')
+ ts = pd.Timestamp("2014-01-01 09:00")
day = pd.offsets.Day()
day.apply(ts)
day.apply(ts).normalize()
- ts = pd.Timestamp('2014-01-01 22:00')
+ ts = pd.Timestamp("2014-01-01 22:00")
hour = pd.offsets.Hour()
hour.apply(ts)
hour.apply(ts).normalize()
@@ -959,7 +989,7 @@ apply the offset to each element.
.. ipython:: python
- rng = pd.date_range('2012-01-01', '2012-01-03')
+ rng = pd.date_range("2012-01-01", "2012-01-03")
s = pd.Series(rng)
rng
rng + pd.DateOffset(months=2)
@@ -974,7 +1004,7 @@ used exactly like a ``Timedelta`` - see the
.. ipython:: python
s - pd.offsets.Day(2)
- td = s - pd.Series(pd.date_range('2011-12-29', '2011-12-31'))
+ td = s - pd.Series(pd.date_range("2011-12-29", "2011-12-31"))
td
td + pd.offsets.Minute(15)
@@ -1001,16 +1031,13 @@ As an interesting example, let's look at Egypt where a Friday-Saturday weekend i
.. ipython:: python
- weekmask_egypt = 'Sun Mon Tue Wed Thu'
+ weekmask_egypt = "Sun Mon Tue Wed Thu"
# They also observe International Workers' Day so let's
# add that for a couple of years
- holidays = ['2012-05-01',
- datetime.datetime(2013, 5, 1),
- np.datetime64('2014-05-01')]
- bday_egypt = pd.offsets.CustomBusinessDay(holidays=holidays,
- weekmask=weekmask_egypt)
+ holidays = ["2012-05-01", datetime.datetime(2013, 5, 1), np.datetime64("2014-05-01")]
+ bday_egypt = pd.offsets.CustomBusinessDay(holidays=holidays, weekmask=weekmask_egypt)
dt = datetime.datetime(2013, 4, 30)
dt + 2 * bday_egypt
@@ -1020,8 +1047,7 @@ Let's map to the weekday names:
dts = pd.date_range(dt, periods=5, freq=bday_egypt)
- pd.Series(dts.weekday, dts).map(
- pd.Series('Mon Tue Wed Thu Fri Sat Sun'.split()))
+ pd.Series(dts.weekday, dts).map(pd.Series("Mon Tue Wed Thu Fri Sat Sun".split()))
Holiday calendars can be used to provide the list of holidays. See the
:ref:`holiday calendar` section for more information.
@@ -1043,15 +1069,14 @@ in the usual way.
.. ipython:: python
- bmth_us = pd.offsets.CustomBusinessMonthBegin(
- calendar=USFederalHolidayCalendar())
+ bmth_us = pd.offsets.CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
# Skip new years
dt = datetime.datetime(2013, 12, 17)
dt + bmth_us
# Define date index with custom offset
- pd.date_range(start='20100101', end='20120101', freq=bmth_us)
+ pd.date_range(start="20100101", end="20120101", freq=bmth_us)
.. note::
@@ -1082,23 +1107,23 @@ hours are added to the next business day.
bh
# 2014-08-01 is Friday
- pd.Timestamp('2014-08-01 10:00').weekday()
- pd.Timestamp('2014-08-01 10:00') + bh
+ pd.Timestamp("2014-08-01 10:00").weekday()
+ pd.Timestamp("2014-08-01 10:00") + bh
# Below example is the same as: pd.Timestamp('2014-08-01 09:00') + bh
- pd.Timestamp('2014-08-01 08:00') + bh
+ pd.Timestamp("2014-08-01 08:00") + bh
# If the results is on the end time, move to the next business day
- pd.Timestamp('2014-08-01 16:00') + bh
+ pd.Timestamp("2014-08-01 16:00") + bh
# Remainings are added to the next day
- pd.Timestamp('2014-08-01 16:30') + bh
+ pd.Timestamp("2014-08-01 16:30") + bh
# Adding 2 business hours
- pd.Timestamp('2014-08-01 10:00') + pd.offsets.BusinessHour(2)
+ pd.Timestamp("2014-08-01 10:00") + pd.offsets.BusinessHour(2)
# Subtracting 3 business hours
- pd.Timestamp('2014-08-01 10:00') + pd.offsets.BusinessHour(-3)
+ pd.Timestamp("2014-08-01 10:00") + pd.offsets.BusinessHour(-3)
You can also specify ``start`` and ``end`` time by keywords. The argument must
be a ``str`` with an ``hour:minute`` representation or a ``datetime.time``
@@ -1107,12 +1132,12 @@ results in ``ValueError``.
.. ipython:: python
- bh = pd.offsets.BusinessHour(start='11:00', end=datetime.time(20, 0))
+ bh = pd.offsets.BusinessHour(start="11:00", end=datetime.time(20, 0))
bh
- pd.Timestamp('2014-08-01 13:00') + bh
- pd.Timestamp('2014-08-01 09:00') + bh
- pd.Timestamp('2014-08-01 18:00') + bh
+ pd.Timestamp("2014-08-01 13:00") + bh
+ pd.Timestamp("2014-08-01 09:00") + bh
+ pd.Timestamp("2014-08-01 18:00") + bh
Passing ``start`` time later than ``end`` represents midnight business hour.
In this case, business hour exceeds midnight and overlap to the next day.
@@ -1120,19 +1145,19 @@ Valid business hours are distinguished by whether it started from valid ``Busine
.. ipython:: python
- bh = pd.offsets.BusinessHour(start='17:00', end='09:00')
+ bh = pd.offsets.BusinessHour(start="17:00", end="09:00")
bh
- pd.Timestamp('2014-08-01 17:00') + bh
- pd.Timestamp('2014-08-01 23:00') + bh
+ pd.Timestamp("2014-08-01 17:00") + bh
+ pd.Timestamp("2014-08-01 23:00") + bh
# Although 2014-08-02 is Saturday,
# it is valid because it starts from 08-01 (Friday).
- pd.Timestamp('2014-08-02 04:00') + bh
+ pd.Timestamp("2014-08-02 04:00") + bh
# Although 2014-08-04 is Monday,
# it is out of business hours because it starts from 08-03 (Sunday).
- pd.Timestamp('2014-08-04 04:00') + bh
+ pd.Timestamp("2014-08-04 04:00") + bh
Applying ``BusinessHour.rollforward`` and ``rollback`` to out of business hours results in
the next business hour start or previous day's end. Different from other offsets, ``BusinessHour.rollforward``
@@ -1145,19 +1170,19 @@ under the default business hours (9:00 - 17:00), there is no gap (0 minutes) bet
.. ipython:: python
# This adjusts a Timestamp to business hour edge
- pd.offsets.BusinessHour().rollback(pd.Timestamp('2014-08-02 15:00'))
- pd.offsets.BusinessHour().rollforward(pd.Timestamp('2014-08-02 15:00'))
+ pd.offsets.BusinessHour().rollback(pd.Timestamp("2014-08-02 15:00"))
+ pd.offsets.BusinessHour().rollforward(pd.Timestamp("2014-08-02 15:00"))
# It is the same as BusinessHour().apply(pd.Timestamp('2014-08-01 17:00')).
# And it is the same as BusinessHour().apply(pd.Timestamp('2014-08-04 09:00'))
- pd.offsets.BusinessHour().apply(pd.Timestamp('2014-08-02 15:00'))
+ pd.offsets.BusinessHour().apply(pd.Timestamp("2014-08-02 15:00"))
# BusinessDay results (for reference)
- pd.offsets.BusinessHour().rollforward(pd.Timestamp('2014-08-02'))
+ pd.offsets.BusinessHour().rollforward(pd.Timestamp("2014-08-02"))
# It is the same as BusinessDay().apply(pd.Timestamp('2014-08-01'))
# The result is the same as rollworward because BusinessDay never overlap.
- pd.offsets.BusinessHour().apply(pd.Timestamp('2014-08-02'))
+ pd.offsets.BusinessHour().apply(pd.Timestamp("2014-08-02"))
``BusinessHour`` regards Saturday and Sunday as holidays. To use arbitrary
holidays, you can use ``CustomBusinessHour`` offset, as explained in the
@@ -1175,6 +1200,7 @@ as ``BusinessHour`` except that it skips specified custom holidays.
.. ipython:: python
from pandas.tseries.holiday import USFederalHolidayCalendar
+
bhour_us = pd.offsets.CustomBusinessHour(calendar=USFederalHolidayCalendar())
# Friday before MLK Day
dt = datetime.datetime(2014, 1, 17, 15)
@@ -1188,8 +1214,7 @@ You can use keyword arguments supported by either ``BusinessHour`` and ``CustomB
.. ipython:: python
- bhour_mon = pd.offsets.CustomBusinessHour(start='10:00',
- weekmask='Tue Wed Thu Fri')
+ bhour_mon = pd.offsets.CustomBusinessHour(start="10:00", weekmask="Tue Wed Thu Fri")
# Monday is skipped because it's a holiday, business hour starts from 10:00
dt + bhour_mon * 2
@@ -1242,7 +1267,7 @@ most functions:
.. ipython:: python
- pd.date_range(start, periods=5, freq='B')
+ pd.date_range(start, periods=5, freq="B")
pd.date_range(start, periods=5, freq=pd.offsets.BDay())
@@ -1250,9 +1275,9 @@ You can combine together day and intraday offsets:
.. ipython:: python
- pd.date_range(start, periods=10, freq='2h20min')
+ pd.date_range(start, periods=10, freq="2h20min")
- pd.date_range(start, periods=10, freq='1D10U')
+ pd.date_range(start, periods=10, freq="1D10U")
Anchored offsets
~~~~~~~~~~~~~~~~
@@ -1311,39 +1336,39 @@ anchor point, and moved ``|n|-1`` additional steps forwards or backwards.
.. ipython:: python
- pd.Timestamp('2014-01-02') + pd.offsets.MonthBegin(n=1)
- pd.Timestamp('2014-01-02') + pd.offsets.MonthEnd(n=1)
+ pd.Timestamp("2014-01-02") + pd.offsets.MonthBegin(n=1)
+ pd.Timestamp("2014-01-02") + pd.offsets.MonthEnd(n=1)
- pd.Timestamp('2014-01-02') - pd.offsets.MonthBegin(n=1)
- pd.Timestamp('2014-01-02') - pd.offsets.MonthEnd(n=1)
+ pd.Timestamp("2014-01-02") - pd.offsets.MonthBegin(n=1)
+ pd.Timestamp("2014-01-02") - pd.offsets.MonthEnd(n=1)
- pd.Timestamp('2014-01-02') + pd.offsets.MonthBegin(n=4)
- pd.Timestamp('2014-01-02') - pd.offsets.MonthBegin(n=4)
+ pd.Timestamp("2014-01-02") + pd.offsets.MonthBegin(n=4)
+ pd.Timestamp("2014-01-02") - pd.offsets.MonthBegin(n=4)
If the given date *is* on an anchor point, it is moved ``|n|`` points forwards
or backwards.
.. ipython:: python
- pd.Timestamp('2014-01-01') + pd.offsets.MonthBegin(n=1)
- pd.Timestamp('2014-01-31') + pd.offsets.MonthEnd(n=1)
+ pd.Timestamp("2014-01-01") + pd.offsets.MonthBegin(n=1)
+ pd.Timestamp("2014-01-31") + pd.offsets.MonthEnd(n=1)
- pd.Timestamp('2014-01-01') - pd.offsets.MonthBegin(n=1)
- pd.Timestamp('2014-01-31') - pd.offsets.MonthEnd(n=1)
+ pd.Timestamp("2014-01-01") - pd.offsets.MonthBegin(n=1)
+ pd.Timestamp("2014-01-31") - pd.offsets.MonthEnd(n=1)
- pd.Timestamp('2014-01-01') + pd.offsets.MonthBegin(n=4)
- pd.Timestamp('2014-01-31') - pd.offsets.MonthBegin(n=4)
+ pd.Timestamp("2014-01-01") + pd.offsets.MonthBegin(n=4)
+ pd.Timestamp("2014-01-31") - pd.offsets.MonthBegin(n=4)
For the case when ``n=0``, the date is not moved if on an anchor point, otherwise
it is rolled forward to the next anchor point.
.. ipython:: python
- pd.Timestamp('2014-01-02') + pd.offsets.MonthBegin(n=0)
- pd.Timestamp('2014-01-02') + pd.offsets.MonthEnd(n=0)
+ pd.Timestamp("2014-01-02") + pd.offsets.MonthBegin(n=0)
+ pd.Timestamp("2014-01-02") + pd.offsets.MonthEnd(n=0)
- pd.Timestamp('2014-01-01') + pd.offsets.MonthBegin(n=0)
- pd.Timestamp('2014-01-31') + pd.offsets.MonthEnd(n=0)
+ pd.Timestamp("2014-01-01") + pd.offsets.MonthBegin(n=0)
+ pd.Timestamp("2014-01-31") + pd.offsets.MonthEnd(n=0)
.. _timeseries.holiday:
@@ -1379,14 +1404,22 @@ An example of how holidays and holiday calendars are defined:
.. ipython:: python
- from pandas.tseries.holiday import Holiday, USMemorialDay,\
- AbstractHolidayCalendar, nearest_workday, MO
+ from pandas.tseries.holiday import (
+ Holiday,
+ USMemorialDay,
+ AbstractHolidayCalendar,
+ nearest_workday,
+ MO,
+ )
+
+
class ExampleCalendar(AbstractHolidayCalendar):
rules = [
USMemorialDay,
- Holiday('July 4th', month=7, day=4, observance=nearest_workday),
- Holiday('Columbus Day', month=10, day=1,
- offset=pd.DateOffset(weekday=MO(2)))]
+ Holiday("July 4th", month=7, day=4, observance=nearest_workday),
+ Holiday("Columbus Day", month=10, day=1, offset=pd.DateOffset(weekday=MO(2))),
+ ]
+
cal = ExampleCalendar()
cal.holidays(datetime.datetime(2012, 1, 1), datetime.datetime(2012, 12, 31))
@@ -1402,8 +1435,9 @@ or ``Timestamp`` objects.
.. ipython:: python
- pd.date_range(start='7/1/2012', end='7/10/2012',
- freq=pd.offsets.CDay(calendar=cal)).to_pydatetime()
+ pd.date_range(
+ start="7/1/2012", end="7/10/2012", freq=pd.offsets.CDay(calendar=cal)
+ ).to_pydatetime()
offset = pd.offsets.CustomBusinessDay(calendar=cal)
datetime.datetime(2012, 5, 25) + offset
datetime.datetime(2012, 7, 3) + offset
@@ -1435,11 +1469,11 @@ or calendars with additional rules.
.. ipython:: python
- from pandas.tseries.holiday import get_calendar, HolidayCalendarFactory,\
- USLaborDay
- cal = get_calendar('ExampleCalendar')
+ from pandas.tseries.holiday import get_calendar, HolidayCalendarFactory, USLaborDay
+
+ cal = get_calendar("ExampleCalendar")
cal.rules
- new_cal = HolidayCalendarFactory('NewExampleCalendar', cal, USLaborDay)
+ new_cal = HolidayCalendarFactory("NewExampleCalendar", cal, USLaborDay)
new_cal.rules
.. _timeseries.advanced_datetime:
@@ -1462,23 +1496,19 @@ the pandas objects.
The ``shift`` method accepts an ``freq`` argument which can accept a
``DateOffset`` class or other ``timedelta``-like object or also an
-:ref:`offset alias `:
-
-.. ipython:: python
-
- ts.shift(5, freq=pd.offsets.BDay())
- ts.shift(5, freq='BM')
+:ref:`offset alias `.
-Rather than changing the alignment of the data and the index, ``DataFrame`` and
-``Series`` objects also have a :meth:`~Series.tshift` convenience method that
-changes all the dates in the index by a specified number of offsets:
+When ``freq`` is specified, ``shift`` method changes all the dates in the index
+rather than changing the alignment of the data and the index:
.. ipython:: python
- ts.tshift(5, freq='D')
+ ts.shift(5, freq="D")
+ ts.shift(5, freq=pd.offsets.BDay())
+ ts.shift(5, freq="BM")
-Note that with ``tshift``, the leading entry is no longer NaN because the data
-is not being realigned.
+Note that with when ``freq`` is specified, the leading entry is no longer NaN
+because the data is not being realigned.
Frequency conversion
~~~~~~~~~~~~~~~~~~~~
@@ -1490,7 +1520,7 @@ calls ``reindex``.
.. ipython:: python
- dr = pd.date_range('1/1/2010', periods=3, freq=3 * pd.offsets.BDay())
+ dr = pd.date_range("1/1/2010", periods=3, freq=3 * pd.offsets.BDay())
ts = pd.Series(np.random.randn(3), index=dr)
ts
ts.asfreq(pd.offsets.BDay())
@@ -1500,7 +1530,7 @@ method for any gaps that may appear after the frequency conversion.
.. ipython:: python
- ts.asfreq(pd.offsets.BDay(), method='pad')
+ ts.asfreq(pd.offsets.BDay(), method="pad")
Filling forward / backward
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1519,7 +1549,7 @@ Converting to Python datetimes
Resampling
----------
-Pandas has a simple, powerful, and efficient functionality for performing
+pandas has a simple, powerful, and efficient functionality for performing
resampling operations during frequency conversion (e.g., converting secondly
data into 5-minutely data). This is extremely common in, but not limited to,
financial applications.
@@ -1541,11 +1571,11 @@ Basics
.. ipython:: python
- rng = pd.date_range('1/1/2012', periods=100, freq='S')
+ rng = pd.date_range("1/1/2012", periods=100, freq="S")
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
- ts.resample('5Min').sum()
+ ts.resample("5Min").sum()
The ``resample`` function is very flexible and allows you to specify many
different parameters to control the frequency conversion and resampling
@@ -1557,11 +1587,11 @@ a method of the returned object, including ``sum``, ``mean``, ``std``, ``sem``,
.. ipython:: python
- ts.resample('5Min').mean()
+ ts.resample("5Min").mean()
- ts.resample('5Min').ohlc()
+ ts.resample("5Min").ohlc()
- ts.resample('5Min').max()
+ ts.resample("5Min").max()
For downsampling, ``closed`` can be set to 'left' or 'right' to specify which
@@ -1569,9 +1599,9 @@ end of the interval is closed:
.. ipython:: python
- ts.resample('5Min', closed='right').mean()
+ ts.resample("5Min", closed="right").mean()
- ts.resample('5Min', closed='left').mean()
+ ts.resample("5Min", closed="left").mean()
Parameters like ``label`` are used to manipulate the resulting labels.
``label`` specifies whether the result is labeled with the beginning or
@@ -1579,9 +1609,9 @@ the end of the interval.
.. ipython:: python
- ts.resample('5Min').mean() # by default label='left'
+ ts.resample("5Min").mean() # by default label='left'
- ts.resample('5Min', label='left').mean()
+ ts.resample("5Min", label="left").mean()
.. warning::
@@ -1595,12 +1625,12 @@ the end of the interval.
.. ipython:: python
- s = pd.date_range('2000-01-01', '2000-01-05').to_series()
+ s = pd.date_range("2000-01-01", "2000-01-05").to_series()
s.iloc[2] = pd.NaT
s.dt.day_name()
# default: label='left', closed='left'
- s.resample('B').last().dt.day_name()
+ s.resample("B").last().dt.day_name()
Notice how the value for Sunday got pulled back to the previous Friday.
To get the behavior where the value for Sunday is pushed to Monday, use
@@ -1608,7 +1638,7 @@ the end of the interval.
.. ipython:: python
- s.resample('B', label='right', closed='right').last().dt.day_name()
+ s.resample("B", label="right", closed="right").last().dt.day_name()
The ``axis`` parameter can be set to 0 or 1 and allows you to resample the
specified axis for a ``DataFrame``.
@@ -1631,11 +1661,11 @@ For upsampling, you can specify a way to upsample and the ``limit`` parameter to
# from secondly to every 250 milliseconds
- ts[:2].resample('250L').asfreq()
+ ts[:2].resample("250L").asfreq()
- ts[:2].resample('250L').ffill()
+ ts[:2].resample("250L").ffill()
- ts[:2].resample('250L').ffill(limit=2)
+ ts[:2].resample("250L").ffill(limit=2)
Sparse resampling
~~~~~~~~~~~~~~~~~
@@ -1651,14 +1681,14 @@ resample only the groups that are not all ``NaN``.
.. ipython:: python
- rng = pd.date_range('2014-1-1', periods=100, freq='D') + pd.Timedelta('1s')
+ rng = pd.date_range("2014-1-1", periods=100, freq="D") + pd.Timedelta("1s")
ts = pd.Series(range(100), index=rng)
If we want to resample to the full range of the series:
.. ipython:: python
- ts.resample('3T').sum()
+ ts.resample("3T").sum()
We can instead only resample those groups where we have points as follows:
@@ -1667,12 +1697,14 @@ We can instead only resample those groups where we have points as follows:
from functools import partial
from pandas.tseries.frequencies import to_offset
+
def round(t, freq):
# round a Timestamp to a specified freq
freq = to_offset(freq)
return pd.Timestamp((t.value // freq.delta.value) * freq.delta.value)
- ts.groupby(partial(round, freq='3T')).sum()
+
+ ts.groupby(partial(round, freq="3T")).sum()
.. _timeseries.aggregate:
@@ -1686,25 +1718,27 @@ Resampling a ``DataFrame``, the default will be to act on all columns with the s
.. ipython:: python
- df = pd.DataFrame(np.random.randn(1000, 3),
- index=pd.date_range('1/1/2012', freq='S', periods=1000),
- columns=['A', 'B', 'C'])
- r = df.resample('3T')
+ df = pd.DataFrame(
+ np.random.randn(1000, 3),
+ index=pd.date_range("1/1/2012", freq="S", periods=1000),
+ columns=["A", "B", "C"],
+ )
+ r = df.resample("3T")
r.mean()
We can select a specific column or columns using standard getitem.
.. ipython:: python
- r['A'].mean()
+ r["A"].mean()
- r[['A', 'B']].mean()
+ r[["A", "B"]].mean()
You can pass a list or dict of functions to do aggregation with, outputting a ``DataFrame``:
.. ipython:: python
- r['A'].agg([np.sum, np.mean, np.std])
+ r["A"].agg([np.sum, np.mean, np.std])
On a resampled ``DataFrame``, you can pass a list of functions to apply to each
column, which produces an aggregated result with a hierarchical index:
@@ -1719,21 +1753,20 @@ columns of a ``DataFrame``:
.. ipython:: python
:okexcept:
- r.agg({'A': np.sum,
- 'B': lambda x: np.std(x, ddof=1)})
+ r.agg({"A": np.sum, "B": lambda x: np.std(x, ddof=1)})
The function names can also be strings. In order for a string to be valid it
must be implemented on the resampled object:
.. ipython:: python
- r.agg({'A': 'sum', 'B': 'std'})
+ r.agg({"A": "sum", "B": "std"})
Furthermore, you can also specify multiple aggregation functions for each column separately.
.. ipython:: python
- r.agg({'A': ['sum', 'std'], 'B': ['mean', 'std']})
+ r.agg({"A": ["sum", "std"], "B": ["mean", "std"]})
If a ``DataFrame`` does not have a datetimelike index, but instead you want
@@ -1742,14 +1775,15 @@ to resample based on datetimelike column in the frame, it can passed to the
.. ipython:: python
- df = pd.DataFrame({'date': pd.date_range('2015-01-01', freq='W', periods=5),
- 'a': np.arange(5)},
- index=pd.MultiIndex.from_arrays([
- [1, 2, 3, 4, 5],
- pd.date_range('2015-01-01', freq='W', periods=5)],
- names=['v', 'd']))
+ df = pd.DataFrame(
+ {"date": pd.date_range("2015-01-01", freq="W", periods=5), "a": np.arange(5)},
+ index=pd.MultiIndex.from_arrays(
+ [[1, 2, 3, 4, 5], pd.date_range("2015-01-01", freq="W", periods=5)],
+ names=["v", "d"],
+ ),
+ )
df
- df.resample('M', on='date').sum()
+ df.resample("M", on="date").sum()
Similarly, if you instead want to resample by a datetimelike
level of ``MultiIndex``, its name or location can be passed to the
@@ -1757,7 +1791,7 @@ level of ``MultiIndex``, its name or location can be passed to the
.. ipython:: python
- df.resample('M', level='d').sum()
+ df.resample("M", level="d").sum()
.. _timeseries.iterating-label:
@@ -1771,14 +1805,18 @@ natural and functions similarly to :py:func:`itertools.groupby`:
small = pd.Series(
range(6),
- index=pd.to_datetime(['2017-01-01T00:00:00',
- '2017-01-01T00:30:00',
- '2017-01-01T00:31:00',
- '2017-01-01T01:00:00',
- '2017-01-01T03:00:00',
- '2017-01-01T03:05:00'])
+ index=pd.to_datetime(
+ [
+ "2017-01-01T00:00:00",
+ "2017-01-01T00:30:00",
+ "2017-01-01T00:31:00",
+ "2017-01-01T01:00:00",
+ "2017-01-01T03:00:00",
+ "2017-01-01T03:05:00",
+ ]
+ ),
)
- resampled = small.resample('H')
+ resampled = small.resample("H")
for name, group in resampled:
print("Group: ", name)
@@ -1789,20 +1827,20 @@ See :ref:`groupby.iterating-label` or :class:`Resampler.__iter__` for more.
.. _timeseries.adjust-the-start-of-the-bins:
-Use `origin` or `offset` to adjust the start of the bins
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Use ``origin`` or ``offset`` to adjust the start of the bins
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. versionadded:: 1.1.0
-The bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like `30D`) or that divide a day evenly (like `90s` or `1min`). This can create inconsistencies with some frequencies that do not meet this criteria. To change this behavior you can specify a fixed Timestamp with the argument ``origin``.
+The bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like ``30D``) or that divide a day evenly (like ``90s`` or ``1min``). This can create inconsistencies with some frequencies that do not meet this criteria. To change this behavior you can specify a fixed Timestamp with the argument ``origin``.
For example:
.. ipython:: python
- start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
- middle = '2000-10-02 00:00:00'
- rng = pd.date_range(start, end, freq='7min')
+ start, end = "2000-10-01 23:30:00", "2000-10-02 00:30:00"
+ middle = "2000-10-02 00:00:00"
+ rng = pd.date_range(start, end, freq="7min")
ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
ts
@@ -1810,32 +1848,32 @@ Here we can see that, when using ``origin`` with its default value (``'start_day
.. ipython:: python
- ts.resample('17min', origin='start_day').sum()
- ts[middle:end].resample('17min', origin='start_day').sum()
+ ts.resample("17min", origin="start_day").sum()
+ ts[middle:end].resample("17min", origin="start_day").sum()
Here we can see that, when setting ``origin`` to ``'epoch'``, the result after ``'2000-10-02 00:00:00'`` are identical depending on the start of time series:
.. ipython:: python
- ts.resample('17min', origin='epoch').sum()
- ts[middle:end].resample('17min', origin='epoch').sum()
+ ts.resample("17min", origin="epoch").sum()
+ ts[middle:end].resample("17min", origin="epoch").sum()
If needed you can use a custom timestamp for ``origin``:
.. ipython:: python
- ts.resample('17min', origin='2001-01-01').sum()
- ts[middle:end].resample('17min', origin=pd.Timestamp('2001-01-01')).sum()
+ ts.resample("17min", origin="2001-01-01").sum()
+ ts[middle:end].resample("17min", origin=pd.Timestamp("2001-01-01")).sum()
If needed you can just adjust the bins with an ``offset`` Timedelta that would be added to the default ``origin``.
Those two examples are equivalent for this time series:
.. ipython:: python
- ts.resample('17min', origin='start').sum()
- ts.resample('17min', offset='23h30min').sum()
+ ts.resample("17min", origin="start").sum()
+ ts.resample("17min", offset="23h30min").sum()
Note the use of ``'start'`` for ``origin`` on the last example. In that case, ``origin`` will be set to the first value of the timeseries.
@@ -1858,37 +1896,37 @@ Because ``freq`` represents a span of ``Period``, it cannot be negative like "-3
.. ipython:: python
- pd.Period('2012', freq='A-DEC')
+ pd.Period("2012", freq="A-DEC")
- pd.Period('2012-1-1', freq='D')
+ pd.Period("2012-1-1", freq="D")
- pd.Period('2012-1-1 19:00', freq='H')
+ pd.Period("2012-1-1 19:00", freq="H")
- pd.Period('2012-1-1 19:00', freq='5H')
+ pd.Period("2012-1-1 19:00", freq="5H")
Adding and subtracting integers from periods shifts the period by its own
frequency. Arithmetic is not allowed between ``Period`` with different ``freq`` (span).
.. ipython:: python
- p = pd.Period('2012', freq='A-DEC')
+ p = pd.Period("2012", freq="A-DEC")
p + 1
p - 3
- p = pd.Period('2012-01', freq='2M')
+ p = pd.Period("2012-01", freq="2M")
p + 2
p - 1
@okexcept
- p == pd.Period('2012-01', freq='3M')
+ p == pd.Period("2012-01", freq="3M")
If ``Period`` freq is daily or higher (``D``, ``H``, ``T``, ``S``, ``L``, ``U``, ``N``), ``offsets`` and ``timedelta``-like can be added if the result can have the same freq. Otherwise, ``ValueError`` will be raised.
.. ipython:: python
- p = pd.Period('2014-07-01 09:00', freq='H')
+ p = pd.Period("2014-07-01 09:00", freq="H")
p + pd.offsets.Hour(2)
p + datetime.timedelta(minutes=120)
- p + np.timedelta64(7200, 's')
+ p + np.timedelta64(7200, "s")
.. code-block:: ipython
@@ -1901,7 +1939,7 @@ If ``Period`` has other frequencies, only the same ``offsets`` can be added. Oth
.. ipython:: python
- p = pd.Period('2014-07', freq='M')
+ p = pd.Period("2014-07", freq="M")
p + pd.offsets.MonthEnd(3)
.. code-block:: ipython
@@ -1916,7 +1954,7 @@ return the number of frequency units between them:
.. ipython:: python
- pd.Period('2012', freq='A-DEC') - pd.Period('2002', freq='A-DEC')
+ pd.Period("2012", freq="A-DEC") - pd.Period("2002", freq="A-DEC")
PeriodIndex and period_range
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1925,21 +1963,21 @@ which can be constructed using the ``period_range`` convenience function:
.. ipython:: python
- prng = pd.period_range('1/1/2011', '1/1/2012', freq='M')
+ prng = pd.period_range("1/1/2011", "1/1/2012", freq="M")
prng
The ``PeriodIndex`` constructor can also be used directly:
.. ipython:: python
- pd.PeriodIndex(['2011-1', '2011-2', '2011-3'], freq='M')
+ pd.PeriodIndex(["2011-1", "2011-2", "2011-3"], freq="M")
Passing multiplied frequency outputs a sequence of ``Period`` which
has multiplied span.
.. ipython:: python
- pd.period_range(start='2014-01', freq='3M', periods=4)
+ pd.period_range(start="2014-01", freq="3M", periods=4)
If ``start`` or ``end`` are ``Period`` objects, they will be used as anchor
endpoints for a ``PeriodIndex`` with frequency matching that of the
@@ -1947,8 +1985,9 @@ endpoints for a ``PeriodIndex`` with frequency matching that of the
.. ipython:: python
- pd.period_range(start=pd.Period('2017Q1', freq='Q'),
- end=pd.Period('2017Q2', freq='Q'), freq='M')
+ pd.period_range(
+ start=pd.Period("2017Q1", freq="Q"), end=pd.Period("2017Q2", freq="Q"), freq="M"
+ )
Just like ``DatetimeIndex``, a ``PeriodIndex`` can also be used to index pandas
objects:
@@ -1962,11 +2001,11 @@ objects:
.. ipython:: python
- idx = pd.period_range('2014-07-01 09:00', periods=5, freq='H')
+ idx = pd.period_range("2014-07-01 09:00", periods=5, freq="H")
idx
idx + pd.offsets.Hour(2)
- idx = pd.period_range('2014-07', periods=5, freq='M')
+ idx = pd.period_range("2014-07", periods=5, freq="M")
idx
idx + pd.offsets.MonthEnd(3)
@@ -1985,7 +2024,7 @@ The ``period`` dtype holds the ``freq`` attribute and is represented with
.. ipython:: python
- pi = pd.period_range('2016-01-01', periods=3, freq='M')
+ pi = pd.period_range("2016-01-01", periods=3, freq="M")
pi
pi.dtype
@@ -1996,15 +2035,15 @@ The ``period`` dtype can be used in ``.astype(...)``. It allows one to change th
.. ipython:: python
# change monthly freq to daily freq
- pi.astype('period[D]')
+ pi.astype("period[D]")
# convert to DatetimeIndex
- pi.astype('datetime64[ns]')
+ pi.astype("datetime64[ns]")
# convert to PeriodIndex
- dti = pd.date_range('2011-01-01', freq='M', periods=3)
+ dti = pd.date_range("2011-01-01", freq="M", periods=3)
dti
- dti.astype('period[M]')
+ dti.astype("period[M]")
PeriodIndex partial string indexing
@@ -2018,31 +2057,32 @@ You can pass in dates and strings to ``Series`` and ``DataFrame`` with ``PeriodI
.. ipython:: python
- ps['2011-01']
+ ps["2011-01"]
ps[datetime.datetime(2011, 12, 25):]
- ps['10/31/2011':'12/31/2011']
+ ps["10/31/2011":"12/31/2011"]
Passing a string representing a lower frequency than ``PeriodIndex`` returns partial sliced data.
.. ipython:: python
+ :okwarning:
- ps['2011']
+ ps["2011"]
- dfp = pd.DataFrame(np.random.randn(600, 1),
- columns=['A'],
- index=pd.period_range('2013-01-01 9:00',
- periods=600,
- freq='T'))
+ dfp = pd.DataFrame(
+ np.random.randn(600, 1),
+ columns=["A"],
+ index=pd.period_range("2013-01-01 9:00", periods=600, freq="T"),
+ )
dfp
- dfp['2013-01-01 10H']
+ dfp["2013-01-01 10H"]
As with ``DatetimeIndex``, the endpoints will be included in the result. The example below slices data starting from 10:00 to 11:59.
.. ipython:: python
- dfp['2013-01-01 10H':'2013-01-01 11H']
+ dfp["2013-01-01 10H":"2013-01-01 11H"]
Frequency conversion and resampling with PeriodIndex
@@ -2052,7 +2092,7 @@ method. Let's start with the fiscal year 2011, ending in December:
.. ipython:: python
- p = pd.Period('2011', freq='A-DEC')
+ p = pd.Period("2011", freq="A-DEC")
p
We can convert it to a monthly frequency. Using the ``how`` parameter, we can
@@ -2060,16 +2100,16 @@ specify whether to return the starting or ending month:
.. ipython:: python
- p.asfreq('M', how='start')
+ p.asfreq("M", how="start")
- p.asfreq('M', how='end')
+ p.asfreq("M", how="end")
The shorthands 's' and 'e' are provided for convenience:
.. ipython:: python
- p.asfreq('M', 's')
- p.asfreq('M', 'e')
+ p.asfreq("M", "s")
+ p.asfreq("M", "e")
Converting to a "super-period" (e.g., annual frequency is a super-period of
quarterly frequency) automatically returns the super-period that includes the
@@ -2077,9 +2117,9 @@ input period:
.. ipython:: python
- p = pd.Period('2011-12', freq='M')
+ p = pd.Period("2011-12", freq="M")
- p.asfreq('A-NOV')
+ p.asfreq("A-NOV")
Note that since we converted to an annual frequency that ends the year in
November, the monthly period of December 2011 is actually in the 2012 A-NOV
@@ -2098,21 +2138,21 @@ frequencies ``Q-JAN`` through ``Q-DEC``.
.. ipython:: python
- p = pd.Period('2012Q1', freq='Q-DEC')
+ p = pd.Period("2012Q1", freq="Q-DEC")
- p.asfreq('D', 's')
+ p.asfreq("D", "s")
- p.asfreq('D', 'e')
+ p.asfreq("D", "e")
``Q-MAR`` defines fiscal year end in March:
.. ipython:: python
- p = pd.Period('2011Q4', freq='Q-MAR')
+ p = pd.Period("2011Q4", freq="Q-MAR")
- p.asfreq('D', 's')
+ p.asfreq("D", "s")
- p.asfreq('D', 'e')
+ p.asfreq("D", "e")
.. _timeseries.interchange:
@@ -2124,7 +2164,7 @@ and vice-versa using ``to_timestamp``:
.. ipython:: python
- rng = pd.date_range('1/1/2012', periods=5, freq='M')
+ rng = pd.date_range("1/1/2012", periods=5, freq="M")
ts = pd.Series(np.random.randn(len(rng)), index=rng)
@@ -2141,7 +2181,7 @@ end of the period:
.. ipython:: python
- ps.to_timestamp('D', how='s')
+ ps.to_timestamp("D", how="s")
Converting between period and timestamp enables some convenient arithmetic
functions to be used. In the following example, we convert a quarterly
@@ -2150,11 +2190,11 @@ the quarter end:
.. ipython:: python
- prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
+ prng = pd.period_range("1990Q1", "2000Q4", freq="Q-NOV")
ts = pd.Series(np.random.randn(len(prng)), prng)
- ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
+ ts.index = (prng.asfreq("M", "e") + 1).asfreq("H", "s") + 9
ts.head()
@@ -2168,7 +2208,7 @@ then you can use a ``PeriodIndex`` and/or ``Series`` of ``Periods`` to do comput
.. ipython:: python
- span = pd.period_range('1215-01-01', '1381-01-01', freq='D')
+ span = pd.period_range("1215-01-01", "1381-01-01", freq="D")
span
To convert from an ``int64`` based YYYYMMDD representation.
@@ -2178,9 +2218,10 @@ To convert from an ``int64`` based YYYYMMDD representation.
s = pd.Series([20121231, 20141130, 99991231])
s
+
def conv(x):
- return pd.Period(year=x // 10000, month=x // 100 % 100,
- day=x % 100, freq='D')
+ return pd.Period(year=x // 10000, month=x // 100 % 100, day=x % 100, freq="D")
+
s.apply(conv)
s.apply(conv)[2]
@@ -2198,7 +2239,7 @@ Time zone handling
------------------
pandas provides rich support for working with timestamps in different time
-zones using the ``pytz`` and ``dateutil`` libraries or class:`datetime.timezone`
+zones using the ``pytz`` and ``dateutil`` libraries or :class:`datetime.timezone`
objects from the standard library.
@@ -2209,7 +2250,7 @@ By default, pandas objects are time zone unaware:
.. ipython:: python
- rng = pd.date_range('3/6/2012 00:00', periods=15, freq='D')
+ rng = pd.date_range("3/6/2012 00:00", periods=15, freq="D")
rng.tz is None
To localize these dates to a time zone (assign a particular time zone to a naive date),
@@ -2229,18 +2270,16 @@ To return ``dateutil`` time zone objects, append ``dateutil/`` before the string
import dateutil
# pytz
- rng_pytz = pd.date_range('3/6/2012 00:00', periods=3, freq='D',
- tz='Europe/London')
+ rng_pytz = pd.date_range("3/6/2012 00:00", periods=3, freq="D", tz="Europe/London")
rng_pytz.tz
# dateutil
- rng_dateutil = pd.date_range('3/6/2012 00:00', periods=3, freq='D')
- rng_dateutil = rng_dateutil.tz_localize('dateutil/Europe/London')
+ rng_dateutil = pd.date_range("3/6/2012 00:00", periods=3, freq="D")
+ rng_dateutil = rng_dateutil.tz_localize("dateutil/Europe/London")
rng_dateutil.tz
# dateutil - utc special case
- rng_utc = pd.date_range('3/6/2012 00:00', periods=3, freq='D',
- tz=dateutil.tz.tzutc())
+ rng_utc = pd.date_range("3/6/2012 00:00", periods=3, freq="D", tz=dateutil.tz.tzutc())
rng_utc.tz
.. versionadded:: 0.25.0
@@ -2248,8 +2287,7 @@ To return ``dateutil`` time zone objects, append ``dateutil/`` before the string
.. ipython:: python
# datetime.timezone
- rng_utc = pd.date_range('3/6/2012 00:00', periods=3, freq='D',
- tz=datetime.timezone.utc)
+ rng_utc = pd.date_range("3/6/2012 00:00", periods=3, freq="D", tz=datetime.timezone.utc)
rng_utc.tz
Note that the ``UTC`` time zone is a special case in ``dateutil`` and should be constructed explicitly
@@ -2261,15 +2299,14 @@ zones objects explicitly first.
import pytz
# pytz
- tz_pytz = pytz.timezone('Europe/London')
- rng_pytz = pd.date_range('3/6/2012 00:00', periods=3, freq='D')
+ tz_pytz = pytz.timezone("Europe/London")
+ rng_pytz = pd.date_range("3/6/2012 00:00", periods=3, freq="D")
rng_pytz = rng_pytz.tz_localize(tz_pytz)
rng_pytz.tz == tz_pytz
# dateutil
- tz_dateutil = dateutil.tz.gettz('Europe/London')
- rng_dateutil = pd.date_range('3/6/2012 00:00', periods=3, freq='D',
- tz=tz_dateutil)
+ tz_dateutil = dateutil.tz.gettz("Europe/London")
+ rng_dateutil = pd.date_range("3/6/2012 00:00", periods=3, freq="D", tz=tz_dateutil)
rng_dateutil.tz == tz_dateutil
To convert a time zone aware pandas object from one time zone to another,
@@ -2277,7 +2314,7 @@ you can use the ``tz_convert`` method.
.. ipython:: python
- rng_pytz.tz_convert('US/Eastern')
+ rng_pytz.tz_convert("US/Eastern")
.. note::
@@ -2289,9 +2326,9 @@ you can use the ``tz_convert`` method.
.. ipython:: python
- dti = pd.date_range('2019-01-01', periods=3, freq='D', tz='US/Pacific')
+ dti = pd.date_range("2019-01-01", periods=3, freq="D", tz="US/Pacific")
dti.tz
- ts = pd.Timestamp('2019-01-01', tz='US/Pacific')
+ ts = pd.Timestamp("2019-01-01", tz="US/Pacific")
ts.tz
.. warning::
@@ -2315,23 +2352,28 @@ you can use the ``tz_convert`` method.
Instead, the datetime needs to be localized using the ``localize`` method
on the ``pytz`` time zone object.
+.. warning::
+
+ Be aware that for times in the future, correct conversion between time zones
+ (and UTC) cannot be guaranteed by any time zone library because a timezone's
+ offset from UTC may be changed by the respective government.
+
.. warning::
If you are using dates beyond 2038-01-18, due to current deficiencies
in the underlying libraries caused by the year 2038 problem, daylight saving time (DST) adjustments
to timezone aware dates will not be applied. If and when the underlying libraries are fixed,
- the DST transitions will be applied. It should be noted though, that time zone data for far future time zones
- are likely to be inaccurate, as they are simple extrapolations of the current set of (regularly revised) rules.
+ the DST transitions will be applied.
For example, for two dates that are in British Summer Time (and so would normally be GMT+1), both the following asserts evaluate as true:
.. ipython:: python
- d_2037 = '2037-03-31T010101'
- d_2038 = '2038-03-31T010101'
- DST = 'Europe/London'
- assert pd.Timestamp(d_2037, tz=DST) != pd.Timestamp(d_2037, tz='GMT')
- assert pd.Timestamp(d_2038, tz=DST) == pd.Timestamp(d_2038, tz='GMT')
+ d_2037 = "2037-03-31T010101"
+ d_2038 = "2038-03-31T010101"
+ DST = "Europe/London"
+ assert pd.Timestamp(d_2037, tz=DST) != pd.Timestamp(d_2037, tz="GMT")
+ assert pd.Timestamp(d_2038, tz=DST) == pd.Timestamp(d_2038, tz="GMT")
Under the hood, all timestamps are stored in UTC. Values from a time zone aware
:class:`DatetimeIndex` or :class:`Timestamp` will have their fields (day, hour, minute, etc.)
@@ -2340,8 +2382,8 @@ still considered to be equal even if they are in different time zones:
.. ipython:: python
- rng_eastern = rng_utc.tz_convert('US/Eastern')
- rng_berlin = rng_utc.tz_convert('Europe/Berlin')
+ rng_eastern = rng_utc.tz_convert("US/Eastern")
+ rng_berlin = rng_utc.tz_convert("Europe/Berlin")
rng_eastern[2]
rng_berlin[2]
@@ -2352,9 +2394,9 @@ Operations between :class:`Series` in different time zones will yield UTC
.. ipython:: python
- ts_utc = pd.Series(range(3), pd.date_range('20130101', periods=3, tz='UTC'))
- eastern = ts_utc.tz_convert('US/Eastern')
- berlin = ts_utc.tz_convert('Europe/Berlin')
+ ts_utc = pd.Series(range(3), pd.date_range("20130101", periods=3, tz="UTC"))
+ eastern = ts_utc.tz_convert("US/Eastern")
+ berlin = ts_utc.tz_convert("Europe/Berlin")
result = eastern + berlin
result
result.index
@@ -2365,14 +2407,13 @@ To remove time zone information, use ``tz_localize(None)`` or ``tz_convert(None)
.. ipython:: python
- didx = pd.date_range(start='2014-08-01 09:00', freq='H',
- periods=3, tz='US/Eastern')
+ didx = pd.date_range(start="2014-08-01 09:00", freq="H", periods=3, tz="US/Eastern")
didx
didx.tz_localize(None)
didx.tz_convert(None)
# tz_convert(None) is identical to tz_convert('UTC').tz_localize(None)
- didx.tz_convert('UTC').tz_localize(None)
+ didx.tz_convert("UTC").tz_localize(None)
.. _timeseries.fold:
@@ -2398,10 +2439,12 @@ control over how they are handled.
.. ipython:: python
- pd.Timestamp(datetime.datetime(2019, 10, 27, 1, 30, 0, 0),
- tz='dateutil/Europe/London', fold=0)
- pd.Timestamp(year=2019, month=10, day=27, hour=1, minute=30,
- tz='dateutil/Europe/London', fold=1)
+ pd.Timestamp(
+ datetime.datetime(2019, 10, 27, 1, 30, 0, 0), tz="dateutil/Europe/London", fold=0
+ )
+ pd.Timestamp(
+ year=2019, month=10, day=27, hour=1, minute=30, tz="dateutil/Europe/London", fold=1
+ )
.. _timeseries.timezone_ambiguous:
@@ -2419,8 +2462,9 @@ twice within one day ("clocks fall back"). The following options are available:
.. ipython:: python
- rng_hourly = pd.DatetimeIndex(['11/06/2011 00:00', '11/06/2011 01:00',
- '11/06/2011 01:00', '11/06/2011 02:00'])
+ rng_hourly = pd.DatetimeIndex(
+ ["11/06/2011 00:00", "11/06/2011 01:00", "11/06/2011 01:00", "11/06/2011 02:00"]
+ )
This will fail as there are ambiguous times (``'11/06/2011 01:00'``)
@@ -2433,9 +2477,9 @@ Handle these ambiguous times by specifying the following.
.. ipython:: python
- rng_hourly.tz_localize('US/Eastern', ambiguous='infer')
- rng_hourly.tz_localize('US/Eastern', ambiguous='NaT')
- rng_hourly.tz_localize('US/Eastern', ambiguous=[True, True, False, False])
+ rng_hourly.tz_localize("US/Eastern", ambiguous="infer")
+ rng_hourly.tz_localize("US/Eastern", ambiguous="NaT")
+ rng_hourly.tz_localize("US/Eastern", ambiguous=[True, True, False, False])
.. _timeseries.timezone_nonexistent:
@@ -2454,7 +2498,7 @@ can be controlled by the ``nonexistent`` argument. The following options are ava
.. ipython:: python
- dti = pd.date_range(start='2015-03-29 02:30:00', periods=3, freq='H')
+ dti = pd.date_range(start="2015-03-29 02:30:00", periods=3, freq="H")
# 2:30 is a nonexistent time
Localization of nonexistent times will raise an error by default.
@@ -2469,10 +2513,10 @@ Transform nonexistent times to ``NaT`` or shift the times.
.. ipython:: python
dti
- dti.tz_localize('Europe/Warsaw', nonexistent='shift_forward')
- dti.tz_localize('Europe/Warsaw', nonexistent='shift_backward')
- dti.tz_localize('Europe/Warsaw', nonexistent=pd.Timedelta(1, unit='H'))
- dti.tz_localize('Europe/Warsaw', nonexistent='NaT')
+ dti.tz_localize("Europe/Warsaw", nonexistent="shift_forward")
+ dti.tz_localize("Europe/Warsaw", nonexistent="shift_backward")
+ dti.tz_localize("Europe/Warsaw", nonexistent=pd.Timedelta(1, unit="H"))
+ dti.tz_localize("Europe/Warsaw", nonexistent="NaT")
.. _timeseries.timezone_series:
@@ -2485,7 +2529,7 @@ represented with a dtype of ``datetime64[ns]``.
.. ipython:: python
- s_naive = pd.Series(pd.date_range('20130101', periods=3))
+ s_naive = pd.Series(pd.date_range("20130101", periods=3))
s_naive
A :class:`Series` with a time zone **aware** values is
@@ -2493,7 +2537,7 @@ represented with a dtype of ``datetime64[ns, tz]`` where ``tz`` is the time zone
.. ipython:: python
- s_aware = pd.Series(pd.date_range('20130101', periods=3, tz='US/Eastern'))
+ s_aware = pd.Series(pd.date_range("20130101", periods=3, tz="US/Eastern"))
s_aware
Both of these :class:`Series` time zone information
@@ -2503,7 +2547,7 @@ For example, to localize and convert a naive stamp to time zone aware.
.. ipython:: python
- s_naive.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
+ s_naive.dt.tz_localize("UTC").dt.tz_convert("US/Eastern")
Time zone information can also be manipulated using the ``astype`` method.
This method can localize and convert time zone naive timestamps or
@@ -2512,13 +2556,13 @@ convert time zone aware timestamps.
.. ipython:: python
# localize and convert a naive time zone
- s_naive.astype('datetime64[ns, US/Eastern]')
+ s_naive.astype("datetime64[ns, US/Eastern]")
# make an aware tz naive
- s_aware.astype('datetime64[ns]')
+ s_aware.astype("datetime64[ns]")
# convert to a new time zone
- s_aware.astype('datetime64[ns, CET]')
+ s_aware.astype("datetime64[ns, CET]")
.. note::
@@ -2544,4 +2588,4 @@ convert time zone aware timestamps.
.. ipython:: python
- s_aware.to_numpy(dtype='datetime64[ns]')
+ s_aware.to_numpy(dtype="datetime64[ns]")
diff --git a/doc/source/user_guide/visualization.rst b/doc/source/user_guide/visualization.rst
index 814627043cfc8..a6c3d9814b03d 100644
--- a/doc/source/user_guide/visualization.rst
+++ b/doc/source/user_guide/visualization.rst
@@ -11,7 +11,8 @@ We use the standard convention for referencing the matplotlib API:
.. ipython:: python
import matplotlib.pyplot as plt
- plt.close('all')
+
+ plt.close("all")
We provide the basics in pandas to easily create decent looking plots.
See the :ref:`ecosystem ` section for visualization
@@ -39,8 +40,7 @@ The ``plot`` method on Series and DataFrame is just a simple wrapper around
.. ipython:: python
- ts = pd.Series(np.random.randn(1000),
- index=pd.date_range('1/1/2000', periods=1000))
+ ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
ts = ts.cumsum()
@savefig series_plot_basic.png
@@ -54,36 +54,35 @@ On DataFrame, :meth:`~DataFrame.plot` is a convenience to plot all of the column
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
np.random.seed(123456)
.. ipython:: python
- df = pd.DataFrame(np.random.randn(1000, 4),
- index=ts.index, columns=list('ABCD'))
+ df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list("ABCD"))
df = df.cumsum()
plt.figure();
@savefig frame_plot_basic.png
- df.plot();
+ df.plot()
-You can plot one column versus another using the `x` and `y` keywords in
+You can plot one column versus another using the ``x`` and ``y`` keywords in
:meth:`~DataFrame.plot`:
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
plt.figure()
np.random.seed(123456)
.. ipython:: python
- df3 = pd.DataFrame(np.random.randn(1000, 2), columns=['B', 'C']).cumsum()
- df3['A'] = pd.Series(list(range(len(df))))
+ df3 = pd.DataFrame(np.random.randn(1000, 2), columns=["B", "C"]).cumsum()
+ df3["A"] = pd.Series(list(range(len(df))))
@savefig df_plot_xy.png
- df3.plot(x='A', y='B')
+ df3.plot(x="A", y="B")
.. note::
@@ -93,7 +92,7 @@ You can plot one column versus another using the `x` and `y` keywords in
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
.. _visualization.other:
@@ -120,7 +119,7 @@ For example, a bar plot can be created the following way:
plt.figure();
@savefig bar_plot_ex.png
- df.iloc[5].plot(kind='bar');
+ df.iloc[5].plot(kind="bar")
You can also create these other plots using the methods ``DataFrame.plot.`` instead of providing the ``kind`` keyword argument. This makes it easier to discover plot methods and the specific arguments they use:
@@ -164,7 +163,7 @@ For labeled, non-time series data, you may wish to produce a bar plot:
@savefig bar_plot_ex.png
df.iloc[5].plot.bar()
- plt.axhline(0, color='k');
+ plt.axhline(0, color="k")
Calling a DataFrame's :meth:`plot.bar() ` method produces a multiple
bar plot:
@@ -172,42 +171,42 @@ bar plot:
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
plt.figure()
np.random.seed(123456)
.. ipython:: python
- df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
+ df2 = pd.DataFrame(np.random.rand(10, 4), columns=["a", "b", "c", "d"])
@savefig bar_plot_multi_ex.png
- df2.plot.bar();
+ df2.plot.bar()
To produce a stacked bar plot, pass ``stacked=True``:
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
plt.figure()
.. ipython:: python
@savefig bar_plot_stacked_ex.png
- df2.plot.bar(stacked=True);
+ df2.plot.bar(stacked=True)
To get horizontal bar plots, use the ``barh`` method:
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
plt.figure()
.. ipython:: python
@savefig barh_plot_stacked_ex.png
- df2.plot.barh(stacked=True);
+ df2.plot.barh(stacked=True)
.. _visualization.hist:
@@ -218,8 +217,14 @@ Histograms can be drawn by using the :meth:`DataFrame.plot.hist` and :meth:`Seri
.. ipython:: python
- df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),
- 'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
+ df4 = pd.DataFrame(
+ {
+ "a": np.random.randn(1000) + 1,
+ "b": np.random.randn(1000),
+ "c": np.random.randn(1000) - 1,
+ },
+ columns=["a", "b", "c"],
+ )
plt.figure();
@@ -230,7 +235,7 @@ Histograms can be drawn by using the :meth:`DataFrame.plot.hist` and :meth:`Seri
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
A histogram can be stacked using ``stacked=True``. Bin size can be changed
using the ``bins`` keyword.
@@ -245,7 +250,7 @@ using the ``bins`` keyword.
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
You can pass other keywords supported by matplotlib ``hist``. For example,
horizontal and cumulative histograms can be drawn by
@@ -256,12 +261,12 @@ horizontal and cumulative histograms can be drawn by
plt.figure();
@savefig hist_new_kwargs.png
- df4['a'].plot.hist(orientation='horizontal', cumulative=True)
+ df4["a"].plot.hist(orientation="horizontal", cumulative=True)
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
See the :meth:`hist ` method and the
`matplotlib hist documentation `__ for more.
@@ -274,12 +279,12 @@ The existing interface ``DataFrame.hist`` to plot histogram still can be used.
plt.figure();
@savefig hist_plot_ex.png
- df['A'].diff().hist()
+ df["A"].diff().hist()
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
:meth:`DataFrame.hist` plots the histograms of the columns on multiple
subplots:
@@ -289,7 +294,7 @@ subplots:
plt.figure()
@savefig frame_hist_ex.png
- df.diff().hist(color='k', alpha=0.5, bins=50)
+ df.diff().hist(color="k", alpha=0.5, bins=50)
The ``by`` keyword can be specified to plot grouped histograms:
@@ -297,7 +302,7 @@ The ``by`` keyword can be specified to plot grouped histograms:
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
plt.figure()
np.random.seed(123456)
@@ -323,12 +328,12 @@ a uniform random variable on [0,1).
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
np.random.seed(123456)
.. ipython:: python
- df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
+ df = pd.DataFrame(np.random.rand(10, 5), columns=["A", "B", "C", "D", "E"])
@savefig box_plot_new.png
df.plot.box()
@@ -348,16 +353,20 @@ more complicated colorization, you can get each drawn artists by passing
.. ipython:: python
- color = {'boxes': 'DarkGreen', 'whiskers': 'DarkOrange',
- 'medians': 'DarkBlue', 'caps': 'Gray'}
+ color = {
+ "boxes": "DarkGreen",
+ "whiskers": "DarkOrange",
+ "medians": "DarkBlue",
+ "caps": "Gray",
+ }
@savefig box_new_colorize.png
- df.plot.box(color=color, sym='r+')
+ df.plot.box(color=color, sym="r+")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
Also, you can pass other keywords supported by matplotlib ``boxplot``.
For example, horizontal and custom-positioned boxplot can be drawn by
@@ -378,7 +387,7 @@ The existing interface ``DataFrame.boxplot`` to plot boxplot still can be used.
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
np.random.seed(123456)
.. ipython:: python
@@ -396,19 +405,19 @@ groupings. For instance,
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
np.random.seed(123456)
.. ipython:: python
:okwarning:
- df = pd.DataFrame(np.random.rand(10, 2), columns=['Col1', 'Col2'])
- df['X'] = pd.Series(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'])
+ df = pd.DataFrame(np.random.rand(10, 2), columns=["Col1", "Col2"])
+ df["X"] = pd.Series(["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"])
- plt.figure();
+ plt.figure()
@savefig box_plot_ex2.png
- bp = df.boxplot(by='X')
+ bp = df.boxplot(by="X")
You can also pass a subset of columns to plot, as well as group by multiple
columns:
@@ -416,25 +425,25 @@ columns:
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
np.random.seed(123456)
.. ipython:: python
:okwarning:
- df = pd.DataFrame(np.random.rand(10, 3), columns=['Col1', 'Col2', 'Col3'])
- df['X'] = pd.Series(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'])
- df['Y'] = pd.Series(['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'])
+ df = pd.DataFrame(np.random.rand(10, 3), columns=["Col1", "Col2", "Col3"])
+ df["X"] = pd.Series(["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"])
+ df["Y"] = pd.Series(["A", "B", "A", "B", "A", "B", "A", "B", "A", "B"])
plt.figure();
@savefig box_plot_ex3.png
- bp = df.boxplot(column=['Col1', 'Col2'], by=['X', 'Y'])
+ bp = df.boxplot(column=["Col1", "Col2"], by=["X", "Y"])
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
.. _visualization.box.return:
@@ -443,9 +452,8 @@ Faceting, created by ``DataFrame.boxplot`` with the ``by``
keyword, will affect the output type as well:
================ ======= ==========================
-``return_type=`` Faceted Output type
----------------- ------- --------------------------
-
+``return_type`` Faceted Output type
+================ ======= ==========================
``None`` No axes
``None`` Yes 2-D ndarray of axes
``'axes'`` No axes
@@ -463,16 +471,16 @@ keyword, will affect the output type as well:
np.random.seed(1234)
df_box = pd.DataFrame(np.random.randn(50, 2))
- df_box['g'] = np.random.choice(['A', 'B'], size=50)
- df_box.loc[df_box['g'] == 'B', 1] += 3
+ df_box["g"] = np.random.choice(["A", "B"], size=50)
+ df_box.loc[df_box["g"] == "B", 1] += 3
@savefig boxplot_groupby.png
- bp = df_box.boxplot(by='g')
+ bp = df_box.boxplot(by="g")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
The subplots above are split by the numeric columns first, then the value of
the ``g`` column. Below the subplots are first split by the value of ``g``,
@@ -482,12 +490,12 @@ then by the numeric columns.
:okwarning:
@savefig groupby_boxplot_vis.png
- bp = df_box.groupby('g').boxplot()
+ bp = df_box.groupby("g").boxplot()
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
.. _visualization.area_plot:
@@ -497,7 +505,7 @@ Area plot
You can create area plots with :meth:`Series.plot.area` and :meth:`DataFrame.plot.area`.
Area plots are stacked by default. To produce stacked area plot, each column must be either all positive or all negative values.
-When input data contains `NaN`, it will be automatically filled by 0. If you want to drop or fill by different values, use :func:`dataframe.dropna` or :func:`dataframe.fillna` before calling `plot`.
+When input data contains ``NaN``, it will be automatically filled by 0. If you want to drop or fill by different values, use :func:`dataframe.dropna` or :func:`dataframe.fillna` before calling ``plot``.
.. ipython:: python
:suppress:
@@ -507,23 +515,23 @@ When input data contains `NaN`, it will be automatically filled by 0. If you wan
.. ipython:: python
- df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
+ df = pd.DataFrame(np.random.rand(10, 4), columns=["a", "b", "c", "d"])
@savefig area_plot_stacked.png
- df.plot.area();
+ df.plot.area()
To produce an unstacked plot, pass ``stacked=False``. Alpha value is set to 0.5 unless otherwise specified:
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
plt.figure()
.. ipython:: python
@savefig area_plot_unstacked.png
- df.plot.area(stacked=False);
+ df.plot.area(stacked=False)
.. _visualization.scatter:
@@ -538,29 +546,29 @@ These can be specified by the ``x`` and ``y`` keywords.
:suppress:
np.random.seed(123456)
- plt.close('all')
+ plt.close("all")
plt.figure()
.. ipython:: python
- df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])
+ df = pd.DataFrame(np.random.rand(50, 4), columns=["a", "b", "c", "d"])
@savefig scatter_plot.png
- df.plot.scatter(x='a', y='b');
+ df.plot.scatter(x="a", y="b")
To plot multiple column groups in a single axes, repeat ``plot`` method specifying target ``ax``.
It is recommended to specify ``color`` and ``label`` keywords to distinguish each groups.
.. ipython:: python
- ax = df.plot.scatter(x='a', y='b', color='DarkBlue', label='Group 1');
+ ax = df.plot.scatter(x="a", y="b", color="DarkBlue", label="Group 1")
@savefig scatter_plot_repeated.png
- df.plot.scatter(x='c', y='d', color='DarkGreen', label='Group 2', ax=ax);
+ df.plot.scatter(x="c", y="d", color="DarkGreen", label="Group 2", ax=ax)
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
The keyword ``c`` may be given as the name of a column to provide colors for
each point:
@@ -568,13 +576,13 @@ each point:
.. ipython:: python
@savefig scatter_plot_colored.png
- df.plot.scatter(x='a', y='b', c='c', s=50);
+ df.plot.scatter(x="a", y="b", c="c", s=50)
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
You can pass other keywords supported by matplotlib
:meth:`scatter `. The example below shows a
@@ -583,12 +591,12 @@ bubble chart using a column of the ``DataFrame`` as the bubble size.
.. ipython:: python
@savefig scatter_plot_bubble.png
- df.plot.scatter(x='a', y='b', s=df['c'] * 200);
+ df.plot.scatter(x="a", y="b", s=df["c"] * 200)
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
See the :meth:`scatter ` method and the
`matplotlib scatter documentation `__ for more.
@@ -610,11 +618,11 @@ too dense to plot each point individually.
.. ipython:: python
- df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
- df['b'] = df['b'] + np.arange(1000)
+ df = pd.DataFrame(np.random.randn(1000, 2), columns=["a", "b"])
+ df["b"] = df["b"] + np.arange(1000)
@savefig hexbin_plot.png
- df.plot.hexbin(x='a', y='b', gridsize=25)
+ df.plot.hexbin(x="a", y="b", gridsize=25)
A useful keyword argument is ``gridsize``; it controls the number of hexagons
@@ -632,23 +640,23 @@ given by column ``z``. The bins are aggregated with NumPy's ``max`` function.
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
plt.figure()
np.random.seed(123456)
.. ipython:: python
- df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
- df['b'] = df['b'] = df['b'] + np.arange(1000)
- df['z'] = np.random.uniform(0, 3, 1000)
+ df = pd.DataFrame(np.random.randn(1000, 2), columns=["a", "b"])
+ df["b"] = df["b"] = df["b"] + np.arange(1000)
+ df["z"] = np.random.uniform(0, 3, 1000)
@savefig hexbin_plot_agg.png
- df.plot.hexbin(x='a', y='b', C='z', reduce_C_function=np.max, gridsize=25)
+ df.plot.hexbin(x="a", y="b", C="z", reduce_C_function=np.max, gridsize=25)
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
See the :meth:`hexbin ` method and the
`matplotlib hexbin documentation `__ for more.
@@ -669,9 +677,9 @@ A ``ValueError`` will be raised if there are any negative values in your data.
plt.figure()
.. ipython:: python
+ :okwarning:
- series = pd.Series(3 * np.random.rand(4),
- index=['a', 'b', 'c', 'd'], name='series')
+ series = pd.Series(3 * np.random.rand(4), index=["a", "b", "c", "d"], name="series")
@savefig series_pie_plot.png
series.plot.pie(figsize=(6, 6))
@@ -679,7 +687,7 @@ A ``ValueError`` will be raised if there are any negative values in your data.
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
For pie plots it's best to use square figures, i.e. a figure aspect ratio 1.
You can create the figure with equal width and height, or force the aspect ratio
@@ -700,8 +708,9 @@ drawn in each pie plots by default; specify ``legend=False`` to hide it.
.. ipython:: python
- df = pd.DataFrame(3 * np.random.rand(4, 2),
- index=['a', 'b', 'c', 'd'], columns=['x', 'y'])
+ df = pd.DataFrame(
+ 3 * np.random.rand(4, 2), index=["a", "b", "c", "d"], columns=["x", "y"]
+ )
@savefig df_pie_plot.png
df.plot.pie(subplots=True, figsize=(8, 4))
@@ -709,7 +718,7 @@ drawn in each pie plots by default; specify ``legend=False`` to hide it.
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
You can use the ``labels`` and ``colors`` keywords to specify the labels and colors of each wedge.
@@ -731,20 +740,26 @@ Also, other keywords supported by :func:`matplotlib.pyplot.pie` can be used.
.. ipython:: python
@savefig series_pie_plot_options.png
- series.plot.pie(labels=['AA', 'BB', 'CC', 'DD'], colors=['r', 'g', 'b', 'c'],
- autopct='%.2f', fontsize=20, figsize=(6, 6))
+ series.plot.pie(
+ labels=["AA", "BB", "CC", "DD"],
+ colors=["r", "g", "b", "c"],
+ autopct="%.2f",
+ fontsize=20,
+ figsize=(6, 6),
+ )
If you pass values whose sum total is less than 1.0, matplotlib draws a semicircle.
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
plt.figure()
.. ipython:: python
+ :okwarning:
- series = pd.Series([0.1] * 4, index=['a', 'b', 'c', 'd'], name='series2')
+ series = pd.Series([0.1] * 4, index=["a", "b", "c", "d"], name="series2")
@savefig series_pie_plot_semi.png
series.plot.pie(figsize=(6, 6))
@@ -754,14 +769,14 @@ See the `matplotlib pie documentation `__.
+**Note**: The "Iris" dataset is available `here `__.
.. ipython:: python
from pandas.plotting import andrews_curves
- data = pd.read_csv('data/iris.data')
+ data = pd.read_csv("data/iris.data")
plt.figure()
@savefig andrews_curves.png
- andrews_curves(data, 'Name')
+ andrews_curves(data, "Name")
.. _visualization.parallel_coordinates:
@@ -895,17 +911,17 @@ represents one data point. Points that tend to cluster will appear closer togeth
from pandas.plotting import parallel_coordinates
- data = pd.read_csv('data/iris.data')
+ data = pd.read_csv("data/iris.data")
plt.figure()
@savefig parallel_coordinates.png
- parallel_coordinates(data, 'Name')
+ parallel_coordinates(data, "Name")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
.. _visualization.lag:
@@ -938,7 +954,7 @@ be passed, and when ``lag=1`` the plot is essentially ``data[:-1]`` vs.
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
.. _visualization.autocorrelation:
@@ -975,7 +991,7 @@ autocorrelation plots.
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
.. _visualization.bootstrap:
@@ -1000,12 +1016,12 @@ are what constitutes the bootstrap plot.
data = pd.Series(np.random.rand(1000))
@savefig bootstrap_plot.png
- bootstrap_plot(data, size=50, samples=500, color='grey')
+ bootstrap_plot(data, size=50, samples=500, color="grey")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
.. _visualization.radviz:
@@ -1025,23 +1041,23 @@ be colored differently.
See the R package `Radviz `__
for more information.
-**Note**: The "Iris" dataset is available `here `__.
+**Note**: The "Iris" dataset is available `here `__.
.. ipython:: python
from pandas.plotting import radviz
- data = pd.read_csv('data/iris.data')
+ data = pd.read_csv("data/iris.data")
plt.figure()
@savefig radviz.png
- radviz(data, 'Name')
+ radviz(data, "Name")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
.. _visualization.formatting:
@@ -1070,14 +1086,14 @@ layout and formatting of the returned plot:
plt.figure();
@savefig series_plot_basic2.png
- ts.plot(style='k--', label='Series');
+ ts.plot(style="k--", label="Series")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
-For each kind of plot (e.g. `line`, `bar`, `scatter`) any additional arguments
+For each kind of plot (e.g. ``line``, ``bar``, ``scatter``) any additional arguments
keywords are passed along to the corresponding matplotlib function
(:meth:`ax.plot() `,
:meth:`ax.bar() `,
@@ -1097,8 +1113,7 @@ shown by default.
.. ipython:: python
- df = pd.DataFrame(np.random.randn(1000, 4),
- index=ts.index, columns=list('ABCD'))
+ df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list("ABCD"))
df = df.cumsum()
@savefig frame_plot_basic_noleg.png
@@ -1107,7 +1122,35 @@ shown by default.
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
+
+
+Controlling the labels
+~~~~~~~~~~~~~~~~~~~~~~
+
+.. versionadded:: 1.1.0
+
+You may set the ``xlabel`` and ``ylabel`` arguments to give the plot custom labels
+for x and y axis. By default, pandas will pick up index name as xlabel, while leaving
+it empty for ylabel.
+
+.. ipython:: python
+ :suppress:
+
+ plt.figure()
+
+.. ipython:: python
+
+ df.plot()
+
+ @savefig plot_xlabel_ylabel.png
+ df.plot(xlabel="new x", ylabel="new y")
+
+.. ipython:: python
+ :suppress:
+
+ plt.close("all")
+
Scales
~~~~~~
@@ -1122,8 +1165,7 @@ You may pass ``logy`` to get a log-scale Y axis.
.. ipython:: python
- ts = pd.Series(np.random.randn(1000),
- index=pd.date_range('1/1/2000', periods=1000))
+ ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
ts = np.exp(ts.cumsum())
@savefig series_plot_logy.png
@@ -1132,7 +1174,7 @@ You may pass ``logy`` to get a log-scale Y axis.
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
See also the ``logx`` and ``loglog`` keyword arguments.
@@ -1148,15 +1190,15 @@ To plot data on a secondary y-axis, use the ``secondary_y`` keyword:
.. ipython:: python
- df['A'].plot()
+ df["A"].plot()
@savefig series_plot_secondary_y.png
- df['B'].plot(secondary_y=True, style='g')
+ df["B"].plot(secondary_y=True, style="g")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
To plot some columns in a ``DataFrame``, give the column names to the ``secondary_y``
keyword:
@@ -1164,15 +1206,15 @@ keyword:
.. ipython:: python
plt.figure()
- ax = df.plot(secondary_y=['A', 'B'])
- ax.set_ylabel('CD scale')
+ ax = df.plot(secondary_y=["A", "B"])
+ ax.set_ylabel("CD scale")
@savefig frame_plot_secondary_y.png
- ax.right_ax.set_ylabel('AB scale')
+ ax.right_ax.set_ylabel("AB scale")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
Note that the columns plotted on the secondary y-axis is automatically marked
with "(right)" in the legend. To turn off the automatic marking, use the
@@ -1183,12 +1225,12 @@ with "(right)" in the legend. To turn off the automatic marking, use the
plt.figure()
@savefig frame_plot_secondary_y_no_right.png
- df.plot(secondary_y=['A', 'B'], mark_right=False)
+ df.plot(secondary_y=["A", "B"], mark_right=False)
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
.. _plotting.formatters:
@@ -1197,7 +1239,7 @@ Custom formatters for timeseries plots
.. versionchanged:: 1.0.0
-Pandas provides custom formatters for timeseries plots. These change the
+pandas provides custom formatters for timeseries plots. These change the
formatting of the axis labels for dates and times. By default,
the custom formatters are applied only to plots created by pandas with
:meth:`DataFrame.plot` or :meth:`Series.plot`. To have them apply to all
@@ -1220,12 +1262,12 @@ Here is the default behavior, notice how the x-axis tick labeling is performed:
plt.figure()
@savefig ser_plot_suppress.png
- df['A'].plot()
+ df["A"].plot()
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
Using the ``x_compat`` parameter, you can suppress this behavior:
@@ -1234,30 +1276,30 @@ Using the ``x_compat`` parameter, you can suppress this behavior:
plt.figure()
@savefig ser_plot_suppress_parm.png
- df['A'].plot(x_compat=True)
+ df["A"].plot(x_compat=True)
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
If you have more than one plot that needs to be suppressed, the ``use`` method
-in ``pandas.plotting.plot_params`` can be used in a `with statement`:
+in ``pandas.plotting.plot_params`` can be used in a ``with`` statement:
.. ipython:: python
plt.figure()
@savefig ser_plot_suppress_context.png
- with pd.plotting.plot_params.use('x_compat', True):
- df['A'].plot(color='r')
- df['B'].plot(color='g')
- df['C'].plot(color='b')
+ with pd.plotting.plot_params.use("x_compat", True):
+ df["A"].plot(color="r")
+ df["B"].plot(color="g")
+ df["C"].plot(color="b")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
Automatic date tick adjustment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1278,12 +1320,12 @@ with the ``subplots`` keyword:
.. ipython:: python
@savefig frame_plot_subplots.png
- df.plot(subplots=True, figsize=(6, 6));
+ df.plot(subplots=True, figsize=(6, 6))
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
Using layout and targeting multiple axes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1301,23 +1343,23 @@ or columns needed, given the other.
.. ipython:: python
@savefig frame_plot_subplots_layout.png
- df.plot(subplots=True, layout=(2, 3), figsize=(6, 6), sharex=False);
+ df.plot(subplots=True, layout=(2, 3), figsize=(6, 6), sharex=False)
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
The above example is identical to using:
.. ipython:: python
- df.plot(subplots=True, layout=(2, -1), figsize=(6, 6), sharex=False);
+ df.plot(subplots=True, layout=(2, -1), figsize=(6, 6), sharex=False)
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
The required number of columns (3) is inferred from the number of series to plot
and the given number of rows (2).
@@ -1332,20 +1374,19 @@ otherwise you will see a warning.
.. ipython:: python
- fig, axes = plt.subplots(4, 4, figsize=(6, 6))
+ fig, axes = plt.subplots(4, 4, figsize=(9, 9))
plt.subplots_adjust(wspace=0.5, hspace=0.5)
target1 = [axes[0][0], axes[1][1], axes[2][2], axes[3][3]]
target2 = [axes[3][0], axes[2][1], axes[1][2], axes[0][3]]
- df.plot(subplots=True, ax=target1, legend=False, sharex=False, sharey=False);
+ df.plot(subplots=True, ax=target1, legend=False, sharex=False, sharey=False)
@savefig frame_plot_subplots_multi_ax.png
- (-df).plot(subplots=True, ax=target2, legend=False,
- sharex=False, sharey=False);
+ (-df).plot(subplots=True, ax=target2, legend=False, sharex=False, sharey=False)
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
Another option is passing an ``ax`` argument to :meth:`Series.plot` to plot on a particular axis:
@@ -1353,36 +1394,35 @@ Another option is passing an ``ax`` argument to :meth:`Series.plot` to plot on a
:suppress:
np.random.seed(123456)
- ts = pd.Series(np.random.randn(1000),
- index=pd.date_range('1/1/2000', periods=1000))
+ ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
ts = ts.cumsum()
- df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
- columns=list('ABCD'))
+ df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list("ABCD"))
df = df.cumsum()
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
.. ipython:: python
fig, axes = plt.subplots(nrows=2, ncols=2)
- df['A'].plot(ax=axes[0, 0]);
- axes[0, 0].set_title('A');
- df['B'].plot(ax=axes[0, 1]);
- axes[0, 1].set_title('B');
- df['C'].plot(ax=axes[1, 0]);
- axes[1, 0].set_title('C');
- df['D'].plot(ax=axes[1, 1]);
+ plt.subplots_adjust(wspace=0.2, hspace=0.5)
+ df["A"].plot(ax=axes[0, 0])
+ axes[0, 0].set_title("A")
+ df["B"].plot(ax=axes[0, 1])
+ axes[0, 1].set_title("B")
+ df["C"].plot(ax=axes[1, 0])
+ axes[1, 0].set_title("C")
+ df["D"].plot(ax=axes[1, 1])
@savefig series_plot_multi.png
- axes[1, 1].set_title('D');
+ axes[1, 1].set_title("D")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
.. _visualization.errorbars:
@@ -1397,24 +1437,28 @@ Horizontal and vertical error bars can be supplied to the ``xerr`` and ``yerr``
* As a ``str`` indicating which of the columns of plotting :class:`DataFrame` contain the error values.
* As raw values (``list``, ``tuple``, or ``np.ndarray``). Must be the same length as the plotting :class:`DataFrame`/:class:`Series`.
-Asymmetrical error bars are also supported, however raw error values must be provided in this case. For a ``M`` length :class:`Series`, a ``Mx2`` array should be provided indicating lower and upper (or left and right) errors. For a ``MxN`` :class:`DataFrame`, asymmetrical errors should be in a ``Mx2xN`` array.
+Asymmetrical error bars are also supported, however raw error values must be provided in this case. For a ``N`` length :class:`Series`, a ``2xN`` array should be provided indicating lower and upper (or left and right) errors. For a ``MxN`` :class:`DataFrame`, asymmetrical errors should be in a ``Mx2xN`` array.
Here is an example of one way to easily plot group means with standard deviations from the raw data.
.. ipython:: python
# Generate the data
- ix3 = pd.MultiIndex.from_arrays([
- ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
- ['foo', 'foo', 'bar', 'bar', 'foo', 'foo', 'bar', 'bar']],
- names=['letter', 'word'])
-
- df3 = pd.DataFrame({'data1': [3, 2, 4, 3, 2, 4, 3, 2],
- 'data2': [6, 5, 7, 5, 4, 5, 6, 5]}, index=ix3)
+ ix3 = pd.MultiIndex.from_arrays(
+ [
+ ["a", "a", "a", "a", "b", "b", "b", "b"],
+ ["foo", "foo", "bar", "bar", "foo", "foo", "bar", "bar"],
+ ],
+ names=["letter", "word"],
+ )
+
+ df3 = pd.DataFrame(
+ {"data1": [3, 2, 4, 3, 2, 4, 3, 2], "data2": [6, 5, 7, 5, 4, 5, 6, 5]}, index=ix3
+ )
# Group by index labels and take the means and standard deviations
# for each group
- gp3 = df3.groupby(level=('letter', 'word'))
+ gp3 = df3.groupby(level=("letter", "word"))
means = gp3.mean()
errors = gp3.std()
means
@@ -1423,12 +1467,12 @@ Here is an example of one way to easily plot group means with standard deviation
# Plot
fig, ax = plt.subplots()
@savefig errorbar_example.png
- means.plot.bar(yerr=errors, ax=ax, capsize=4)
+ means.plot.bar(yerr=errors, ax=ax, capsize=4, rot=0)
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
.. _visualization.table:
@@ -1444,9 +1488,9 @@ Plotting with matplotlib table is now supported in :meth:`DataFrame.plot` and :
.. ipython:: python
- fig, ax = plt.subplots(1, 1)
- df = pd.DataFrame(np.random.rand(5, 3), columns=['a', 'b', 'c'])
- ax.get_xaxis().set_visible(False) # Hide Ticks
+ fig, ax = plt.subplots(1, 1, figsize=(7, 6.5))
+ df = pd.DataFrame(np.random.rand(5, 3), columns=["a", "b", "c"])
+ ax.xaxis.tick_top() # Display x-axis ticks on top.
@savefig line_plot_table_true.png
df.plot(table=True, ax=ax)
@@ -1454,7 +1498,7 @@ Plotting with matplotlib table is now supported in :meth:`DataFrame.plot` and :
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
Also, you can pass a different :class:`DataFrame` or :class:`Series` to the
``table`` keyword. The data will be drawn as displayed in print method
@@ -1463,15 +1507,16 @@ as seen in the example below.
.. ipython:: python
- fig, ax = plt.subplots(1, 1)
- ax.get_xaxis().set_visible(False) # Hide Ticks
+ fig, ax = plt.subplots(1, 1, figsize=(7, 6.75))
+ ax.xaxis.tick_top() # Display x-axis ticks on top.
+
@savefig line_plot_table_data.png
df.plot(table=np.round(df.T, 2), ax=ax)
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
There also exists a helper function ``pandas.plotting.table``, which creates a
table from :class:`DataFrame` or :class:`Series`, and adds it to an
@@ -1481,10 +1526,10 @@ matplotlib `table `__ for more.
@@ -1529,12 +1574,12 @@ To use the cubehelix colormap, we can pass ``colormap='cubehelix'``.
plt.figure()
@savefig cubehelix.png
- df.plot(colormap='cubehelix')
+ df.plot(colormap="cubehelix")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
Alternatively, we can pass the colormap itself:
@@ -1550,7 +1595,7 @@ Alternatively, we can pass the colormap itself:
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
Colormaps can also be used other plot types, like bar charts:
@@ -1567,12 +1612,12 @@ Colormaps can also be used other plot types, like bar charts:
plt.figure()
@savefig greens.png
- dd.plot.bar(colormap='Greens')
+ dd.plot.bar(colormap="Greens")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
Parallel coordinates charts:
@@ -1581,12 +1626,12 @@ Parallel coordinates charts:
plt.figure()
@savefig parallel_gist_rainbow.png
- parallel_coordinates(data, 'Name', colormap='gist_rainbow')
+ parallel_coordinates(data, "Name", colormap="gist_rainbow")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
Andrews curves charts:
@@ -1595,12 +1640,12 @@ Andrews curves charts:
plt.figure()
@savefig andrews_curve_winter.png
- andrews_curves(data, 'Name', colormap='winter')
+ andrews_curves(data, "Name", colormap="winter")
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
Plotting directly with matplotlib
---------------------------------
@@ -1624,23 +1669,24 @@ when plotting a large number of points.
.. ipython:: python
- price = pd.Series(np.random.randn(150).cumsum(),
- index=pd.date_range('2000-1-1', periods=150, freq='B'))
+ price = pd.Series(
+ np.random.randn(150).cumsum(),
+ index=pd.date_range("2000-1-1", periods=150, freq="B"),
+ )
ma = price.rolling(20).mean()
mstd = price.rolling(20).std()
plt.figure()
- plt.plot(price.index, price, 'k')
- plt.plot(ma.index, ma, 'b')
+ plt.plot(price.index, price, "k")
+ plt.plot(ma.index, ma, "b")
@savefig bollinger.png
- plt.fill_between(mstd.index, ma - 2 * mstd, ma + 2 * mstd,
- color='b', alpha=0.2)
+ plt.fill_between(mstd.index, ma - 2 * mstd, ma + 2 * mstd, color="b", alpha=0.2)
.. ipython:: python
:suppress:
- plt.close('all')
+ plt.close("all")
Plotting backends
-----------------
@@ -1654,21 +1700,21 @@ function. For example:
.. code-block:: python
- >>> Series([1, 2, 3]).plot(backend='backend.module')
+ >>> Series([1, 2, 3]).plot(backend="backend.module")
Alternatively, you can also set this option globally, do you don't need to specify
the keyword in each ``plot`` call. For example:
.. code-block:: python
- >>> pd.set_option('plotting.backend', 'backend.module')
+ >>> pd.set_option("plotting.backend", "backend.module")
>>> pd.Series([1, 2, 3]).plot()
Or:
.. code-block:: python
- >>> pd.options.plotting.backend = 'backend.module'
+ >>> pd.options.plotting.backend = "backend.module"
>>> pd.Series([1, 2, 3]).plot()
This would be more or less equivalent to:
diff --git a/doc/source/whatsnew/index.rst b/doc/source/whatsnew/index.rst
index ad5bb5a5b2d72..848121f822383 100644
--- a/doc/source/whatsnew/index.rst
+++ b/doc/source/whatsnew/index.rst
@@ -10,12 +10,24 @@ This is the list of changes to pandas between each release. For full details,
see the `commit logs `_. For install and
upgrade instructions, see :ref:`install`.
+Version 1.2
+-----------
+
+.. toctree::
+ :maxdepth: 2
+
+ v1.2.0
+
Version 1.1
-----------
.. toctree::
:maxdepth: 2
+ v1.1.4
+ v1.1.3
+ v1.1.2
+ v1.1.1
v1.1.0
Version 1.0
diff --git a/doc/source/whatsnew/v0.10.0.rst b/doc/source/whatsnew/v0.10.0.rst
index 443250592a4a7..aa2749c85a232 100644
--- a/doc/source/whatsnew/v0.10.0.rst
+++ b/doc/source/whatsnew/v0.10.0.rst
@@ -49,8 +49,8 @@ talking about:
:okwarning:
import pandas as pd
- df = pd.DataFrame(np.random.randn(6, 4),
- index=pd.date_range('1/1/2000', periods=6))
+
+ df = pd.DataFrame(np.random.randn(6, 4), index=pd.date_range("1/1/2000", periods=6))
df
# deprecated now
df - df[0]
@@ -184,12 +184,14 @@ labeled the aggregated group with the end of the interval: the next day).
import io
- data = ('a,b,c\n'
- '1,Yes,2\n'
- '3,No,4')
+ data = """
+ a,b,c
+ 1,Yes,2
+ 3,No,4
+ """
print(data)
pd.read_csv(io.StringIO(data), header=None)
- pd.read_csv(io.StringIO(data), header=None, prefix='X')
+ pd.read_csv(io.StringIO(data), header=None, prefix="X")
- Values like ``'Yes'`` and ``'No'`` are not interpreted as boolean by default,
though this can be controlled by new ``true_values`` and ``false_values``
@@ -199,7 +201,7 @@ labeled the aggregated group with the end of the interval: the next day).
print(data)
pd.read_csv(io.StringIO(data))
- pd.read_csv(io.StringIO(data), true_values=['Yes'], false_values=['No'])
+ pd.read_csv(io.StringIO(data), true_values=["Yes"], false_values=["No"])
- The file parsers will not recognize non-string values arising from a
converter function as NA if passed in the ``na_values`` argument. It's better
@@ -210,10 +212,10 @@ labeled the aggregated group with the end of the interval: the next day).
.. ipython:: python
- s = pd.Series([np.nan, 1., 2., np.nan, 4])
+ s = pd.Series([np.nan, 1.0, 2.0, np.nan, 4])
s
s.fillna(0)
- s.fillna(method='pad')
+ s.fillna(method="pad")
Convenience methods ``ffill`` and ``bfill`` have been added:
@@ -229,7 +231,8 @@ Convenience methods ``ffill`` and ``bfill`` have been added:
.. ipython:: python
def f(x):
- return pd.Series([x, x**2], index=['x', 'x^2'])
+ return pd.Series([x, x ** 2], index=["x", "x^2"])
+
s = pd.Series(np.random.rand(5))
s
@@ -272,20 +275,20 @@ The old behavior of printing out summary information can be achieved via the
.. ipython:: python
- pd.set_option('expand_frame_repr', False)
+ pd.set_option("expand_frame_repr", False)
wide_frame
.. ipython:: python
:suppress:
- pd.reset_option('expand_frame_repr')
+ pd.reset_option("expand_frame_repr")
The width of each line can be changed via 'line_width' (80 by default):
.. code-block:: python
- pd.set_option('line_width', 40)
+ pd.set_option("line_width", 40)
wide_frame
diff --git a/doc/source/whatsnew/v0.10.1.rst b/doc/source/whatsnew/v0.10.1.rst
index 1e9eafd2700e9..d71a0d5ca68cd 100644
--- a/doc/source/whatsnew/v0.10.1.rst
+++ b/doc/source/whatsnew/v0.10.1.rst
@@ -45,29 +45,31 @@ You may need to upgrade your existing data files. Please visit the
import os
- os.remove('store.h5')
+ os.remove("store.h5")
You can designate (and index) certain columns that you want to be able to
perform queries on a table, by passing a list to ``data_columns``
.. ipython:: python
- store = pd.HDFStore('store.h5')
- df = pd.DataFrame(np.random.randn(8, 3),
- index=pd.date_range('1/1/2000', periods=8),
- columns=['A', 'B', 'C'])
- df['string'] = 'foo'
- df.loc[df.index[4:6], 'string'] = np.nan
- df.loc[df.index[7:9], 'string'] = 'bar'
- df['string2'] = 'cool'
+ store = pd.HDFStore("store.h5")
+ df = pd.DataFrame(
+ np.random.randn(8, 3),
+ index=pd.date_range("1/1/2000", periods=8),
+ columns=["A", "B", "C"],
+ )
+ df["string"] = "foo"
+ df.loc[df.index[4:6], "string"] = np.nan
+ df.loc[df.index[7:9], "string"] = "bar"
+ df["string2"] = "cool"
df
# on-disk operations
- store.append('df', df, data_columns=['B', 'C', 'string', 'string2'])
- store.select('df', "B>0 and string=='foo'")
+ store.append("df", df, data_columns=["B", "C", "string", "string2"])
+ store.select("df", "B>0 and string=='foo'")
# this is in-memory version of this type of selection
- df[(df.B > 0) & (df.string == 'foo')]
+ df[(df.B > 0) & (df.string == "foo")]
Retrieving unique values in an indexable or data column.
@@ -75,19 +77,19 @@ Retrieving unique values in an indexable or data column.
# note that this is deprecated as of 0.14.0
# can be replicated by: store.select_column('df','index').unique()
- store.unique('df', 'index')
- store.unique('df', 'string')
+ store.unique("df", "index")
+ store.unique("df", "string")
You can now store ``datetime64`` in data columns
.. ipython:: python
df_mixed = df.copy()
- df_mixed['datetime64'] = pd.Timestamp('20010102')
- df_mixed.loc[df_mixed.index[3:4], ['A', 'B']] = np.nan
+ df_mixed["datetime64"] = pd.Timestamp("20010102")
+ df_mixed.loc[df_mixed.index[3:4], ["A", "B"]] = np.nan
- store.append('df_mixed', df_mixed)
- df_mixed1 = store.select('df_mixed')
+ store.append("df_mixed", df_mixed)
+ df_mixed1 = store.select("df_mixed")
df_mixed1
df_mixed1.dtypes.value_counts()
@@ -97,7 +99,7 @@ columns, this is equivalent to passing a
.. ipython:: python
- store.select('df', columns=['A', 'B'])
+ store.select("df", columns=["A", "B"])
``HDFStore`` now serializes MultiIndex dataframes when appending tables.
@@ -160,37 +162,39 @@ combined result, by using ``where`` on a selector table.
.. ipython:: python
- df_mt = pd.DataFrame(np.random.randn(8, 6),
- index=pd.date_range('1/1/2000', periods=8),
- columns=['A', 'B', 'C', 'D', 'E', 'F'])
- df_mt['foo'] = 'bar'
+ df_mt = pd.DataFrame(
+ np.random.randn(8, 6),
+ index=pd.date_range("1/1/2000", periods=8),
+ columns=["A", "B", "C", "D", "E", "F"],
+ )
+ df_mt["foo"] = "bar"
# you can also create the tables individually
- store.append_to_multiple({'df1_mt': ['A', 'B'], 'df2_mt': None},
- df_mt, selector='df1_mt')
+ store.append_to_multiple(
+ {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
+ )
store
# individual tables were created
- store.select('df1_mt')
- store.select('df2_mt')
+ store.select("df1_mt")
+ store.select("df2_mt")
# as a multiple
- store.select_as_multiple(['df1_mt', 'df2_mt'], where=['A>0', 'B>0'],
- selector='df1_mt')
+ store.select_as_multiple(["df1_mt", "df2_mt"], where=["A>0", "B>0"], selector="df1_mt")
.. ipython:: python
:suppress:
store.close()
- os.remove('store.h5')
+ os.remove("store.h5")
**Enhancements**
- ``HDFStore`` now can read native PyTables table format tables
- You can pass ``nan_rep = 'my_nan_rep'`` to append, to change the default nan
- representation on disk (which converts to/from `np.nan`), this defaults to
- `nan`.
+ representation on disk (which converts to/from ``np.nan``), this defaults to
+ ``nan``.
- You can pass ``index`` to ``append``. This defaults to ``True``. This will
automagically create indices on the *indexables* and *data columns* of the
@@ -224,7 +228,7 @@ combined result, by using ``where`` on a selector table.
- Function to reset Google Analytics token store so users can recover from
improperly setup client secrets (:issue:`2687`).
- Fixed groupby bug resulting in segfault when passing in MultiIndex (:issue:`2706`)
-- Fixed bug where passing a Series with datetime64 values into `to_datetime`
+- Fixed bug where passing a Series with datetime64 values into ``to_datetime``
results in bogus output values (:issue:`2699`)
- Fixed bug in ``pattern in HDFStore`` expressions when pattern is not a valid
regex (:issue:`2694`)
@@ -240,7 +244,7 @@ combined result, by using ``where`` on a selector table.
- Fixed C file parser behavior when the file has more columns than data
(:issue:`2668`)
- Fixed file reader bug that misaligned columns with data in the presence of an
- implicit column and a specified `usecols` value
+ implicit column and a specified ``usecols`` value
- DataFrames with numerical or datetime indices are now sorted prior to
plotting (:issue:`2609`)
- Fixed DataFrame.from_records error when passed columns, index, but empty
diff --git a/doc/source/whatsnew/v0.11.0.rst b/doc/source/whatsnew/v0.11.0.rst
index 6c13a125a4e54..a69d1ad1dec3b 100644
--- a/doc/source/whatsnew/v0.11.0.rst
+++ b/doc/source/whatsnew/v0.11.0.rst
@@ -24,7 +24,7 @@ Selection choices
~~~~~~~~~~~~~~~~~
Starting in 0.11.0, object selection has had a number of user-requested additions in
-order to support more explicit location based indexing. Pandas now supports
+order to support more explicit location based indexing. pandas now supports
three types of multi-axis indexing.
- ``.loc`` is strictly label based, will raise ``KeyError`` when the items are not found, allowed inputs are:
@@ -367,6 +367,7 @@ Enhancements
- You can now select with a string from a DataFrame with a datelike index, in a similar way to a Series (:issue:`3070`)
.. ipython:: python
+ :okwarning:
idx = pd.date_range("2001-10-1", periods=5, freq='M')
ts = pd.Series(np.random.rand(len(idx)), index=idx)
@@ -424,13 +425,13 @@ Enhancements
- Cursor coordinate information is now displayed in time-series plots.
- - added option `display.max_seq_items` to control the number of
+ - added option ``display.max_seq_items`` to control the number of
elements printed per sequence pprinting it. (:issue:`2979`)
- - added option `display.chop_threshold` to control display of small numerical
+ - added option ``display.chop_threshold`` to control display of small numerical
values. (:issue:`2739`)
- - added option `display.max_info_rows` to prevent verbose_info from being
+ - added option ``display.max_info_rows`` to prevent verbose_info from being
calculated for frames above 1M rows (configurable). (:issue:`2807`, :issue:`2918`)
- value_counts() now accepts a "normalize" argument, for normalized
@@ -439,7 +440,7 @@ Enhancements
- DataFrame.from_records now accepts not only dicts but any instance of
the collections.Mapping ABC.
- - added option `display.mpl_style` providing a sleeker visual style
+ - added option ``display.mpl_style`` providing a sleeker visual style
for plots. Based on https://gist.github.com/huyng/816622 (:issue:`3075`).
- Treat boolean values as integers (values 1 and 0) for numeric
diff --git a/doc/source/whatsnew/v0.12.0.rst b/doc/source/whatsnew/v0.12.0.rst
index 9971ae22822f6..4de76510c6bc1 100644
--- a/doc/source/whatsnew/v0.12.0.rst
+++ b/doc/source/whatsnew/v0.12.0.rst
@@ -47,7 +47,7 @@ API changes
.. ipython:: python
- p = pd.DataFrame({'first': [4, 5, 8], 'second': [0, 0, 3]})
+ p = pd.DataFrame({"first": [4, 5, 8], "second": [0, 0, 3]})
p % 0
p % p
p / p
@@ -95,8 +95,8 @@ API changes
.. ipython:: python
- df = pd.DataFrame(range(5), index=list('ABCDE'), columns=['a'])
- mask = (df.a % 2 == 0)
+ df = pd.DataFrame(range(5), index=list("ABCDE"), columns=["a"])
+ mask = df.a % 2 == 0
mask
# this is what you should use
@@ -141,21 +141,24 @@ API changes
.. code-block:: python
from pandas.io.parsers import ExcelFile
- xls = ExcelFile('path_to_file.xls')
- xls.parse('Sheet1', index_col=None, na_values=['NA'])
+
+ xls = ExcelFile("path_to_file.xls")
+ xls.parse("Sheet1", index_col=None, na_values=["NA"])
With
.. code-block:: python
import pandas as pd
- pd.read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])
+
+ pd.read_excel("path_to_file.xls", "Sheet1", index_col=None, na_values=["NA"])
- added top-level function ``read_sql`` that is equivalent to the following
.. code-block:: python
from pandas.io.sql import read_frame
+
read_frame(...)
- ``DataFrame.to_html`` and ``DataFrame.to_latex`` now accept a path for
@@ -200,7 +203,7 @@ IO enhancements
.. ipython:: python
:okwarning:
- df = pd.DataFrame({'a': range(3), 'b': list('abc')})
+ df = pd.DataFrame({"a": range(3), "b": list("abc")})
print(df)
html = df.to_html()
alist = pd.read_html(html, index_col=0)
@@ -248,16 +251,18 @@ IO enhancements
.. ipython:: python
from pandas._testing import makeCustomDataframe as mkdf
+
df = mkdf(5, 3, r_idx_nlevels=2, c_idx_nlevels=4)
- df.to_csv('mi.csv')
- print(open('mi.csv').read())
- pd.read_csv('mi.csv', header=[0, 1, 2, 3], index_col=[0, 1])
+ df.to_csv("mi.csv")
+ print(open("mi.csv").read())
+ pd.read_csv("mi.csv", header=[0, 1, 2, 3], index_col=[0, 1])
.. ipython:: python
:suppress:
import os
- os.remove('mi.csv')
+
+ os.remove("mi.csv")
- Support for ``HDFStore`` (via ``PyTables 3.0.0``) on Python3
@@ -304,8 +309,8 @@ Other enhancements
.. ipython:: python
- df = pd.DataFrame({'a': list('ab..'), 'b': [1, 2, 3, 4]})
- df.replace(regex=r'\s*\.\s*', value=np.nan)
+ df = pd.DataFrame({"a": list("ab.."), "b": [1, 2, 3, 4]})
+ df.replace(regex=r"\s*\.\s*", value=np.nan)
to replace all occurrences of the string ``'.'`` with zero or more
instances of surrounding white space with ``NaN``.
@@ -314,7 +319,7 @@ Other enhancements
.. ipython:: python
- df.replace('.', np.nan)
+ df.replace(".", np.nan)
to replace all occurrences of the string ``'.'`` with ``NaN``.
@@ -359,8 +364,8 @@ Other enhancements
.. ipython:: python
- dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})
- dff.groupby('B').filter(lambda x: len(x) > 2)
+ dff = pd.DataFrame({"A": np.arange(8), "B": list("aabbbbcc")})
+ dff.groupby("B").filter(lambda x: len(x) > 2)
Alternatively, instead of dropping the offending groups, we can return a
like-indexed objects where the groups that do not pass the filter are
@@ -368,7 +373,7 @@ Other enhancements
.. ipython:: python
- dff.groupby('B').filter(lambda x: len(x) > 2, dropna=False)
+ dff.groupby("B").filter(lambda x: len(x) > 2, dropna=False)
- Series and DataFrame hist methods now take a ``figsize`` argument (:issue:`3834`)
@@ -397,17 +402,18 @@ Experimental features
from pandas.tseries.offsets import CustomBusinessDay
from datetime import datetime
+
# As an interesting example, let's look at Egypt where
# a Friday-Saturday weekend is observed.
- weekmask_egypt = 'Sun Mon Tue Wed Thu'
+ weekmask_egypt = "Sun Mon Tue Wed Thu"
# They also observe International Workers' Day so let's
# add that for a couple of years
- holidays = ['2012-05-01', datetime(2013, 5, 1), np.datetime64('2014-05-01')]
+ holidays = ["2012-05-01", datetime(2013, 5, 1), np.datetime64("2014-05-01")]
bday_egypt = CustomBusinessDay(holidays=holidays, weekmask=weekmask_egypt)
dt = datetime(2013, 4, 30)
print(dt + 2 * bday_egypt)
dts = pd.date_range(dt, periods=5, freq=bday_egypt)
- print(pd.Series(dts.weekday, dts).map(pd.Series('Mon Tue Wed Thu Fri Sat Sun'.split())))
+ print(pd.Series(dts.weekday, dts).map(pd.Series("Mon Tue Wed Thu Fri Sat Sun".split())))
Bug fixes
~~~~~~~~~
@@ -430,14 +436,14 @@ Bug fixes
.. ipython:: python
:okwarning:
- strs = 'go', 'bow', 'joe', 'slow'
+ strs = "go", "bow", "joe", "slow"
ds = pd.Series(strs)
for s in ds.str:
print(s)
s
- s.dropna().values.item() == 'w'
+ s.dropna().values.item() == "w"
The last element yielded by the iterator will be a ``Series`` containing
the last element of the longest string in the ``Series`` with all other
diff --git a/doc/source/whatsnew/v0.13.0.rst b/doc/source/whatsnew/v0.13.0.rst
index 5a904d6c85c61..3c6b70fb21383 100644
--- a/doc/source/whatsnew/v0.13.0.rst
+++ b/doc/source/whatsnew/v0.13.0.rst
@@ -214,7 +214,7 @@ These were announced changes in 0.12 or prior that are taking effect as of 0.13.
- Remove deprecated ``read_clipboard/to_clipboard/ExcelFile/ExcelWriter`` from ``pandas.io.parsers`` (:issue:`3717`)
These are available as functions in the main pandas namespace (e.g. ``pd.read_clipboard``)
- default for ``tupleize_cols`` is now ``False`` for both ``to_csv`` and ``read_csv``. Fair warning in 0.12 (:issue:`3604`)
-- default for `display.max_seq_len` is now 100 rather than `None`. This activates
+- default for ``display.max_seq_len`` is now 100 rather than ``None``. This activates
truncated display ("...") of long sequences in various places. (:issue:`3391`)
Deprecations
@@ -498,7 +498,7 @@ Enhancements
- ``to_dict`` now takes ``records`` as a possible out type. Returns an array
of column-keyed dictionaries. (:issue:`4936`)
-- ``NaN`` handing in get_dummies (:issue:`4446`) with `dummy_na`
+- ``NaN`` handing in get_dummies (:issue:`4446`) with ``dummy_na``
.. ipython:: python
@@ -668,7 +668,7 @@ Enhancements
- ``Series`` now supports a ``to_frame`` method to convert it to a single-column DataFrame (:issue:`5164`)
-- All R datasets listed here http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html can now be loaded into Pandas objects
+- All R datasets listed here http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html can now be loaded into pandas objects
.. code-block:: python
@@ -1071,7 +1071,7 @@ Bug fixes
as the docstring says (:issue:`4362`).
- ``as_index`` is no longer ignored when doing groupby apply (:issue:`4648`,
:issue:`3417`)
-- JSON NaT handling fixed, NaTs are now serialized to `null` (:issue:`4498`)
+- JSON NaT handling fixed, NaTs are now serialized to ``null`` (:issue:`4498`)
- Fixed JSON handling of escapable characters in JSON object keys
(:issue:`4593`)
- Fixed passing ``keep_default_na=False`` when ``na_values=None``
@@ -1188,7 +1188,7 @@ Bug fixes
single column and passing a list for ``ascending``, the argument for
``ascending`` was being interpreted as ``True`` (:issue:`4839`,
:issue:`4846`)
-- Fixed ``Panel.tshift`` not working. Added `freq` support to ``Panel.shift``
+- Fixed ``Panel.tshift`` not working. Added ``freq`` support to ``Panel.shift``
(:issue:`4853`)
- Fix an issue in TextFileReader w/ Python engine (i.e. PythonParser)
with thousands != "," (:issue:`4596`)
diff --git a/doc/source/whatsnew/v0.13.1.rst b/doc/source/whatsnew/v0.13.1.rst
index 6fe010be8fb2d..1215786b4cccc 100644
--- a/doc/source/whatsnew/v0.13.1.rst
+++ b/doc/source/whatsnew/v0.13.1.rst
@@ -31,16 +31,16 @@ Highlights include:
.. ipython:: python
- df = pd.DataFrame({'A': np.array(['foo', 'bar', 'bah', 'foo', 'bar'])})
- df['A'].iloc[0] = np.nan
+ df = pd.DataFrame({"A": np.array(["foo", "bar", "bah", "foo", "bar"])})
+ df["A"].iloc[0] = np.nan
df
The recommended way to do this type of assignment is:
.. ipython:: python
- df = pd.DataFrame({'A': np.array(['foo', 'bar', 'bah', 'foo', 'bar'])})
- df.loc[0, 'A'] = np.nan
+ df = pd.DataFrame({"A": np.array(["foo", "bar", "bah", "foo", "bar"])})
+ df.loc[0, "A"] = np.nan
df
Output formatting enhancements
@@ -52,24 +52,27 @@ Output formatting enhancements
.. ipython:: python
- max_info_rows = pd.get_option('max_info_rows')
+ max_info_rows = pd.get_option("max_info_rows")
- df = pd.DataFrame({'A': np.random.randn(10),
- 'B': np.random.randn(10),
- 'C': pd.date_range('20130101', periods=10)
- })
+ df = pd.DataFrame(
+ {
+ "A": np.random.randn(10),
+ "B": np.random.randn(10),
+ "C": pd.date_range("20130101", periods=10),
+ }
+ )
df.iloc[3:6, [0, 2]] = np.nan
.. ipython:: python
# set to not display the null counts
- pd.set_option('max_info_rows', 0)
+ pd.set_option("max_info_rows", 0)
df.info()
.. ipython:: python
# this is the default (same as in 0.13.0)
- pd.set_option('max_info_rows', max_info_rows)
+ pd.set_option("max_info_rows", max_info_rows)
df.info()
- Add ``show_dimensions`` display option for the new DataFrame repr to control whether the dimensions print.
@@ -77,10 +80,10 @@ Output formatting enhancements
.. ipython:: python
df = pd.DataFrame([[1, 2], [3, 4]])
- pd.set_option('show_dimensions', False)
+ pd.set_option("show_dimensions", False)
df
- pd.set_option('show_dimensions', True)
+ pd.set_option("show_dimensions", True)
df
- The ``ArrayFormatter`` for ``datetime`` and ``timedelta64`` now intelligently
@@ -98,10 +101,9 @@ Output formatting enhancements
.. ipython:: python
- df = pd.DataFrame([pd.Timestamp('20010101'),
- pd.Timestamp('20040601')], columns=['age'])
- df['today'] = pd.Timestamp('20130419')
- df['diff'] = df['today'] - df['age']
+ df = pd.DataFrame([pd.Timestamp("20010101"), pd.Timestamp("20040601")], columns=["age"])
+ df["today"] = pd.Timestamp("20130419")
+ df["diff"] = df["today"] - df["age"]
df
API changes
@@ -115,8 +117,8 @@ API changes
.. ipython:: python
- s = pd.Series(['a', 'a|b', np.nan, 'a|c'])
- s.str.get_dummies(sep='|')
+ s = pd.Series(["a", "a|b", np.nan, "a|c"])
+ s.str.get_dummies(sep="|")
- Added the ``NDFrame.equals()`` method to compare if two NDFrames are
equal have equal axes, dtypes, and values. Added the
@@ -126,8 +128,8 @@ API changes
.. code-block:: python
- df = pd.DataFrame({'col': ['foo', 0, np.nan]})
- df2 = pd.DataFrame({'col': [np.nan, 0, 'foo']}, index=[2, 1, 0])
+ df = pd.DataFrame({"col": ["foo", 0, np.nan]})
+ df2 = pd.DataFrame({"col": [np.nan, 0, "foo"]}, index=[2, 1, 0])
df.equals(df2)
df.equals(df2.sort_index())
@@ -204,8 +206,7 @@ Enhancements
.. code-block:: python
# Try to infer the format for the index column
- df = pd.read_csv('foo.csv', index_col=0, parse_dates=True,
- infer_datetime_format=True)
+ df = pd.read_csv("foo.csv", index_col=0, parse_dates=True, infer_datetime_format=True)
- ``date_format`` and ``datetime_format`` keywords can now be specified when writing to ``excel``
files (:issue:`4133`)
@@ -215,10 +216,10 @@ Enhancements
.. ipython:: python
- shades = ['light', 'dark']
- colors = ['red', 'green', 'blue']
+ shades = ["light", "dark"]
+ colors = ["red", "green", "blue"]
- pd.MultiIndex.from_product([shades, colors], names=['shade', 'color'])
+ pd.MultiIndex.from_product([shades, colors], names=["shade", "color"])
- Panel :meth:`~pandas.Panel.apply` will work on non-ufuncs. See :ref:`the docs`.
@@ -379,7 +380,7 @@ Performance improvements for 0.13.1
- Series datetime/timedelta binary operations (:issue:`5801`)
- DataFrame ``count/dropna`` for ``axis=1``
-- Series.str.contains now has a `regex=False` keyword which can be faster for plain (non-regex) string patterns. (:issue:`5879`)
+- Series.str.contains now has a ``regex=False`` keyword which can be faster for plain (non-regex) string patterns. (:issue:`5879`)
- Series.str.extract (:issue:`5944`)
- ``dtypes/ftypes`` methods (:issue:`5968`)
- indexing with object dtypes (:issue:`5968`)
@@ -399,7 +400,7 @@ Bug fixes
- Bug in ``io.wb.get_countries`` not including all countries (:issue:`6008`)
- Bug in Series replace with timestamp dict (:issue:`5797`)
-- read_csv/read_table now respects the `prefix` kwarg (:issue:`5732`).
+- read_csv/read_table now respects the ``prefix`` kwarg (:issue:`5732`).
- Bug in selection with missing values via ``.ix`` from a duplicate indexed DataFrame failing (:issue:`5835`)
- Fix issue of boolean comparison on empty DataFrames (:issue:`5808`)
- Bug in isnull handling ``NaT`` in an object array (:issue:`5443`)
diff --git a/doc/source/whatsnew/v0.14.0.rst b/doc/source/whatsnew/v0.14.0.rst
index 847a42b3a7643..421ef81427210 100644
--- a/doc/source/whatsnew/v0.14.0.rst
+++ b/doc/source/whatsnew/v0.14.0.rst
@@ -82,7 +82,7 @@ API changes
- The :meth:`DataFrame.interpolate` keyword ``downcast`` default has been changed from ``infer`` to
``None``. This is to preserve the original dtype unless explicitly requested otherwise (:issue:`6290`).
-- When converting a dataframe to HTML it used to return `Empty DataFrame`. This special case has
+- When converting a dataframe to HTML it used to return ``Empty DataFrame``. This special case has
been removed, instead a header with the column names is returned (:issue:`6062`).
- ``Series`` and ``Index`` now internally share more common operations, e.g. ``factorize(),nunique(),value_counts()`` are
now supported on ``Index`` types as well. The ``Series.weekday`` property from is removed
@@ -291,12 +291,12 @@ Display changes
- Regression in the display of a MultiIndexed Series with ``display.max_rows`` is less than the
length of the series (:issue:`7101`)
- Fixed a bug in the HTML repr of a truncated Series or DataFrame not showing the class name with the
- `large_repr` set to 'info' (:issue:`7105`)
-- The `verbose` keyword in ``DataFrame.info()``, which controls whether to shorten the ``info``
+ ``large_repr`` set to 'info' (:issue:`7105`)
+- The ``verbose`` keyword in ``DataFrame.info()``, which controls whether to shorten the ``info``
representation, is now ``None`` by default. This will follow the global setting in
``display.max_info_columns``. The global setting can be overridden with ``verbose=True`` or
``verbose=False``.
-- Fixed a bug with the `info` repr not honoring the `display.max_info_columns` setting (:issue:`6939`)
+- Fixed a bug with the ``info`` repr not honoring the ``display.max_info_columns`` setting (:issue:`6939`)
- Offset/freq info now in Timestamp __repr__ (:issue:`4553`)
.. _whatsnew_0140.parsing:
@@ -603,11 +603,11 @@ Plotting
- Following keywords are now acceptable for :meth:`DataFrame.plot` with ``kind='bar'`` and ``kind='barh'``:
- - `width`: Specify the bar width. In previous versions, static value 0.5 was passed to matplotlib and it cannot be overwritten. (:issue:`6604`)
- - `align`: Specify the bar alignment. Default is `center` (different from matplotlib). In previous versions, pandas passes `align='edge'` to matplotlib and adjust the location to `center` by itself, and it results `align` keyword is not applied as expected. (:issue:`4525`)
- - `position`: Specify relative alignments for bar plot layout. From 0 (left/bottom-end) to 1(right/top-end). Default is 0.5 (center). (:issue:`6604`)
+ - ``width``: Specify the bar width. In previous versions, static value 0.5 was passed to matplotlib and it cannot be overwritten. (:issue:`6604`)
+ - ``align``: Specify the bar alignment. Default is ``center`` (different from matplotlib). In previous versions, pandas passes ``align='edge'`` to matplotlib and adjust the location to ``center`` by itself, and it results ``align`` keyword is not applied as expected. (:issue:`4525`)
+ - ``position``: Specify relative alignments for bar plot layout. From 0 (left/bottom-end) to 1(right/top-end). Default is 0.5 (center). (:issue:`6604`)
- Because of the default `align` value changes, coordinates of bar plots are now located on integer values (0.0, 1.0, 2.0 ...). This is intended to make bar plot be located on the same coordinates as line plot. However, bar plot may differs unexpectedly when you manually adjust the bar location or drawing area, such as using `set_xlim`, `set_ylim`, etc. In this cases, please modify your script to meet with new coordinates.
+ Because of the default ``align`` value changes, coordinates of bar plots are now located on integer values (0.0, 1.0, 2.0 ...). This is intended to make bar plot be located on the same coordinates as line plot. However, bar plot may differs unexpectedly when you manually adjust the bar location or drawing area, such as using ``set_xlim``, ``set_ylim``, etc. In this cases, please modify your script to meet with new coordinates.
- The :func:`parallel_coordinates` function now takes argument ``color``
instead of ``colors``. A ``FutureWarning`` is raised to alert that
@@ -618,7 +618,7 @@ Plotting
raised if the old ``data`` argument is used by name. (:issue:`6956`)
- :meth:`DataFrame.boxplot` now supports ``layout`` keyword (:issue:`6769`)
-- :meth:`DataFrame.boxplot` has a new keyword argument, `return_type`. It accepts ``'dict'``,
+- :meth:`DataFrame.boxplot` has a new keyword argument, ``return_type``. It accepts ``'dict'``,
``'axes'``, or ``'both'``, in which case a namedtuple with the matplotlib
axes and a dict of matplotlib Lines is returned.
@@ -721,8 +721,8 @@ Deprecations
- The following ``io.sql`` functions have been deprecated: ``tquery``, ``uquery``, ``read_frame``, ``frame_query``, ``write_frame``.
-- The `percentile_width` keyword argument in :meth:`~DataFrame.describe` has been deprecated.
- Use the `percentiles` keyword instead, which takes a list of percentiles to display. The
+- The ``percentile_width`` keyword argument in :meth:`~DataFrame.describe` has been deprecated.
+ Use the ``percentiles`` keyword instead, which takes a list of percentiles to display. The
default output is unchanged.
- The default return type of :func:`boxplot` will change from a dict to a matplotlib Axes
@@ -851,7 +851,7 @@ Enhancements
- Arrays of strings can be wrapped to a specified width (``str.wrap``) (:issue:`6999`)
- Add :meth:`~Series.nsmallest` and :meth:`Series.nlargest` methods to Series, See :ref:`the docs ` (:issue:`3960`)
-- `PeriodIndex` fully supports partial string indexing like `DatetimeIndex` (:issue:`7043`)
+- ``PeriodIndex`` fully supports partial string indexing like ``DatetimeIndex`` (:issue:`7043`)
.. ipython:: python
@@ -868,7 +868,7 @@ Enhancements
- ``Series.rank()`` now has a percentage rank option (:issue:`5971`)
- ``Series.rank()`` and ``DataFrame.rank()`` now accept ``method='dense'`` for ranks without gaps (:issue:`6514`)
- Support passing ``encoding`` with xlwt (:issue:`3710`)
-- Refactor Block classes removing `Block.items` attributes to avoid duplication
+- Refactor Block classes removing ``Block.items`` attributes to avoid duplication
in item handling (:issue:`6745`, :issue:`6988`).
- Testing statements updated to use specialized asserts (:issue:`6175`)
@@ -1063,10 +1063,10 @@ Bug fixes
- Bug in ``MultiIndex.get_level_values`` doesn't preserve ``DatetimeIndex`` and ``PeriodIndex`` attributes (:issue:`7092`)
- Bug in ``Groupby`` doesn't preserve ``tz`` (:issue:`3950`)
- Bug in ``PeriodIndex`` partial string slicing (:issue:`6716`)
-- Bug in the HTML repr of a truncated Series or DataFrame not showing the class name with the `large_repr` set to 'info'
+- Bug in the HTML repr of a truncated Series or DataFrame not showing the class name with the ``large_repr`` set to 'info'
(:issue:`7105`)
- Bug in ``DatetimeIndex`` specifying ``freq`` raises ``ValueError`` when passed value is too short (:issue:`7098`)
-- Fixed a bug with the `info` repr not honoring the `display.max_info_columns` setting (:issue:`6939`)
+- Fixed a bug with the ``info`` repr not honoring the ``display.max_info_columns`` setting (:issue:`6939`)
- Bug ``PeriodIndex`` string slicing with out of bounds values (:issue:`5407`)
- Fixed a memory error in the hashtable implementation/factorizer on resizing of large tables (:issue:`7157`)
- Bug in ``isnull`` when applied to 0-dimensional object arrays (:issue:`7176`)
diff --git a/doc/source/whatsnew/v0.14.1.rst b/doc/source/whatsnew/v0.14.1.rst
index 3dfc4272681df..78fd182ea86c3 100644
--- a/doc/source/whatsnew/v0.14.1.rst
+++ b/doc/source/whatsnew/v0.14.1.rst
@@ -68,7 +68,8 @@ API changes
:suppress:
import pandas.tseries.offsets as offsets
- d = pd.Timestamp('2014-01-01 09:00')
+
+ d = pd.Timestamp("2014-01-01 09:00")
.. ipython:: python
@@ -100,15 +101,15 @@ Enhancements
import pandas.tseries.offsets as offsets
day = offsets.Day()
- day.apply(pd.Timestamp('2014-01-01 09:00'))
+ day.apply(pd.Timestamp("2014-01-01 09:00"))
day = offsets.Day(normalize=True)
- day.apply(pd.Timestamp('2014-01-01 09:00'))
+ day.apply(pd.Timestamp("2014-01-01 09:00"))
- ``PeriodIndex`` is represented as the same format as ``DatetimeIndex`` (:issue:`7601`)
- ``StringMethods`` now work on empty Series (:issue:`7242`)
- The file parsers ``read_csv`` and ``read_table`` now ignore line comments provided by
- the parameter `comment`, which accepts only a single character for the C reader.
+ the parameter ``comment``, which accepts only a single character for the C reader.
In particular, they allow for comments before file data begins (:issue:`2685`)
- Add ``NotImplementedError`` for simultaneous use of ``chunksize`` and ``nrows``
for read_csv() (:issue:`6774`).
@@ -123,15 +124,14 @@ Enhancements
.. ipython:: python
- rng = pd.date_range('3/6/2012 00:00', periods=10, freq='D',
- tz='dateutil/Europe/London')
+ rng = pd.date_range("3/6/2012 00:00", periods=10, freq="D", tz="dateutil/Europe/London")
rng.tz
See :ref:`the docs `.
- Implemented ``sem`` (standard error of the mean) operation for ``Series``,
``DataFrame``, ``Panel``, and ``Groupby`` (:issue:`6897`)
-- Add ``nlargest`` and ``nsmallest`` to the ``Series`` ``groupby`` whitelist,
+- Add ``nlargest`` and ``nsmallest`` to the ``Series`` ``groupby`` allowlist,
which means you can now use these methods on a ``SeriesGroupBy`` object
(:issue:`7053`).
- All offsets ``apply``, ``rollforward`` and ``rollback`` can now handle ``np.datetime64``, previously results in ``ApplyTypeError`` (:issue:`7452`)
@@ -150,7 +150,7 @@ Performance
- Improvements in Series.transform for significant performance gains (:issue:`6496`)
- Improvements in DataFrame.transform with ufuncs and built-in grouper functions for significant performance gains (:issue:`7383`)
- Regression in groupby aggregation of datetime64 dtypes (:issue:`7555`)
-- Improvements in `MultiIndex.from_product` for large iterables (:issue:`7627`)
+- Improvements in ``MultiIndex.from_product`` for large iterables (:issue:`7627`)
.. _whatsnew_0141.experimental:
@@ -217,7 +217,7 @@ Bug fixes
- Bug in ``.loc`` with a list of indexers on a single-multi index level (that is not nested) (:issue:`7349`)
- Bug in ``Series.map`` when mapping a dict with tuple keys of different lengths (:issue:`7333`)
- Bug all ``StringMethods`` now work on empty Series (:issue:`7242`)
-- Fix delegation of `read_sql` to `read_sql_query` when query does not contain 'select' (:issue:`7324`).
+- Fix delegation of ``read_sql`` to ``read_sql_query`` when query does not contain 'select' (:issue:`7324`).
- Bug where a string column name assignment to a ``DataFrame`` with a
``Float64Index`` raised a ``TypeError`` during a call to ``np.isnan``
(:issue:`7366`).
@@ -269,7 +269,7 @@ Bug fixes
- Bug in ``pandas.core.strings.str_contains`` does not properly match in a case insensitive fashion when ``regex=False`` and ``case=False`` (:issue:`7505`)
- Bug in ``expanding_cov``, ``expanding_corr``, ``rolling_cov``, and ``rolling_corr`` for two arguments with mismatched index (:issue:`7512`)
- Bug in ``to_sql`` taking the boolean column as text column (:issue:`7678`)
-- Bug in grouped `hist` doesn't handle `rot` kw and `sharex` kw properly (:issue:`7234`)
+- Bug in grouped ``hist`` doesn't handle ``rot`` kw and ``sharex`` kw properly (:issue:`7234`)
- Bug in ``.loc`` performing fallback integer indexing with ``object`` dtype indices (:issue:`7496`)
- Bug (regression) in ``PeriodIndex`` constructor when passed ``Series`` objects (:issue:`7701`).
diff --git a/doc/source/whatsnew/v0.15.0.rst b/doc/source/whatsnew/v0.15.0.rst
index b80ed7446f805..1f054930b3709 100644
--- a/doc/source/whatsnew/v0.15.0.rst
+++ b/doc/source/whatsnew/v0.15.0.rst
@@ -61,7 +61,7 @@ New features
Categoricals in Series/DataFrame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-:class:`~pandas.Categorical` can now be included in `Series` and `DataFrames` and gained new
+:class:`~pandas.Categorical` can now be included in ``Series`` and ``DataFrames`` and gained new
methods to manipulate. Thanks to Jan Schulz for much of this API/implementation. (:issue:`3943`, :issue:`5313`, :issue:`5314`,
:issue:`7444`, :issue:`7839`, :issue:`7848`, :issue:`7864`, :issue:`7914`, :issue:`7768`, :issue:`8006`, :issue:`3678`,
:issue:`8075`, :issue:`8076`, :issue:`8143`, :issue:`8453`, :issue:`8518`).
@@ -808,7 +808,7 @@ Other notable API changes:
.. _whatsnew_0150.blanklines:
-- Made both the C-based and Python engines for `read_csv` and `read_table` ignore empty lines in input as well as
+- Made both the C-based and Python engines for ``read_csv`` and ``read_table`` ignore empty lines in input as well as
white space-filled lines, as long as ``sep`` is not white space. This is an API change
that can be controlled by the keyword parameter ``skip_blank_lines``. See :ref:`the docs ` (:issue:`4466`)
@@ -830,7 +830,7 @@ Other notable API changes:
Previously this would have yielded a column of ``datetime64`` dtype, but without timezone info.
- The behaviour of assigning a column to an existing dataframe as `df['a'] = i`
+ The behaviour of assigning a column to an existing dataframe as ``df['a'] = i``
remains unchanged (this already returned an ``object`` column with a timezone).
- When passing multiple levels to :meth:`~pandas.DataFrame.stack()`, it will now raise a ``ValueError`` when the
@@ -894,7 +894,7 @@ a transparent change with only very limited API implications (:issue:`5080`, :is
- you may need to unpickle pandas version < 0.15.0 pickles using ``pd.read_pickle`` rather than ``pickle.load``. See :ref:`pickle docs `
- when plotting with a ``PeriodIndex``, the matplotlib internal axes will now be arrays of ``Period`` rather than a ``PeriodIndex`` (this is similar to how a ``DatetimeIndex`` passes arrays of ``datetimes`` now)
- MultiIndexes will now raise similarly to other pandas objects w.r.t. truth testing, see :ref:`here ` (:issue:`7897`).
-- When plotting a DatetimeIndex directly with matplotlib's `plot` function,
+- When plotting a DatetimeIndex directly with matplotlib's ``plot`` function,
the axis labels will no longer be formatted as dates but as integers (the
internal representation of a ``datetime64``). **UPDATE** This is fixed
in 0.15.1, see :ref:`here `.
diff --git a/doc/source/whatsnew/v0.15.1.rst b/doc/source/whatsnew/v0.15.1.rst
index f9c17058dc3ee..a1d4f9d14a905 100644
--- a/doc/source/whatsnew/v0.15.1.rst
+++ b/doc/source/whatsnew/v0.15.1.rst
@@ -23,7 +23,7 @@ API changes
.. ipython:: python
- s = pd.Series(pd.date_range('20130101', periods=5, freq='D'))
+ s = pd.Series(pd.date_range("20130101", periods=5, freq="D"))
s.iloc[2] = np.nan
s
@@ -52,8 +52,7 @@ API changes
.. ipython:: python
np.random.seed(2718281)
- df = pd.DataFrame(np.random.randint(0, 100, (10, 2)),
- columns=['jim', 'joe'])
+ df = pd.DataFrame(np.random.randint(0, 100, (10, 2)), columns=["jim", "joe"])
df.head()
ts = pd.Series(5 * np.random.randint(0, 3, 10))
@@ -80,9 +79,9 @@ API changes
.. ipython:: python
- df = pd.DataFrame({'jim': range(5), 'joe': range(5, 10)})
+ df = pd.DataFrame({"jim": range(5), "joe": range(5, 10)})
df
- gr = df.groupby(df['jim'] < 2)
+ gr = df.groupby(df["jim"] < 2)
previous behavior (excludes 1st column from output):
@@ -106,7 +105,7 @@ API changes
.. ipython:: python
- s = pd.Series(['a', 'b', 'c', 'd'], [4, 3, 2, 1])
+ s = pd.Series(["a", "b", "c", "d"], [4, 3, 2, 1])
s
previous behavior:
@@ -208,6 +207,7 @@ Enhancements
.. ipython:: python
from collections import deque
+
df1 = pd.DataFrame([1, 2, 3])
df2 = pd.DataFrame([4, 5, 6])
@@ -228,8 +228,9 @@ Enhancements
.. ipython:: python
- dfi = pd.DataFrame(1, index=pd.MultiIndex.from_product([['a'],
- range(1000)]), columns=['A'])
+ dfi = pd.DataFrame(
+ 1, index=pd.MultiIndex.from_product([["a"], range(1000)]), columns=["A"]
+ )
previous behavior:
@@ -249,7 +250,7 @@ Enhancements
dfi.memory_usage(index=True)
-- Added Index properties `is_monotonic_increasing` and `is_monotonic_decreasing` (:issue:`8680`).
+- Added Index properties ``is_monotonic_increasing`` and ``is_monotonic_decreasing`` (:issue:`8680`).
- Added option to select columns when importing Stata files (:issue:`7935`)
@@ -305,7 +306,7 @@ Bug fixes
- Fixed a bug where plotting a column ``y`` and specifying a label would mutate the index name of the original DataFrame (:issue:`8494`)
- Fix regression in plotting of a DatetimeIndex directly with matplotlib (:issue:`8614`).
- Bug in ``date_range`` where partially-specified dates would incorporate current date (:issue:`6961`)
-- Bug in Setting by indexer to a scalar value with a mixed-dtype `Panel4d` was failing (:issue:`8702`)
+- Bug in Setting by indexer to a scalar value with a mixed-dtype ``Panel4d`` was failing (:issue:`8702`)
- Bug where ``DataReader``'s would fail if one of the symbols passed was invalid. Now returns data for valid symbols and np.nan for invalid (:issue:`8494`)
- Bug in ``get_quote_yahoo`` that wouldn't allow non-float return values (:issue:`5229`).
diff --git a/doc/source/whatsnew/v0.15.2.rst b/doc/source/whatsnew/v0.15.2.rst
index a4eabb97471de..95ca925f18692 100644
--- a/doc/source/whatsnew/v0.15.2.rst
+++ b/doc/source/whatsnew/v0.15.2.rst
@@ -137,7 +137,7 @@ Enhancements
- Added ability to export Categorical data to Stata (:issue:`8633`). See :ref:`here ` for limitations of categorical variables exported to Stata data files.
- Added flag ``order_categoricals`` to ``StataReader`` and ``read_stata`` to select whether to order imported categorical data (:issue:`8836`). See :ref:`here ` for more information on importing categorical variables from Stata data files.
- Added ability to export Categorical data to to/from HDF5 (:issue:`7621`). Queries work the same as if it was an object array. However, the ``category`` dtyped data is stored in a more efficient manner. See :ref:`here ` for an example and caveats w.r.t. prior versions of pandas.
-- Added support for ``searchsorted()`` on `Categorical` class (:issue:`8420`).
+- Added support for ``searchsorted()`` on ``Categorical`` class (:issue:`8420`).
Other enhancements:
@@ -171,7 +171,7 @@ Other enhancements:
3 False True False True
4 True True True True
-- Added support for ``utcfromtimestamp()``, ``fromtimestamp()``, and ``combine()`` on `Timestamp` class (:issue:`5351`).
+- Added support for ``utcfromtimestamp()``, ``fromtimestamp()``, and ``combine()`` on ``Timestamp`` class (:issue:`5351`).
- Added Google Analytics (`pandas.io.ga`) basic documentation (:issue:`8835`). See `here `__.
- ``Timedelta`` arithmetic returns ``NotImplemented`` in unknown cases, allowing extensions by custom classes (:issue:`8813`).
- ``Timedelta`` now supports arithmetic with ``numpy.ndarray`` objects of the appropriate dtype (numpy 1.8 or newer only) (:issue:`8884`).
@@ -241,7 +241,7 @@ Bug fixes
- Bug in ``MultiIndex`` where ``__contains__`` returns wrong result if index is not lexically sorted or unique (:issue:`7724`)
- BUG CSV: fix problem with trailing white space in skipped rows, (:issue:`8679`), (:issue:`8661`), (:issue:`8983`)
- Regression in ``Timestamp`` does not parse 'Z' zone designator for UTC (:issue:`8771`)
-- Bug in `StataWriter` the produces writes strings with 244 characters irrespective of actual size (:issue:`8969`)
+- Bug in ``StataWriter`` the produces writes strings with 244 characters irrespective of actual size (:issue:`8969`)
- Fixed ValueError raised by cummin/cummax when datetime64 Series contains NaT. (:issue:`8965`)
- Bug in DataReader returns object dtype if there are missing values (:issue:`8980`)
- Bug in plotting if sharex was enabled and index was a timeseries, would show labels on multiple axes (:issue:`3964`).
diff --git a/doc/source/whatsnew/v0.16.0.rst b/doc/source/whatsnew/v0.16.0.rst
index 4ad533e68e275..8d0d6854cbf85 100644
--- a/doc/source/whatsnew/v0.16.0.rst
+++ b/doc/source/whatsnew/v0.16.0.rst
@@ -89,7 +89,7 @@ See the :ref:`documentation ` for more. (:issue:`922
Interaction with scipy.sparse
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Added :meth:`SparseSeries.to_coo` and :meth:`SparseSeries.from_coo` methods (:issue:`8048`) for converting to and from ``scipy.sparse.coo_matrix`` instances (see :ref:`here `). For example, given a SparseSeries with MultiIndex we can convert to a `scipy.sparse.coo_matrix` by specifying the row and column labels as index levels:
+Added :meth:`SparseSeries.to_coo` and :meth:`SparseSeries.from_coo` methods (:issue:`8048`) for converting to and from ``scipy.sparse.coo_matrix`` instances (see :ref:`here `). For example, given a SparseSeries with MultiIndex we can convert to a ``scipy.sparse.coo_matrix`` by specifying the row and column labels as index levels:
.. code-block:: python
@@ -630,7 +630,7 @@ Bug fixes
- Bug in ``Series.values_counts`` with excluding ``NaN`` for categorical type ``Series`` with ``dropna=True`` (:issue:`9443`)
- Fixed missing numeric_only option for ``DataFrame.std/var/sem`` (:issue:`9201`)
- Support constructing ``Panel`` or ``Panel4D`` with scalar data (:issue:`8285`)
-- ``Series`` text representation disconnected from `max_rows`/`max_columns` (:issue:`7508`).
+- ``Series`` text representation disconnected from ``max_rows``/``max_columns`` (:issue:`7508`).
\
diff --git a/doc/source/whatsnew/v0.16.1.rst b/doc/source/whatsnew/v0.16.1.rst
index 8dcac4c1044be..39767684c01d0 100644
--- a/doc/source/whatsnew/v0.16.1.rst
+++ b/doc/source/whatsnew/v0.16.1.rst
@@ -209,9 +209,8 @@ when sampling from rows.
.. ipython:: python
- df = pd.DataFrame({'col1': [9, 8, 7, 6],
- 'weight_column': [0.5, 0.4, 0.1, 0]})
- df.sample(n=3, weights='weight_column')
+ df = pd.DataFrame({"col1": [9, 8, 7, 6], "weight_column": [0.5, 0.4, 0.1, 0]})
+ df.sample(n=3, weights="weight_column")
.. _whatsnew_0161.enhancements.string:
@@ -229,20 +228,20 @@ enhancements make string operations easier and more consistent with standard pyt
.. ipython:: python
- idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])
+ idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
idx.str.strip()
- One special case for the `.str` accessor on ``Index`` is that if a string method returns ``bool``, the ``.str`` accessor
+ One special case for the ``.str`` accessor on ``Index`` is that if a string method returns ``bool``, the ``.str`` accessor
will return a ``np.array`` instead of a boolean ``Index`` (:issue:`8875`). This enables the following expression
to work naturally:
.. ipython:: python
- idx = pd.Index(['a1', 'a2', 'b1', 'b2'])
+ idx = pd.Index(["a1", "a2", "b1", "b2"])
s = pd.Series(range(4), index=idx)
s
- idx.str.startswith('a')
- s[s.index.str.startswith('a')]
+ idx.str.startswith("a")
+ s[s.index.str.startswith("a")]
- The following new methods are accessible via ``.str`` accessor to apply the function to each values. (:issue:`9766`, :issue:`9773`, :issue:`10031`, :issue:`10045`, :issue:`10052`)
@@ -257,21 +256,21 @@ enhancements make string operations easier and more consistent with standard pyt
.. ipython:: python
- s = pd.Series(['a,b', 'a,c', 'b,c'])
+ s = pd.Series(["a,b", "a,c", "b,c"])
# return Series
- s.str.split(',')
+ s.str.split(",")
# return DataFrame
- s.str.split(',', expand=True)
+ s.str.split(",", expand=True)
- idx = pd.Index(['a,b', 'a,c', 'b,c'])
+ idx = pd.Index(["a,b", "a,c", "b,c"])
# return Index
- idx.str.split(',')
+ idx.str.split(",")
# return MultiIndex
- idx.str.split(',', expand=True)
+ idx.str.split(",", expand=True)
- Improved ``extract`` and ``get_dummies`` methods for ``Index.str`` (:issue:`9980`)
@@ -286,9 +285,9 @@ Other enhancements
.. ipython:: python
- pd.Timestamp('2014-08-01 09:00') + pd.tseries.offsets.BusinessHour()
- pd.Timestamp('2014-08-01 07:00') + pd.tseries.offsets.BusinessHour()
- pd.Timestamp('2014-08-01 16:30') + pd.tseries.offsets.BusinessHour()
+ pd.Timestamp("2014-08-01 09:00") + pd.tseries.offsets.BusinessHour()
+ pd.Timestamp("2014-08-01 07:00") + pd.tseries.offsets.BusinessHour()
+ pd.Timestamp("2014-08-01 16:30") + pd.tseries.offsets.BusinessHour()
- ``DataFrame.diff`` now takes an ``axis`` parameter that determines the direction of differencing (:issue:`9727`)
@@ -300,8 +299,8 @@ Other enhancements
.. ipython:: python
- df = pd.DataFrame(np.random.randn(3, 3), columns=['A', 'B', 'C'])
- df.drop(['A', 'X'], axis=1, errors='ignore')
+ df = pd.DataFrame(np.random.randn(3, 3), columns=["A", "B", "C"])
+ df.drop(["A", "X"], axis=1, errors="ignore")
- Add support for separating years and quarters using dashes, for
example 2014-Q1. (:issue:`9688`)
@@ -310,7 +309,7 @@ Other enhancements
- ``get_dummies`` function now accepts ``sparse`` keyword. If set to ``True``, the return ``DataFrame`` is sparse, e.g. ``SparseDataFrame``. (:issue:`8823`)
- ``Period`` now accepts ``datetime64`` as value input. (:issue:`9054`)
-- Allow timedelta string conversion when leading zero is missing from time definition, ie `0:00:00` vs `00:00:00`. (:issue:`9570`)
+- Allow timedelta string conversion when leading zero is missing from time definition, ie ``0:00:00`` vs ``00:00:00``. (:issue:`9570`)
- Allow ``Panel.shift`` with ``axis='items'`` (:issue:`9890`)
- Trying to write an excel file now raises ``NotImplementedError`` if the ``DataFrame`` has a ``MultiIndex`` instead of writing a broken Excel file. (:issue:`9794`)
@@ -329,11 +328,11 @@ Other enhancements
API changes
~~~~~~~~~~~
-- When passing in an ax to ``df.plot( ..., ax=ax)``, the `sharex` kwarg will now default to `False`.
+- When passing in an ax to ``df.plot( ..., ax=ax)``, the ``sharex`` kwarg will now default to ``False``.
The result is that the visibility of xlabels and xticklabels will not anymore be changed. You
have to do that by yourself for the right axes in your figure or set ``sharex=True`` explicitly
(but this changes the visible for all axes in the figure, not only the one which is passed in!).
- If pandas creates the subplots itself (e.g. no passed in `ax` kwarg), then the
+ If pandas creates the subplots itself (e.g. no passed in ``ax`` kwarg), then the
default is still ``sharex=True`` and the visibility changes are applied.
- :meth:`~pandas.DataFrame.assign` now inserts new columns in alphabetical order. Previously
@@ -382,19 +381,16 @@ New behavior
.. ipython:: python
- pd.set_option('display.width', 80)
- pd.Index(range(4), name='foo')
- pd.Index(range(30), name='foo')
- pd.Index(range(104), name='foo')
- pd.CategoricalIndex(['a', 'bb', 'ccc', 'dddd'],
- ordered=True, name='foobar')
- pd.CategoricalIndex(['a', 'bb', 'ccc', 'dddd'] * 10,
- ordered=True, name='foobar')
- pd.CategoricalIndex(['a', 'bb', 'ccc', 'dddd'] * 100,
- ordered=True, name='foobar')
- pd.date_range('20130101', periods=4, name='foo', tz='US/Eastern')
- pd.date_range('20130101', periods=25, freq='D')
- pd.date_range('20130101', periods=104, name='foo', tz='US/Eastern')
+ pd.set_option("display.width", 80)
+ pd.Index(range(4), name="foo")
+ pd.Index(range(30), name="foo")
+ pd.Index(range(104), name="foo")
+ pd.CategoricalIndex(["a", "bb", "ccc", "dddd"], ordered=True, name="foobar")
+ pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 10, ordered=True, name="foobar")
+ pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 100, ordered=True, name="foobar")
+ pd.date_range("20130101", periods=4, name="foo", tz="US/Eastern")
+ pd.date_range("20130101", periods=25, freq="D")
+ pd.date_range("20130101", periods=104, name="foo", tz="US/Eastern")
.. _whatsnew_0161.performance:
@@ -442,7 +438,7 @@ Bug fixes
- Bug in ``read_csv`` and ``read_table`` when using ``skip_rows`` parameter if blank lines are present. (:issue:`9832`)
- Bug in ``read_csv()`` interprets ``index_col=True`` as ``1`` (:issue:`9798`)
- Bug in index equality comparisons using ``==`` failing on Index/MultiIndex type incompatibility (:issue:`9785`)
-- Bug in which ``SparseDataFrame`` could not take `nan` as a column name (:issue:`8822`)
+- Bug in which ``SparseDataFrame`` could not take ``nan`` as a column name (:issue:`8822`)
- Bug in ``to_msgpack`` and ``read_msgpack`` zlib and blosc compression support (:issue:`9783`)
- Bug ``GroupBy.size`` doesn't attach index name properly if grouped by ``TimeGrouper`` (:issue:`9925`)
- Bug causing an exception in slice assignments because ``length_of_indexer`` returns wrong results (:issue:`9995`)
diff --git a/doc/source/whatsnew/v0.16.2.rst b/doc/source/whatsnew/v0.16.2.rst
index a3c34db09f555..bb2aa166419b4 100644
--- a/doc/source/whatsnew/v0.16.2.rst
+++ b/doc/source/whatsnew/v0.16.2.rst
@@ -48,9 +48,10 @@ This can be rewritten as
.. code-block:: python
- (df.pipe(h) # noqa F821
- .pipe(g, arg1=1) # noqa F821
- .pipe(f, arg2=2, arg3=3) # noqa F821
+ (
+ df.pipe(h) # noqa F821
+ .pipe(g, arg1=1) # noqa F821
+ .pipe(f, arg2=2, arg3=3) # noqa F821
)
Now both the code and the logic flow from top to bottom. Keyword arguments are next to
@@ -64,15 +65,16 @@ of ``(function, keyword)`` indicating where the DataFrame should flow. For examp
import statsmodels.formula.api as sm
- bb = pd.read_csv('data/baseball.csv', index_col='id')
+ bb = pd.read_csv("data/baseball.csv", index_col="id")
# sm.ols takes (formula, data)
- (bb.query('h > 0')
- .assign(ln_h=lambda df: np.log(df.h))
- .pipe((sm.ols, 'data'), 'hr ~ ln_h + year + g + C(lg)')
- .fit()
- .summary()
- )
+ (
+ bb.query("h > 0")
+ .assign(ln_h=lambda df: np.log(df.h))
+ .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
+ .fit()
+ .summary()
+ )
The pipe method is inspired by unix pipes, which stream text through
processes. More recently dplyr_ and magrittr_ have introduced the
@@ -89,7 +91,7 @@ See the :ref:`documentation ` for more. (:issue:`10129`)
Other enhancements
^^^^^^^^^^^^^^^^^^
-- Added `rsplit` to Index/Series StringMethods (:issue:`10303`)
+- Added ``rsplit`` to Index/Series StringMethods (:issue:`10303`)
- Removed the hard-coded size limits on the ``DataFrame`` HTML representation
in the IPython notebook, and leave this to IPython itself (only for IPython
diff --git a/doc/source/whatsnew/v0.17.0.rst b/doc/source/whatsnew/v0.17.0.rst
index 11c252192be6b..1658f877f5523 100644
--- a/doc/source/whatsnew/v0.17.0.rst
+++ b/doc/source/whatsnew/v0.17.0.rst
@@ -40,7 +40,7 @@ Highlights include:
- Plotting methods are now available as attributes of the ``.plot`` accessor, see :ref:`here `
- The sorting API has been revamped to remove some long-time inconsistencies, see :ref:`here `
- Support for a ``datetime64[ns]`` with timezones as a first-class dtype, see :ref:`here `
-- The default for ``to_datetime`` will now be to ``raise`` when presented with unparseable formats,
+- The default for ``to_datetime`` will now be to ``raise`` when presented with unparsable formats,
previously this would return the original input. Also, date parse
functions now return consistent results. See :ref:`here `
- The default for ``dropna`` in ``HDFStore`` has changed to ``False``, to store by default all rows even
@@ -80,9 +80,13 @@ The new implementation allows for having a single-timezone across all rows, with
.. ipython:: python
- df = pd.DataFrame({'A': pd.date_range('20130101', periods=3),
- 'B': pd.date_range('20130101', periods=3, tz='US/Eastern'),
- 'C': pd.date_range('20130101', periods=3, tz='CET')})
+ df = pd.DataFrame(
+ {
+ "A": pd.date_range("20130101", periods=3),
+ "B": pd.date_range("20130101", periods=3, tz="US/Eastern"),
+ "C": pd.date_range("20130101", periods=3, tz="CET"),
+ }
+ )
df
df.dtypes
@@ -95,8 +99,8 @@ This uses a new-dtype representation as well, that is very similar in look-and-f
.. ipython:: python
- df['B'].dtype
- type(df['B'].dtype)
+ df["B"].dtype
+ type(df["B"].dtype)
.. note::
@@ -119,8 +123,8 @@ This uses a new-dtype representation as well, that is very similar in look-and-f
.. ipython:: python
- pd.date_range('20130101', periods=3, tz='US/Eastern')
- pd.date_range('20130101', periods=3, tz='US/Eastern').dtype
+ pd.date_range("20130101", periods=3, tz="US/Eastern")
+ pd.date_range("20130101", periods=3, tz="US/Eastern").dtype
.. _whatsnew_0170.gil:
@@ -138,9 +142,10 @@ as well as the ``.sum()`` operation.
N = 1000000
ngroups = 10
- df = DataFrame({'key': np.random.randint(0, ngroups, size=N),
- 'data': np.random.randn(N)})
- df.groupby('key')['data'].sum()
+ df = DataFrame(
+ {"key": np.random.randint(0, ngroups, size=N), "data": np.random.randn(N)}
+ )
+ df.groupby("key")["data"].sum()
Releasing of the GIL could benefit an application that uses threads for user interactions (e.g. QT_), or performing multi-threaded computations. A nice example of a library that can handle these types of computation-in-parallel is the dask_ library.
@@ -189,16 +194,16 @@ We are now supporting a ``Series.dt.strftime`` method for datetime-likes to gene
.. ipython:: python
# DatetimeIndex
- s = pd.Series(pd.date_range('20130101', periods=4))
+ s = pd.Series(pd.date_range("20130101", periods=4))
s
- s.dt.strftime('%Y/%m/%d')
+ s.dt.strftime("%Y/%m/%d")
.. ipython:: python
# PeriodIndex
- s = pd.Series(pd.period_range('20130101', periods=4))
+ s = pd.Series(pd.period_range("20130101", periods=4))
s
- s.dt.strftime('%Y/%m/%d')
+ s.dt.strftime("%Y/%m/%d")
The string format is as the python standard library and details can be found `here `_
@@ -210,7 +215,7 @@ Series.dt.total_seconds
.. ipython:: python
# TimedeltaIndex
- s = pd.Series(pd.timedelta_range('1 minutes', periods=4))
+ s = pd.Series(pd.timedelta_range("1 minutes", periods=4))
s
s.dt.total_seconds()
@@ -225,18 +230,18 @@ A multiplied freq represents a span of corresponding length. The example below c
.. ipython:: python
- p = pd.Period('2015-08-01', freq='3D')
+ p = pd.Period("2015-08-01", freq="3D")
p
p + 1
p - 2
p.to_timestamp()
- p.to_timestamp(how='E')
+ p.to_timestamp(how="E")
You can use the multiplied freq in ``PeriodIndex`` and ``period_range``.
.. ipython:: python
- idx = pd.period_range('2015-08-01', periods=4, freq='2D')
+ idx = pd.period_range("2015-08-01", periods=4, freq="2D")
idx
idx + 1
@@ -249,14 +254,14 @@ Support for SAS XPORT files
.. code-block:: python
- df = pd.read_sas('sas_xport.xpt')
+ df = pd.read_sas("sas_xport.xpt")
It is also possible to obtain an iterator and read an XPORT file
incrementally.
.. code-block:: python
- for df in pd.read_sas('sas_xport.xpt', chunksize=10000):
+ for df in pd.read_sas("sas_xport.xpt", chunksize=10000):
do_something(df)
See the :ref:`docs ` for more details.
@@ -270,12 +275,12 @@ Support for math functions in .eval()
.. code-block:: python
- df = pd.DataFrame({'a': np.random.randn(10)})
+ df = pd.DataFrame({"a": np.random.randn(10)})
df.eval("b = sin(a)")
-The support math functions are `sin`, `cos`, `exp`, `log`, `expm1`, `log1p`,
-`sqrt`, `sinh`, `cosh`, `tanh`, `arcsin`, `arccos`, `arctan`, `arccosh`,
-`arcsinh`, `arctanh`, `abs` and `arctan2`.
+The support math functions are ``sin``, ``cos``, ``exp``, ``log``, ``expm1``, ``log1p``,
+``sqrt``, ``sinh``, ``cosh``, ``tanh``, ``arcsin``, ``arccos``, ``arctan``, ``arccosh``,
+``arcsinh``, ``arctanh``, ``abs`` and ``arctan2``.
These functions map to the intrinsics for the ``NumExpr`` engine. For the Python
engine, they are mapped to ``NumPy`` calls.
@@ -292,23 +297,26 @@ See the :ref:`documentation ` for more details.
.. ipython:: python
- df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]],
- columns=pd.MultiIndex.from_product(
- [['foo', 'bar'], ['a', 'b']], names=['col1', 'col2']),
- index=pd.MultiIndex.from_product([['j'], ['l', 'k']],
- names=['i1', 'i2']))
+ df = pd.DataFrame(
+ [[1, 2, 3, 4], [5, 6, 7, 8]],
+ columns=pd.MultiIndex.from_product(
+ [["foo", "bar"], ["a", "b"]], names=["col1", "col2"]
+ ),
+ index=pd.MultiIndex.from_product([["j"], ["l", "k"]], names=["i1", "i2"]),
+ )
df
- df.to_excel('test.xlsx')
+ df.to_excel("test.xlsx")
- df = pd.read_excel('test.xlsx', header=[0, 1], index_col=[0, 1])
+ df = pd.read_excel("test.xlsx", header=[0, 1], index_col=[0, 1])
df
.. ipython:: python
:suppress:
import os
- os.remove('test.xlsx')
+
+ os.remove("test.xlsx")
Previously, it was necessary to specify the ``has_index_names`` argument in ``read_excel``,
if the serialized data had index names. For version 0.17.0 the output format of ``to_excel``
@@ -354,14 +362,14 @@ Some East Asian countries use Unicode characters its width is corresponding to 2
.. ipython:: python
- df = pd.DataFrame({u'国籍': ['UK', u'日本'], u'名前': ['Alice', u'しのぶ']})
+ df = pd.DataFrame({u"国籍": ["UK", u"日本"], u"名前": ["Alice", u"しのぶ"]})
df;
.. image:: ../_static/option_unicode01.png
.. ipython:: python
- pd.set_option('display.unicode.east_asian_width', True)
+ pd.set_option("display.unicode.east_asian_width", True)
df;
.. image:: ../_static/option_unicode02.png
@@ -371,7 +379,7 @@ For further details, see :ref:`here `
.. ipython:: python
:suppress:
- pd.set_option('display.unicode.east_asian_width', False)
+ pd.set_option("display.unicode.east_asian_width", False)
.. _whatsnew_0170.enhancements.other:
@@ -391,9 +399,9 @@ Other enhancements
.. ipython:: python
- df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']})
- df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
- pd.merge(df1, df2, on='col1', how='outer', indicator=True)
+ df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]})
+ df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]})
+ pd.merge(df1, df2, on="col1", how="outer", indicator=True)
For more, see the :ref:`updated docs `
@@ -407,7 +415,7 @@ Other enhancements
.. ipython:: python
- foo = pd.Series([1, 2], name='foo')
+ foo = pd.Series([1, 2], name="foo")
bar = pd.Series([1, 2])
baz = pd.Series([4, 5])
@@ -434,46 +442,43 @@ Other enhancements
.. ipython:: python
ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13])
- ser.interpolate(limit=1, limit_direction='both')
+ ser.interpolate(limit=1, limit_direction="both")
- Added a ``DataFrame.round`` method to round the values to a variable number of decimal places (:issue:`10568`).
.. ipython:: python
- df = pd.DataFrame(np.random.random([3, 3]),
- columns=['A', 'B', 'C'],
- index=['first', 'second', 'third'])
+ df = pd.DataFrame(
+ np.random.random([3, 3]),
+ columns=["A", "B", "C"],
+ index=["first", "second", "third"],
+ )
df
df.round(2)
- df.round({'A': 0, 'C': 2})
+ df.round({"A": 0, "C": 2})
- ``drop_duplicates`` and ``duplicated`` now accept a ``keep`` keyword to target first, last, and all duplicates. The ``take_last`` keyword is deprecated, see :ref:`here ` (:issue:`6511`, :issue:`8505`)
.. ipython:: python
- s = pd.Series(['A', 'B', 'C', 'A', 'B', 'D'])
+ s = pd.Series(["A", "B", "C", "A", "B", "D"])
s.drop_duplicates()
- s.drop_duplicates(keep='last')
+ s.drop_duplicates(keep="last")
s.drop_duplicates(keep=False)
- Reindex now has a ``tolerance`` argument that allows for finer control of :ref:`basics.limits_on_reindex_fill` (:issue:`10411`):
.. ipython:: python
- df = pd.DataFrame({'x': range(5),
- 't': pd.date_range('2000-01-01', periods=5)})
- df.reindex([0.1, 1.9, 3.5],
- method='nearest',
- tolerance=0.2)
+ df = pd.DataFrame({"x": range(5), "t": pd.date_range("2000-01-01", periods=5)})
+ df.reindex([0.1, 1.9, 3.5], method="nearest", tolerance=0.2)
When used on a ``DatetimeIndex``, ``TimedeltaIndex`` or ``PeriodIndex``, ``tolerance`` will coerced into a ``Timedelta`` if possible. This allows you to specify tolerance with a string:
.. ipython:: python
- df = df.set_index('t')
- df.reindex(pd.to_datetime(['1999-12-31']),
- method='nearest',
- tolerance='1 day')
+ df = df.set_index("t")
+ df.reindex(pd.to_datetime(["1999-12-31"]), method="nearest", tolerance="1 day")
``tolerance`` is also exposed by the lower level ``Index.get_indexer`` and ``Index.get_loc`` methods.
@@ -519,7 +524,7 @@ Other enhancements
- ``DataFrame.apply`` will return a Series of dicts if the passed function returns a dict and ``reduce=True`` (:issue:`8735`).
-- Allow passing `kwargs` to the interpolation methods (:issue:`10378`).
+- Allow passing ``kwargs`` to the interpolation methods (:issue:`10378`).
- Improved error message when concatenating an empty iterable of ``Dataframe`` objects (:issue:`9157`)
@@ -627,13 +632,13 @@ Of course you can coerce this as well.
.. ipython:: python
- pd.to_datetime(['2009-07-31', 'asd'], errors='coerce')
+ pd.to_datetime(["2009-07-31", "asd"], errors="coerce")
To keep the previous behavior, you can use ``errors='ignore'``:
.. ipython:: python
- pd.to_datetime(['2009-07-31', 'asd'], errors='ignore')
+ pd.to_datetime(["2009-07-31", "asd"], errors="ignore")
Furthermore, ``pd.to_timedelta`` has gained a similar API, of ``errors='raise'|'ignore'|'coerce'``, and the ``coerce`` keyword
has been deprecated in favor of ``errors='coerce'``.
@@ -667,9 +672,9 @@ New behavior:
.. ipython:: python
- pd.Timestamp('2012Q2')
- pd.Timestamp('2014')
- pd.DatetimeIndex(['2012Q2', '2014'])
+ pd.Timestamp("2012Q2")
+ pd.Timestamp("2014")
+ pd.DatetimeIndex(["2012Q2", "2014"])
.. note::
@@ -678,6 +683,7 @@ New behavior:
.. ipython:: python
import pandas.tseries.offsets as offsets
+
pd.Timestamp.now()
pd.Timestamp.now() + offsets.DateOffset(years=1)
@@ -762,7 +768,7 @@ Usually you simply want to know which values are null.
.. warning::
You generally will want to use ``isnull/notnull`` for these types of comparisons, as ``isnull/notnull`` tells you which elements are null. One has to be
- mindful that ``nan's`` don't compare equal, but ``None's`` do. Note that Pandas/numpy uses the fact that ``np.nan != np.nan``, and treats ``None`` like ``np.nan``.
+ mindful that ``nan's`` don't compare equal, but ``None's`` do. Note that pandas/numpy uses the fact that ``np.nan != np.nan``, and treats ``None`` like ``np.nan``.
.. ipython:: python
@@ -780,8 +786,7 @@ Previous behavior:
.. ipython:: python
- df_with_missing = pd.DataFrame({'col1': [0, np.nan, 2],
- 'col2': [1, np.nan, np.nan]})
+ df_with_missing = pd.DataFrame({"col1": [0, np.nan, 2], "col2": [1, np.nan, np.nan]})
df_with_missing
@@ -806,18 +811,16 @@ New behavior:
.. ipython:: python
- df_with_missing.to_hdf('file.h5',
- 'df_with_missing',
- format='table',
- mode='w')
+ df_with_missing.to_hdf("file.h5", "df_with_missing", format="table", mode="w")
- pd.read_hdf('file.h5', 'df_with_missing')
+ pd.read_hdf("file.h5", "df_with_missing")
.. ipython:: python
:suppress:
import os
- os.remove('file.h5')
+
+ os.remove("file.h5")
See the :ref:`docs ` for more details.
@@ -848,8 +851,8 @@ regular formatting as well as scientific notation, similar to how numpy's ``prec
.. ipython:: python
- pd.set_option('display.precision', 2)
- pd.DataFrame({'x': [123.456789]})
+ pd.set_option("display.precision", 2)
+ pd.DataFrame({"x": [123.456789]})
To preserve output behavior with prior versions the default value of ``display.precision`` has been reduced to ``6``
from ``7``.
@@ -857,7 +860,7 @@ from ``7``.
.. ipython:: python
:suppress:
- pd.set_option('display.precision', 6)
+ pd.set_option("display.precision", 6)
.. _whatsnew_0170.api_breaking.categorical_unique:
@@ -871,14 +874,11 @@ Changes to ``Categorical.unique``
.. ipython:: python
- cat = pd.Categorical(['C', 'A', 'B', 'C'],
- categories=['A', 'B', 'C'],
- ordered=True)
+ cat = pd.Categorical(["C", "A", "B", "C"], categories=["A", "B", "C"], ordered=True)
cat
cat.unique()
- cat = pd.Categorical(['C', 'A', 'B', 'C'],
- categories=['A', 'B', 'C'])
+ cat = pd.Categorical(["C", "A", "B", "C"], categories=["A", "B", "C"])
cat
cat.unique()
@@ -909,7 +909,7 @@ Other API changes
- The metadata properties of subclasses of pandas objects will now be serialized (:issue:`10553`).
- ``groupby`` using ``Categorical`` follows the same rule as ``Categorical.unique`` described above (:issue:`10508`)
- When constructing ``DataFrame`` with an array of ``complex64`` dtype previously meant the corresponding column
- was automatically promoted to the ``complex128`` dtype. Pandas will now preserve the itemsize of the input for complex data (:issue:`10952`)
+ was automatically promoted to the ``complex128`` dtype. pandas will now preserve the itemsize of the input for complex data (:issue:`10952`)
- some numeric reduction operators would return ``ValueError``, rather than ``TypeError`` on object types that includes strings and numbers (:issue:`11131`)
- Passing currently unsupported ``chunksize`` argument to ``read_excel`` or ``ExcelFile.parse`` will now raise ``NotImplementedError`` (:issue:`8011`)
- Allow an ``ExcelFile`` object to be passed into ``read_excel`` (:issue:`11198`)
@@ -980,9 +980,11 @@ Removal of prior version deprecations/changes
.. ipython:: python
np.random.seed(1234)
- df = pd.DataFrame(np.random.randn(5, 2),
- columns=list('AB'),
- index=pd.date_range('2013-01-01', periods=5))
+ df = pd.DataFrame(
+ np.random.randn(5, 2),
+ columns=list("AB"),
+ index=pd.date_range("2013-01-01", periods=5),
+ )
df
Previously
@@ -1005,7 +1007,7 @@ Removal of prior version deprecations/changes
.. ipython:: python
- df.add(df.A, axis='index')
+ df.add(df.A, axis="index")
- Remove ``table`` keyword in ``HDFStore.put/append``, in favor of using ``format=`` (:issue:`4645`)
diff --git a/doc/source/whatsnew/v0.17.1.rst b/doc/source/whatsnew/v0.17.1.rst
index 5d15a01aee5a0..6b0a28ec47568 100644
--- a/doc/source/whatsnew/v0.17.1.rst
+++ b/doc/source/whatsnew/v0.17.1.rst
@@ -52,8 +52,8 @@ Here's a quick example:
.. ipython:: python
np.random.seed(123)
- df = pd.DataFrame(np.random.randn(10, 5), columns=list('abcde'))
- html = df.style.background_gradient(cmap='viridis', low=.5)
+ df = pd.DataFrame(np.random.randn(10, 5), columns=list("abcde"))
+ html = df.style.background_gradient(cmap="viridis", low=0.5)
We can render the HTML to get the following table.
@@ -80,14 +80,14 @@ Enhancements
.. ipython:: python
- df = pd.DataFrame({'A': ['foo'] * 1000}) # noqa: F821
- df['B'] = df['A'].astype('category')
+ df = pd.DataFrame({"A": ["foo"] * 1000}) # noqa: F821
+ df["B"] = df["A"].astype("category")
# shows the '+' as we have object dtypes
df.info()
# we have an accurate memory assessment (but can be expensive to compute this)
- df.info(memory_usage='deep')
+ df.info(memory_usage="deep")
- ``Index`` now has a ``fillna`` method (:issue:`10089`)
@@ -99,11 +99,11 @@ Enhancements
.. ipython:: python
- s = pd.Series(list('aabb')).astype('category')
+ s = pd.Series(list("aabb")).astype("category")
s
s.str.contains("a")
- date = pd.Series(pd.date_range('1/1/2015', periods=5)).astype('category')
+ date = pd.Series(pd.date_range("1/1/2015", periods=5)).astype("category")
date
date.dt.day
diff --git a/doc/source/whatsnew/v0.18.0.rst b/doc/source/whatsnew/v0.18.0.rst
index fbe24675ddfe2..ef5242b0e33c8 100644
--- a/doc/source/whatsnew/v0.18.0.rst
+++ b/doc/source/whatsnew/v0.18.0.rst
@@ -290,7 +290,7 @@ A new, friendlier ``ValueError`` is added to protect against the mistake of supp
.. code-block:: ipython
In [2]: pd.Series(['a', 'b', np.nan, 'c']).str.cat(' ')
- ValueError: Did you mean to supply a `sep` keyword?
+ ValueError: Did you mean to supply a ``sep`` keyword?
.. _whatsnew_0180.enhancements.rounding:
diff --git a/doc/source/whatsnew/v0.18.1.rst b/doc/source/whatsnew/v0.18.1.rst
index 13ed6bc38163b..3db00f686d62c 100644
--- a/doc/source/whatsnew/v0.18.1.rst
+++ b/doc/source/whatsnew/v0.18.1.rst
@@ -42,6 +42,7 @@ see :ref:`Custom Business Hour ` (:issue:`11514`)
from pandas.tseries.offsets import CustomBusinessHour
from pandas.tseries.holiday import USFederalHolidayCalendar
+
bhour_us = CustomBusinessHour(calendar=USFederalHolidayCalendar())
Friday before MLK Day
@@ -49,6 +50,7 @@ Friday before MLK Day
.. ipython:: python
import datetime
+
dt = datetime.datetime(2014, 1, 17, 15)
dt + bhour_us
@@ -72,41 +74,42 @@ Previously you would have to do this to get a rolling window mean per-group:
.. ipython:: python
- df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
- 'B': np.arange(40)})
+ df = pd.DataFrame({"A": [1] * 20 + [2] * 12 + [3] * 8, "B": np.arange(40)})
df
.. ipython:: python
- df.groupby('A').apply(lambda x: x.rolling(4).B.mean())
+ df.groupby("A").apply(lambda x: x.rolling(4).B.mean())
Now you can do:
.. ipython:: python
- df.groupby('A').rolling(4).B.mean()
+ df.groupby("A").rolling(4).B.mean()
For ``.resample(..)`` type of operations, previously you would have to:
.. ipython:: python
- df = pd.DataFrame({'date': pd.date_range(start='2016-01-01',
- periods=4,
- freq='W'),
- 'group': [1, 1, 2, 2],
- 'val': [5, 6, 7, 8]}).set_index('date')
+ df = pd.DataFrame(
+ {
+ "date": pd.date_range(start="2016-01-01", periods=4, freq="W"),
+ "group": [1, 1, 2, 2],
+ "val": [5, 6, 7, 8],
+ }
+ ).set_index("date")
df
.. ipython:: python
- df.groupby('group').apply(lambda x: x.resample('1D').ffill())
+ df.groupby("group").apply(lambda x: x.resample("1D").ffill())
Now you can do:
.. ipython:: python
- df.groupby('group').resample('1D').ffill()
+ df.groupby("group").resample("1D").ffill()
.. _whatsnew_0181.enhancements.method_chain:
@@ -129,9 +132,7 @@ arguments.
.. ipython:: python
- df = pd.DataFrame({'A': [1, 2, 3],
- 'B': [4, 5, 6],
- 'C': [7, 8, 9]})
+ df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]})
df.where(lambda x: x > 4, lambda x: x + 10)
Methods ``.loc[]``, ``.iloc[]``, ``.ix[]``
@@ -146,7 +147,7 @@ can return a valid boolean indexer or anything which is valid for these indexer'
df.loc[lambda x: x.A >= 2, lambda x: x.sum() > 10]
# callable returns list of labels
- df.loc[lambda x: [1, 2], lambda x: ['A', 'B']]
+ df.loc[lambda x: [1, 2], lambda x: ["A", "B"]]
Indexing with``[]``
"""""""""""""""""""
@@ -157,17 +158,15 @@ class and index type.
.. ipython:: python
- df[lambda x: 'A']
+ df[lambda x: "A"]
Using these methods / indexers, you can chain data selection operations
without using temporary variable.
.. ipython:: python
- bb = pd.read_csv('data/baseball.csv', index_col='id')
- (bb.groupby(['year', 'team'])
- .sum()
- .loc[lambda df: df.r > 100])
+ bb = pd.read_csv("data/baseball.csv", index_col="id")
+ (bb.groupby(["year", "team"]).sum().loc[lambda df: df.r > 100])
.. _whatsnew_0181.partial_string_indexing:
@@ -180,13 +179,13 @@ Partial string indexing now matches on ``DateTimeIndex`` when part of a ``MultiI
dft2 = pd.DataFrame(
np.random.randn(20, 1),
- columns=['A'],
- index=pd.MultiIndex.from_product([pd.date_range('20130101',
- periods=10,
- freq='12H'),
- ['a', 'b']]))
+ columns=["A"],
+ index=pd.MultiIndex.from_product(
+ [pd.date_range("20130101", periods=10, freq="12H"), ["a", "b"]]
+ ),
+ )
dft2
- dft2.loc['2013-01-05']
+ dft2.loc["2013-01-05"]
On other levels
@@ -195,7 +194,7 @@ On other levels
idx = pd.IndexSlice
dft2 = dft2.swaplevel(0, 1).sort_index()
dft2
- dft2.loc[idx[:, '2013-01-05'], :]
+ dft2.loc[idx[:, "2013-01-05"], :]
.. _whatsnew_0181.enhancements.assembling:
@@ -206,10 +205,9 @@ Assembling datetimes
.. ipython:: python
- df = pd.DataFrame({'year': [2015, 2016],
- 'month': [2, 3],
- 'day': [4, 5],
- 'hour': [2, 3]})
+ df = pd.DataFrame(
+ {"year": [2015, 2016], "month": [2, 3], "day": [4, 5], "hour": [2, 3]}
+ )
df
Assembling using the passed frame.
@@ -222,7 +220,7 @@ You can pass only the columns that you need to assemble.
.. ipython:: python
- pd.to_datetime(df[['year', 'month', 'day']])
+ pd.to_datetime(df[["year", "month", "day"]])
.. _whatsnew_0181.other:
@@ -243,7 +241,7 @@ Other enhancements
.. ipython:: python
- idx = pd.Index([1., 2., 3., 4.], dtype='float')
+ idx = pd.Index([1.0, 2.0, 3.0, 4.0], dtype="float")
# default, allow_fill=True, fill_value=None
idx.take([2, -1])
@@ -253,8 +251,8 @@ Other enhancements
.. ipython:: python
- idx = pd.Index(['a|b', 'a|c', 'b|c'])
- idx.str.get_dummies('|')
+ idx = pd.Index(["a|b", "a|c", "b|c"])
+ idx.str.get_dummies("|")
- ``pd.crosstab()`` has gained a ``normalize`` argument for normalizing frequency tables (:issue:`12569`). Examples in the updated docs :ref:`here `.
@@ -313,8 +311,7 @@ The index in ``.groupby(..).nth()`` output is now more consistent when the ``as_
.. ipython:: python
- df = pd.DataFrame({'A': ['a', 'b', 'a'],
- 'B': [1, 2, 3]})
+ df = pd.DataFrame({"A": ["a", "b", "a"], "B": [1, 2, 3]})
df
Previous behavior:
@@ -337,16 +334,16 @@ New behavior:
.. ipython:: python
- df.groupby('A', as_index=True)['B'].nth(0)
- df.groupby('A', as_index=False)['B'].nth(0)
+ df.groupby("A", as_index=True)["B"].nth(0)
+ df.groupby("A", as_index=False)["B"].nth(0)
Furthermore, previously, a ``.groupby`` would always sort, regardless if ``sort=False`` was passed with ``.nth()``.
.. ipython:: python
np.random.seed(1234)
- df = pd.DataFrame(np.random.randn(100, 2), columns=['a', 'b'])
- df['c'] = np.random.randint(0, 4, 100)
+ df = pd.DataFrame(np.random.randn(100, 2), columns=["a", "b"])
+ df["c"] = np.random.randint(0, 4, 100)
Previous behavior:
@@ -374,8 +371,8 @@ New behavior:
.. ipython:: python
- df.groupby('c', sort=True).nth(1)
- df.groupby('c', sort=False).nth(1)
+ df.groupby("c", sort=True).nth(1)
+ df.groupby("c", sort=False).nth(1)
.. _whatsnew_0181.numpy_compatibility:
@@ -421,8 +418,9 @@ Using ``apply`` on resampling groupby operations (using a ``pd.TimeGrouper``) no
.. ipython:: python
- df = pd.DataFrame({'date': pd.to_datetime(['10/10/2000', '11/10/2000']),
- 'value': [10, 13]})
+ df = pd.DataFrame(
+ {"date": pd.to_datetime(["10/10/2000", "11/10/2000"]), "value": [10, 13]}
+ )
df
Previous behavior:
diff --git a/doc/source/whatsnew/v0.19.0.rst b/doc/source/whatsnew/v0.19.0.rst
index 6e8c4273a0550..08ccc1565125f 100644
--- a/doc/source/whatsnew/v0.19.0.rst
+++ b/doc/source/whatsnew/v0.19.0.rst
@@ -49,10 +49,8 @@ except that we match on nearest key rather than equal keys.
.. ipython:: python
- left = pd.DataFrame({'a': [1, 5, 10],
- 'left_val': ['a', 'b', 'c']})
- right = pd.DataFrame({'a': [1, 2, 3, 6, 7],
- 'right_val': [1, 2, 3, 6, 7]})
+ left = pd.DataFrame({"a": [1, 5, 10], "left_val": ["a", "b", "c"]})
+ right = pd.DataFrame({"a": [1, 2, 3, 6, 7], "right_val": [1, 2, 3, 6, 7]})
left
right
@@ -62,13 +60,13 @@ recent value otherwise.
.. ipython:: python
- pd.merge_asof(left, right, on='a')
+ pd.merge_asof(left, right, on="a")
We can also match rows ONLY with prior data, and not an exact match.
.. ipython:: python
- pd.merge_asof(left, right, on='a', allow_exact_matches=False)
+ pd.merge_asof(left, right, on="a", allow_exact_matches=False)
In a typical time-series example, we have ``trades`` and ``quotes`` and we want to ``asof-join`` them.
@@ -76,36 +74,44 @@ This also illustrates using the ``by`` parameter to group data before merging.
.. ipython:: python
- trades = pd.DataFrame({
- 'time': pd.to_datetime(['20160525 13:30:00.023',
- '20160525 13:30:00.038',
- '20160525 13:30:00.048',
- '20160525 13:30:00.048',
- '20160525 13:30:00.048']),
- 'ticker': ['MSFT', 'MSFT',
- 'GOOG', 'GOOG', 'AAPL'],
- 'price': [51.95, 51.95,
- 720.77, 720.92, 98.00],
- 'quantity': [75, 155,
- 100, 100, 100]},
- columns=['time', 'ticker', 'price', 'quantity'])
-
- quotes = pd.DataFrame({
- 'time': pd.to_datetime(['20160525 13:30:00.023',
- '20160525 13:30:00.023',
- '20160525 13:30:00.030',
- '20160525 13:30:00.041',
- '20160525 13:30:00.048',
- '20160525 13:30:00.049',
- '20160525 13:30:00.072',
- '20160525 13:30:00.075']),
- 'ticker': ['GOOG', 'MSFT', 'MSFT', 'MSFT',
- 'GOOG', 'AAPL', 'GOOG', 'MSFT'],
- 'bid': [720.50, 51.95, 51.97, 51.99,
- 720.50, 97.99, 720.50, 52.01],
- 'ask': [720.93, 51.96, 51.98, 52.00,
- 720.93, 98.01, 720.88, 52.03]},
- columns=['time', 'ticker', 'bid', 'ask'])
+ trades = pd.DataFrame(
+ {
+ "time": pd.to_datetime(
+ [
+ "20160525 13:30:00.023",
+ "20160525 13:30:00.038",
+ "20160525 13:30:00.048",
+ "20160525 13:30:00.048",
+ "20160525 13:30:00.048",
+ ]
+ ),
+ "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
+ "price": [51.95, 51.95, 720.77, 720.92, 98.00],
+ "quantity": [75, 155, 100, 100, 100],
+ },
+ columns=["time", "ticker", "price", "quantity"],
+ )
+
+ quotes = pd.DataFrame(
+ {
+ "time": pd.to_datetime(
+ [
+ "20160525 13:30:00.023",
+ "20160525 13:30:00.023",
+ "20160525 13:30:00.030",
+ "20160525 13:30:00.041",
+ "20160525 13:30:00.048",
+ "20160525 13:30:00.049",
+ "20160525 13:30:00.072",
+ "20160525 13:30:00.075",
+ ]
+ ),
+ "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG", "AAPL", "GOOG", "MSFT"],
+ "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
+ "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03],
+ },
+ columns=["time", "ticker", "bid", "ask"],
+ )
.. ipython:: python
@@ -118,9 +124,7 @@ that forward filling happens automatically taking the most recent non-NaN value.
.. ipython:: python
- pd.merge_asof(trades, quotes,
- on='time',
- by='ticker')
+ pd.merge_asof(trades, quotes, on="time", by="ticker")
This returns a merged DataFrame with the entries in the same order as the original left
passed DataFrame (``trades`` in this case), with the fields of the ``quotes`` merged.
@@ -135,9 +139,10 @@ See the full documentation :ref:`here `.
.. ipython:: python
- dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
- index=pd.date_range('20130101 09:00:00',
- periods=5, freq='s'))
+ dft = pd.DataFrame(
+ {"B": [0, 1, 2, np.nan, 4]},
+ index=pd.date_range("20130101 09:00:00", periods=5, freq="s"),
+ )
dft
This is a regular frequency index. Using an integer window parameter works to roll along the window frequency.
@@ -151,20 +156,26 @@ Specifying an offset allows a more intuitive specification of the rolling freque
.. ipython:: python
- dft.rolling('2s').sum()
+ dft.rolling("2s").sum()
Using a non-regular, but still monotonic index, rolling with an integer window does not impart any special calculation.
.. ipython:: python
- dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
- index=pd.Index([pd.Timestamp('20130101 09:00:00'),
- pd.Timestamp('20130101 09:00:02'),
- pd.Timestamp('20130101 09:00:03'),
- pd.Timestamp('20130101 09:00:05'),
- pd.Timestamp('20130101 09:00:06')],
- name='foo'))
+ dft = pd.DataFrame(
+ {"B": [0, 1, 2, np.nan, 4]},
+ index=pd.Index(
+ [
+ pd.Timestamp("20130101 09:00:00"),
+ pd.Timestamp("20130101 09:00:02"),
+ pd.Timestamp("20130101 09:00:03"),
+ pd.Timestamp("20130101 09:00:05"),
+ pd.Timestamp("20130101 09:00:06"),
+ ],
+ name="foo",
+ ),
+ )
dft
dft.rolling(2).sum()
@@ -173,7 +184,7 @@ Using the time-specification generates variable windows for this sparse data.
.. ipython:: python
- dft.rolling('2s').sum()
+ dft.rolling("2s").sum()
Furthermore, we now allow an optional ``on`` parameter to specify a column (rather than the
default of the index) in a DataFrame.
@@ -182,7 +193,7 @@ default of the index) in a DataFrame.
dft = dft.reset_index()
dft
- dft.rolling('2s', on='foo').sum()
+ dft.rolling("2s", on="foo").sum()
.. _whatsnew_0190.enhancements.read_csv_dupe_col_names_support:
@@ -199,8 +210,8 @@ they are in the file or passed in as the ``names`` parameter (:issue:`7160`, :is
.. ipython:: python
- data = '0,1,2\n3,4,5'
- names = ['a', 'b', 'a']
+ data = "0,1,2\n3,4,5"
+ names = ["a", "b", "a"]
**Previous behavior**:
@@ -235,17 +246,22 @@ converting to ``Categorical`` after parsing. See the io :ref:`docs here ` (:issue:`10008`, :issue:`13156`)
@@ -388,7 +404,7 @@ Google BigQuery enhancements
Fine-grained NumPy errstate
^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Previous versions of pandas would permanently silence numpy's ufunc error handling when ``pandas`` was imported. Pandas did this in order to silence the warnings that would arise from using numpy ufuncs on missing data, which are usually represented as ``NaN`` s. Unfortunately, this silenced legitimate warnings arising in non-pandas code in the application. Starting with 0.19.0, pandas will use the ``numpy.errstate`` context manager to silence these warnings in a more fine-grained manner, only around where these operations are actually used in the pandas code base. (:issue:`13109`, :issue:`13145`)
+Previous versions of pandas would permanently silence numpy's ufunc error handling when ``pandas`` was imported. pandas did this in order to silence the warnings that would arise from using numpy ufuncs on missing data, which are usually represented as ``NaN`` s. Unfortunately, this silenced legitimate warnings arising in non-pandas code in the application. Starting with 0.19.0, pandas will use the ``numpy.errstate`` context manager to silence these warnings in a more fine-grained manner, only around where these operations are actually used in the pandas code base. (:issue:`13109`, :issue:`13145`)
After upgrading pandas, you may see *new* ``RuntimeWarnings`` being issued from your code. These are likely legitimate, and the underlying cause likely existed in the code when using previous versions of pandas that simply silenced the warning. Use `numpy.errstate `__ around the source of the ``RuntimeWarning`` to control how these conditions are handled.
@@ -415,7 +431,7 @@ The ``pd.get_dummies`` function now returns dummy-encoded columns as small integ
.. ipython:: python
- pd.get_dummies(['a', 'b', 'a', 'c']).dtypes
+ pd.get_dummies(["a", "b", "a", "c"]).dtypes
.. _whatsnew_0190.enhancements.to_numeric_downcast:
@@ -427,9 +443,9 @@ Downcast values to smallest possible dtype in ``to_numeric``
.. ipython:: python
- s = ['1', 2, 3]
- pd.to_numeric(s, downcast='unsigned')
- pd.to_numeric(s, downcast='integer')
+ s = ["1", 2, 3]
+ pd.to_numeric(s, downcast="unsigned")
+ pd.to_numeric(s, downcast="integer")
.. _whatsnew_0190.dev_api:
@@ -447,7 +463,8 @@ The following are now part of this API:
import pprint
from pandas.api import types
- funcs = [f for f in dir(types) if not f.startswith('_')]
+
+ funcs = [f for f in dir(types) if not f.startswith("_")]
pprint.pprint(funcs)
.. note::
@@ -472,16 +489,16 @@ Other enhancements
.. ipython:: python
- df = pd.DataFrame({'date': pd.date_range('2015-01-01', freq='W', periods=5),
- 'a': np.arange(5)},
- index=pd.MultiIndex.from_arrays([[1, 2, 3, 4, 5],
- pd.date_range('2015-01-01',
- freq='W',
- periods=5)
- ], names=['v', 'd']))
+ df = pd.DataFrame(
+ {"date": pd.date_range("2015-01-01", freq="W", periods=5), "a": np.arange(5)},
+ index=pd.MultiIndex.from_arrays(
+ [[1, 2, 3, 4, 5], pd.date_range("2015-01-01", freq="W", periods=5)],
+ names=["v", "d"],
+ ),
+ )
df
- df.resample('M', on='date').sum()
- df.resample('M', level='d').sum()
+ df.resample("M", on="date").sum()
+ df.resample("M", level="d").sum()
- The ``.get_credentials()`` method of ``GbqConnector`` can now first try to fetch `the application default credentials `__. See the docs for more details (:issue:`13577`).
- The ``.tz_localize()`` method of ``DatetimeIndex`` and ``Timestamp`` has gained the ``errors`` keyword, so you can potentially coerce nonexistent timestamps to ``NaT``. The default behavior remains to raising a ``NonExistentTimeError`` (:issue:`13057`)
@@ -507,10 +524,9 @@ Other enhancements
.. ipython:: python
- df = pd.DataFrame({'A': [2, 7], 'B': [3, 5], 'C': [4, 8]},
- index=['row1', 'row2'])
+ df = pd.DataFrame({"A": [2, 7], "B": [3, 5], "C": [4, 8]}, index=["row1", "row2"])
df
- df.sort_values(by='row2', axis=1)
+ df.sort_values(by="row2", axis=1)
- Added documentation to :ref:`I/O` regarding the perils of reading in columns with mixed dtypes and how to handle it (:issue:`13746`)
- :meth:`~DataFrame.to_html` now has a ``border`` argument to control the value in the opening ```` tag. The default is the value of the ``html.border`` option, which defaults to 1. This also affects the notebook HTML repr, but since Jupyter's CSS includes a border-width attribute, the visual effect is the same. (:issue:`11563`).
@@ -583,12 +599,12 @@ Arithmetic operators align both ``index`` (no changes).
.. ipython:: python
- s1 = pd.Series([1, 2, 3], index=list('ABC'))
- s2 = pd.Series([2, 2, 2], index=list('ABD'))
+ s1 = pd.Series([1, 2, 3], index=list("ABC"))
+ s2 = pd.Series([2, 2, 2], index=list("ABD"))
s1 + s2
- df1 = pd.DataFrame([1, 2, 3], index=list('ABC'))
- df2 = pd.DataFrame([2, 2, 2], index=list('ABD'))
+ df1 = pd.DataFrame([1, 2, 3], index=list("ABC"))
+ df2 = pd.DataFrame([2, 2, 2], index=list("ABD"))
df1 + df2
Comparison operators
@@ -661,8 +677,8 @@ Logical operators align both ``.index`` of left and right hand side.
.. ipython:: python
- s1 = pd.Series([True, False, True], index=list('ABC'))
- s2 = pd.Series([True, True, True], index=list('ABD'))
+ s1 = pd.Series([True, False, True], index=list("ABC"))
+ s2 = pd.Series([True, True, True], index=list("ABD"))
s1 & s2
.. note::
@@ -679,8 +695,8 @@ Logical operators align both ``.index`` of left and right hand side.
.. ipython:: python
- df1 = pd.DataFrame([True, False, True], index=list('ABC'))
- df2 = pd.DataFrame([True, True, True], index=list('ABD'))
+ df1 = pd.DataFrame([True, False, True], index=list("ABC"))
+ df2 = pd.DataFrame([True, True, True], index=list("ABD"))
df1 & df2
Flexible comparison methods
@@ -691,8 +707,8 @@ which has the different ``index``.
.. ipython:: python
- s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
- s2 = pd.Series([2, 2, 2], index=['b', 'c', 'd'])
+ s1 = pd.Series([1, 2, 3], index=["a", "b", "c"])
+ s2 = pd.Series([2, 2, 2], index=["b", "c", "d"])
s1.eq(s2)
s1.ge(s2)
@@ -749,7 +765,7 @@ This will now convert integers/floats with the default unit of ``ns``.
.. ipython:: python
- pd.to_datetime([1, 'foo'], errors='coerce')
+ pd.to_datetime([1, "foo"], errors="coerce")
Bug fixes related to ``.to_datetime()``:
@@ -768,9 +784,9 @@ Merging will now preserve the dtype of the join keys (:issue:`8596`)
.. ipython:: python
- df1 = pd.DataFrame({'key': [1], 'v1': [10]})
+ df1 = pd.DataFrame({"key": [1], "v1": [10]})
df1
- df2 = pd.DataFrame({'key': [1, 2], 'v1': [20, 30]})
+ df2 = pd.DataFrame({"key": [1, 2], "v1": [20, 30]})
df2
**Previous behavior**:
@@ -796,16 +812,16 @@ We are able to preserve the join keys
.. ipython:: python
- pd.merge(df1, df2, how='outer')
- pd.merge(df1, df2, how='outer').dtypes
+ pd.merge(df1, df2, how="outer")
+ pd.merge(df1, df2, how="outer").dtypes
Of course if you have missing values that are introduced, then the
resulting dtype will be upcast, which is unchanged from previous.
.. ipython:: python
- pd.merge(df1, df2, how='outer', on='key')
- pd.merge(df1, df2, how='outer', on='key').dtypes
+ pd.merge(df1, df2, how="outer", on="key")
+ pd.merge(df1, df2, how="outer", on="key").dtypes
.. _whatsnew_0190.api.describe:
@@ -889,7 +905,7 @@ As a consequence of this change, ``PeriodIndex`` no longer has an integer dtype:
.. ipython:: python
- pi = pd.PeriodIndex(['2016-08-01'], freq='D')
+ pi = pd.PeriodIndex(["2016-08-01"], freq="D")
pi
pd.api.types.is_integer_dtype(pi)
pd.api.types.is_period_dtype(pi)
@@ -916,7 +932,7 @@ These result in ``pd.NaT`` without providing ``freq`` option.
.. ipython:: python
- pd.Period('NaT')
+ pd.Period("NaT")
pd.Period(None)
@@ -955,7 +971,7 @@ of integers (:issue:`13988`).
.. ipython:: python
- pi = pd.PeriodIndex(['2011-01', '2011-02'], freq='M')
+ pi = pd.PeriodIndex(["2011-01", "2011-02"], freq="M")
pi.values
@@ -985,7 +1001,7 @@ Previous behavior:
.. ipython:: python
- pd.Index(['a', 'b']) + pd.Index(['a', 'c'])
+ pd.Index(["a", "b"]) + pd.Index(["a", "c"])
Note that numeric Index objects already performed element-wise operations.
For example, the behavior of adding two integer Indexes is unchanged.
@@ -1011,8 +1027,10 @@ DatetimeIndex objects resulting in a TimedeltaIndex:
.. ipython:: python
- (pd.DatetimeIndex(['2016-01-01', '2016-01-02'])
- - pd.DatetimeIndex(['2016-01-02', '2016-01-03']))
+ (
+ pd.DatetimeIndex(["2016-01-01", "2016-01-02"])
+ - pd.DatetimeIndex(["2016-01-02", "2016-01-03"])
+ )
.. _whatsnew_0190.api.difference:
@@ -1073,8 +1091,7 @@ Previously, most ``Index`` classes returned ``np.ndarray``, and ``DatetimeIndex`
.. ipython:: python
pd.Index([1, 2, 3]).unique()
- pd.DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'],
- tz='Asia/Tokyo').unique()
+ pd.DatetimeIndex(["2011-01-01", "2011-01-02", "2011-01-03"], tz="Asia/Tokyo").unique()
.. _whatsnew_0190.api.multiindex:
@@ -1086,8 +1103,8 @@ in ``MultiIndex`` levels (:issue:`13743`, :issue:`13854`).
.. ipython:: python
- cat = pd.Categorical(['a', 'b'], categories=list("bac"))
- lvl1 = ['foo', 'bar']
+ cat = pd.Categorical(["a", "b"], categories=list("bac"))
+ lvl1 = ["foo", "bar"]
midx = pd.MultiIndex.from_arrays([cat, lvl1])
midx
@@ -1113,9 +1130,9 @@ As a consequence, ``groupby`` and ``set_index`` also preserve categorical dtypes
.. ipython:: python
- df = pd.DataFrame({'A': [0, 1], 'B': [10, 11], 'C': cat})
- df_grouped = df.groupby(by=['A', 'C']).first()
- df_set_idx = df.set_index(['A', 'C'])
+ df = pd.DataFrame({"A": [0, 1], "B": [10, 11], "C": cat})
+ df_grouped = df.groupby(by=["A", "C"]).first()
+ df_set_idx = df.set_index(["A", "C"])
**Previous behavior**:
@@ -1163,7 +1180,7 @@ the result of calling :func:`read_csv` without the ``chunksize=`` argument
.. ipython:: python
- data = 'A,B\n0,1\n2,3\n4,5\n6,7'
+ data = "A,B\n0,1\n2,3\n4,5\n6,7"
**Previous behavior**:
@@ -1248,7 +1265,7 @@ Operators now preserve dtypes
.. code-block:: python
- s = pd.SparseSeries([1., 0., 2., 0.], fill_value=0)
+ s = pd.SparseSeries([1.0, 0.0, 2.0, 0.0], fill_value=0)
s
s.astype(np.int64)
@@ -1372,7 +1389,7 @@ Deprecations
- ``Timestamp.offset`` property (and named arg in the constructor), has been deprecated in favor of ``freq`` (:issue:`12160`)
- ``pd.tseries.util.pivot_annual`` is deprecated. Use ``pivot_table`` as alternative, an example is :ref:`here ` (:issue:`736`)
- ``pd.tseries.util.isleapyear`` has been deprecated and will be removed in a subsequent release. Datetime-likes now have a ``.is_leap_year`` property (:issue:`13727`)
-- ``Panel4D`` and ``PanelND`` constructors are deprecated and will be removed in a future version. The recommended way to represent these types of n-dimensional data are with the `xarray package `__. Pandas provides a :meth:`~Panel4D.to_xarray` method to automate this conversion (:issue:`13564`).
+- ``Panel4D`` and ``PanelND`` constructors are deprecated and will be removed in a future version. The recommended way to represent these types of n-dimensional data are with the `xarray package `__. pandas provides a :meth:`~Panel4D.to_xarray` method to automate this conversion (:issue:`13564`).
- ``pandas.tseries.frequencies.get_standard_freq`` is deprecated. Use ``pandas.tseries.frequencies.to_offset(freq).rule_code`` instead (:issue:`13874`)
- ``pandas.tseries.frequencies.to_offset``'s ``freqstr`` keyword is deprecated in favor of ``freq`` (:issue:`13874`)
- ``Categorical.from_array`` has been deprecated and will be removed in a future version (:issue:`13854`)
diff --git a/doc/source/whatsnew/v0.19.1.rst b/doc/source/whatsnew/v0.19.1.rst
index 9e6b884e08587..6ff3fb6900a99 100644
--- a/doc/source/whatsnew/v0.19.1.rst
+++ b/doc/source/whatsnew/v0.19.1.rst
@@ -8,7 +8,7 @@ Version 0.19.1 (November 3, 2016)
.. ipython:: python
:suppress:
- from pandas import * # noqa F401, F403
+ from pandas import * # noqa F401, F403
This is a minor bug-fix release from 0.19.0 and includes some small regression fixes,
@@ -29,7 +29,7 @@ Performance improvements
- Fixed performance regression in ``Series.asof(where)`` when ``where`` is a scalar (:issue:`14461`)
- Improved performance in ``DataFrame.asof(where)`` when ``where`` is a scalar (:issue:`14461`)
- Improved performance in ``.to_json()`` when ``lines=True`` (:issue:`14408`)
-- Improved performance in certain types of `loc` indexing with a MultiIndex (:issue:`14551`).
+- Improved performance in certain types of ``loc`` indexing with a MultiIndex (:issue:`14551`).
.. _whatsnew_0191.bug_fixes:
diff --git a/doc/source/whatsnew/v0.19.2.rst b/doc/source/whatsnew/v0.19.2.rst
index 924c95f21ceff..bba89d78be869 100644
--- a/doc/source/whatsnew/v0.19.2.rst
+++ b/doc/source/whatsnew/v0.19.2.rst
@@ -8,7 +8,7 @@ Version 0.19.2 (December 24, 2016)
.. ipython:: python
:suppress:
- from pandas import * # noqa F401, F403
+ from pandas import * # noqa F401, F403
This is a minor bug-fix release in the 0.19.x series and includes some small regression fixes,
diff --git a/doc/source/whatsnew/v0.20.0.rst b/doc/source/whatsnew/v0.20.0.rst
index 09980b52b6b3a..a9e57f0039735 100644
--- a/doc/source/whatsnew/v0.20.0.rst
+++ b/doc/source/whatsnew/v0.20.0.rst
@@ -26,7 +26,7 @@ Highlights include:
.. warning::
- Pandas has changed the internal structure and layout of the code base.
+ pandas has changed the internal structure and layout of the code base.
This can affect imports that are not from the top-level ``pandas.*`` namespace, please see the changes :ref:`here `.
Check the :ref:`API Changes ` and :ref:`deprecations ` before updating.
@@ -243,7 +243,7 @@ The default is to infer the compression type from the extension (``compression='
UInt64 support improved
^^^^^^^^^^^^^^^^^^^^^^^
-Pandas has significantly improved support for operations involving unsigned,
+pandas has significantly improved support for operations involving unsigned,
or purely non-negative, integers. Previously, handling these integers would
result in improper rounding or data-type casting, leading to incorrect results.
Notably, a new numerical index, ``UInt64Index``, has been created (:issue:`14937`)
@@ -333,7 +333,7 @@ You must enable this by setting the ``display.html.table_schema`` option to ``Tr
SciPy sparse matrix from/to SparseDataFrame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Pandas now supports creating sparse dataframes directly from ``scipy.sparse.spmatrix`` instances.
+pandas now supports creating sparse dataframes directly from ``scipy.sparse.spmatrix`` instances.
See the :ref:`documentation ` for more information. (:issue:`4343`)
All sparse formats are supported, but matrices that are not in :mod:`COOrdinate ` format will be converted, copying data as needed.
@@ -1201,7 +1201,7 @@ Modules privacy has changed
Some formerly public python/c/c++/cython extension modules have been moved and/or renamed. These are all removed from the public API.
Furthermore, the ``pandas.core``, ``pandas.compat``, and ``pandas.util`` top-level modules are now considered to be PRIVATE.
-If indicated, a deprecation warning will be issued if you reference theses modules. (:issue:`12588`)
+If indicated, a deprecation warning will be issued if you reference these modules. (:issue:`12588`)
.. csv-table::
:header: "Previous Location", "New Location", "Deprecated"
@@ -1355,7 +1355,7 @@ Deprecate Panel
^^^^^^^^^^^^^^^
``Panel`` is deprecated and will be removed in a future version. The recommended way to represent 3-D data are
-with a ``MultiIndex`` on a ``DataFrame`` via the :meth:`~Panel.to_frame` or with the `xarray package `__. Pandas
+with a ``MultiIndex`` on a ``DataFrame`` via the :meth:`~Panel.to_frame` or with the `xarray package `__. pandas
provides a :meth:`~Panel.to_xarray` method to automate this conversion (:issue:`13563`).
.. code-block:: ipython
diff --git a/doc/source/whatsnew/v0.20.2.rst b/doc/source/whatsnew/v0.20.2.rst
index 7f84c6b3f17bd..430a39d2d2e97 100644
--- a/doc/source/whatsnew/v0.20.2.rst
+++ b/doc/source/whatsnew/v0.20.2.rst
@@ -8,7 +8,7 @@ Version 0.20.2 (June 4, 2017)
.. ipython:: python
:suppress:
- from pandas import * # noqa F401, F403
+ from pandas import * # noqa F401, F403
This is a minor bug-fix release in the 0.20.x series and includes some small regression fixes,
diff --git a/doc/source/whatsnew/v0.20.3.rst b/doc/source/whatsnew/v0.20.3.rst
index 888d0048ca9f3..ff28f6830783e 100644
--- a/doc/source/whatsnew/v0.20.3.rst
+++ b/doc/source/whatsnew/v0.20.3.rst
@@ -8,7 +8,7 @@ Version 0.20.3 (July 7, 2017)
.. ipython:: python
:suppress:
- from pandas import * # noqa F401, F403
+ from pandas import * # noqa F401, F403
This is a minor bug-fix release in the 0.20.x series and includes some small regression fixes
diff --git a/doc/source/whatsnew/v0.21.0.rst b/doc/source/whatsnew/v0.21.0.rst
index 926bcaa21ac3a..6035b89aa8643 100644
--- a/doc/source/whatsnew/v0.21.0.rst
+++ b/doc/source/whatsnew/v0.21.0.rst
@@ -900,13 +900,13 @@ New behavior:
No automatic Matplotlib converters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Pandas no longer registers our ``date``, ``time``, ``datetime``,
+pandas no longer registers our ``date``, ``time``, ``datetime``,
``datetime64``, and ``Period`` converters with matplotlib when pandas is
imported. Matplotlib plot methods (``plt.plot``, ``ax.plot``, ...), will not
nicely format the x-axis for ``DatetimeIndex`` or ``PeriodIndex`` values. You
must explicitly register these methods:
-Pandas built-in ``Series.plot`` and ``DataFrame.plot`` *will* register these
+pandas built-in ``Series.plot`` and ``DataFrame.plot`` *will* register these
converters on first-use (:issue:`17710`).
.. note::
diff --git a/doc/source/whatsnew/v0.21.1.rst b/doc/source/whatsnew/v0.21.1.rst
index f930dfac869cd..090a988d6406a 100644
--- a/doc/source/whatsnew/v0.21.1.rst
+++ b/doc/source/whatsnew/v0.21.1.rst
@@ -8,7 +8,7 @@ Version 0.21.1 (December 12, 2017)
.. ipython:: python
:suppress:
- from pandas import * # noqa F401, F403
+ from pandas import * # noqa F401, F403
This is a minor bug-fix release in the 0.21.x series and includes some small regression fixes,
@@ -34,7 +34,7 @@ Highlights include:
Restore Matplotlib datetime converter registration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Pandas implements some matplotlib converters for nicely formatting the axis
+pandas implements some matplotlib converters for nicely formatting the axis
labels on plots with ``datetime`` or ``Period`` values. Prior to pandas 0.21.0,
these were implicitly registered with matplotlib, as a side effect of ``import
pandas``.
diff --git a/doc/source/whatsnew/v0.22.0.rst b/doc/source/whatsnew/v0.22.0.rst
index 75949a90d09a6..ec9769c22e76b 100644
--- a/doc/source/whatsnew/v0.22.0.rst
+++ b/doc/source/whatsnew/v0.22.0.rst
@@ -1,14 +1,14 @@
.. _whatsnew_0220:
-v0.22.0 (December 29, 2017)
----------------------------
+Version 0.22.0 (December 29, 2017)
+----------------------------------
{{ header }}
.. ipython:: python
:suppress:
- from pandas import * # noqa F401, F403
+ from pandas import * # noqa F401, F403
This is a major release from 0.21.1 and includes a single, API-breaking change.
@@ -20,7 +20,7 @@ release note (singular!).
Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Pandas 0.22.0 changes the handling of empty and all-*NA* sums and products. The
+pandas 0.22.0 changes the handling of empty and all-*NA* sums and products. The
summary is that
* The sum of an empty or all-*NA* ``Series`` is now ``0``
@@ -96,7 +96,7 @@ returning ``1`` instead.
These changes affect :meth:`DataFrame.sum` and :meth:`DataFrame.prod` as well.
Finally, a few less obvious places in pandas are affected by this change.
-Grouping by a categorical
+Grouping by a Categorical
^^^^^^^^^^^^^^^^^^^^^^^^^
Grouping by a ``Categorical`` and summing now returns ``0`` instead of
@@ -119,7 +119,7 @@ instead of ``NaN``.
.. ipython:: python
- grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
+ grouper = pd.Categorical(["a", "a"], categories=["a", "b"])
pd.Series([1, 2]).groupby(grouper).sum()
To restore the 0.21 behavior of returning ``NaN`` for unobserved groups,
@@ -159,15 +159,14 @@ sum and ``1`` for product.
.. ipython:: python
- s = pd.Series([1, 1, np.nan, np.nan],
- index=pd.date_range('2017', periods=4))
- s.resample('2d').sum()
+ s = pd.Series([1, 1, np.nan, np.nan], index=pd.date_range("2017", periods=4))
+ s.resample("2d").sum()
To restore the 0.21 behavior of returning ``NaN``, use ``min_count>=1``.
.. ipython:: python
- s.resample('2d').sum(min_count=1)
+ s.resample("2d").sum(min_count=1)
In particular, upsampling and taking the sum or product is affected, as
upsampling introduces missing values even if the original series was
@@ -190,7 +189,7 @@ entirely valid.
.. ipython:: python
- idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
+ idx = pd.DatetimeIndex(["2017-01-01", "2017-01-02"])
pd.Series([1, 2], index=idx).resample("12H").sum()
Once again, the ``min_count`` keyword is available to restore the 0.21 behavior.
diff --git a/doc/source/whatsnew/v0.23.0.rst b/doc/source/whatsnew/v0.23.0.rst
index b9e1b5060d1da..f4caea9d363eb 100644
--- a/doc/source/whatsnew/v0.23.0.rst
+++ b/doc/source/whatsnew/v0.23.0.rst
@@ -64,7 +64,7 @@ A ``DataFrame`` can now be written to and subsequently read back via JSON while
new_df
new_df.dtypes
-Please note that the string `index` is not supported with the round trip format, as it is used by default in ``write_json`` to indicate a missing index name.
+Please note that the string ``index`` is not supported with the round trip format, as it is used by default in ``write_json`` to indicate a missing index name.
.. ipython:: python
:okwarning:
@@ -86,8 +86,8 @@ Please note that the string `index` is not supported with the round trip format,
.. _whatsnew_0230.enhancements.assign_dependent:
-``.assign()`` accepts dependent arguments
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Method ``.assign()`` accepts dependent arguments
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The :func:`DataFrame.assign` now accepts dependent keyword arguments for python version later than 3.6 (see also `PEP 468
`_). Later keyword arguments may now refer to earlier ones if the argument is a callable. See the
@@ -189,7 +189,7 @@ resetting indexes. See the :ref:`Sorting by Indexes and Values
Extending pandas with custom types (experimental)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Pandas now supports storing array-like objects that aren't necessarily 1-D NumPy
+pandas now supports storing array-like objects that aren't necessarily 1-D NumPy
arrays as columns in a DataFrame or values in a Series. This allows third-party
libraries to implement extensions to NumPy's types, similar to how pandas
implemented categoricals, datetimes with timezones, periods, and intervals.
@@ -244,7 +244,7 @@ documentation. If you build an extension array, publicize it on our
.. _whatsnew_0230.enhancements.categorical_grouping:
-New ``observed`` keyword for excluding unobserved categories in ``groupby``
+New ``observed`` keyword for excluding unobserved categories in ``GroupBy``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Grouping by a categorical includes the unobserved categories in the output.
@@ -360,8 +360,8 @@ Fill all consecutive outside values in both directions
.. _whatsnew_0210.enhancements.get_dummies_dtype:
-``get_dummies`` now supports ``dtype`` argument
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Function ``get_dummies`` now supports ``dtype`` argument
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The :func:`get_dummies` now accepts a ``dtype`` argument, which specifies a dtype for the new columns. The default remains uint8. (:issue:`18330`)
@@ -388,8 +388,8 @@ See the :ref:`documentation here `. (:issue:`19365`)
.. _whatsnew_0230.enhancements.ran_inf:
-``.rank()`` handles ``inf`` values when ``NaN`` are present
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Method ``.rank()`` handles ``inf`` values when ``NaN`` are present
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In previous versions, ``.rank()`` would assign ``inf`` elements ``NaN`` as their ranks. Now ranks are calculated properly. (:issue:`6945`)
@@ -457,7 +457,7 @@ These bugs were squashed:
Previously, :meth:`Series.str.cat` did not -- in contrast to most of ``pandas`` -- align :class:`Series` on their index before concatenation (see :issue:`18657`).
The method has now gained a keyword ``join`` to control the manner of alignment, see examples below and :ref:`here `.
-In v.0.23 `join` will default to None (meaning no alignment), but this default will change to ``'left'`` in a future version of pandas.
+In v.0.23 ``join`` will default to None (meaning no alignment), but this default will change to ``'left'`` in a future version of pandas.
.. ipython:: python
:okwarning:
@@ -553,7 +553,7 @@ Other enhancements
- :class:`~pandas.tseries.offsets.WeekOfMonth` constructor now supports ``n=0`` (:issue:`20517`).
- :class:`DataFrame` and :class:`Series` now support matrix multiplication (``@``) operator (:issue:`10259`) for Python>=3.5
- Updated :meth:`DataFrame.to_gbq` and :meth:`pandas.read_gbq` signature and documentation to reflect changes from
- the Pandas-GBQ library version 0.4.0. Adds intersphinx mapping to Pandas-GBQ
+ the pandas-gbq library version 0.4.0. Adds intersphinx mapping to pandas-gbq
library. (:issue:`20564`)
- Added new writer for exporting Stata dta files in version 117, ``StataWriter117``. This format supports exporting strings with lengths up to 2,000,000 characters (:issue:`16450`)
- :func:`to_hdf` and :func:`read_hdf` now accept an ``errors`` keyword argument to control encoding error handling (:issue:`20835`)
@@ -587,13 +587,13 @@ If installed, we now require:
.. _whatsnew_0230.api_breaking.dict_insertion_order:
-Instantiation from dicts preserves dict insertion order for python 3.6+
+Instantiation from dicts preserves dict insertion order for Python 3.6+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Until Python 3.6, dicts in Python had no formally defined ordering. For Python
version 3.6 and later, dicts are ordered by insertion order, see
`PEP 468 `_.
-Pandas will use the dict's insertion order, when creating a ``Series`` or
+pandas will use the dict's insertion order, when creating a ``Series`` or
``DataFrame`` from a dict and you're using Python version 3.6 or
higher. (:issue:`19884`)
@@ -643,7 +643,7 @@ Deprecate Panel
^^^^^^^^^^^^^^^
``Panel`` was deprecated in the 0.20.x release, showing as a ``DeprecationWarning``. Using ``Panel`` will now show a ``FutureWarning``. The recommended way to represent 3-D data are
-with a ``MultiIndex`` on a ``DataFrame`` via the :meth:`~Panel.to_frame` or with the `xarray package `__. Pandas
+with a ``MultiIndex`` on a ``DataFrame`` via the :meth:`~Panel.to_frame` or with the `xarray package `__. pandas
provides a :meth:`~Panel.to_xarray` method to automate this conversion (:issue:`13563`, :issue:`18324`).
.. code-block:: ipython
@@ -836,7 +836,7 @@ Build changes
Index division by zero fills correctly
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Division operations on ``Index`` and subclasses will now fill division of positive numbers by zero with ``np.inf``, division of negative numbers by zero with ``-np.inf`` and `0 / 0` with ``np.nan``. This matches existing ``Series`` behavior. (:issue:`19322`, :issue:`19347`)
+Division operations on ``Index`` and subclasses will now fill division of positive numbers by zero with ``np.inf``, division of negative numbers by zero with ``-np.inf`` and ``0 / 0`` with ``np.nan``. This matches existing ``Series`` behavior. (:issue:`19322`, :issue:`19347`)
Previous behavior:
@@ -884,7 +884,7 @@ Extraction of matching patterns from strings
By default, extracting matching patterns from strings with :func:`str.extract` used to return a
``Series`` if a single group was being extracted (a ``DataFrame`` if more than one group was
-extracted). As of Pandas 0.23.0 :func:`str.extract` always returns a ``DataFrame``, unless
+extracted). As of pandas 0.23.0 :func:`str.extract` always returns a ``DataFrame``, unless
``expand`` is set to ``False``. Finally, ``None`` was an accepted value for
the ``expand`` parameter (which was equivalent to ``False``), but now raises a ``ValueError``. (:issue:`11386`)
@@ -974,7 +974,7 @@ automatically so that the printed data frame fits within the current terminal
width (``pd.options.display.max_columns=0``) (:issue:`17023`). If Python runs
as a Jupyter kernel (such as the Jupyter QtConsole or a Jupyter notebook, as
well as in many IDEs), this value cannot be inferred automatically and is thus
-set to `20` as in previous versions. In a terminal, this results in a much
+set to ``20`` as in previous versions. In a terminal, this results in a much
nicer output:
.. image:: ../_static/print_df_new.png
@@ -998,7 +998,7 @@ Datetimelike API changes
- Addition and subtraction of ``NaN`` from a :class:`Series` with ``dtype='timedelta64[ns]'`` will raise a ``TypeError`` instead of treating the ``NaN`` as ``NaT`` (:issue:`19274`)
- ``NaT`` division with :class:`datetime.timedelta` will now return ``NaN`` instead of raising (:issue:`17876`)
- Operations between a :class:`Series` with dtype ``dtype='datetime64[ns]'`` and a :class:`PeriodIndex` will correctly raises ``TypeError`` (:issue:`18850`)
-- Subtraction of :class:`Series` with timezone-aware ``dtype='datetime64[ns]'`` with mis-matched timezones will raise ``TypeError`` instead of ``ValueError`` (:issue:`18817`)
+- Subtraction of :class:`Series` with timezone-aware ``dtype='datetime64[ns]'`` with mismatched timezones will raise ``TypeError`` instead of ``ValueError`` (:issue:`18817`)
- :class:`Timestamp` will no longer silently ignore unused or invalid ``tz`` or ``tzinfo`` keyword arguments (:issue:`17690`)
- :class:`Timestamp` will no longer silently ignore invalid ``freq`` arguments (:issue:`5168`)
- :class:`CacheableOffset` and :class:`WeekDay` are no longer available in the ``pandas.tseries.offsets`` module (:issue:`17830`)
@@ -1011,7 +1011,7 @@ Datetimelike API changes
- Restricted ``DateOffset`` keyword arguments. Previously, ``DateOffset`` subclasses allowed arbitrary keyword arguments which could lead to unexpected behavior. Now, only valid arguments will be accepted. (:issue:`17176`, :issue:`18226`).
- :func:`pandas.merge` provides a more informative error message when trying to merge on timezone-aware and timezone-naive columns (:issue:`15800`)
- For :class:`DatetimeIndex` and :class:`TimedeltaIndex` with ``freq=None``, addition or subtraction of integer-dtyped array or ``Index`` will raise ``NullFrequencyError`` instead of ``TypeError`` (:issue:`19895`)
-- :class:`Timestamp` constructor now accepts a `nanosecond` keyword or positional argument (:issue:`18898`)
+- :class:`Timestamp` constructor now accepts a ``nanosecond`` keyword or positional argument (:issue:`18898`)
- :class:`DatetimeIndex` will now raise an ``AttributeError`` when the ``tz`` attribute is set after instantiation (:issue:`3746`)
- :class:`DatetimeIndex` with a ``pytz`` timezone will now return a consistent ``pytz`` timezone (:issue:`18595`)
@@ -1049,7 +1049,7 @@ Other API changes
- :class:`DateOffset` objects render more simply, e.g. ```` instead of ```` (:issue:`19403`)
- ``Categorical.fillna`` now validates its ``value`` and ``method`` keyword arguments. It now raises when both or none are specified, matching the behavior of :meth:`Series.fillna` (:issue:`19682`)
- ``pd.to_datetime('today')`` now returns a datetime, consistent with ``pd.Timestamp('today')``; previously ``pd.to_datetime('today')`` returned a ``.normalized()`` datetime (:issue:`19935`)
-- :func:`Series.str.replace` now takes an optional `regex` keyword which, when set to ``False``, uses literal string replacement rather than regex replacement (:issue:`16808`)
+- :func:`Series.str.replace` now takes an optional ``regex`` keyword which, when set to ``False``, uses literal string replacement rather than regex replacement (:issue:`16808`)
- :func:`DatetimeIndex.strftime` and :func:`PeriodIndex.strftime` now return an ``Index`` instead of a numpy array to be consistent with similar accessors (:issue:`20127`)
- Constructing a Series from a list of length 1 no longer broadcasts this list when a longer index is specified (:issue:`19714`, :issue:`20391`).
- :func:`DataFrame.to_dict` with ``orient='index'`` no longer casts int columns to float for a DataFrame with only int and float columns (:issue:`18580`)
@@ -1175,7 +1175,7 @@ Performance improvements
Documentation changes
~~~~~~~~~~~~~~~~~~~~~
-Thanks to all of the contributors who participated in the Pandas Documentation
+Thanks to all of the contributors who participated in the pandas Documentation
Sprint, which took place on March 10th. We had about 500 participants from over
30 locations across the world. You should notice that many of the
:ref:`API docstrings ` have greatly improved.
@@ -1234,7 +1234,7 @@ Categorical
- Bug in ``Categorical.__iter__`` not converting to Python types (:issue:`19909`)
- Bug in :func:`pandas.factorize` returning the unique codes for the ``uniques``. This now returns a ``Categorical`` with the same dtype as the input (:issue:`19721`)
- Bug in :func:`pandas.factorize` including an item for missing values in the ``uniques`` return value (:issue:`19721`)
-- Bug in :meth:`Series.take` with categorical data interpreting ``-1`` in `indices` as missing value markers, rather than the last element of the Series (:issue:`20664`)
+- Bug in :meth:`Series.take` with categorical data interpreting ``-1`` in ``indices`` as missing value markers, rather than the last element of the Series (:issue:`20664`)
Datetimelike
^^^^^^^^^^^^
@@ -1273,7 +1273,7 @@ Timedelta
- Bug in :func:`Period.asfreq` where periods near ``datetime(1, 1, 1)`` could be converted incorrectly (:issue:`19643`, :issue:`19834`)
- Bug in :func:`Timedelta.total_seconds()` causing precision errors, for example ``Timedelta('30S').total_seconds()==30.000000000000004`` (:issue:`19458`)
- Bug in :func:`Timedelta.__rmod__` where operating with a ``numpy.timedelta64`` returned a ``timedelta64`` object instead of a ``Timedelta`` (:issue:`19820`)
-- Multiplication of :class:`TimedeltaIndex` by ``TimedeltaIndex`` will now raise ``TypeError`` instead of raising ``ValueError`` in cases of length mis-match (:issue:`19333`)
+- Multiplication of :class:`TimedeltaIndex` by ``TimedeltaIndex`` will now raise ``TypeError`` instead of raising ``ValueError`` in cases of length mismatch (:issue:`19333`)
- Bug in indexing a :class:`TimedeltaIndex` with a ``np.timedelta64`` object which was raising a ``TypeError`` (:issue:`20393`)
@@ -1316,7 +1316,7 @@ Numeric
Strings
^^^^^^^
-- Bug in :func:`Series.str.get` with a dictionary in the values and the index not in the keys, raising `KeyError` (:issue:`20671`)
+- Bug in :func:`Series.str.get` with a dictionary in the values and the index not in the keys, raising ``KeyError`` (:issue:`20671`)
Indexing
@@ -1365,11 +1365,11 @@ MultiIndex
- Bug in indexing where nested indexers having only numpy arrays are handled incorrectly (:issue:`19686`)
-I/O
-^^^
+IO
+^^
- :func:`read_html` now rewinds seekable IO objects after parse failure, before attempting to parse with a new parser. If a parser errors and the object is non-seekable, an informative error is raised suggesting the use of a different parser (:issue:`17975`)
-- :meth:`DataFrame.to_html` now has an option to add an id to the leading `` tag (:issue:`8496`)
+- :meth:`DataFrame.to_html` now has an option to add an id to the leading ```` tag (:issue:`8496`)
- Bug in :func:`read_msgpack` with a non existent file is passed in Python 2 (:issue:`15296`)
- Bug in :func:`read_csv` where a ``MultiIndex`` with duplicate columns was not being mangled appropriately (:issue:`18062`)
- Bug in :func:`read_csv` where missing values were not being handled properly when ``keep_default_na=False`` with dictionary ``na_values`` (:issue:`19227`)
@@ -1378,7 +1378,7 @@ I/O
- Bug in :func:`DataFrame.to_latex()` where pairs of braces meant to serve as invisible placeholders were escaped (:issue:`18667`)
- Bug in :func:`DataFrame.to_latex()` where a ``NaN`` in a ``MultiIndex`` would cause an ``IndexError`` or incorrect output (:issue:`14249`)
- Bug in :func:`DataFrame.to_latex()` where a non-string index-level name would result in an ``AttributeError`` (:issue:`19981`)
-- Bug in :func:`DataFrame.to_latex()` where the combination of an index name and the `index_names=False` option would result in incorrect output (:issue:`18326`)
+- Bug in :func:`DataFrame.to_latex()` where the combination of an index name and the ``index_names=False`` option would result in incorrect output (:issue:`18326`)
- Bug in :func:`DataFrame.to_latex()` where a ``MultiIndex`` with an empty string as its name would result in incorrect output (:issue:`18669`)
- Bug in :func:`DataFrame.to_latex()` where missing space characters caused wrong escaping and produced non-valid latex in some cases (:issue:`20859`)
- Bug in :func:`read_json` where large numeric values were causing an ``OverflowError`` (:issue:`18842`)
@@ -1403,7 +1403,7 @@ Plotting
- :func:`DataFrame.plot` now supports multiple columns to the ``y`` argument (:issue:`19699`)
-Groupby/resample/rolling
+GroupBy/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^
- Bug when grouping by a single column and aggregating with a class like ``list`` or ``tuple`` (:issue:`18079`)
@@ -1412,7 +1412,7 @@ Groupby/resample/rolling
- Bug in :func:`DataFrame.groupby` where tuples were interpreted as lists of keys rather than as keys (:issue:`17979`, :issue:`18249`)
- Bug in :func:`DataFrame.groupby` where aggregation by ``first``/``last``/``min``/``max`` was causing timestamps to lose precision (:issue:`19526`)
- Bug in :func:`DataFrame.transform` where particular aggregation functions were being incorrectly cast to match the dtype(s) of the grouped data (:issue:`19200`)
-- Bug in :func:`DataFrame.groupby` passing the `on=` kwarg, and subsequently using ``.apply()`` (:issue:`17813`)
+- Bug in :func:`DataFrame.groupby` passing the ``on=`` kwarg, and subsequently using ``.apply()`` (:issue:`17813`)
- Bug in :func:`DataFrame.resample().aggregate ` not raising a ``KeyError`` when aggregating a non-existent column (:issue:`16766`, :issue:`19566`)
- Bug in :func:`DataFrameGroupBy.cumsum` and :func:`DataFrameGroupBy.cumprod` when ``skipna`` was passed (:issue:`19806`)
- Bug in :func:`DataFrame.resample` that dropped timezone information (:issue:`13238`)
diff --git a/doc/source/whatsnew/v0.23.1.rst b/doc/source/whatsnew/v0.23.1.rst
index 03b7d9db6bc63..b51368c87f991 100644
--- a/doc/source/whatsnew/v0.23.1.rst
+++ b/doc/source/whatsnew/v0.23.1.rst
@@ -74,10 +74,10 @@ In addition, ordering comparisons will raise a ``TypeError`` in the future.
a tz-aware time instead of tz-naive (:issue:`21267`) and :attr:`DatetimeIndex.date`
returned incorrect date when the input date has a non-UTC timezone (:issue:`21230`).
- Fixed regression in :meth:`pandas.io.json.json_normalize` when called with ``None`` values
- in nested levels in JSON, and to not drop keys with value as `None` (:issue:`21158`, :issue:`21356`).
+ in nested levels in JSON, and to not drop keys with value as ``None`` (:issue:`21158`, :issue:`21356`).
- Bug in :meth:`~DataFrame.to_csv` causes encoding error when compression and encoding are specified (:issue:`21241`, :issue:`21118`)
- Bug preventing pandas from being importable with -OO optimization (:issue:`21071`)
-- Bug in :meth:`Categorical.fillna` incorrectly raising a ``TypeError`` when `value` the individual categories are iterable and `value` is an iterable (:issue:`21097`, :issue:`19788`)
+- Bug in :meth:`Categorical.fillna` incorrectly raising a ``TypeError`` when ``value`` the individual categories are iterable and ``value`` is an iterable (:issue:`21097`, :issue:`19788`)
- Fixed regression in constructors coercing NA values like ``None`` to strings when passing ``dtype=str`` (:issue:`21083`)
- Regression in :func:`pivot_table` where an ordered ``Categorical`` with missing
values for the pivot's ``index`` would give a mis-aligned result (:issue:`21133`)
@@ -106,7 +106,7 @@ Bug fixes
**Data-type specific**
-- Bug in :meth:`Series.str.replace()` where the method throws `TypeError` on Python 3.5.2 (:issue:`21078`)
+- Bug in :meth:`Series.str.replace()` where the method throws ``TypeError`` on Python 3.5.2 (:issue:`21078`)
- Bug in :class:`Timedelta` where passing a float with a unit would prematurely round the float precision (:issue:`14156`)
- Bug in :func:`pandas.testing.assert_index_equal` which raised ``AssertionError`` incorrectly, when comparing two :class:`CategoricalIndex` objects with param ``check_categorical=False`` (:issue:`19776`)
diff --git a/doc/source/whatsnew/v0.23.2.rst b/doc/source/whatsnew/v0.23.2.rst
index 9f24092d1d4ae..99650e8291d3d 100644
--- a/doc/source/whatsnew/v0.23.2.rst
+++ b/doc/source/whatsnew/v0.23.2.rst
@@ -11,7 +11,7 @@ and bug fixes. We recommend that all users upgrade to this version.
.. note::
- Pandas 0.23.2 is first pandas release that's compatible with
+ pandas 0.23.2 is first pandas release that's compatible with
Python 3.7 (:issue:`20552`)
.. warning::
diff --git a/doc/source/whatsnew/v0.24.0.rst b/doc/source/whatsnew/v0.24.0.rst
index 45399792baecf..9ef50045d5b5e 100644
--- a/doc/source/whatsnew/v0.24.0.rst
+++ b/doc/source/whatsnew/v0.24.0.rst
@@ -38,7 +38,7 @@ Enhancements
Optional integer NA support
^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Pandas has gained the ability to hold integer dtypes with missing values. This long requested feature is enabled through the use of :ref:`extension types `.
+pandas has gained the ability to hold integer dtypes with missing values. This long requested feature is enabled through the use of :ref:`extension types `.
.. note::
@@ -277,8 +277,8 @@ For earlier versions this can be done using the following.
.. _whatsnew_0240.enhancements.read_html:
-``read_html`` Enhancements
-^^^^^^^^^^^^^^^^^^^^^^^^^^
+Function ``read_html`` enhancements
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:func:`read_html` previously ignored ``colspan`` and ``rowspan`` attributes.
Now it understands them, treating them as sequences of cells with the same
@@ -376,7 +376,7 @@ Other enhancements
- :func:`DataFrame.to_html` now accepts ``render_links`` as an argument, allowing the user to generate HTML with links to any URLs that appear in the DataFrame.
See the :ref:`section on writing HTML ` in the IO docs for example usage. (:issue:`2679`)
- :func:`pandas.read_csv` now supports pandas extension types as an argument to ``dtype``, allowing the user to use pandas extension types when reading CSVs. (:issue:`23228`)
-- The :meth:`~DataFrame.shift` method now accepts `fill_value` as an argument, allowing the user to specify a value which will be used instead of NA/NaT in the empty periods. (:issue:`15486`)
+- The :meth:`~DataFrame.shift` method now accepts ``fill_value`` as an argument, allowing the user to specify a value which will be used instead of NA/NaT in the empty periods. (:issue:`15486`)
- :func:`to_datetime` now supports the ``%Z`` and ``%z`` directive when passed into ``format`` (:issue:`13486`)
- :func:`Series.mode` and :func:`DataFrame.mode` now support the ``dropna`` parameter which can be used to specify whether ``NaN``/``NaT`` values should be considered (:issue:`17534`)
- :func:`DataFrame.to_csv` and :func:`Series.to_csv` now support the ``compression`` keyword when a file handle is passed. (:issue:`21227`)
@@ -384,7 +384,7 @@ Other enhancements
- :meth:`Series.droplevel` and :meth:`DataFrame.droplevel` are now implemented (:issue:`20342`)
- Added support for reading from/writing to Google Cloud Storage via the ``gcsfs`` library (:issue:`19454`, :issue:`23094`)
- :func:`DataFrame.to_gbq` and :func:`read_gbq` signature and documentation updated to
- reflect changes from the `Pandas-GBQ library version 0.8.0
+ reflect changes from the `pandas-gbq library version 0.8.0
`__.
Adds a ``credentials`` argument, which enables the use of any kind of
`google-auth credentials
@@ -419,7 +419,7 @@ Other enhancements
- :meth:`Index.difference`, :meth:`Index.intersection`, :meth:`Index.union`, and :meth:`Index.symmetric_difference` now have an optional ``sort`` parameter to control whether the results should be sorted if possible (:issue:`17839`, :issue:`24471`)
- :meth:`read_excel()` now accepts ``usecols`` as a list of column names or callable (:issue:`18273`)
- :meth:`MultiIndex.to_flat_index` has been added to flatten multiple levels into a single-level :class:`Index` object.
-- :meth:`DataFrame.to_stata` and :class:`pandas.io.stata.StataWriter117` can write mixed sting columns to Stata strl format (:issue:`23633`)
+- :meth:`DataFrame.to_stata` and :class:`pandas.io.stata.StataWriter117` can write mixed string columns to Stata strl format (:issue:`23633`)
- :meth:`DataFrame.between_time` and :meth:`DataFrame.at_time` have gained the ``axis`` parameter (:issue:`8839`)
- :meth:`DataFrame.to_records` now accepts ``index_dtypes`` and ``column_dtypes`` parameters to allow different data types in stored column and index records (:issue:`18146`)
- :class:`IntervalIndex` has gained the :attr:`~IntervalIndex.is_overlapping` attribute to indicate if the ``IntervalIndex`` contains any overlapping intervals (:issue:`23309`)
@@ -432,7 +432,7 @@ Other enhancements
Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Pandas 0.24.0 includes a number of API breaking changes.
+pandas 0.24.0 includes a number of API breaking changes.
.. _whatsnew_0240.api_breaking.deps:
@@ -474,8 +474,8 @@ and replaced it with references to ``pyarrow`` (:issue:`21639` and :issue:`23053
.. _whatsnew_0240.api_breaking.csv_line_terminator:
-`os.linesep` is used for ``line_terminator`` of ``DataFrame.to_csv``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+``os.linesep`` is used for ``line_terminator`` of ``DataFrame.to_csv``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:func:`DataFrame.to_csv` now uses :func:`os.linesep` rather than ``'\n'``
for the default line terminator (:issue:`20353`).
@@ -510,7 +510,7 @@ even when ``'\n'`` was passed in ``line_terminator``.
*New behavior* on Windows:
-Passing ``line_terminator`` explicitly, set thes ``line terminator`` to that character.
+Passing ``line_terminator`` explicitly, set the ``line terminator`` to that character.
.. code-block:: ipython
@@ -556,8 +556,8 @@ You must pass in the ``line_terminator`` explicitly, even in this case.
.. _whatsnew_0240.bug_fixes.nan_with_str_dtype:
-Proper handling of `np.NaN` in a string data-typed column with the Python engine
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Proper handling of ``np.NaN`` in a string data-typed column with the Python engine
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There was bug in :func:`read_excel` and :func:`read_csv` with the Python
engine, where missing values turned to ``'nan'`` with ``dtype=str`` and
@@ -1198,7 +1198,7 @@ Other API changes
- :meth:`DataFrame.set_index` now gives a better (and less frequent) KeyError, raises a ``ValueError`` for incorrect types,
and will not fail on duplicate column names with ``drop=True``. (:issue:`22484`)
- Slicing a single row of a DataFrame with multiple ExtensionArrays of the same type now preserves the dtype, rather than coercing to object (:issue:`22784`)
-- :class:`DateOffset` attribute `_cacheable` and method `_should_cache` have been removed (:issue:`23118`)
+- :class:`DateOffset` attribute ``_cacheable`` and method ``_should_cache`` have been removed (:issue:`23118`)
- :meth:`Series.searchsorted`, when supplied a scalar value to search for, now returns a scalar instead of an array (:issue:`23801`).
- :meth:`Categorical.searchsorted`, when supplied a scalar value to search for, now returns a scalar instead of an array (:issue:`23466`).
- :meth:`Categorical.searchsorted` now raises a ``KeyError`` rather that a ``ValueError``, if a searched for key is not found in its categories (:issue:`23466`).
@@ -1217,7 +1217,7 @@ Extension type changes
**Equality and hashability**
-Pandas now requires that extension dtypes be hashable (i.e. the respective
+pandas now requires that extension dtypes be hashable (i.e. the respective
``ExtensionDtype`` objects; hashability is not a requirement for the values
of the corresponding ``ExtensionArray``). The base class implements
a default ``__eq__`` and ``__hash__``. If you have a parametrized dtype, you should
@@ -1317,7 +1317,7 @@ Deprecations
- Timezone converting a tz-aware ``datetime.datetime`` or :class:`Timestamp` with :class:`Timestamp` and the ``tz`` argument is now deprecated. Instead, use :meth:`Timestamp.tz_convert` (:issue:`23579`)
- :func:`pandas.api.types.is_period` is deprecated in favor of ``pandas.api.types.is_period_dtype`` (:issue:`23917`)
- :func:`pandas.api.types.is_datetimetz` is deprecated in favor of ``pandas.api.types.is_datetime64tz`` (:issue:`23917`)
-- Creating a :class:`TimedeltaIndex`, :class:`DatetimeIndex`, or :class:`PeriodIndex` by passing range arguments `start`, `end`, and `periods` is deprecated in favor of :func:`timedelta_range`, :func:`date_range`, or :func:`period_range` (:issue:`23919`)
+- Creating a :class:`TimedeltaIndex`, :class:`DatetimeIndex`, or :class:`PeriodIndex` by passing range arguments ``start``, ``end``, and ``periods`` is deprecated in favor of :func:`timedelta_range`, :func:`date_range`, or :func:`period_range` (:issue:`23919`)
- Passing a string alias like ``'datetime64[ns, UTC]'`` as the ``unit`` parameter to :class:`DatetimeTZDtype` is deprecated. Use :class:`DatetimeTZDtype.construct_from_string` instead (:issue:`23990`).
- The ``skipna`` parameter of :meth:`~pandas.api.types.infer_dtype` will switch to ``True`` by default in a future version of pandas (:issue:`17066`, :issue:`24050`)
- In :meth:`Series.where` with Categorical data, providing an ``other`` that is not present in the categories is deprecated. Convert the categorical to a different dtype or add the ``other`` to the categories first (:issue:`24077`).
@@ -1371,7 +1371,7 @@ the object's ``freq`` attribute (:issue:`21939`, :issue:`23878`).
.. _whatsnew_0240.deprecations.integer_tz:
-Passing integer data and a timezone to datetimeindex
+Passing integer data and a timezone to DatetimeIndex
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The behavior of :class:`DatetimeIndex` when passed integer data and
@@ -1534,7 +1534,7 @@ Performance improvements
- Improved the performance of :func:`pandas.get_dummies` with ``sparse=True`` (:issue:`21997`)
- Improved performance of :func:`IndexEngine.get_indexer_non_unique` for sorted, non-unique indexes (:issue:`9466`)
- Improved performance of :func:`PeriodIndex.unique` (:issue:`23083`)
-- Improved performance of :func:`concat` for `Series` objects (:issue:`23404`)
+- Improved performance of :func:`concat` for ``Series`` objects (:issue:`23404`)
- Improved performance of :meth:`DatetimeIndex.normalize` and :meth:`Timestamp.normalize` for timezone naive or UTC datetimes (:issue:`23634`)
- Improved performance of :meth:`DatetimeIndex.tz_localize` and various ``DatetimeIndex`` attributes with dateutil UTC timezone (:issue:`23772`)
- Fixed a performance regression on Windows with Python 3.7 of :func:`read_csv` (:issue:`23516`)
@@ -1602,7 +1602,7 @@ Datetimelike
- Bug in :class:`DataFrame` when creating a new column from an ndarray of :class:`Timestamp` objects with timezones creating an object-dtype column, rather than datetime with timezone (:issue:`23932`)
- Bug in :class:`Timestamp` constructor which would drop the frequency of an input :class:`Timestamp` (:issue:`22311`)
- Bug in :class:`DatetimeIndex` where calling ``np.array(dtindex, dtype=object)`` would incorrectly return an array of ``long`` objects (:issue:`23524`)
-- Bug in :class:`Index` where passing a timezone-aware :class:`DatetimeIndex` and `dtype=object` would incorrectly raise a ``ValueError`` (:issue:`23524`)
+- Bug in :class:`Index` where passing a timezone-aware :class:`DatetimeIndex` and ``dtype=object`` would incorrectly raise a ``ValueError`` (:issue:`23524`)
- Bug in :class:`Index` where calling ``np.array(dtindex, dtype=object)`` on a timezone-naive :class:`DatetimeIndex` would return an array of ``datetime`` objects instead of :class:`Timestamp` objects, potentially losing nanosecond portions of the timestamps (:issue:`23524`)
- Bug in :class:`Categorical.__setitem__` not allowing setting with another ``Categorical`` when both are unordered and have the same categories, but in a different order (:issue:`24142`)
- Bug in :func:`date_range` where using dates with millisecond resolution or higher could return incorrect values or the wrong number of values in the index (:issue:`24110`)
@@ -1647,7 +1647,7 @@ Timezones
- Bug in :class:`Series` constructor which would coerce tz-aware and tz-naive :class:`Timestamp` to tz-aware (:issue:`13051`)
- Bug in :class:`Index` with ``datetime64[ns, tz]`` dtype that did not localize integer data correctly (:issue:`20964`)
- Bug in :class:`DatetimeIndex` where constructing with an integer and tz would not localize correctly (:issue:`12619`)
-- Fixed bug where :meth:`DataFrame.describe` and :meth:`Series.describe` on tz-aware datetimes did not show `first` and `last` result (:issue:`21328`)
+- Fixed bug where :meth:`DataFrame.describe` and :meth:`Series.describe` on tz-aware datetimes did not show ``first`` and ``last`` result (:issue:`21328`)
- Bug in :class:`DatetimeIndex` comparisons failing to raise ``TypeError`` when comparing timezone-aware ``DatetimeIndex`` against ``np.datetime64`` (:issue:`22074`)
- Bug in ``DataFrame`` assignment with a timezone-aware scalar (:issue:`19843`)
- Bug in :func:`DataFrame.asof` that raised a ``TypeError`` when attempting to compare tz-naive and tz-aware timestamps (:issue:`21194`)
@@ -1693,7 +1693,7 @@ Numeric
- :meth:`Series.agg` can now handle numpy NaN-aware methods like :func:`numpy.nansum` (:issue:`19629`)
- Bug in :meth:`Series.rank` and :meth:`DataFrame.rank` when ``pct=True`` and more than 2\ :sup:`24` rows are present resulted in percentages greater than 1.0 (:issue:`18271`)
- Calls such as :meth:`DataFrame.round` with a non-unique :meth:`CategoricalIndex` now return expected data. Previously, data would be improperly duplicated (:issue:`21809`).
-- Added ``log10``, `floor` and `ceil` to the list of supported functions in :meth:`DataFrame.eval` (:issue:`24139`, :issue:`24353`)
+- Added ``log10``, ``floor`` and ``ceil`` to the list of supported functions in :meth:`DataFrame.eval` (:issue:`24139`, :issue:`24353`)
- Logical operations ``&, |, ^`` between :class:`Series` and :class:`Index` will no longer raise ``ValueError`` (:issue:`22092`)
- Checking PEP 3141 numbers in :func:`~pandas.api.types.is_scalar` function returns ``True`` (:issue:`22903`)
- Reduction methods like :meth:`Series.sum` now accept the default value of ``keepdims=False`` when called from a NumPy ufunc, rather than raising a ``TypeError``. Full support for ``keepdims`` has not been implemented (:issue:`24356`).
@@ -1769,8 +1769,8 @@ MultiIndex
- :class:`MultiIndex` has gained the :meth:`MultiIndex.from_frame`, it allows constructing a :class:`MultiIndex` object from a :class:`DataFrame` (:issue:`22420`)
- Fix ``TypeError`` in Python 3 when creating :class:`MultiIndex` in which some levels have mixed types, e.g. when some labels are tuples (:issue:`15457`)
-I/O
-^^^
+IO
+^^
- Bug in :func:`read_csv` in which a column specified with ``CategoricalDtype`` of boolean categories was not being correctly coerced from string values to booleans (:issue:`20498`)
- Bug in :func:`read_csv` in which unicode column names were not being properly recognized with Python 2.x (:issue:`13253`)
@@ -1827,7 +1827,7 @@ Plotting
- Bug in :func:`DataFrame.plot.bar` caused bars to use multiple colors instead of a single one (:issue:`20585`)
- Bug in validating color parameter caused extra color to be appended to the given color array. This happened to multiple plotting functions using matplotlib. (:issue:`20726`)
-Groupby/resample/rolling
+GroupBy/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^
- Bug in :func:`pandas.core.window.Rolling.min` and :func:`pandas.core.window.Rolling.max` with ``closed='left'``, a datetime-like index and only one entry in the series leading to segfault (:issue:`24718`)
@@ -1859,7 +1859,7 @@ Reshaping
^^^^^^^^^
- Bug in :func:`pandas.concat` when joining resampled DataFrames with timezone aware index (:issue:`13783`)
-- Bug in :func:`pandas.concat` when joining only `Series` the `names` argument of `concat` is no longer ignored (:issue:`23490`)
+- Bug in :func:`pandas.concat` when joining only ``Series`` the ``names`` argument of ``concat`` is no longer ignored (:issue:`23490`)
- Bug in :meth:`Series.combine_first` with ``datetime64[ns, tz]`` dtype which would return tz-naive result (:issue:`21469`)
- Bug in :meth:`Series.where` and :meth:`DataFrame.where` with ``datetime64[ns, tz]`` dtype (:issue:`21546`)
- Bug in :meth:`DataFrame.where` with an empty DataFrame and empty ``cond`` having non-bool dtype (:issue:`21947`)
@@ -1885,7 +1885,7 @@ Reshaping
- :meth:`DataFrame.nlargest` and :meth:`DataFrame.nsmallest` now returns the correct n values when keep != 'all' also when tied on the first columns (:issue:`22752`)
- Constructing a DataFrame with an index argument that wasn't already an instance of :class:`~pandas.core.Index` was broken (:issue:`22227`).
- Bug in :class:`DataFrame` prevented list subclasses to be used to construction (:issue:`21226`)
-- Bug in :func:`DataFrame.unstack` and :func:`DataFrame.pivot_table` returning a missleading error message when the resulting DataFrame has more elements than int32 can handle. Now, the error message is improved, pointing towards the actual problem (:issue:`20601`)
+- Bug in :func:`DataFrame.unstack` and :func:`DataFrame.pivot_table` returning a misleading error message when the resulting DataFrame has more elements than int32 can handle. Now, the error message is improved, pointing towards the actual problem (:issue:`20601`)
- Bug in :func:`DataFrame.unstack` where a ``ValueError`` was raised when unstacking timezone aware values (:issue:`18338`)
- Bug in :func:`DataFrame.stack` where timezone aware values were converted to timezone naive values (:issue:`19420`)
- Bug in :func:`merge_asof` where a ``TypeError`` was raised when ``by_col`` were timezone aware values (:issue:`21184`)
@@ -1925,7 +1925,7 @@ Build changes
Other
^^^^^
-- Bug where C variables were declared with external linkage causing import errors if certain other C libraries were imported before Pandas. (:issue:`24113`)
+- Bug where C variables were declared with external linkage causing import errors if certain other C libraries were imported before pandas. (:issue:`24113`)
.. _whatsnew_0.24.0.contributors:
diff --git a/doc/source/whatsnew/v0.24.1.rst b/doc/source/whatsnew/v0.24.1.rst
index aead8c48eb9b7..1918a1e8caf6c 100644
--- a/doc/source/whatsnew/v0.24.1.rst
+++ b/doc/source/whatsnew/v0.24.1.rst
@@ -33,7 +33,7 @@ This change will allow ``sort=True`` to mean "always sort" in a future release.
The same change applies to :meth:`Index.difference` and :meth:`Index.symmetric_difference`, which
would not sort the result when the values could not be compared.
-The `sort` option for :meth:`Index.intersection` has changed in three ways.
+The ``sort`` option for :meth:`Index.intersection` has changed in three ways.
1. The default has changed from ``True`` to ``False``, to restore the
pandas 0.23.4 and earlier behavior of not sorting by default.
@@ -55,7 +55,7 @@ Fixed regressions
- Fixed regression in :class:`Index.intersection` incorrectly sorting the values by default (:issue:`24959`).
- Fixed regression in :func:`merge` when merging an empty ``DataFrame`` with multiple timezone-aware columns on one of the timezone-aware columns (:issue:`25014`).
- Fixed regression in :meth:`Series.rename_axis` and :meth:`DataFrame.rename_axis` where passing ``None`` failed to remove the axis name (:issue:`25034`)
-- Fixed regression in :func:`to_timedelta` with `box=False` incorrectly returning a ``datetime64`` object instead of a ``timedelta64`` object (:issue:`24961`)
+- Fixed regression in :func:`to_timedelta` with ``box=False`` incorrectly returning a ``datetime64`` object instead of a ``timedelta64`` object (:issue:`24961`)
- Fixed regression where custom hashable types could not be used as column keys in :meth:`DataFrame.set_index` (:issue:`24969`)
.. _whatsnew_0241.bug_fixes:
diff --git a/doc/source/whatsnew/v0.24.2.rst b/doc/source/whatsnew/v0.24.2.rst
index d1a893f99cff4..27e84bf0a7cd7 100644
--- a/doc/source/whatsnew/v0.24.2.rst
+++ b/doc/source/whatsnew/v0.24.2.rst
@@ -51,7 +51,6 @@ Bug fixes
- Bug where calling :meth:`Series.replace` on categorical data could return a ``Series`` with incorrect dimensions (:issue:`24971`)
-
--
**Reshaping**
diff --git a/doc/source/whatsnew/v0.25.0.rst b/doc/source/whatsnew/v0.25.0.rst
index 44558fd63ba15..43b42c5cb5648 100644
--- a/doc/source/whatsnew/v0.25.0.rst
+++ b/doc/source/whatsnew/v0.25.0.rst
@@ -14,7 +14,7 @@ What's new in 0.25.0 (July 18, 2019)
.. warning::
- `Panel` has been fully removed. For N-D labeled data structures, please
+ ``Panel`` has been fully removed. For N-D labeled data structures, please
use `xarray `_
.. warning::
@@ -36,7 +36,7 @@ Enhancements
Groupby aggregation with relabeling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Pandas has added special groupby behavior, known as "named aggregation", for naming the
+pandas has added special groupby behavior, known as "named aggregation", for naming the
output columns when applying multiple aggregation functions to specific columns (:issue:`18366`, :issue:`26512`).
.. ipython:: python
@@ -53,7 +53,7 @@ output columns when applying multiple aggregation functions to specific columns
Pass the desired columns names as the ``**kwargs`` to ``.agg``. The values of ``**kwargs``
should be tuples where the first element is the column selection, and the second element is the
-aggregation function to apply. Pandas provides the ``pandas.NamedAgg`` namedtuple to make it clearer
+aggregation function to apply. pandas provides the ``pandas.NamedAgg`` namedtuple to make it clearer
what the arguments to the function are, but plain tuples are accepted as well.
.. ipython:: python
@@ -425,7 +425,7 @@ of ``object`` dtype. :attr:`Series.str` will now infer the dtype data *within* t
Categorical dtypes are preserved during groupby
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Previously, columns that were categorical, but not the groupby key(s) would be converted to ``object`` dtype during groupby operations. Pandas now will preserve these dtypes. (:issue:`18502`)
+Previously, columns that were categorical, but not the groupby key(s) would be converted to ``object`` dtype during groupby operations. pandas now will preserve these dtypes. (:issue:`18502`)
.. ipython:: python
@@ -540,19 +540,19 @@ with :attr:`numpy.nan` in the case of an empty :class:`DataFrame` (:issue:`26397
.. ipython:: python
- df.describe()
+ df.describe()
``__str__`` methods now call ``__repr__`` rather than vice versa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Pandas has until now mostly defined string representations in a Pandas objects's
+pandas has until now mostly defined string representations in a pandas objects'
``__str__``/``__unicode__``/``__bytes__`` methods, and called ``__str__`` from the ``__repr__``
method, if a specific ``__repr__`` method is not found. This is not needed for Python3.
-In Pandas 0.25, the string representations of Pandas objects are now generally
+In pandas 0.25, the string representations of pandas objects are now generally
defined in ``__repr__``, and calls to ``__str__`` in general now pass the call on to
the ``__repr__``, if a specific ``__str__`` method doesn't exist, as is standard for Python.
-This change is backward compatible for direct usage of Pandas, but if you subclass
-Pandas objects *and* give your subclasses specific ``__str__``/``__repr__`` methods,
+This change is backward compatible for direct usage of pandas, but if you subclass
+pandas objects *and* give your subclasses specific ``__str__``/``__repr__`` methods,
you may have to adjust your ``__str__``/``__repr__`` methods (:issue:`26495`).
.. _whatsnew_0250.api_breaking.interval_indexing:
@@ -881,7 +881,7 @@ Other API changes
- Bug in :meth:`DatetimeIndex.snap` which didn't preserving the ``name`` of the input :class:`Index` (:issue:`25575`)
- The ``arg`` argument in :meth:`pandas.core.groupby.DataFrameGroupBy.agg` has been renamed to ``func`` (:issue:`26089`)
- The ``arg`` argument in :meth:`pandas.core.window._Window.aggregate` has been renamed to ``func`` (:issue:`26372`)
-- Most Pandas classes had a ``__bytes__`` method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (:issue:`26447`)
+- Most pandas classes had a ``__bytes__`` method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (:issue:`26447`)
- The ``.str``-accessor has been disabled for 1-level :class:`MultiIndex`, use :meth:`MultiIndex.to_flat_index` if necessary (:issue:`23679`)
- Removed support of gtk package for clipboards (:issue:`26563`)
- Using an unsupported version of Beautiful Soup 4 will now raise an ``ImportError`` instead of a ``ValueError`` (:issue:`27063`)
@@ -1167,7 +1167,7 @@ I/O
- Fixed bug in :func:`pandas.read_csv` where a BOM would result in incorrect parsing using engine='python' (:issue:`26545`)
- :func:`read_excel` now raises a ``ValueError`` when input is of type :class:`pandas.io.excel.ExcelFile` and ``engine`` param is passed since :class:`pandas.io.excel.ExcelFile` has an engine defined (:issue:`26566`)
- Bug while selecting from :class:`HDFStore` with ``where=''`` specified (:issue:`26610`).
-- Fixed bug in :func:`DataFrame.to_excel()` where custom objects (i.e. `PeriodIndex`) inside merged cells were not being converted into types safe for the Excel writer (:issue:`27006`)
+- Fixed bug in :func:`DataFrame.to_excel()` where custom objects (i.e. ``PeriodIndex``) inside merged cells were not being converted into types safe for the Excel writer (:issue:`27006`)
- Bug in :meth:`read_hdf` where reading a timezone aware :class:`DatetimeIndex` would raise a ``TypeError`` (:issue:`11926`)
- Bug in :meth:`to_msgpack` and :meth:`read_msgpack` which would raise a ``ValueError`` rather than a ``FileNotFoundError`` for an invalid path (:issue:`27160`)
- Fixed bug in :meth:`DataFrame.to_parquet` which would raise a ``ValueError`` when the dataframe had no columns (:issue:`27339`)
@@ -1206,7 +1206,7 @@ Groupby/resample/rolling
- Bug in :meth:`pandas.core.groupby.GroupBy.agg` where incorrect results are returned for uint64 columns. (:issue:`26310`)
- Bug in :meth:`pandas.core.window.Rolling.median` and :meth:`pandas.core.window.Rolling.quantile` where MemoryError is raised with empty window (:issue:`26005`)
- Bug in :meth:`pandas.core.window.Rolling.median` and :meth:`pandas.core.window.Rolling.quantile` where incorrect results are returned with ``closed='left'`` and ``closed='neither'`` (:issue:`26005`)
-- Improved :class:`pandas.core.window.Rolling`, :class:`pandas.core.window.Window` and :class:`pandas.core.window.EWM` functions to exclude nuisance columns from results instead of raising errors and raise a ``DataError`` only if all columns are nuisance (:issue:`12537`)
+- Improved :class:`pandas.core.window.Rolling`, :class:`pandas.core.window.Window` and :class:`pandas.core.window.ExponentialMovingWindow` functions to exclude nuisance columns from results instead of raising errors and raise a ``DataError`` only if all columns are nuisance (:issue:`12537`)
- Bug in :meth:`pandas.core.window.Rolling.max` and :meth:`pandas.core.window.Rolling.min` where incorrect results are returned with an empty variable window (:issue:`26005`)
- Raise a helpful exception when an unsupported weighted window function is used as an argument of :meth:`pandas.core.window.Window.aggregate` (:issue:`26597`)
@@ -1262,7 +1262,7 @@ Other
- Removed unused C functions from vendored UltraJSON implementation (:issue:`26198`)
- Allow :class:`Index` and :class:`RangeIndex` to be passed to numpy ``min`` and ``max`` functions (:issue:`26125`)
- Use actual class name in repr of empty objects of a ``Series`` subclass (:issue:`27001`).
-- Bug in :class:`DataFrame` where passing an object array of timezone-aware `datetime` objects would incorrectly raise ``ValueError`` (:issue:`13287`)
+- Bug in :class:`DataFrame` where passing an object array of timezone-aware ``datetime`` objects would incorrectly raise ``ValueError`` (:issue:`13287`)
.. _whatsnew_0.250.contributors:
diff --git a/doc/source/whatsnew/v0.25.1.rst b/doc/source/whatsnew/v0.25.1.rst
index 944021ca0fcae..8a16bab63f1bf 100644
--- a/doc/source/whatsnew/v0.25.1.rst
+++ b/doc/source/whatsnew/v0.25.1.rst
@@ -9,10 +9,10 @@ including other versions of pandas.
I/O and LZMA
~~~~~~~~~~~~
-Some users may unknowingly have an incomplete Python installation lacking the `lzma` module from the standard library. In this case, `import pandas` failed due to an `ImportError` (:issue:`27575`).
-Pandas will now warn, rather than raising an `ImportError` if the `lzma` module is not present. Any subsequent attempt to use `lzma` methods will raise a `RuntimeError`.
-A possible fix for the lack of the `lzma` module is to ensure you have the necessary libraries and then re-install Python.
-For example, on MacOS installing Python with `pyenv` may lead to an incomplete Python installation due to unmet system dependencies at compilation time (like `xz`). Compilation will succeed, but Python might fail at run time. The issue can be solved by installing the necessary dependencies and then re-installing Python.
+Some users may unknowingly have an incomplete Python installation lacking the ``lzma`` module from the standard library. In this case, ``import pandas`` failed due to an ``ImportError`` (:issue:`27575`).
+pandas will now warn, rather than raising an ``ImportError`` if the ``lzma`` module is not present. Any subsequent attempt to use ``lzma`` methods will raise a ``RuntimeError``.
+A possible fix for the lack of the ``lzma`` module is to ensure you have the necessary libraries and then re-install Python.
+For example, on MacOS installing Python with ``pyenv`` may lead to an incomplete Python installation due to unmet system dependencies at compilation time (like ``xz``). Compilation will succeed, but Python might fail at run time. The issue can be solved by installing the necessary dependencies and then re-installing Python.
.. _whatsnew_0251.bug_fixes:
@@ -52,7 +52,7 @@ Conversion
Interval
^^^^^^^^
-- Bug in :class:`IntervalIndex` where `dir(obj)` would raise ``ValueError`` (:issue:`27571`)
+- Bug in :class:`IntervalIndex` where ``dir(obj)`` would raise ``ValueError`` (:issue:`27571`)
Indexing
^^^^^^^^
@@ -89,13 +89,13 @@ Groupby/resample/rolling
- Bug in :meth:`pandas.core.groupby.DataFrameGroupBy.transform` where applying a timezone conversion lambda function would drop timezone information (:issue:`27496`)
- Bug in :meth:`pandas.core.groupby.GroupBy.nth` where ``observed=False`` was being ignored for Categorical groupers (:issue:`26385`)
- Bug in windowing over read-only arrays (:issue:`27766`)
-- Fixed segfault in `pandas.core.groupby.DataFrameGroupBy.quantile` when an invalid quantile was passed (:issue:`27470`)
+- Fixed segfault in ``pandas.core.groupby.DataFrameGroupBy.quantile`` when an invalid quantile was passed (:issue:`27470`)
Reshaping
^^^^^^^^^
- A ``KeyError`` is now raised if ``.unstack()`` is called on a :class:`Series` or :class:`DataFrame` with a flat :class:`Index` passing a name which is not the correct one (:issue:`18303`)
-- Bug :meth:`merge_asof` could not merge :class:`Timedelta` objects when passing `tolerance` kwarg (:issue:`27642`)
+- Bug :meth:`merge_asof` could not merge :class:`Timedelta` objects when passing ``tolerance`` kwarg (:issue:`27642`)
- Bug in :meth:`DataFrame.crosstab` when ``margins`` set to ``True`` and ``normalize`` is not ``False``, an error is raised. (:issue:`27500`)
- :meth:`DataFrame.join` now suppresses the ``FutureWarning`` when the sort parameter is specified (:issue:`21952`)
- Bug in :meth:`DataFrame.join` raising with readonly arrays (:issue:`27943`)
diff --git a/doc/source/whatsnew/v0.25.2.rst b/doc/source/whatsnew/v0.25.2.rst
index c0c68ce4b1f44..a5ea8933762ab 100644
--- a/doc/source/whatsnew/v0.25.2.rst
+++ b/doc/source/whatsnew/v0.25.2.rst
@@ -8,7 +8,7 @@ including other versions of pandas.
.. note::
- Pandas 0.25.2 adds compatibility for Python 3.8 (:issue:`28147`).
+ pandas 0.25.2 adds compatibility for Python 3.8 (:issue:`28147`).
.. _whatsnew_0252.bug_fixes:
diff --git a/doc/source/whatsnew/v0.5.0.rst b/doc/source/whatsnew/v0.5.0.rst
index 7ccb141260f18..7447a10fa1d6b 100644
--- a/doc/source/whatsnew/v0.5.0.rst
+++ b/doc/source/whatsnew/v0.5.0.rst
@@ -9,7 +9,7 @@ Version 0.5.0 (October 24, 2011)
.. ipython:: python
:suppress:
- from pandas import * # noqa F401, F403
+ from pandas import * # noqa F401, F403
New features
diff --git a/doc/source/whatsnew/v0.6.0.rst b/doc/source/whatsnew/v0.6.0.rst
index f984b9ad71b63..8ff688eaa91e7 100644
--- a/doc/source/whatsnew/v0.6.0.rst
+++ b/doc/source/whatsnew/v0.6.0.rst
@@ -8,7 +8,7 @@ Version 0.6.0 (November 25, 2011)
.. ipython:: python
:suppress:
- from pandas import * # noqa F401, F403
+ from pandas import * # noqa F401, F403
New features
@@ -52,7 +52,7 @@ New features
Performance enhancements
~~~~~~~~~~~~~~~~~~~~~~~~
- VBENCH Cythonized ``cache_readonly``, resulting in substantial micro-performance enhancements throughout the code base (:issue:`361`)
-- VBENCH Special Cython matrix iterator for applying arbitrary reduction operations with 3-5x better performance than `np.apply_along_axis` (:issue:`309`)
+- VBENCH Special Cython matrix iterator for applying arbitrary reduction operations with 3-5x better performance than ``np.apply_along_axis`` (:issue:`309`)
- VBENCH Improved performance of ``MultiIndex.from_tuples``
- VBENCH Special Cython matrix iterator for applying arbitrary reduction operations
- VBENCH + DOCUMENT Add ``raw`` option to ``DataFrame.apply`` for getting better performance when
diff --git a/doc/source/whatsnew/v0.6.1.rst b/doc/source/whatsnew/v0.6.1.rst
index 8eea0a07f1f79..8ee80fa2c44b1 100644
--- a/doc/source/whatsnew/v0.6.1.rst
+++ b/doc/source/whatsnew/v0.6.1.rst
@@ -16,12 +16,12 @@ New features
- Add PyQt table widget to sandbox (:issue:`435`)
- DataFrame.align can :ref:`accept Series arguments `
and an :ref:`axis option ` (:issue:`461`)
-- Implement new :ref:`SparseArray ` and `SparseList`
+- Implement new :ref:`SparseArray ` and ``SparseList``
data structures. SparseSeries now derives from SparseArray (:issue:`463`)
- :ref:`Better console printing options ` (:issue:`453`)
- Implement fast :ref:`data ranking ` for Series and
DataFrame, fast versions of scipy.stats.rankdata (:issue:`428`)
-- Implement `DataFrame.from_items` alternate
+- Implement ``DataFrame.from_items`` alternate
constructor (:issue:`444`)
- DataFrame.convert_objects method for :ref:`inferring better dtypes `
for object columns (:issue:`302`)
@@ -37,7 +37,7 @@ New features
Performance improvements
~~~~~~~~~~~~~~~~~~~~~~~~
-- Improve memory usage of `DataFrame.describe` (do not copy data
+- Improve memory usage of ``DataFrame.describe`` (do not copy data
unnecessarily) (PR #425)
- Optimize scalar value lookups in the general case by 25% or more in Series
diff --git a/doc/source/whatsnew/v0.7.0.rst b/doc/source/whatsnew/v0.7.0.rst
index a193b8049e951..2fe686d8858a2 100644
--- a/doc/source/whatsnew/v0.7.0.rst
+++ b/doc/source/whatsnew/v0.7.0.rst
@@ -20,7 +20,7 @@ New features
``DataFrame.append`` (:issue:`468`, :issue:`479`, :issue:`273`)
- :ref:`Can |