diff --git a/doc/source/user_guide/10min.rst b/doc/source/user_guide/10min.rst index 08488a33936f0..494d5308b284c 100644 --- a/doc/source/user_guide/10min.rst +++ b/doc/source/user_guide/10min.rst @@ -29,7 +29,7 @@ a default integer index: s = pd.Series([1, 3, 5, np.nan, 6, 8]) s -Creating a :class:`DataFrame` by passing a NumPy array, with a datetime index +Creating a :class:`DataFrame` by passing a NumPy array, with a datetime index using :func:`date_range` and labeled columns: .. ipython:: python @@ -93,14 +93,15 @@ Viewing data See the :ref:`Basics section `. -Here is how to view the top and bottom rows of the frame: +Use :meth:`DataFrame.head` and :meth:`DataFrame.tail` to view the top and bottom rows of the frame +respectively: .. ipython:: python df.head() df.tail(3) -Display the index, columns: +Display the :attr:`DataFrame.index` or :attr:`DataFrame.columns`: .. ipython:: python @@ -116,7 +117,7 @@ while pandas DataFrames have one dtype per column**. When you call of the dtypes in the DataFrame. This may end up being ``object``, which requires casting every value to a Python object. -For ``df``, our :class:`DataFrame` of all floating-point values, +For ``df``, our :class:`DataFrame` of all floating-point values, and :meth:`DataFrame.to_numpy` is fast and doesn't require copying data: .. ipython:: python @@ -147,13 +148,13 @@ Transposing your data: df.T -Sorting by an axis: +:meth:`DataFrame.sort_index` sorts by an axis: .. ipython:: python df.sort_index(axis=1, ascending=False) -Sorting by values: +:meth:`DataFrame.sort_values` sorts by values: .. ipython:: python @@ -166,8 +167,8 @@ Selection While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we - recommend the optimized pandas data access methods, ``.at``, ``.iat``, - ``.loc`` and ``.iloc``. + recommend the optimized pandas data access methods, :meth:`DataFrame.at`, :meth:`DataFrame.iat`, + :meth:`DataFrame.loc` and :meth:`DataFrame.iloc`. See the indexing documentation :ref:`Indexing and Selecting Data ` and :ref:`MultiIndex / Advanced Indexing `. @@ -181,7 +182,7 @@ equivalent to ``df.A``: df["A"] -Selecting via ``[]``, which slices the rows: +Selecting via ``[]`` (``__getitem__``), which slices the rows: .. ipython:: python @@ -191,7 +192,7 @@ Selecting via ``[]``, which slices the rows: Selection by label ~~~~~~~~~~~~~~~~~~ -See more in :ref:`Selection by Label `. +See more in :ref:`Selection by Label ` using :meth:`DataFrame.loc` or :meth:`DataFrame.at`. For getting a cross section using a label: @@ -232,7 +233,7 @@ For getting fast access to a scalar (equivalent to the prior method): Selection by position ~~~~~~~~~~~~~~~~~~~~~ -See more in :ref:`Selection by Position `. +See more in :ref:`Selection by Position ` using :meth:`DataFrame.iloc` or :meth:`DataFrame.at`. Select via the position of the passed integers: @@ -361,19 +362,19 @@ returns a copy of the data: df1.loc[dates[0] : dates[1], "E"] = 1 df1 -To drop any rows that have missing data: +:meth:`DataFrame.dropna` drops any rows that have missing data: .. ipython:: python df1.dropna(how="any") -Filling missing data: +:meth:`DataFrame.fillna` fills missing data: .. ipython:: python df1.fillna(value=5) -To get the boolean mask where values are ``nan``: +:func:`isna` gets the boolean mask where values are ``nan``: .. ipython:: python @@ -415,7 +416,7 @@ In addition, pandas automatically broadcasts along the specified dimension: Apply ~~~~~ -Applying functions to the data: +:meth:`DataFrame.apply` applies a user defined function to the data: .. ipython:: python @@ -461,7 +462,7 @@ operations. See the :ref:`Merging section `. -Concatenating pandas objects together with :func:`concat`: +Concatenating pandas objects together along an axis with :func:`concat`: .. ipython:: python @@ -482,7 +483,7 @@ Concatenating pandas objects together with :func:`concat`: Join ~~~~ -SQL style merges. See the :ref:`Database style joining ` section. +:func:`merge` enables SQL style join types along specific columns. See the :ref:`Database style joining ` section. .. ipython:: python @@ -572,7 +573,7 @@ columns: stacked = df2.stack() stacked -With a "stacked" DataFrame or Series (having a ``MultiIndex`` as the +With a "stacked" DataFrame or Series (having a :class:`MultiIndex` as the ``index``), the inverse operation of :meth:`~DataFrame.stack` is :meth:`~DataFrame.unstack`, which by default unstacks the **last level**: @@ -599,7 +600,7 @@ See the section on :ref:`Pivot Tables `. ) df -We can produce pivot tables from this data very easily: +:func:`pivot_table` pivots a :class:`DataFrame` specifying the ``values``, ``index`` and ``columns`` .. ipython:: python @@ -620,7 +621,7 @@ financial applications. See the :ref:`Time Series section `. ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng) ts.resample("5Min").sum() -Time zone representation: +:meth:`Series.tz_localize` localizes a time series to a time zone: .. ipython:: python @@ -630,7 +631,7 @@ Time zone representation: ts_utc = ts.tz_localize("UTC") ts_utc -Converting to another time zone: +:meth:`Series.tz_convert` converts a timezones aware time series to another time zone: .. ipython:: python @@ -722,7 +723,7 @@ We use the standard convention for referencing the matplotlib API: plt.close("all") -The :meth:`~plt.close` method is used to `close `__ a figure window: +The ``plt.close`` method is used to `close `__ a figure window: .. ipython:: python @@ -732,7 +733,7 @@ The :meth:`~plt.close` method is used to `close `__ to show it or `matplotlib.pyplot.savefig `__ to write it to a file. @@ -756,19 +757,19 @@ of the columns with labels: @savefig frame_plot_basic.png plt.legend(loc='best'); -Getting data in/out -------------------- +Importing and exporting data +---------------------------- CSV ~~~ -:ref:`Writing to a csv file: ` +:ref:`Writing to a csv file: ` using :meth:`DataFrame.to_csv` .. ipython:: python df.to_csv("foo.csv") -:ref:`Reading from a csv file: ` +:ref:`Reading from a csv file: ` using :func:`read_csv` .. ipython:: python @@ -786,13 +787,13 @@ HDF5 Reading and writing to :ref:`HDFStores `. -Writing to a HDF5 Store: +Writing to a HDF5 Store using :meth:`DataFrame.to_hdf`: .. ipython:: python df.to_hdf("foo.h5", "df") -Reading from a HDF5 Store: +Reading from a HDF5 Store using :func:`read_hdf`: .. ipython:: python @@ -806,15 +807,15 @@ Reading from a HDF5 Store: Excel ~~~~~ -Reading and writing to :ref:`MS Excel `. +Reading and writing to :ref:`Excel `. -Writing to an excel file: +Writing to an excel file using :meth:`DataFrame.to_excel`: .. ipython:: python df.to_excel("foo.xlsx", sheet_name="Sheet1") -Reading from an excel file: +Reading from an excel file using :func:`read_excel`: .. ipython:: python @@ -828,16 +829,13 @@ Reading from an excel file: Gotchas ------- -If you are attempting to perform an operation you might see an exception like: +If you are attempting to perform a boolean operation on a :class:`Series` or :class:`DataFrame` +you might see an exception like: -.. code-block:: python - - >>> if pd.Series([False, True, False]): - ... print("I was true") - Traceback - ... - ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all(). +.. ipython:: python + :okexcept: -See :ref:`Comparisons` for an explanation and what to do. + if pd.Series([False, True, False]): + print("I was true") -See :ref:`Gotchas` as well. +See :ref:`Comparisons` and :ref:`Gotchas` for an explanation and what to do. diff --git a/doc/source/user_guide/enhancingperf.rst b/doc/source/user_guide/enhancingperf.rst index c78d972f33d65..7402fe10aeacb 100644 --- a/doc/source/user_guide/enhancingperf.rst +++ b/doc/source/user_guide/enhancingperf.rst @@ -7,10 +7,10 @@ Enhancing performance ********************* In this part of the tutorial, we will investigate how to speed up certain -functions operating on pandas ``DataFrames`` using three different techniques: +functions operating on pandas :class:`DataFrame` using three different techniques: Cython, Numba and :func:`pandas.eval`. We will see a speed improvement of ~200 when we use Cython and Numba on a test function operating row-wise on the -``DataFrame``. Using :func:`pandas.eval` we will speed up a sum by an order of +:class:`DataFrame`. Using :func:`pandas.eval` we will speed up a sum by an order of ~2. .. note:: @@ -44,7 +44,7 @@ faster than the pure Python solution. Pure Python ~~~~~~~~~~~ -We have a ``DataFrame`` to which we want to apply a function row-wise. +We have a :class:`DataFrame` to which we want to apply a function row-wise. .. ipython:: python @@ -73,12 +73,11 @@ Here's the function in pure Python: s += f(a + i * dx) return s * dx -We achieve our result by using ``apply`` (row-wise): +We achieve our result by using :meth:`DataFrame.apply` (row-wise): -.. code-block:: ipython +.. ipython:: python - In [7]: %timeit df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1) - 10 loops, best of 3: 174 ms per loop + %timeit df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1) But clearly this isn't fast enough for us. Let's take a look and see where the time is spent during this operation (limited to the most time consuming @@ -126,10 +125,9 @@ is here to distinguish between function versions): to be using bleeding edge IPython for paste to play well with cell magics. -.. code-block:: ipython +.. ipython:: python - In [4]: %timeit df.apply(lambda x: integrate_f_plain(x["a"], x["b"], x["N"]), axis=1) - 10 loops, best of 3: 85.5 ms per loop + %timeit df.apply(lambda x: integrate_f_plain(x["a"], x["b"], x["N"]), axis=1) Already this has shaved a third off, not too bad for a simple copy and paste. @@ -155,10 +153,9 @@ We get another huge improvement simply by providing type information: ...: return s * dx ...: -.. code-block:: ipython +.. ipython:: python - In [4]: %timeit df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1) - 10 loops, best of 3: 20.3 ms per loop + %timeit df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1) Now, we're talking! It's now over ten times faster than the original Python implementation, and we haven't *really* modified the code. Let's have another @@ -173,7 +170,7 @@ look at what's eating up time: Using ndarray ~~~~~~~~~~~~~ -It's calling series... a lot! It's creating a Series from each row, and get-ting from both +It's calling series a lot! It's creating a :class:`Series` from each row, and calling get from both the index and the series (three times for each row). Function calls are expensive in Python, so maybe we could minimize these by cythonizing the apply part. @@ -216,10 +213,10 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra .. warning:: - You can **not pass** a ``Series`` directly as a ``ndarray`` typed parameter + You can **not pass** a :class:`Series` directly as a ``ndarray`` typed parameter to a Cython function. Instead pass the actual ``ndarray`` using the :meth:`Series.to_numpy`. The reason is that the Cython - definition is specific to an ndarray and not the passed ``Series``. + definition is specific to an ndarray and not the passed :class:`Series`. So, do not do this: @@ -238,10 +235,9 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra Loops like this would be *extremely* slow in Python, but in Cython looping over NumPy arrays is *fast*. -.. code-block:: ipython +.. ipython:: python - In [4]: %timeit apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy()) - 1000 loops, best of 3: 1.25 ms per loop + %timeit apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy()) We've gotten another big improvement. Let's check again where the time is spent: @@ -267,33 +263,33 @@ advanced Cython techniques: ...: cimport cython ...: cimport numpy as np ...: import numpy as np - ...: cdef double f_typed(double x) except? -2: + ...: cdef np.float64_t f_typed(np.float64_t x) except? -2: ...: return x * (x - 1) - ...: cpdef double integrate_f_typed(double a, double b, int N): - ...: cdef int i - ...: cdef double s, dx - ...: s = 0 + ...: cpdef np.float64_t integrate_f_typed(np.float64_t a, np.float64_t b, np.int64_t N): + ...: cdef np.int64_t i + ...: cdef np.float64_t s = 0.0, dx ...: dx = (b - a) / N ...: for i in range(N): ...: s += f_typed(a + i * dx) ...: return s * dx ...: @cython.boundscheck(False) ...: @cython.wraparound(False) - ...: cpdef np.ndarray[double] apply_integrate_f_wrap(np.ndarray[double] col_a, - ...: np.ndarray[double] col_b, - ...: np.ndarray[int] col_N): - ...: cdef int i, n = len(col_N) + ...: cpdef np.ndarray[np.float64_t] apply_integrate_f_wrap( + ...: np.ndarray[np.float64_t] col_a, + ...: np.ndarray[np.float64_t] col_b, + ...: np.ndarray[np.int64_t] col_N + ...: ): + ...: cdef np.int64_t i, n = len(col_N) ...: assert len(col_a) == len(col_b) == n - ...: cdef np.ndarray[double] res = np.empty(n) + ...: cdef np.ndarray[np.float64_t] res = np.empty(n, dtype=np.float64) ...: for i in range(n): ...: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i]) ...: return res ...: -.. code-block:: ipython +.. ipython:: python - In [4]: %timeit apply_integrate_f_wrap(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy()) - 1000 loops, best of 3: 987 us per loop + %timeit apply_integrate_f_wrap(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy()) Even faster, with the caveat that a bug in our Cython code (an off-by-one error, for example) might cause a segfault because memory access isn't checked. @@ -321,7 +317,7 @@ Numba supports compilation of Python to run on either CPU or GPU hardware and is Numba can be used in 2 ways with pandas: #. Specify the ``engine="numba"`` keyword in select pandas methods -#. Define your own Python function decorated with ``@jit`` and pass the underlying NumPy array of :class:`Series` or :class:`Dataframe` (using ``to_numpy()``) into the function +#. Define your own Python function decorated with ``@jit`` and pass the underlying NumPy array of :class:`Series` or :class:`DataFrame` (using ``to_numpy()``) into the function pandas Numba Engine ~~~~~~~~~~~~~~~~~~~ @@ -595,8 +591,8 @@ Now let's do the same thing but with comparisons: of type ``bool`` or ``np.bool_``. Again, you should perform these kinds of operations in plain Python. -The ``DataFrame.eval`` method -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The :meth:`DataFrame.eval` method +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In addition to the top level :func:`pandas.eval` function you can also evaluate an expression in the "context" of a :class:`~pandas.DataFrame`. @@ -630,7 +626,7 @@ new column name or an existing column name, and it must be a valid Python identifier. The ``inplace`` keyword determines whether this assignment will performed -on the original ``DataFrame`` or return a copy with the new column. +on the original :class:`DataFrame` or return a copy with the new column. .. ipython:: python @@ -640,7 +636,7 @@ on the original ``DataFrame`` or return a copy with the new column. df.eval("a = 1", inplace=True) df -When ``inplace`` is set to ``False``, the default, a copy of the ``DataFrame`` with the +When ``inplace`` is set to ``False``, the default, a copy of the :class:`DataFrame` with the new or modified columns is returned and the original frame is unchanged. .. ipython:: python @@ -672,7 +668,7 @@ The equivalent in standard Python would be df["a"] = 1 df -The ``query`` method has a ``inplace`` keyword which determines +The :class:`DataFrame.query` method has a ``inplace`` keyword which determines whether the query modifies the original frame. .. ipython:: python @@ -814,7 +810,7 @@ computation. The two lines are two different engines. .. image:: ../_static/eval-perf-small.png -This plot was created using a ``DataFrame`` with 3 columns each containing +This plot was created using a :class:`DataFrame` with 3 columns each containing floating point values generated using ``numpy.random.randn()``. Technical minutia regarding expression evaluation