Skip to content

DOC: ecosystem: Vaex, Pandas on Ray, alphabetization, pandas-datareader #20345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jul 8, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 85 additions & 46 deletions doc/source/ecosystem.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,13 @@ build powerful and more focused data tools.
The creation of libraries that complement pandas' functionality also allows pandas
development to remain focused around it's original requirements.

This is an in-exhaustive list of projects that build on pandas in order to provide
tools in the PyData space.
This is an inexhaustive list of projects that build on pandas in order to provide
tools in the PyData space. For a list of projects that depend on pandas,
see the
`libraries.io usage page for pandas <https://libraries.io/pypi/pandas/usage>`_
or `search pypi for pandas <https://pypi.org/search/?q=pandas>`_.

We'd like to make it easier for users to find these project, if you know of other
We'd like to make it easier for users to find these projects, if you know of other
substantial projects that you feel should be on this list, please let us know.


Expand Down Expand Up @@ -48,6 +51,17 @@ Featuretools is a Python library for automated feature engineering built on top
Visualization
-------------

`Altair <https://altair-viz.github.io/>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Altair is a declarative statistical visualization library for Python.
With Altair, you can spend more time understanding your data and its
meaning. Altair's API is simple, friendly and consistent and built on
top of the powerful Vega-Lite JSON specification. This elegant
simplicity produces beautiful and effective visualizations with a
minimal amount of code. Altair works with Pandas DataFrames.


`Bokeh <http://bokeh.pydata.org>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -68,31 +82,22 @@ also goes beyond matplotlib and pandas with the option to perform statistical
estimation while plotting, aggregating across observations and visualizing the
fit of statistical models to emphasize patterns in a dataset.

`yhat/ggplot <https://github.com/yhat/ggplot>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`yhat/ggpy <https://github.com/yhat/ggpy>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Hadley Wickham's `ggplot2 <http://ggplot2.org/>`__ is a foundational exploratory visualization package for the R language.
Based on `"The Grammar of Graphics" <http://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html>`__ it
provides a powerful, declarative and extremely general way to generate bespoke plots of any kind of data.
It's really quite incredible. Various implementations to other languages are available,
but a faithful implementation for Python users has long been missing. Although still young
(as of Jan-2014), the `yhat/ggplot <https://github.com/yhat/ggplot>`__ project has been
(as of Jan-2014), the `yhat/ggpy <https://github.com/yhat/ggpy>`__ project has been
progressing quickly in that direction.

`Vincent <https://github.com/wrobstory/vincent>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The `Vincent <https://github.com/wrobstory/vincent>`__ project leverages `Vega <https://github.com/trifacta/vega>`__
(that in turn, leverages `d3 <http://d3js.org/>`__) to create
plots. Although functional, as of Summer 2016 the Vincent project has not been updated
in over two years and is `unlikely to receive further updates <https://github.com/wrobstory/vincent#2015-08-12-update>`__.

`IPython Vega <https://github.com/vega/ipyvega>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Like Vincent, the `IPython Vega <https://github.com/vega/ipyvega>`__ project leverages `Vega
<https://github.com/trifacta/vega>`__ to create plots, but primarily
targets the IPython Notebook environment.
`IPython Vega <https://github.com/vega/ipyvega>`__ leverages `Vega
<https://github.com/trifacta/vega>`__ to create plots within Jupyter Notebook.

`Plotly <https://plot.ly/python>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -115,20 +120,28 @@ IDE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

IPython is an interactive command shell and distributed computing
environment.
IPython Notebook is a web application for creating IPython notebooks.
An IPython notebook is a JSON document containing an ordered list
environment. IPython tab completion works with Pandas methods and also
attributes like DataFrame columns.

`Jupyter Notebook / Jupyter Lab <https://jupyter.org>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jupyter Notebook is a web application for creating Jupyter notebooks.
A Jupyter notebook is a JSON document containing an ordered list
of input/output cells which can contain code, text, mathematics, plots
and rich media.
IPython notebooks can be converted to a number of open standard output formats
Jupyter notebooks can be converted to a number of open standard output formats
(HTML, HTML presentation slides, LaTeX, PDF, ReStructuredText, Markdown,
Python) through 'Download As' in the web interface and ``ipython nbconvert``
Python) through 'Download As' in the web interface and ``jupyter convert``
in a shell.

Pandas DataFrames implement ``_repr_html_`` methods
which are utilized by IPython Notebook for displaying
(abbreviated) HTML tables. (Note: HTML tables may or may not be
compatible with non-HTML IPython output formats.)
Pandas DataFrames implement ``_repr_html_``and ``_repr_latex`` methods
which are utilized by Jupyter Notebook for displaying
(abbreviated) HTML or LaTeX tables. LaTeX output is properly escaped.
(Note: HTML tables may or may not be
compatible with non-HTML Jupyter output formats.)

See :ref:`Options and Settings <options>` and :ref:`<options.available>`
for pandas ``display.`` settings.

`quantopian/qgrid <https://github.com/quantopian/qgrid>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -144,11 +157,10 @@ editing, testing, debugging, and introspection features.
Spyder can now introspect and display Pandas DataFrames and show
both "column wise min/max and global min/max coloring."


.. _ecosystem.api:

API
-----
---

`pandas-datareader <https://github.com/pydata/pandas-datareader>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -159,14 +171,22 @@ See more in the `pandas-datareader docs <https://pandas-datareader.readthedocs.

The following data feeds are available:

* Yahoo! Finance
* Google Finance
* FRED
* Fama/French
* World Bank
* OECD
* Eurostat
* EDGAR Index
* Google Finance
* Tiingo
* Morningstar
* IEX
* Robinhood
* Enigma
* Quandl
* FRED
* Fama/French
* World Bank
* OECD
* Eurostat
* TSP Fund Data
* Nasdaq Trader Symbol Definitions
* Stooq Index Data
* MOEX Data

`quandl/Python <https://github.com/quandl/Python>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -227,25 +247,24 @@ dimensional arrays, rather than the tabular data for which pandas excels.
Out-of-core
-------------

`Blaze <http://blaze.pydata.org/>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Blaze provides a standard API for doing computations with various
in-memory and on-disk backends: NumPy, Pandas, SQLAlchemy, MongoDB, PyTables,
PySpark.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you are only putting them in alphabetical order, but I not sure to what extent the blaze library itself is still actively developed (in contrast to things that grew out of it, like dask)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @llllllllll is still doing some work on it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could remove blaze I think. Should add Ibis though:http://docs.ibis-project.org/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ibis also 'only' does to/from in relation to pandas?
http://docs.ibis-project.org/impala.html#ingesting-data-from-pandas

Would we add ibis under 'Out of Core' or a new 'Similar API' section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I, and a few others, are still using blaze, but it is not very active. We are still using blaze at @quantopian in production, and it is still maintained. Given its current state, I am fine with either leaving it or removing it.


`Dask <https://dask.readthedocs.io/en/latest/>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Dask is a flexible parallel computing library for analytics. Dask
provides a familiar ``DataFrame`` interface for out-of-core, parallel and distributed computing.

`Dask-ML <https://dask-ml.readthedocs.io/en/latest/>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line lengths appear to be equal in my editor, and the output from rst2html.py doesn't show an error.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Dask-ML enables parallel and distributed machine learning using Dask alongside existing machine learning libraries like Scikit-Learn, XGBoost, and TensorFlow.


`Blaze <http://blaze.pydata.org/>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Blaze provides a standard API for doing computations with various
in-memory and on-disk backends: NumPy, Pandas, SQLAlchemy, MongoDB, PyTables,
PySpark.

`Odo <http://odo.pydata.org>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -255,6 +274,26 @@ PyTables, h5py, and pymongo to move data between non pandas formats. Its graph
based approach is also extensible by end users for custom formats that may be
too specific for the core of odo.

`Ray <https://ray.readthedocs.io/en/latest/pandas_on_ray.html>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pandas on Ray is an early stage DataFrame library that wraps Pandas and transparently distributes the data and computation. The user does not need to know how many cores their system has, nor do they need to specify how to distribute the data. In fact, users can continue using their previous Pandas notebooks while experiencing a considerable speedup from Pandas on Ray, even on a single machine. Only a modification of the import statement is needed, as we demonstrate below. Once you’ve changed your import statement, you’re ready to use Pandas on Ray just like you would Pandas.

.. code:: python

# import pandas as pd
import ray.dataframe as pd


`Vaex <https://docs.vaex.io/>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Increasingly, packages are being built on top of pandas to address specific needs in data preparation, analysis and visualization. Vaex is a python library for Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (10\ :sup:`9`) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

* vaex.from_pandas
* vaex.to_pandas_df


.. _ecosystem.data_validation:

Data validation
Expand Down