diff --git a/web/pandas/community/blog/2019-user-survey.md b/web/pandas/community/blog/2019-user-survey.md new file mode 100644 index 0000000000000..73c426e7cbec9 --- /dev/null +++ b/web/pandas/community/blog/2019-user-survey.md @@ -0,0 +1,172 @@ +Title: 2019 pandas user survey +Date: 2019-08-22 + + + +# 2019 pandas user survey + +Pandas recently conducted a user survey to help guide future development. +Thanks to everyone who participated! This post presents the high-level results. + +This analysis and the raw data can be found [on GitHub](https://github.com/pandas-dev/pandas-user-surveys) and run on Binder + +[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/pandas-dev/pandas-user-surveys/master?filepath=2019.ipynb) + + +We had about 1250 repsonses over the 15 days we ran the survey in the summer of 2019. + +## About the Respondents + +There was a fair amount of representation across pandas experience and frequeny of use, though the majority of respondents are on the more experienced side. + + + +![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_4_0.png) + + + + +![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_5_0.png) + + +We included a few questions that were also asked in the [Python Developers Survey](https://www.jetbrains.com/research/python-developers-survey-2018/) so we could compare Pandas' population to Python's. + +90% of our respondents use Python as a primary language (compared with 84% from the PSF survey). + + + + + + Yes 90.67% + No 9.33% + Name: Is Python your main language?, dtype: object + + + +Windows users are well represented (see [Steve Dower's talk](https://www.youtube.com/watch?v=uoI57uMdDD4) on this topic). + + + + + + Linux 61.57% + Windows 60.21% + MacOS 42.75% + Name: What Operating Systems do you use?, dtype: object + + + +For environment isolation, [conda](https://conda.io/en/latest/) was the most popular. + + + + +![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_13_0.png) + + +Most repondents are Python 3 only. + + + + + + 3 92.39% + 2 & 3 6.80% + 2 0.81% + Name: Python 2 or 3?, dtype: object + + + +## Pandas APIs + +It can be hard for open source projects to know what features are actually used. We asked a few questions to get an idea. + +CSV and Excel are (for better or worse) the most popular formats. + + + +![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_18_0.png) + + +In preperation for a possible refactor of pandas internals, we wanted to get a sense for +how common wide (100s of columns or more) DataFrames are. + + + +![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_20_0.png) + + +Pandas is slowly growing new exentension types. Categoricals are the most popular, +and the nullable integer type is already almost as popular as datetime with timezone. + + + +![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_22_0.png) + + +More and better examples seem to be a high-priority development item. +Pandas recently received a NumFOCUS grant to improve our documentation, +which we're using to write tutorial-style documentation, which should help +meet this need. + + + +![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_24_0.png) + + +We also asked about specific, commonly-requested features. + + + +![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_26_0.png) + + +Of these, the clear standout is "scaling" to large datasets. A couple observations: + +1. Perhaps pandas' documentation should do a better job of promoting libraries that provide scalable dataframes (like [Dask](https://dask.org), [vaex](https://dask.org), and [modin](https://modin.readthedocs.io/en/latest/)) +2. Memory efficiency (perhaps from a native string data type, fewer internal copies, etc.) is a valuable goal. + +After that, the next-most critical improvement is integer missing values. Those were actually added in [Pandas 0.24](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#optional-integer-na-support), but they're not the default, and there's still some incompatibilites with the rest of pandas API. + +Pandas is a less conservative library than, say, NumPy. We're approaching 1.0, but on the way we've made many deprecations and some outright API breaking changes. Fortunately, most people are OK with the tradeoff. + + + + + + Yes 94.89% + No 5.11% + Name: Is Pandas stable enough for you?, dtype: object + + + +There's a perception (which is shared by many of the pandas maintainers) that the pandas API is too large. To measure that, we asked whether users thought that pandas' API was too large, too small, or just right. + + + +![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_31_0.png) + + +Finally, we asked for an overall satisfaction with the library, from 1 (not very unsatisfied) to 5 (very satisfied). + + + +![png]({{ base_url }}/static/img/blog/2019-user-survey/2019_33_0.png) + + +Most people are very satisfied. The average response is 4.39. I look forward to tracking this number over time. + +If you're analyzing the raw data, be sure to share the results with us [@pandas_dev](https://twitter.com/pandas_dev). diff --git a/web/pandas/community/blog/extension-arrays.md b/web/pandas/community/blog/extension-arrays.md new file mode 100644 index 0000000000000..bc6179adfa719 --- /dev/null +++ b/web/pandas/community/blog/extension-arrays.md @@ -0,0 +1,218 @@ +Title: pandas extension arrays +Date: 2019-01-04 + +# pandas extension arrays + +Extensibility was a major theme in pandas development over the last couple of +releases. This post introduces the pandas extension array interface: the +motivation behind it and how it might affect you as a pandas user. Finally, we +look at how extension arrays may shape the future of pandas. + +Extension Arrays are just one of the changes in pandas 0.24.0. See the +[whatsnew][whatsnew] for a full changelog. + +## The Motivation + +Pandas is built on top of NumPy. You could roughly define a Series as a wrapper +around a NumPy array, and a DataFrame as a collection of Series with a shared +index. That's not entirely correct for several reasons, but I want to focus on +the "wrapper around a NumPy array" part. It'd be more correct to say "wrapper +around an array-like object". + +Pandas mostly uses NumPy's builtin data representation; we've restricted it in +places and extended it in others. For example, pandas' early users cared greatly +about timezone-aware datetimes, which NumPy doesn't support. So pandas +internally defined a `DatetimeTZ` dtype (which mimics a NumPy dtype), and +allowed you to use that dtype in `Index`, `Series`, and as a column in a +`DataFrame`. That dtype carried around the tzinfo, but wasn't itself a valid +NumPy dtype. + +As another example, consider `Categorical`. This actually composes *two* arrays: +one for the `categories` and one for the `codes`. But it can be stored in a +`DataFrame` like any other column. + +Each of these extension types pandas added is useful on its own, but carries a +high maintenance cost. Large sections of the codebase need to be aware of how to +handle a NumPy array or one of these other kinds of special arrays. This made +adding new extension types to pandas very difficult. + +Anaconda, Inc. had a client who regularly dealt with datasets with IP addresses. +They wondered if it made sense to add an [IPArray][IPArray] to pandas. In the +end, we didn't think it passed the cost-benefit test for inclusion in pandas +*itself*, but we were interested in defining an interface for third-party +extensions to pandas. Any object implementing this interface would be allowed in +pandas. I was able to write [cyberpandas][cyberpandas] outside of pandas, but it +feels like using any other dtype built into pandas. + +## The Current State + +As of pandas 0.24.0, all of pandas' internal extension arrays (Categorical, +Datetime with Timezone, Period, Interval, and Sparse) are now built on top of +the ExtensionArray interface. Users shouldn't notice many changes. The main +thing you'll notice is that things are cast to `object` dtype in fewer places, +meaning your code will run faster and your types will be more stable. This +includes storing `Period` and `Interval` data in `Series` (which were previously +cast to object dtype). + +Additionally, we'll be able to add *new* extension arrays with relative ease. +For example, 0.24.0 (optionally) solved one of pandas longest-standing pain +points: missing values casting integer-dtype values to float. + + +```python +>>> int_ser = pd.Series([1, 2], index=[0, 2]) +>>> int_ser +0 1 +2 2 +dtype: int64 + +>>> int_ser.reindex([0, 1, 2]) +0 1.0 +1 NaN +2 2.0 +dtype: float64 +``` + +With the new [IntegerArray][IntegerArray] and nullable integer dtypes, we can +natively represent integer data with missing values. + +```python +>>> int_ser = pd.Series([1, 2], index=[0, 2], dtype=pd.Int64Dtype()) +>>> int_ser +0 1 +2 2 +dtype: Int64 + +>>> int_ser.reindex([0, 1, 2]) +0 1 +1 NaN +2 2 +dtype: Int64 +``` + +One thing it does slightly change how you should access the raw (unlabeled) +arrays stored inside a Series or Index, which is occasionally useful. Perhaps +the method you're calling only works with NumPy arrays, or perhaps you want to +disable automatic alignment. + +In the past, you'd hear things like "Use `.values` to extract the NumPy array +from a Series or DataFrame." If it were a good resource, they'd tell you that's +not *entirely* true, since there are some exceptions. I'd like to delve into +those exceptions. + +The fundamental problem with `.values` is that it serves two purposes: + +1. Extracting the array backing a Series, Index, or DataFrame +2. Converting the Series, Index, or DataFrame to a NumPy array + +As we saw above, the "array" backing a Series or Index might not be a NumPy +array, it may instead be an extension array (from pandas or a third-party +library). For example, consider `Categorical`, + +```python +>>> cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c']) +>>> ser = pd.Series(cat) +>>> ser +0 a +1 b +2 a +dtype: category +Categories (3, object): [a, b, c] + +>>> ser.values +[a, b, a] +Categories (3, object): [a, b, c] +``` + +In this case `.values` is a Categorical, not a NumPy array. For period-dtype +data, `.values` returns a NumPy array of `Period` objects, which is expensive to +create. For timezone-aware data, `.values` converts to UTC and *drops* the +timezone info. These kind of surprises (different types, or expensive or lossy +conversions) stem from trying to shoehorn these extension arrays into a NumPy +array. But the entire point of an extension array is for representing data NumPy +*can't* natively represent. + +To solve the `.values` problem, we've split its roles into two dedicated methods: + +1. Use `.array` to get a zero-copy reference to the underlying data +2. Use `.to_numpy()` to get a (potentially expensive, lossy) NumPy array of the + data. + +So with our Categorical example, + +```python +>>> ser.array +[a, b, a] +Categories (3, object): [a, b, c] + +>>> ser.to_numpy() +array(['a', 'b', 'a'], dtype=object) +``` + +To summarize: + +- `.array` will *always* be a an ExtensionArray, and is always a zero-copy + reference back to the data. +- `.to_numpy()` is *always* a NumPy array, so you can reliably call + ndarray-specific methods on it. + +You shouldn't ever need `.values` anymore. + +## Possible Future Paths + +Extension Arrays open up quite a few exciting opportunities. Currently, pandas +represents string data using Python objects in a NumPy array, which is slow. +Libraries like [Apache Arrow][arrow] provide native support for variable-length +strings, and the [Fletcher][fletcher] library provides pandas extension arrays +for Arrow arrays. It will allow [GeoPandas][geopandas] to store geometry data +more efficiently. Pandas (or third-party libraries) will be able to support +nested data, data with units, geo data, GPU arrays. Keep an eye on the +[pandas ecosystem][eco] page, which will keep track of third-party extension +arrays. It's an exciting time for pandas development. + +## Other Thoughts + +I'd like to emphasize that this is an *interface*, and not a concrete array +implementation. We are *not* reimplementing NumPy here in pandas. Rather, this +is a way to take any array-like data structure (one or more NumPy arrays, an +Apache Arrow array, a CuPy array) and place it inside a DataFrame. I think +getting pandas out of the array business, and instead thinking about +higher-level tabular data things, is a healthy development for the project. + +This works perfectly with NumPy's [`__array_ufunc__`][ufunc] protocol and +[NEP-18][nep18]. You'll be able to use the familiar NumPy API on objects that +aren't backed by NumPy memory. + +## Upgrade + +These new goodies are all available in the recently released pandas 0.24. + +conda: + + conda install -c conda-forge pandas + +pip: + + pip install --upgrade pandas + +As always, we're happy to hear feedback on the [mailing list][ml], +[@pandas-dev][twitter], or [issue tracker][tracker]. + +Thanks to the many contributors, maintainers, and [institutional +partners][partners] involved in the pandas community. + + +[IPArray]: https://github.com/pandas-dev/pandas/issues/18767 +[cyberpandas]: https://cyberpandas.readthedocs.io +[IntegerArray]: http://pandas.pydata.org/pandas-docs/version/0.24/reference/api/pandas.arrays.IntegerArray.html +[fletcher]: https://github.com/xhochy/fletcher +[arrow]: https://arrow.apache.org +[ufunc]: https://docs.scipy.org/doc/numpy-1.13.0/neps/ufunc-overrides.html +[nep18]: https://www.numpy.org/neps/nep-0018-array-function-protocol.html +[ml]: https://mail.python.org/mailman/listinfo/pandas-dev +[twitter]: https://twitter.com/pandas_dev +[tracker]: https://github.com/pandas-dev/pandas/issues +[partners]: https://github.com/pandas-dev/pandas-governance/blob/master/people.md +[eco]: http://pandas.pydata.org/pandas-docs/stable/ecosystem.html#extension-data-types +[whatsnew]: http://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html +[geopandas]: https://github.com/geopandas/geopandas diff --git a/web/pandas/community/blog.html b/web/pandas/community/blog/index.html similarity index 100% rename from web/pandas/community/blog.html rename to web/pandas/community/blog/index.html diff --git a/web/pandas/community/blog/pandas-1.0.md b/web/pandas/community/blog/pandas-1.0.md new file mode 100644 index 0000000000000..b07c34a4ab6b5 --- /dev/null +++ b/web/pandas/community/blog/pandas-1.0.md @@ -0,0 +1,31 @@ +Title: pandas 1.0 +Date: 2020-01-29 + +# pandas 1.0 + +Today pandas celebrates its 1.0.0 release. In many ways this is just a normal release with a host of new features, performance improvements, and bug fixes, which are documented in our [release notes](https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html). But it’s also something a bit more — a milestone for the project beyond just the commits. We wanted to take some time to reflect on where we've been and where we're going. + +## Reflections + +The world of scientific Python has changed a lot since pandas was started. In 2011, [the ecosystem was fragmented](https://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/): a standard *rich* data structure for statistics and data science had yet to emerge. This echos a similar story for NumPy, which consolidated array efforts that were [previously fragmented](https://numpy.org/old_array_packages.html). + +Over the subsequent years, pandas emerged as a *de facto* standard. It’s used by data scientists and analysts and as a data structure for other libraries to build on top of. StackOverflow [cited pandas](https://stackoverflow.blog/2017/09/14/python-growing-quickly/) as one of the reasons for Python being the fastest growing major programming language. + +![Growth of pandas](https://149351115.v2.pressablecdn.com/wp-content/uploads/2017/09/related_tags_over_time-1-1000x1000.png) + +Today, the ecosystem is in another phase of exploration. +Several new DataFrame implementations are cropping up to fill needs not met by pandas. +We're [working with those projects](https://datapythonista.me/blog/dataframe-summit-at-euroscipy.html) to establish shared standards and semantics for rich data structures. + +## Community and Project Health + +This release cycle is the first to involve any kind of grant funding for pandas. [Pandas received funding](https://chanzuckerberg.com/eoss/proposals/) as part of the CZI’s [*Essential Open Source Software for Science*](https://medium.com/@cziscience/the-invisible-foundations-of-biomedicine-4ab7f8d4f5dd) [program](https://medium.com/@cziscience/the-invisible-foundations-of-biomedicine-4ab7f8d4f5dd). The pandas project relies overwhelmingly on volunteer contributors. These volunteer contributions are shepherded and augmented by some maintainers who are given time from their employers — our [institutional partners](https://github.com/pandas-dev/pandas-governance/blob/master/people.md#institutional-partners). The largest work item in our grant award was library maintenance, which specifically includes working with community members to address our large backlog of open issues and pull requests. + +While a “1.0.0” version might seem arbitrary or anti-climactic (given that pandas as a codebase is nearly 12 years old), we see it as a symbolic milestone celebrating the growth of our core developer team and depth of our contributor base. Few open source projects are ever truly “done” and pandas is no different. We recognize the essential role that pandas now occupies, and we intend to continue to evolve the project and adapt to the needs of the world’s data wranglers. + +## Going Forward + +Our [roadmap](https://pandas.pydata.org/pandas-docs/version/1.0.0/development/roadmap.html) contains an up-to-date listing of where we see the project heading over the next few years. +Needless to say, there's still plenty to do. + +Check out the [release notes](https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html) and visit the [installation page](https://pandas.pydata.org/pandas-docs/version/1.0.0/getting_started/install.html) for instructions on updating to pandas 1.0. diff --git a/web/pandas/config.yml b/web/pandas/config.yml index d943ad3833b52..23575cc123050 100644 --- a/web/pandas/config.yml +++ b/web/pandas/config.yml @@ -15,6 +15,7 @@ main: - toc - tables - fenced_code + - meta static: logo: /static/img/pandas_white.svg css: @@ -23,7 +24,7 @@ navbar: - name: "About us" target: - name: "About pandas" - target: /about/index.html + target: /about/ - name: "Project roadmap" target: /about/roadmap.html - name: "Team" @@ -39,7 +40,7 @@ navbar: - name: "Community" target: - name: "Blog" - target: /community/blog.html + target: /community/blog/ - name: "Ask a question (StackOverflow)" target: https://stackoverflow.com/questions/tagged/pandas - name: "Code of conduct" @@ -49,9 +50,11 @@ navbar: - name: "Contribute" target: /contribute.html blog: - num_posts: 8 + num_posts: 50 + posts_path: community/blog + author: "pandas team" + feed_name: "pandas blog" feed: - - https://dev.pandas.io/pandas-blog/feeds/all.atom.xml - https://wesmckinney.com/feeds/pandas.atom.xml - https://tomaugspurger.github.io/feed - https://jorisvandenbossche.github.io/feeds/pandas.atom.xml diff --git a/web/pandas/static/img/blog/2019-user-survey/2019_13_0.png b/web/pandas/static/img/blog/2019-user-survey/2019_13_0.png new file mode 100644 index 0000000000000..9ce2ff483f2c2 Binary files /dev/null and b/web/pandas/static/img/blog/2019-user-survey/2019_13_0.png differ diff --git a/web/pandas/static/img/blog/2019-user-survey/2019_18_0.png b/web/pandas/static/img/blog/2019-user-survey/2019_18_0.png new file mode 100644 index 0000000000000..63b2c93b0573d Binary files /dev/null and b/web/pandas/static/img/blog/2019-user-survey/2019_18_0.png differ diff --git a/web/pandas/static/img/blog/2019-user-survey/2019_20_0.png b/web/pandas/static/img/blog/2019-user-survey/2019_20_0.png new file mode 100644 index 0000000000000..1c7abb0434dad Binary files /dev/null and b/web/pandas/static/img/blog/2019-user-survey/2019_20_0.png differ diff --git a/web/pandas/static/img/blog/2019-user-survey/2019_22_0.png b/web/pandas/static/img/blog/2019-user-survey/2019_22_0.png new file mode 100644 index 0000000000000..5ef3d69b48700 Binary files /dev/null and b/web/pandas/static/img/blog/2019-user-survey/2019_22_0.png differ diff --git a/web/pandas/static/img/blog/2019-user-survey/2019_24_0.png b/web/pandas/static/img/blog/2019-user-survey/2019_24_0.png new file mode 100644 index 0000000000000..1a15be05af92d Binary files /dev/null and b/web/pandas/static/img/blog/2019-user-survey/2019_24_0.png differ diff --git a/web/pandas/static/img/blog/2019-user-survey/2019_26_0.png b/web/pandas/static/img/blog/2019-user-survey/2019_26_0.png new file mode 100644 index 0000000000000..4f8d9f2c439ae Binary files /dev/null and b/web/pandas/static/img/blog/2019-user-survey/2019_26_0.png differ diff --git a/web/pandas/static/img/blog/2019-user-survey/2019_31_0.png b/web/pandas/static/img/blog/2019-user-survey/2019_31_0.png new file mode 100644 index 0000000000000..6c8b5f1108f79 Binary files /dev/null and b/web/pandas/static/img/blog/2019-user-survey/2019_31_0.png differ diff --git a/web/pandas/static/img/blog/2019-user-survey/2019_33_0.png b/web/pandas/static/img/blog/2019-user-survey/2019_33_0.png new file mode 100644 index 0000000000000..fd490d3e7255a Binary files /dev/null and b/web/pandas/static/img/blog/2019-user-survey/2019_33_0.png differ diff --git a/web/pandas/static/img/blog/2019-user-survey/2019_4_0.png b/web/pandas/static/img/blog/2019-user-survey/2019_4_0.png new file mode 100644 index 0000000000000..5276ed359badb Binary files /dev/null and b/web/pandas/static/img/blog/2019-user-survey/2019_4_0.png differ diff --git a/web/pandas/static/img/blog/2019-user-survey/2019_5_0.png b/web/pandas/static/img/blog/2019-user-survey/2019_5_0.png new file mode 100644 index 0000000000000..a252e1c9b3503 Binary files /dev/null and b/web/pandas/static/img/blog/2019-user-survey/2019_5_0.png differ diff --git a/web/pandas_web.py b/web/pandas_web.py index 38ab78f5690e7..e62deaa8cdc7f 100755 --- a/web/pandas_web.py +++ b/web/pandas_web.py @@ -78,6 +78,47 @@ def blog_add_posts(context): """ tag_expr = re.compile("<.*?>") posts = [] + # posts from the file system + if context["blog"]["posts_path"]: + posts_path = os.path.join( + context["source_path"], *context["blog"]["posts_path"].split("/") + ) + for fname in os.listdir(posts_path): + if fname.startswith("index."): + continue + link = ( + f"/{context['blog']['posts_path']}" + f"/{os.path.splitext(fname)[0]}.html" + ) + md = markdown.Markdown( + extensions=context["main"]["markdown_extensions"] + ) + with open(os.path.join(posts_path, fname)) as f: + html = md.convert(f.read()) + title = md.Meta["title"][0] + summary = re.sub(tag_expr, "", html) + try: + body_position = summary.index(title) + len(title) + except ValueError: + raise ValueError( + f'Blog post "{fname}" should have a markdown header ' + f'corresponding to its "Title" element "{title}"' + ) + summary = " ".join(summary[body_position:].split(" ")[:30]) + posts.append( + { + "title": title, + "author": context["blog"]["author"], + "published": datetime.datetime.strptime( + md.Meta["date"][0], "%Y-%m-%d" + ), + "feed": context["blog"]["feed_name"], + "link": link, + "description": summary, + "summary": summary, + } + ) + # posts from rss feeds for feed_url in context["blog"]["feed"]: feed_data = feedparser.parse(feed_url) for entry in feed_data.entries: @@ -180,6 +221,7 @@ def get_context(config_fname: str, ignore_io_errors: bool, **kwargs): with open(config_fname) as f: context = yaml.safe_load(f) + context["source_path"] = os.path.dirname(config_fname) context["ignore_io_errors"] = ignore_io_errors context.update(kwargs)