From 7a40073918a219f1e329e5b2895a794a7c6b21a0 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com> Date: Fri, 26 May 2023 14:18:37 +0100 Subject: [PATCH 1/4] draft standard rfc --- content/blog/dataframe_standard_RFC.md | 146 +++++++++++++++++++++++++ 1 file changed, 146 insertions(+) create mode 100644 content/blog/dataframe_standard_RFC.md diff --git a/content/blog/dataframe_standard_RFC.md b/content/blog/dataframe_standard_RFC.md new file mode 100644 index 0000000..6e43bc9 --- /dev/null +++ b/content/blog/dataframe_standard_RFC.md @@ -0,0 +1,146 @@ ++++ +date = "2023-05-25" +author = "Marco Gorelli" +title = "Want to super-charge your library by writing DataFrame-agnostic code? We'd love to hear from you" +tags = ["APIs", "standard", "consortium", "dataframes", "community", "pandas", "polars", "cudf", "modin", "vaex", "koalas", "ibis", "dask"] +categories = ["Consortium", "Standardization"] +description = "An RFC for a DataFrame API Standard" +draft = true +weight = 40 ++++ + +

+ standard-compliant DataFrame +

+ +Tired of getting lost in if-then statements when dealing with API differences +between DataFrame libraries? Would you like to be able to write your code +once, have it work with all major DataFrame libraries, and be done? +Let's learn about an initiative which will enable you to write +cross-DataFrame code - no special-casing nor data conversions required! + +## Why would I want this anyway? + +Say you want to write a function which selects rows of a DataFrame based +on the [z-score](https://en.wikipedia.org/wiki/Standard_score) of a given +column, and you want it to work with any DataFrame library. How might +you write that? + +### Solution 1 + +Here's a typical solution: +```python +def remove_outliers(df: object, column: str) -> pd.DataFrame: + if isinstance(df, pandas.DataFrame): + z_score = (df[column] - df[column].mean())/df[column].std() + return df[z_score.between(-3, 3)] + if isinstance(df, polars.DataFrame): + z_score = ((pl.col(column) - pl.col(column).mean()) / pl.col(column).std()) + return df.filter(z_score.is_between(-3, 3)) + if isinstance(df, some_other_library.DataFrame): + ... +``` +This quickly gets unwieldy. Libraries like `cudf` and `modin` _might_ work +in the `isinstance(df, pandas.DataFrame)` arm, but there's no guarantee - +their APIs are similar, but subtly different. Furthermore, as new libraries +come out, you'd have to keep updating your function to add new `if` statements. + +Can we do better? + +### Solution 2: Interchange Protocol + +An alternative, which wouldn't involve special-casing, could be to +leverage the [DataFrame interchange protocol](https://data-apis.org/dataframe-protocol/latest/index.html): +```python +def remove_outliers(df: object, column: str) -> pd.DataFrame: + df_pd = pd.api.interchange.from_dataframe(df) + z_score = (df_pd[column] - df_pd[column].mean())/df_pd[column].std() + return df_pd[z_score.between(-3, 3)] +``` +We got out of having to write if-then statements (🥳), but there's still a +couple of issues: +1. we had to convert to pandas: this might be expensive if your data was + originally stored on GPU; +2. the return value is a `pandas.DataFrame`, rather than an object of your + original DataFrame library. + +Can we do better? Can we really have it all? + +### Solution 3: Introducing the DataFrame Standard + +Yes, we really can. To write cross-DataFrame code, we'll take these steps: +1. enable the Standard using ``.__dataframe_standard__``. This will return + a Standard-compliant DataFrame; +2. write your code, using the [DataFrame Standard specification](https://data-apis.org/dataframe-api/draft/API_specification/index.html) +3. (optional) return a DataFrame from your original library by calling `.dataframe`. + +Let's see how this would look like for our ``remove_outliers`` example function: +```python +def remove_outliers(df, column): + # Get a Standard-compliant DataFrame. + # NOTE: this has not yet been upstreamed, so won't work out-of-the-box! + # See 'resources' below for how to try it out. + df_standard = df.__dataframe_standard__() + # Use methods from the Standard specification. + col = df_standard.get_column_by_name(column) + z_score = (col - col.mean()) / col.std() + df_standard_filtered = df_standard.get_rows_by_mask((z_score > -3) & (z_score < 3)) + # Return the result as a DataFrame from the original library. + return df_standard_filtered.dataframe +``` +This will work, as if by magic, on any DataFrame with a Standard-compliant implementation. +But it's not magic, of course, it's the power of standardisation! + +## Standard Philosophy - will all DataFrame libraries have the same API one day? + +Let's start with what this isn't: the Standard isn't an attempt to force all DataFrame +libraries to have the same API. It also isn't a way to convert +between DataFrames: the [Interchange Protocol](https://data-apis.org/dataframe-protocol/latest/index.html), +whose adoption is increasing, already does that. It also doesn't aim to standardise +domain or industry specific functionality. + +Rather, it is minimal set of essential DataFrame functionality which will work +the same way across libraries. It will behave in a strict and predictable manner +across DataFrame libraries. Library authours trying to write DataFrame-agnostic +code are expected to greatly benefit from this, as are their users. + +## Who's this for? Do I need to learn yet another API? + +If you're a casual user, then probably not. +The DataFrame Standard is currently mainly targeted towards library developers, +who wish to support multiple DataFrame libraries. Users of non-pandas DataFrame +would then be able to seamlessly use the DataFrame tools (e.g. visualisation, +feature engineering, data cleaning) without having to do any expensive data +conversions. + +If you're a library authour, then we'd love to hear from you. Would this be +useful to you? We expect it to, as the demand for DataFrame-agnostic tools +certainly seems to be there: +- https://github.com/mwaskom/seaborn/issues/3277, +- https://github.com/scikit-learn/scikit-learn/issues/25896 +- https://github.com/plotly/plotly.py/issues/3637 +- (many, many more...) + +## Are we there yet? What lies ahead? + +No, not yet. This is just a first draft, and a request for comments. + +Future plans include: +- increasing the scope of the Standard (currently, the spec is very minimal); +- creating implementations of the Standard for several major DataFrame libraries; +- creating a cross-DataFrame test-suite; +- aiming to ensure each major DataFrame library has a `__dataframe_standard__` method. + +## Conclusion + +We've introduced the DataFrame Standard, which allows you to write cross-DataFrame code. +We learned about its philosophy, as well as what it doesn't aim to be. Finally, we saw +what plans lie ahead - the Standard is in active development, so please watch this space! + +## Resources + +- Read more on the [official website](https://data-apis.org/dataframe-api/) +- Try out the [proof-of-concept implementation for pandas and polars](https://github.com/MarcoGorelli/impl-dataframe-api)! From 081e122277c1c7ca52ebd817450cc8236b348b59 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com> Date: Fri, 2 Jun 2023 14:33:51 +0100 Subject: [PATCH 2/4] dataframe, authours typo --- content/blog/dataframe_standard_RFC.md | 62 +++++++++++++------------- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/content/blog/dataframe_standard_RFC.md b/content/blog/dataframe_standard_RFC.md index 6e43bc9..86d37da 100644 --- a/content/blog/dataframe_standard_RFC.md +++ b/content/blog/dataframe_standard_RFC.md @@ -1,10 +1,10 @@ +++ date = "2023-05-25" author = "Marco Gorelli" -title = "Want to super-charge your library by writing DataFrame-agnostic code? We'd love to hear from you" +title = "Want to super-charge your library by writing dataframe-agnostic code? We'd love to hear from you" tags = ["APIs", "standard", "consortium", "dataframes", "community", "pandas", "polars", "cudf", "modin", "vaex", "koalas", "ibis", "dask"] categories = ["Consortium", "Standardization"] -description = "An RFC for a DataFrame API Standard" +description = "An RFC for a dataframe API Standard" draft = true weight = 40 +++ @@ -12,21 +12,21 @@ weight = 40

standard-compliant DataFrame

Tired of getting lost in if-then statements when dealing with API differences -between DataFrame libraries? Would you like to be able to write your code -once, have it work with all major DataFrame libraries, and be done? +between dataframe libraries? Would you like to be able to write your code +once, have it work with all major dataframe libraries, and be done? Let's learn about an initiative which will enable you to write -cross-DataFrame code - no special-casing nor data conversions required! +cross-dataframe code - no special-casing nor data conversions required! ## Why would I want this anyway? -Say you want to write a function which selects rows of a DataFrame based +Say you want to write a function which selects rows of a dataframe based on the [z-score](https://en.wikipedia.org/wiki/Standard_score) of a given -column, and you want it to work with any DataFrame library. How might +column, and you want it to work with any dataframe library. How might you write that? ### Solution 1 @@ -65,22 +65,22 @@ couple of issues: 1. we had to convert to pandas: this might be expensive if your data was originally stored on GPU; 2. the return value is a `pandas.DataFrame`, rather than an object of your - original DataFrame library. + original dataframe library. Can we do better? Can we really have it all? -### Solution 3: Introducing the DataFrame Standard +### Solution 3: Introducing the Dataframe Standard -Yes, we really can. To write cross-DataFrame code, we'll take these steps: +Yes, we really can. To write cross-dataframe code, we'll take these steps: 1. enable the Standard using ``.__dataframe_standard__``. This will return - a Standard-compliant DataFrame; -2. write your code, using the [DataFrame Standard specification](https://data-apis.org/dataframe-api/draft/API_specification/index.html) -3. (optional) return a DataFrame from your original library by calling `.dataframe`. + a Standard-compliant dataframe; +2. write your code, using the [Dataframe Standard specification](https://data-apis.org/dataframe-api/draft/API_specification/index.html) +3. (optional) return a dataframe from your original library by calling `.dataframe`. Let's see how this would look like for our ``remove_outliers`` example function: ```python def remove_outliers(df, column): - # Get a Standard-compliant DataFrame. + # Get a Standard-compliant dataframe. # NOTE: this has not yet been upstreamed, so won't work out-of-the-box! # See 'resources' below for how to try it out. df_standard = df.__dataframe_standard__() @@ -88,36 +88,36 @@ def remove_outliers(df, column): col = df_standard.get_column_by_name(column) z_score = (col - col.mean()) / col.std() df_standard_filtered = df_standard.get_rows_by_mask((z_score > -3) & (z_score < 3)) - # Return the result as a DataFrame from the original library. + # Return the result as a dataframe from the original library. return df_standard_filtered.dataframe ``` -This will work, as if by magic, on any DataFrame with a Standard-compliant implementation. +This will work, as if by magic, on any dataframe with a Standard-compliant implementation. But it's not magic, of course, it's the power of standardisation! -## Standard Philosophy - will all DataFrame libraries have the same API one day? +## Standard Philosophy - will all dataframe libraries have the same API one day? -Let's start with what this isn't: the Standard isn't an attempt to force all DataFrame +Let's start with what this isn't: the Standard isn't an attempt to force all dataframe libraries to have the same API. It also isn't a way to convert -between DataFrames: the [Interchange Protocol](https://data-apis.org/dataframe-protocol/latest/index.html), +between dataframes: the [Interchange Protocol](https://data-apis.org/dataframe-protocol/latest/index.html), whose adoption is increasing, already does that. It also doesn't aim to standardise domain or industry specific functionality. -Rather, it is minimal set of essential DataFrame functionality which will work +Rather, it is minimal set of essential dataframe functionality which will work the same way across libraries. It will behave in a strict and predictable manner -across DataFrame libraries. Library authours trying to write DataFrame-agnostic +across dataframe libraries. Library authors trying to write dataframe-agnostic code are expected to greatly benefit from this, as are their users. ## Who's this for? Do I need to learn yet another API? If you're a casual user, then probably not. -The DataFrame Standard is currently mainly targeted towards library developers, -who wish to support multiple DataFrame libraries. Users of non-pandas DataFrame -would then be able to seamlessly use the DataFrame tools (e.g. visualisation, +The Dataframe Standard is currently mainly targeted towards library developers, +who wish to support multiple dataframe libraries. Users of non-pandas dataframe +would then be able to seamlessly use the dataframe tools (e.g. visualisation, feature engineering, data cleaning) without having to do any expensive data conversions. -If you're a library authour, then we'd love to hear from you. Would this be -useful to you? We expect it to, as the demand for DataFrame-agnostic tools +If you're a library author, then we'd love to hear from you. Would this be +useful to you? We expect it to, as the demand for dataframe-agnostic tools certainly seems to be there: - https://github.com/mwaskom/seaborn/issues/3277, - https://github.com/scikit-learn/scikit-learn/issues/25896 @@ -130,13 +130,13 @@ No, not yet. This is just a first draft, and a request for comments. Future plans include: - increasing the scope of the Standard (currently, the spec is very minimal); -- creating implementations of the Standard for several major DataFrame libraries; -- creating a cross-DataFrame test-suite; -- aiming to ensure each major DataFrame library has a `__dataframe_standard__` method. +- creating implementations of the Standard for several major dataframe libraries; +- creating a cross-dataframe test-suite; +- aiming to ensure each major dataframe library has a `__dataframe_standard__` method. ## Conclusion -We've introduced the DataFrame Standard, which allows you to write cross-DataFrame code. +We've introduced the Dataframe Standard, which allows you to write cross-dataframe code. We learned about its philosophy, as well as what it doesn't aim to be. Finally, we saw what plans lie ahead - the Standard is in active development, so please watch this space! From 2c1bdbcccdf2b224b41fb7aaed92bd7bfff13115 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com> Date: Fri, 2 Jun 2023 14:53:02 +0100 Subject: [PATCH 3/4] update as per feedback --- content/blog/dataframe_standard_RFC.md | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/content/blog/dataframe_standard_RFC.md b/content/blog/dataframe_standard_RFC.md index 86d37da..bfe506d 100644 --- a/content/blog/dataframe_standard_RFC.md +++ b/content/blog/dataframe_standard_RFC.md @@ -5,7 +5,7 @@ title = "Want to super-charge your library by writing dataframe-agnostic code? W tags = ["APIs", "standard", "consortium", "dataframes", "community", "pandas", "polars", "cudf", "modin", "vaex", "koalas", "ibis", "dask"] categories = ["Consortium", "Standardization"] description = "An RFC for a dataframe API Standard" -draft = true +draft = false weight = 40 +++ @@ -94,7 +94,7 @@ def remove_outliers(df, column): This will work, as if by magic, on any dataframe with a Standard-compliant implementation. But it's not magic, of course, it's the power of standardisation! -## Standard Philosophy - will all dataframe libraries have the same API one day? +## The Standard's philosophy - will all dataframe libraries have the same API one day? Let's start with what this isn't: the Standard isn't an attempt to force all dataframe libraries to have the same API. It also isn't a way to convert @@ -112,12 +112,12 @@ code are expected to greatly benefit from this, as are their users. If you're a casual user, then probably not. The Dataframe Standard is currently mainly targeted towards library developers, who wish to support multiple dataframe libraries. Users of non-pandas dataframe -would then be able to seamlessly use the dataframe tools (e.g. visualisation, -feature engineering, data cleaning) without having to do any expensive data -conversions. +libraries would then be able to seamlessly use the Python packages which +provide functionality for dataframes (e.g. visualisation, feature engineering, +data cleaning) without having to do any expensive data conversions. If you're a library author, then we'd love to hear from you. Would this be -useful to you? We expect it to, as the demand for dataframe-agnostic tools +useful to you? We expect it to be, as the demand for dataframe-agnostic tools certainly seems to be there: - https://github.com/mwaskom/seaborn/issues/3277, - https://github.com/scikit-learn/scikit-learn/issues/25896 @@ -126,11 +126,16 @@ certainly seems to be there: ## Are we there yet? What lies ahead? -No, not yet. This is just a first draft, and a request for comments. +This is just a first draft, based on design discussions between authors from various +dataframe libraries, and a request for comments (RFC). Our goal is to solicit input +from a wider range of potential stakeholders, and evolve the Standard throughout +the rest of 2023, resulting in a first official release towards the end of the year. Future plans include: -- increasing the scope of the Standard (currently, the spec is very minimal); -- creating implementations of the Standard for several major dataframe libraries; +- increasing the scope of the Standard based on real-world code from widely used + packages (currently, the spec is very minimal); +- creating implementations of the Standard for several major dataframe libraries + (initially available as a separate ``dataframe-api-compat`` package); - creating a cross-dataframe test-suite; - aiming to ensure each major dataframe library has a `__dataframe_standard__` method. @@ -142,5 +147,6 @@ what plans lie ahead - the Standard is in active development, so please watch th ## Resources -- Read more on the [official website](https://data-apis.org/dataframe-api/) +- Read more and contribute to the discussion on the + [official website](https://data-apis.org/dataframe-api/) - Try out the [proof-of-concept implementation for pandas and polars](https://github.com/MarcoGorelli/impl-dataframe-api)! From 34232798ff571cd496137c1a4592277e5568300a Mon Sep 17 00:00:00 2001 From: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com> Date: Tue, 13 Jun 2023 11:58:31 +0100 Subject: [PATCH 4/4] reword resources comment --- content/blog/dataframe_standard_RFC.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/content/blog/dataframe_standard_RFC.md b/content/blog/dataframe_standard_RFC.md index bfe506d..b14675b 100644 --- a/content/blog/dataframe_standard_RFC.md +++ b/content/blog/dataframe_standard_RFC.md @@ -147,6 +147,5 @@ what plans lie ahead - the Standard is in active development, so please watch th ## Resources -- Read more and contribute to the discussion on the - [official website](https://data-apis.org/dataframe-api/) +- Read more on the [official website](https://data-apis.org/dataframe-api/), and contribute to the discussion on the [GitHub repo](https://github.com/data-apis/dataframe-api) - Try out the [proof-of-concept implementation for pandas and polars](https://github.com/MarcoGorelli/impl-dataframe-api)!