From 7a40073918a219f1e329e5b2895a794a7c6b21a0 Mon Sep 17 00:00:00 2001
From: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com>
Date: Fri, 26 May 2023 14:18:37 +0100
Subject: [PATCH 1/4] draft standard rfc

---
 content/blog/dataframe_standard_RFC.md | 146 +++++++++++++++++++++++++
 1 file changed, 146 insertions(+)
 create mode 100644 content/blog/dataframe_standard_RFC.md
diff --git a/content/blog/dataframe_standard_RFC.md b/content/blog/dataframe_standard_RFC.md
new file mode 100644
index 0000000..6e43bc9
--- /dev/null
+++ b/content/blog/dataframe_standard_RFC.md
@@ -0,0 +1,146 @@
++++
+date = "2023-05-25"
+author = "Marco Gorelli"
+title = "Want to super-charge your library by writing DataFrame-agnostic code? We'd love to hear from you"
+tags = ["APIs", "standard", "consortium", "dataframes", "community", "pandas", "polars", "cudf", "modin", "vaex", "koalas", "ibis", "dask"]
+categories = ["Consortium", "Standardization"]
+description = "An RFC for a DataFrame API Standard"
+draft = true
+weight = 40
++++
+
+<h1 align="center">
+	<img
+		width="400"
+		alt="standard-compliant DataFrame"
+		src="https://github.com/MarcoGorelli/impl-dataframe-api/assets/33491632/fb4bc907-2b85-4ad7-8d13-c2b9912b97f5">
+</h1>
+
+Tired of getting lost in if-then statements when dealing with API differences
+between DataFrame libraries? Would you like to be able to write your code
+once, have it work with all major DataFrame libraries, and be done?
+Let's learn about an initiative which will enable you to write
+cross-DataFrame code - no special-casing nor data conversions required!
+
+## Why would I want this anyway?
+
+Say you want to write a function which selects rows of a DataFrame based
+on the [z-score](https://en.wikipedia.org/wiki/Standard_score) of a given
+column, and you want it to work with any DataFrame library. How might
+you write that?
+
+### Solution 1
+
+Here's a typical solution:
+```python
+def remove_outliers(df: object, column: str) -> pd.DataFrame:
+    if isinstance(df, pandas.DataFrame):
+        z_score = (df[column] - df[column].mean())/df[column].std()
+        return df[z_score.between(-3, 3)]
+    if isinstance(df, polars.DataFrame):
+        z_score = ((pl.col(column) - pl.col(column).mean()) / pl.col(column).std())
+        return df.filter(z_score.is_between(-3, 3))
+    if isinstance(df, some_other_library.DataFrame):
+        ...
+```
+This quickly gets unwieldy. Libraries like `cudf` and `modin` _might_ work
+in the `isinstance(df, pandas.DataFrame)` arm, but there's no guarantee -
+their APIs are similar, but subtly different. Furthermore, as new libraries
+come out, you'd have to keep updating your function to add new `if` statements.
+
+Can we do better?
+
+### Solution 2: Interchange Protocol
+
+An alternative, which wouldn't involve special-casing, could be to
+leverage the [DataFrame interchange protocol](https://data-apis.org/dataframe-protocol/latest/index.html):
+```python
+def remove_outliers(df: object, column: str) -> pd.DataFrame:
+    df_pd = pd.api.interchange.from_dataframe(df)
+    z_score = (df_pd[column] - df_pd[column].mean())/df_pd[column].std()
+    return df_pd[z_score.between(-3, 3)]
+```
+We got out of having to write if-then statements (🥳), but there's still a
+couple of issues:
+1. we had to convert to pandas: this might be expensive if your data was
+   originally stored on GPU;
+2. the return value is a `pandas.DataFrame`, rather than an object of your
+   original DataFrame library.
+
+Can we do better? Can we really have it all?
+
+### Solution 3: Introducing the DataFrame Standard
+
+Yes, we really can. To write cross-DataFrame code, we'll take these steps:
+1. enable the Standard using ``.__dataframe_standard__``. This will return
+   a Standard-compliant DataFrame;
+2. write your code, using the [DataFrame Standard specification](https://data-apis.org/dataframe-api/draft/API_specification/index.html)
+3. (optional) return a DataFrame from your original library by calling `.dataframe`.
+
+Let's see how this would look like for our ``remove_outliers`` example function:
+```python
+def remove_outliers(df, column):
+    # Get a Standard-compliant DataFrame.
+    # NOTE: this has not yet been upstreamed, so won't work out-of-the-box!
+    # See 'resources' below for how to try it out.
+    df_standard = df.__dataframe_standard__()
+    # Use methods from the Standard specification.
+    col = df_standard.get_column_by_name(column)
+    z_score = (col - col.mean()) / col.std()
+    df_standard_filtered = df_standard.get_rows_by_mask((z_score > -3) & (z_score < 3))
+    # Return the result as a DataFrame from the original library.
+    return df_standard_filtered.dataframe
+```
+This will work, as if by magic, on any DataFrame with a Standard-compliant implementation.
+But it's not magic, of course, it's the power of standardisation!
+
+## Standard Philosophy - will all DataFrame libraries have the same API one day?
+
+Let's start with what this isn't: the Standard isn't an attempt to force all DataFrame
+libraries to have the same API. It also isn't a way to convert
+between DataFrames: the [Interchange Protocol](https://data-apis.org/dataframe-protocol/latest/index.html),
+whose adoption is increasing, already does that. It also doesn't aim to standardise
+domain or industry specific functionality.
+
+Rather, it is minimal set of essential DataFrame functionality which will work
+the same way across libraries. It will behave in a strict and predictable manner
+across DataFrame libraries. Library authours trying to write DataFrame-agnostic
+code are expected to greatly benefit from this, as are their users.
+
+## Who's this for? Do I need to learn yet another API?
+
+If you're a casual user, then probably not.
+The DataFrame Standard is currently mainly targeted towards library developers,
+who wish to support multiple DataFrame libraries. Users of non-pandas DataFrame
+would then be able to seamlessly use the DataFrame tools (e.g. visualisation,
+feature engineering, data cleaning) without having to do any expensive data
+conversions.
+
+If you're a library authour, then we'd love to hear from you. Would this be
+useful to you? We expect it to, as the demand for DataFrame-agnostic tools
+certainly seems to be there:
+- https://github.com/mwaskom/seaborn/issues/3277,
+- https://github.com/scikit-learn/scikit-learn/issues/25896
+- https://github.com/plotly/plotly.py/issues/3637
+- (many, many more...)
+
+## Are we there yet? What lies ahead?
+
+No, not yet. This is just a first draft, and a request for comments.
+
+Future plans include:
+- increasing the scope of the Standard (currently, the spec is very minimal);
+- creating implementations of the Standard for several major DataFrame libraries;
+- creating a cross-DataFrame test-suite;
+- aiming to ensure each major DataFrame library has a `__dataframe_standard__` method.
+
+## Conclusion
+
+We've introduced the DataFrame Standard, which allows you to write cross-DataFrame code.
+We learned about its philosophy, as well as what it doesn't aim to be. Finally, we saw
+what plans lie ahead - the Standard is in active development, so please watch this space!
+
+## Resources
+
+- Read more on the [official website](https://data-apis.org/dataframe-api/)
+- Try out the [proof-of-concept implementation for pandas and polars](https://github.com/MarcoGorelli/impl-dataframe-api)!

From 081e122277c1c7ca52ebd817450cc8236b348b59 Mon Sep 17 00:00:00 2001
From: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com>
Date: Fri, 2 Jun 2023 14:33:51 +0100
Subject: [PATCH 2/4] dataframe, authours typo

---
 content/blog/dataframe_standard_RFC.md | 62 +++++++++++++-------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/content/blog/dataframe_standard_RFC.md b/content/blog/dataframe_standard_RFC.md
index 6e43bc9..86d37da 100644
--- a/content/blog/dataframe_standard_RFC.md
+++ b/content/blog/dataframe_standard_RFC.md
@@ -1,10 +1,10 @@
 +++
 date = "2023-05-25"
 author = "Marco Gorelli"
-title = "Want to super-charge your library by writing DataFrame-agnostic code? We'd love to hear from you"
+title = "Want to super-charge your library by writing dataframe-agnostic code? We'd love to hear from you"
 tags = ["APIs", "standard", "consortium", "dataframes", "community", "pandas", "polars", "cudf", "modin", "vaex", "koalas", "ibis", "dask"]
 categories = ["Consortium", "Standardization"]
-description = "An RFC for a DataFrame API Standard"
+description = "An RFC for a dataframe API Standard"
 draft = true
 weight = 40
 +++
@@ -12,21 +12,21 @@ weight = 40
 <h1 align="center">
 	<img
 		width="400"
-		alt="standard-compliant DataFrame"
+		alt="standard-compliant dataframe"
 		src="https://github.com/MarcoGorelli/impl-dataframe-api/assets/33491632/fb4bc907-2b85-4ad7-8d13-c2b9912b97f5">
 </h1>
 
 Tired of getting lost in if-then statements when dealing with API differences
-between DataFrame libraries? Would you like to be able to write your code
-once, have it work with all major DataFrame libraries, and be done?
+between dataframe libraries? Would you like to be able to write your code
+once, have it work with all major dataframe libraries, and be done?
 Let's learn about an initiative which will enable you to write
-cross-DataFrame code - no special-casing nor data conversions required!
+cross-dataframe code - no special-casing nor data conversions required!
 
 ## Why would I want this anyway?
 
-Say you want to write a function which selects rows of a DataFrame based
+Say you want to write a function which selects rows of a dataframe based
 on the [z-score](https://en.wikipedia.org/wiki/Standard_score) of a given
-column, and you want it to work with any DataFrame library. How might
+column, and you want it to work with any dataframe library. How might
 you write that?
 
 ### Solution 1
@@ -65,22 +65,22 @@ couple of issues:
 1. we had to convert to pandas: this might be expensive if your data was
    originally stored on GPU;
 2. the return value is a `pandas.DataFrame`, rather than an object of your
-   original DataFrame library.
+   original dataframe library.
 
 Can we do better? Can we really have it all?
 
-### Solution 3: Introducing the DataFrame Standard
+### Solution 3: Introducing the Dataframe Standard
 
-Yes, we really can. To write cross-DataFrame code, we'll take these steps:
+Yes, we really can. To write cross-dataframe code, we'll take these steps:
 1. enable the Standard using ``.__dataframe_standard__``. This will return
-   a Standard-compliant DataFrame;
-2. write your code, using the [DataFrame Standard specification](https://data-apis.org/dataframe-api/draft/API_specification/index.html)
-3. (optional) return a DataFrame from your original library by calling `.dataframe`.
+   a Standard-compliant dataframe;
+2. write your code, using the [Dataframe Standard specification](https://data-apis.org/dataframe-api/draft/API_specification/index.html)
+3. (optional) return a dataframe from your original library by calling `.dataframe`.
 
 Let's see how this would look like for our ``remove_outliers`` example function:
 ```python
 def remove_outliers(df, column):
-    # Get a Standard-compliant DataFrame.
+    # Get a Standard-compliant dataframe.
     # NOTE: this has not yet been upstreamed, so won't work out-of-the-box!
     # See 'resources' below for how to try it out.
     df_standard = df.__dataframe_standard__()
@@ -88,36 +88,36 @@ def remove_outliers(df, column):
     col = df_standard.get_column_by_name(column)
     z_score = (col - col.mean()) / col.std()
     df_standard_filtered = df_standard.get_rows_by_mask((z_score > -3) & (z_score < 3))
-    # Return the result as a DataFrame from the original library.
+    # Return the result as a dataframe from the original library.
     return df_standard_filtered.dataframe
 ```
-This will work, as if by magic, on any DataFrame with a Standard-compliant implementation.
+This will work, as if by magic, on any dataframe with a Standard-compliant implementation.
 But it's not magic, of course, it's the power of standardisation!
 
-## Standard Philosophy - will all DataFrame libraries have the same API one day?
+## Standard Philosophy - will all dataframe libraries have the same API one day?
 
-Let's start with what this isn't: the Standard isn't an attempt to force all DataFrame
+Let's start with what this isn't: the Standard isn't an attempt to force all dataframe
 libraries to have the same API. It also isn't a way to convert
-between DataFrames: the [Interchange Protocol](https://data-apis.org/dataframe-protocol/latest/index.html),
+between dataframes: the [Interchange Protocol](https://data-apis.org/dataframe-protocol/latest/index.html),
 whose adoption is increasing, already does that. It also doesn't aim to standardise
 domain or industry specific functionality.
 
-Rather, it is minimal set of essential DataFrame functionality which will work
+Rather, it is minimal set of essential dataframe functionality which will work
 the same way across libraries. It will behave in a strict and predictable manner
-across DataFrame libraries. Library authours trying to write DataFrame-agnostic
+across dataframe libraries. Library authors trying to write dataframe-agnostic
 code are expected to greatly benefit from this, as are their users.
 
 ## Who's this for? Do I need to learn yet another API?
 
 If you're a casual user, then probably not.
-The DataFrame Standard is currently mainly targeted towards library developers,
-who wish to support multiple DataFrame libraries. Users of non-pandas DataFrame
-would then be able to seamlessly use the DataFrame tools (e.g. visualisation,
+The Dataframe Standard is currently mainly targeted towards library developers,
+who wish to support multiple dataframe libraries. Users of non-pandas dataframe
+would then be able to seamlessly use the dataframe tools (e.g. visualisation,
 feature engineering, data cleaning) without having to do any expensive data
 conversions.
 
-If you're a library authour, then we'd love to hear from you. Would this be
-useful to you? We expect it to, as the demand for DataFrame-agnostic tools
+If you're a library author, then we'd love to hear from you. Would this be
+useful to you? We expect it to, as the demand for dataframe-agnostic tools
 certainly seems to be there:
 - https://github.com/mwaskom/seaborn/issues/3277,
 - https://github.com/scikit-learn/scikit-learn/issues/25896
@@ -130,13 +130,13 @@ No, not yet. This is just a first draft, and a request for comments.
 
 Future plans include:
 - increasing the scope of the Standard (currently, the spec is very minimal);
-- creating implementations of the Standard for several major DataFrame libraries;
-- creating a cross-DataFrame test-suite;
-- aiming to ensure each major DataFrame library has a `__dataframe_standard__` method.
+- creating implementations of the Standard for several major dataframe libraries;
+- creating a cross-dataframe test-suite;
+- aiming to ensure each major dataframe library has a `__dataframe_standard__` method.
 
 ## Conclusion
 
-We've introduced the DataFrame Standard, which allows you to write cross-DataFrame code.
+We've introduced the Dataframe Standard, which allows you to write cross-dataframe code.
 We learned about its philosophy, as well as what it doesn't aim to be. Finally, we saw
 what plans lie ahead - the Standard is in active development, so please watch this space!
 

From 2c1bdbcccdf2b224b41fb7aaed92bd7bfff13115 Mon Sep 17 00:00:00 2001
From: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com>
Date: Fri, 2 Jun 2023 14:53:02 +0100
Subject: [PATCH 3/4] update as per feedback

---
 content/blog/dataframe_standard_RFC.md | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/content/blog/dataframe_standard_RFC.md b/content/blog/dataframe_standard_RFC.md
index 86d37da..bfe506d 100644
--- a/content/blog/dataframe_standard_RFC.md
+++ b/content/blog/dataframe_standard_RFC.md
@@ -5,7 +5,7 @@ title = "Want to super-charge your library by writing dataframe-agnostic code? W
 tags = ["APIs", "standard", "consortium", "dataframes", "community", "pandas", "polars", "cudf", "modin", "vaex", "koalas", "ibis", "dask"]
 categories = ["Consortium", "Standardization"]
 description = "An RFC for a dataframe API Standard"
-draft = true
+draft = false
 weight = 40
 +++
 
@@ -94,7 +94,7 @@ def remove_outliers(df, column):
 This will work, as if by magic, on any dataframe with a Standard-compliant implementation.
 But it's not magic, of course, it's the power of standardisation!
 
-## Standard Philosophy - will all dataframe libraries have the same API one day?
+## The Standard's philosophy - will all dataframe libraries have the same API one day?
 
 Let's start with what this isn't: the Standard isn't an attempt to force all dataframe
 libraries to have the same API. It also isn't a way to convert
@@ -112,12 +112,12 @@ code are expected to greatly benefit from this, as are their users.
 If you're a casual user, then probably not.
 The Dataframe Standard is currently mainly targeted towards library developers,
 who wish to support multiple dataframe libraries. Users of non-pandas dataframe
-would then be able to seamlessly use the dataframe tools (e.g. visualisation,
-feature engineering, data cleaning) without having to do any expensive data
-conversions.
+libraries would then be able to seamlessly use the Python packages which
+provide functionality for dataframes (e.g. visualisation, feature engineering,
+data cleaning) without having to do any expensive data conversions.
 
 If you're a library author, then we'd love to hear from you. Would this be
-useful to you? We expect it to, as the demand for dataframe-agnostic tools
+useful to you? We expect it to be, as the demand for dataframe-agnostic tools
 certainly seems to be there:
 - https://github.com/mwaskom/seaborn/issues/3277,
 - https://github.com/scikit-learn/scikit-learn/issues/25896
@@ -126,11 +126,16 @@ certainly seems to be there:
 
 ## Are we there yet? What lies ahead?
 
-No, not yet. This is just a first draft, and a request for comments.
+This is just a first draft, based on design discussions between authors from various
+dataframe libraries, and a request for comments (RFC). Our goal is to solicit input
+from a wider range of potential stakeholders, and evolve the Standard throughout
+the rest of 2023, resulting in a first official release towards the end of the year.
 
 Future plans include:
-- increasing the scope of the Standard (currently, the spec is very minimal);
-- creating implementations of the Standard for several major dataframe libraries;
+- increasing the scope of the Standard based on real-world code from widely used
+  packages (currently, the spec is very minimal);
+- creating implementations of the Standard for several major dataframe libraries 
+  (initially available as a separate ``dataframe-api-compat`` package);
 - creating a cross-dataframe test-suite;
 - aiming to ensure each major dataframe library has a `__dataframe_standard__` method.
 
@@ -142,5 +147,6 @@ what plans lie ahead - the Standard is in active development, so please watch th
 
 ## Resources
 
-- Read more on the [official website](https://data-apis.org/dataframe-api/)
+- Read more and contribute to the discussion on the
+  [official website](https://data-apis.org/dataframe-api/)
 - Try out the [proof-of-concept implementation for pandas and polars](https://github.com/MarcoGorelli/impl-dataframe-api)!

From 34232798ff571cd496137c1a4592277e5568300a Mon Sep 17 00:00:00 2001
From: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com>
Date: Tue, 13 Jun 2023 11:58:31 +0100
Subject: [PATCH 4/4] reword resources comment

---
 content/blog/dataframe_standard_RFC.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/content/blog/dataframe_standard_RFC.md b/content/blog/dataframe_standard_RFC.md
index bfe506d..b14675b 100644
--- a/content/blog/dataframe_standard_RFC.md
+++ b/content/blog/dataframe_standard_RFC.md
@@ -147,6 +147,5 @@ what plans lie ahead - the Standard is in active development, so please watch th
 
 ## Resources
 
-- Read more and contribute to the discussion on the
-  [official website](https://data-apis.org/dataframe-api/)
+- Read more on the [official website](https://data-apis.org/dataframe-api/), and contribute to the discussion on the [GitHub repo](https://github.com/data-apis/dataframe-api)
 - Try out the [proof-of-concept implementation for pandas and polars](https://github.com/MarcoGorelli/impl-dataframe-api)!