Skip to content

Add a fill_null method to dataframe and column #173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions spec/API_specification/dataframe_api/column_object.py
Original file line number Diff line number Diff line change
Expand Up @@ -641,3 +641,16 @@ def fill_nan(self, value: float | 'null', /) -> Column:

"""
...

def fill_null(self, value: Scalar, /) -> Column:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the PR to discuss it but it would be really nice if for typing purposes we could type this as:

    def fill_null(self[T], value: Scalar[T], /) -> Column[T]:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gh-187 makes a big step in this direction. That said, the generics don't quite work here I think, since the T in Scalar[T] is a Python scalar while in Column[T] it's a dtype. We happened to have a good discussion about gh-187 yesterday, and striking a balance between making the spec fully type-checkable and keeping it readable. I suspect this kind of case requires overloads, which may be useful at some point but can't be rendered as html - so it divorces static typing a bit from spec writing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a way to make Scalar[T] work, and every library can define what their scalar type is corresponding to each of the standard's types. But that's for another issue, let's pick it up again after #187 is in (I'll update it today)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and every library can define what their scalar type is corresponding to each of the standard's types

Quick comment: they are, or at least include, Python builtin types, which one has no control over. So it'd be int) -> Column[Int64] or some such thing.

But that's for another issue, let's pick it up again after #187 is in (I'll update it today)

Sounds good, thanks.

"""
Fill null values with the given fill value.

Parameters
----------
value : Scalar
Value used to replace any ``null`` values in the column with.
Must be of the Python scalar type matching the dtype of the column.

"""
...
32 changes: 32 additions & 0 deletions spec/API_specification/dataframe_api/dataframe_object.py
Original file line number Diff line number Diff line change
Expand Up @@ -774,3 +774,35 @@ def fill_nan(self, value: float | 'null', /) -> DataFrame:

"""
...

def fill_null(
self, value: Scalar, /, *, column_names : list[str] | None = None
) -> DataFrame:
"""
Fill null values with the given fill value.

This method can only be used if all columns that are to be filled are
of the same dtype (e.g., all of ``Float64`` or all of string dtype).
If that is not the case, it is not possible to use a single Python
scalar type that matches the dtype of all columns to which
``fill_null`` is being applied, and hence an exception will be raised.

Parameters
----------
value : Scalar
Value used to replace any ``null`` values in the dataframe with.
Must be of the Python scalar type matching the dtype(s) of the dataframe.
column_names : list[str] | None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential alternative name could be subset, to indicate you want to apply the filling operation only to a subset of the columns

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently don't use either of column_names or subset. I'm happy either way, I think the former is a bit more explicit about what it refers to while the latter expresses better what is done with the provided names.

Anyone else have a preference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems fine

A list of column names for which to replace nulls with the given
scalar value. If ``None``, nulls will be replaced in all columns.

Raises
------
TypeError
If the columns of the dataframe are not all of the same kind.
KeyError
If ``column_names`` contains a column name that is not present in
the dataframe.

"""
...