Description
Part of the discussion on missing value handling in #28095, detailed proposal at https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB.
if we go for a new NA value, we also need to decide the behaviour of this value in comparison operations. And consequently, we also need to decide on the behaviour of boolean values with missing data in logical operations and indexing operations.
So let's use this issue for that part of the discussion.
Some aspects of this:
- Behaviour in comparison operations: currently np.nan compares unequal (
value == np.nan -> False
,values > np.nan -> False
, but we can also propagate missing values (value == NA -> NA
, ...) - Behaviour in logical operations: currently we always return False for
|
or&
with missing data. But we could also use a "three-valued logic" like Julia and SQL (this has, eg,NA | True = True
orNA & True = NA
). - Behaviour in indexing: currently you cannot do boolean indexing with a boolean series with missing values (which is object dtype right now). Do we want to change this? For example, interpret it as False (not select it)
(TODO: should check how other languages do this)
Julia has a nice documentation page explain how they support missing values, the above ideas largely match with that.
Besides those behavioural API discussions, we also need to decide on how to approach this technically (boolean ExtensionArray with boolean numpy array + mask for missing values?) Shall we discuss that here as well, or keep that separate?
cc @pandas-dev/pandas-core