Description
Change to str
dtype
behaviour for missing elements
Following comments the discussion about how to handle missing NA
scalar values in #28778 I was asked to raise my question as this seperate issue.
My rather prosaic question is how if missing str
elements are given the value NA
, how would I distinguish between a missing str
value and the two-character string 'NA'?
I ask as NA is a common abbreviation for 'Not Applicable', 'North America' et al, in a way that in my experience that 'NaN' or 'Not a Number' isn't
That is, if 'NA' were generated as the default missing str
dtype
value, especially if introduced as change rather than as a opt-in, it risks becoming a UX developer issue as I (for one) would no longer know if 'NA' is a valid or a missing data value.
For what it's worth, current idiomatic behaviour is that in a missing values would be replaced by None
dtype
:
>>> array = [['No-one', 'Nadie'], ['Expects']]
>>> df = pd.DataFrame(array, columns=['En', 'Es'])
En Es
0 No-one Nadie
1 Expects *None*
The dtypes
here are:
>>> [type(i) for i in df['Es']]
[<class 'str'>, <class 'NoneType'>]
Given this, my thought is that NA
is not a suitable default replacement for missing str
dtype
elements rather None
of NoneType
dtype