Open
Description
This was already implemented before 2.0 in #50748, but then removed before the release in #51853, as in too many cases the option wasn't being respected.
The idea is to have a global option to let pandas know which dtype kind to use when data is created (the exact option name needs to be discussed, but I'll use use_arrow
to illustrate):
pandas.options.mode.use_arrow = True
df = pandas.read_csv(...) # The returned DataFrame will use pyarrow dtypes
df["foo"] = 1 # The added column will use pyarrow dtypes
df = pandas.DataFrame(...) # The returned DataFrame will use pyarrow dtypes
...
I don't think adding the option is controversial, as it has no impact on users unless set, and it was already implemented without objections in the past.
I think the implementation requires a bit of discussion, as the exact behavior to implement is not immediately obvious, a least to me. Main points I can see
- Should we have an option to set pyarrow as the default (since those should be the types we expect people to use in the future), or a more generic option to set
dtype_backend
tonumpy|nullable|pyarrow
? - I think at least initially it makes sense that if a user is specific about the dtype they want to use (e.g.
Series([1, 2], dtype="Int32")
) we let them do it. But could it make sense to have a second optionforce_arrow
orforce_dtype_backend
so any operation that would use another dtype kind would fail? I think this could be helpful for users that only want to live in the pyarrow world, and it would also be helpful to identify undesired casts for us. - The exact namespace (
mode
vsfuture
vs others) and name of the option, which clearly will depend on the previous points