Description
Pandas version checks
- I have checked that the issue still exists on the latest versions of the docs on
main
here
Location of the documentation
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Documentation problem
I have been reviewing the docstring for read_csv
to add some clarifying edits in a PR and encountered some questions while reviewing the parameter descriptions. I found there were a few cases where the parameter types summary did not appear to align with the docstring specifications, as well as a few cases where the description of the types seems to differ from the type hints in the function signature.
I have made an attempt at describing those discrepancies here and have added a few questions in as well just to get clarification on how things are done. Once I get things figured out here, I'll plan to make any appropriate corrections to the docstring in a PR.
Parameters potentially needing edits
sep:
- type hint:
str | None | lib.NoDefault = lib.no_default
- description: str, default ','
Why is the default not just ','
rather than lib.no_default
?
delimiter:
- type hint:
str | None | lib.NoDefault = None
- description: str, default None
It appears that default None should be replaced with str, optional
Also, I noticed for most parameters if the type hint includes lib.NoDefault
then the default value is lib.no_default
but this one has None
instead. What is the reason this one is different? (And I have more questions about the no_default
flag later on.)
header:
- type hint:
int | Sequence[int] | None | Literal["infer"] = "infer"
- description: int, list of int, None, default 'infer'
I would expect 'infer' to occur in the list of options before being highlighted as the default, like so:
int, list of int, 'infer' or None, default 'infer'
names:
- type hint:
Sequence[Hashable] | None | lib.NoDefault = lib.no_default
- description: array-like, optional
Is there a reason it is described as array-like? It seems to only need to be a sequence of unique names which would be more like an ordered set, or for simplicity maybe just list-like? Or should it really be list of Hashable to be specific? Or even better Sequence of Hashable?
index_col:
- type hint:
IndexLabel | Literal[False] | None = None
- description: int, str, sequence of int / str, or False, optional, default None
it appears the default None is redundant since optional is already specified, so I assume this can be removed, is that correct? Something like:
int, str, list of int, list of str or False, optional
Also, I notice that list is used in the descriptions rather than sequence in most cases althoughsequence might be more accurate and descriptive. I'm wondering if there's a reason for that being used less often. This seems to be the only parameter that employs sequence in its description.
usecols:
- type hint:
list[HashableT] | Callable[[Hashable], bool] | None = None
- description: list-like or callable, optional
Since the order of this parameter does not affect the result, and presumably only unique values are allowed, I am wondering if it would be clearer to describe it as set-like? List-like still seems appropriate but just wondering if it could be more specific to highlight the idea that order will not be preserved.
dtype:
- type hint:
DtypeArg | None = None
- description: Type name or dict of column -> type, optional
From the docstring guide, it looks like the description of a dict should refer to its data types, not necessarily their meaning. And it says to use dict of {key : value} so would something like this be more appropriate?
type, Hashable or dict of {Hashable : type or str}, optional
engine:
- type hint:
CSVEngine | None = None
- description: {'c', 'python', 'pyarrow'}, optional
This description does not appear to specify the behavior when None
is passed. Does it default to the 'c' engine? It seems like older versions used to say this, but I don't see it explicitly described in the current version. And if so, would it then make sense to define the type hint as folllow?
CSVEngine = 'c'
converters:
- type hint:
Mapping[Hashable, Callable] | None = None
- description: dict, optional
The docstring guide suggest this description should be:
dict of {Hashable : Callable}
skiprows:
- type hint:
list[int] | int | Callable[[Hashable], bool] | None = None
- description: list-like, int or callable, optional
This could be made more specific by changing to::
list of int, int or Callable, optional
na_values:
- type hint:
Sequence[str] | Mapping[str, Sequence[str]] | None = None
- description: scalar, str, list-like, or dict, optional
The mention of scalar as a valid type is not clear to me; what would that be referring to? Also, it says that str is a valid input; however, the type hint does not include that as an option. What should be changed to resolve this?
parse_dates:
- type hint:
bool | Sequence[Hashable] | None = None
- description: bool or list of int or names or list of lists or dict, default False
This also appears to not align with the type hint which says you can only assign either a bool or a list. Also the default seems to actually be None
, yet the description says it's False
. Looking for clarification here.
infer_datetime_format:
- type hint:
bool | lib.NoDefault = lib.no_default
- description: bool, default False
It appears that the default is technically lib.no_default
but descibed to be False
. What is the reason for doing this rather than assigning its default as False
, such as how keep_date_col
is defined?
date_parser:
- type hint:
Callable | lib.NoDefault = lib.no_default
- description: function, optional
Could be updated to use a specific type, i.e.:
Callable, optional
Also, what is the reason for using lib.no_default
instead of just None
?
date_format:
- type hint:
str | None = None
- description: str or dict of column -> format, default None
The type hint does not appear to match the description here as there is no mention of dict
being a valid argument. Is the description out of date or vice versa?
If the description is accurate, then Similar to dtype
, this could be updated as:
str or dict of {Hashable : str}, optional
compression:
- type hint:
CompressionOptions = "infer"
- description: str or dict, default 'infer'
Could the dict
in the description be updated as dict of {k : v}
? If so, what would the valid types be for k and v, as I was not totally clear about it from the description.
thousands:
- type hint:
str | None = None
- description: str, optional
I assume this should be a str (length 1) like some other parameters. Should that be updated?
decimal:
- type hint:
str = "."
- description: str, default '.'
I assume this should be a str (length 1) like some other parameters. Should that be updated?
quoting:
- type hint:
int = csv.QUOTE_MINIMAL
- description: int or csv.QUOTE_* instance, default 0
It would make sense to add some context to the default here, maybe something like:
int or csv.QUOTE_* instance, default 0 meaning csv.QUOTE_MINIMAL
comment:
- type hint:
str | None = None
- description: str, optional
The description says this should be a single character so it should be changed to:
str (length 1), optional
encoding:
- type hint:
str | None = None
- description: str, optional, default "utf-8"
The default is technically None
but the description says optional, default 'utf-8'. Is there a reason the default is not just 'utf-8' instead of None?
encoding_errors:
- type hint:
str | None = "strict"
- description: str, optional, default "strict"
I am wondering why this is both optional and has a default. What is the behavior when None
is passed?
on_bad_lines:
- type hint:
str = "error"
- description: {'error', 'warn', 'skip'} or callable, default 'error'
The type hint does not mention Callable as an option as described in the docstring. Does this need to be updated?
float_precision:
- type hint:
Literal["high", "legacy"] | None = None
- description: str, optional
It looks like the description should not be the generic str
, but rather:
{'high', 'legacy'}, optional
dtype_backend:
- type hint:
DtypeBackend | lib.NoDefault = lib.no_default
- description: {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames
Given that this defaults to NumPy backed DataFrames, why is lib.no_default
used rather than just `'numpy_nullable'*?
General Questions:
Here is a summary of the most common questions that came up while reviewing the parameters:
- Why are some 'optional' parameters not assigned
None
and instead uselib.no_default
(e.g.,delimiter
,index_col
,date_format
)? - Why are some values with defined default behavior not assigned a literal that would trigger that behaviour but instead use
None
orlib.no_default
as a flag (e.g.,encoding
,dtype_backend
)? - What is the appropriate way to describe a 'list-like' object? Is there a best practice for using 'list of ' vs. 'list-like' vs. 'array-like' vs. 'sequence`?
Suggested fix for documentation
My suggested fix would be to create a PR where I update each parameter description as described in the Documentation Problem section such that the descriptions align with both the docstring guide and the function signature's type hints in every case.