Description
Pandas version checks
- I have checked that the issue still exists on the latest versions of the docs on
main
here
Location of the documentation
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Documentation problem
I have noted the following Issues of clarity in the docstring description for read_csv
for the header
parameter:
- The behavior associated with
header=None
is not explicitly defined. - Default behavior description is in terms of
header=0
andheader=None
, neither of which have been clearly explained yet. - The relationship between file line numbers (which are conventionally numbered from 1) and row numbers/indices (which are indexed from 0) is not described explicity (only alluded to implicitly through examples of
header=0
meaning the first line) - The description "if column names are passed explicitly" is vague as it doesn't explicitly mention how (i.e. via
names
parameter). - The detailed descriptions to not align with the order given in the inital list of accepted values
Suggested fix for documentation
Original docstring:
header : int, list of int, None, default 'infer'
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical toheader=0
and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical toheader=None
. Explicitly passheader=0
to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines ifskip_blank_lines=True
, soheader=0
denotes the first line of data rather than the first line of the file. Issues of clarity noted in the docstring description:
Proposed change to address the issues:
header : int, list of int, None, default 'infer'
Index or indices corresponding to line number(s) in the CSV file that will be read as DataFrame column labels. Index 0 corresponds to the first line in the file (or the first non-blank, non-commented line ifskip_blank_lines=True
). The following arguments are valid:
- Single
int
: denotes the line index at which column labels will be read.- List of
int
: denotes line indices at which column labels will be read as a multi-index. Note: intervening rows not specified in the list will be skipped (e.g., forheader=[0,1,3]
, the line at index 2 will be skipped).None
: indicates that none of the lines in the file will be interpreted as headers and columns will instead be labelled by column index (or by values passed to thenames
parameter when provided). This is typically for files with no header. If the file has a header which the user intends to override with thenames
parameter, header should be assigned 0 instead of None.'infer'
(default): behaves asheader=0
if nonames
were passed, otherwise asheader=None
.
Note: this is more in line with how the read_excel
function is described which would enhance consistency between the two similar functions as well. I would also like to propose making further edits to other parameter descriptions in the function but wanted to gauge support for this first by keying in on a specific example.