Skip to content

DOC: Unclear description for header in read_csv docstring #53673

Open
@tpaxman

Description

@tpaxman

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Documentation problem

I have noted the following Issues of clarity in the docstring description for read_csv for the header parameter:

  • The behavior associated with header=None is not explicitly defined.
  • Default behavior description is in terms of header=0 and header=None, neither of which have been clearly explained yet.
  • The relationship between file line numbers (which are conventionally numbered from 1) and row numbers/indices (which are indexed from 0) is not described explicity (only alluded to implicitly through examples of header=0 meaning the first line)
  • The description "if column names are passed explicitly" is vague as it doesn't explicitly mention how (i.e. via names parameter).
  • The detailed descriptions to not align with the order given in the inital list of accepted values

Suggested fix for documentation

Original docstring:

header : int, list of int, None, default 'infer'
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file. Issues of clarity noted in the docstring description:

Proposed change to address the issues:

header : int, list of int, None, default 'infer'
Index or indices corresponding to line number(s) in the CSV file that will be read as DataFrame column labels. Index 0 corresponds to the first line in the file (or the first non-blank, non-commented line if skip_blank_lines=True). The following arguments are valid:

  • Single int: denotes the line index at which column labels will be read.
  • List of int: denotes line indices at which column labels will be read as a multi-index. Note: intervening rows not specified in the list will be skipped (e.g., for header=[0,1,3], the line at index 2 will be skipped).
  • None: indicates that none of the lines in the file will be interpreted as headers and columns will instead be labelled by column index (or by values passed to the names parameter when provided). This is typically for files with no header. If the file has a header which the user intends to override with the names parameter, header should be assigned 0 instead of None.
  • 'infer' (default): behaves as header=0 if no names were passed, otherwise as header=None.

Note: this is more in line with how the read_excel function is described which would enhance consistency between the two similar functions as well. I would also like to propose making further edits to other parameter descriptions in the function but wanted to gauge support for this first by keying in on a specific example.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions