Description
Pandas version checks
- I have checked that the issue still exists on the latest versions of the docs on
main
here
Location of the documentation
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Documentation problem
The documentation for read_csv
's nrows
argument says:
Number of rows of file to read. Useful for reading pieces of large files.
I want to read a file using header=1
, and then limit the number of rows. The documentation says this counts the number of rows of the file. To me that sounds like it includes the skipped row and the column header row, since pandas still reads those rows from the file. But I've done some testing. The nrows
argument counts the number of data rows. It excludes the skipped rows, and excludes the column header row. skiprows
is the same (skipped rows aren't counted towards nrows
). When I have a row which is a comment, that also doesn't count towards nrows
.
import pandas as pd
csv = """extra,
a,b
1,1
#comment,comment
2,2
3,3
footer,blah,yeah
"""
from io import StringIO
with StringIO(csv) as io:
df = pd.read_csv(io, header=1, nrows=2, comment='#')
For nrows=2
, it seems to always return 2 rows.
Suggested fix for documentation
Number of rows of data to read. Useful for reading pieces of large files. Refers to the number of included data rows. The following rows are not included in the count:
- the column header
- rows before the column header, if
header=1
or larger- rows which are fully comment rows
- rows skipped with
skiprows