Description
Reading Australian electricity market data files, read_csv reads past the nrows limit for certain nrows values and consequently fails.
These market data files are 4 csv files combined into a single csv file and so the file has multiple headers and variable field size across the rows.
The first set of data is from rows 1-1442.
Intent was to extract first set of data with nrows = 1442.
Testing several arbitrary CSV files from this data source shows well formed CSV - 120 fields between rows 1 to 1442 (with a 10 field at row 0)
lines = [len(line.strip().split(',')) for i,line in enumerate(csvFile) if i < 1442]
s = pd.Series(lines)
print (s.value_counts())
returns
120 1441
10 1
dtype: int64
Other python examples of reading the market data using csv module work fine
In the reproducible example below, code works for nrows< 824, but fails on any value above it.
Testing on arbitrary files suggests the 824 limit is variable - sometimes a few more rows, sometimes a few less rows.
import requests, io, zipfile
import pandas as pd
url = 'http://www.nemweb.com.au/Reports/CURRENT/Public_Prices/PUBLIC_PRICES_201406290000_20140630040528.zip'
# get the zip-archive
request = requests.get(url)
# make the archive available as a byte-stream
zipdata = io.BytesIO()
zipdata.write(request.content)
thezipfile = zipfile.ZipFile(zipdata, mode='r')
# there is only one csv file per arhive - read it into a Pandas DataFrame
fname = thezipfile.namelist()[0]
with thezipfile.open(fname) as csvFile:
#works for nrows < = 823
df1 = pd.read_csv(csvFile, header=1, index_col=4, parse_dates=True, nrows=823)
print(df1.head())
#fails for n> 823
df1 = pd.read_csv(csvFile, header=1, index_col=4, parse_dates=True, nrows=824)
print(df1.head())