Skip to content

nrows limit fails reading well formed csv files from Australian electricity market data #7626

Closed
@ChristopherShort

Description

@ChristopherShort

Reading Australian electricity market data files, read_csv reads past the nrows limit for certain nrows values and consequently fails.

These market data files are 4 csv files combined into a single csv file and so the file has multiple headers and variable field size across the rows.

The first set of data is from rows 1-1442.

Intent was to extract first set of data with nrows = 1442.

Testing several arbitrary CSV files from this data source shows well formed CSV - 120 fields between rows 1 to 1442 (with a 10 field at row 0)

lines = [len(line.strip().split(',')) for i,line in enumerate(csvFile) if i < 1442]
s = pd.Series(lines)
print (s.value_counts())

returns
120 1441
10 1
dtype: int64

Other python examples of reading the market data using csv module work fine

In the reproducible example below, code works for nrows< 824, but fails on any value above it.

Testing on arbitrary files suggests the 824 limit is variable - sometimes a few more rows, sometimes a few less rows.

import requests, io, zipfile
import pandas as pd

url = 'http://www.nemweb.com.au/Reports/CURRENT/Public_Prices/PUBLIC_PRICES_201406290000_20140630040528.zip'

    # get the zip-archive
request = requests.get(url)

    # make the archive available as a byte-stream
zipdata = io.BytesIO()
zipdata.write(request.content)
thezipfile = zipfile.ZipFile(zipdata, mode='r')

    # there is only one csv file per arhive - read it into a Pandas DataFrame
fname = thezipfile.namelist()[0] 

with thezipfile.open(fname) as csvFile:

        #works for nrows < = 823
    df1 = pd.read_csv(csvFile, header=1, index_col=4, parse_dates=True, nrows=823)
    print(df1.head())

        #fails for n> 823
    df1 = pd.read_csv(csvFile, header=1, index_col=4, parse_dates=True, nrows=824)
    print(df1.head())

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions