Closed
Description
Only thing I changed from my usually working reduction pipeline is to try engine="python"
(because I wanted to use nrows
for a smaller test-read, but that fails as well, and I thought maybe the python engine is buggy currently):
$ python reduction.py ~/data/planet4/2015-06-21_planet_four_classifications.csv
INFO:Starting reduction.
Traceback (most recent call last):
File "reduction.py", line 258, in <module>
args.test_n_rows, args.remove_duplicates)
File "reduction.py", line 182, in main
data = [chunk for chunk in reader]
File "reduction.py", line 182, in <listcomp>
data = [chunk for chunk in reader]
File "/Users/klay6683/miniconda3/lib/python3.4/site-packages/pandas-0.16.2_58_g01995b2-py3.4-macosx-10.5-x86_64.egg/pandas/io/parsers.py", line 697, in __iter__
yield self.read(self.chunksize)
File "/Users/klay6683/miniconda3/lib/python3.4/site-packages/pandas-0.16.2_58_g01995b2-py3.4-macosx-10.5-x86_64.egg/pandas/io/parsers.py", line 721, in read
ret = self._engine.read(nrows)
File "/Users/klay6683/miniconda3/lib/python3.4/site-packages/pandas-0.16.2_58_g01995b2-py3.4-macosx-10.5-x86_64.egg/pandas/io/parsers.py", line 1556, in read
content = self._get_lines(rows)
File "/Users/klay6683/miniconda3/lib/python3.4/site-packages/pandas-0.16.2_58_g01995b2-py3.4-macosx-10.5-x86_64.egg/pandas/io/parsers.py", line 2007, in _get_lines
for _ in range(rows):
TypeError: 'float' object cannot be interpreted as an integer
My function call is this:
# as chunksize and nrows cannot be used together yet, i switch chunksize
# to None if I want test_n_rows for a small test database:
if test_n_rows:
chunks = None
else:
chunks = 1e6
# creating reader object with pandas interface for csv parsing
# doing this in chunks as its faster. Also, later will do a split
# into multiple processes to do this.
reader = pd.read_csv(fname, chunksize=chunks, na_values=['null'],
usecols=analysis_cols, nrows=test_n_rows,
engine='c')
Using pandas-0.16.2_58_g01995b2-py3.4