Closed
Description
The CParser doesn't appear to handle the compression flag in 0.10.0
vishnu@grsectoo ~/python/pandas $ cat compression.csv
"Time","A","B"
"12/24/2012 12:00:00","1.00","2.00"
vishnu@grsectoo ~/python/pandas $ zcat compression.csv.gz
"Time","A","B"
"12/24/2012 12:00:00","1.00","2.00",
vishnu@grsectoo ~/python/pandas $ bzcat compression.csv.bz2
"Time","A","B"
"12/24/2012 12:00:00","1.00","2.00"
I've shown my results with Python 3.2.3 and IPython 0.13.1 but it also manifests in Python 2.7.3
vishnu@grsectoo ~/python/pandas $ ipython
Python 3.2.3 (default, Dec 17 2012, 23:03:08)
Type "copyright", "credits" or "license" for more information.
IPython 0.13.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import pandas as pd
In [2]: import gzip
In [3]: import bz2
The raw csv file reads just fine:
In [4]: with open('compression.csv') as fh:
pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0])
...:
Attempting to read the compressed version, while passing compression='gzip' fails:
In [5]: with open('compression.csv.gz') as fh:
pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='gzip')
...:
---------------------------------------------------------------------------
CParserError Traceback (most recent call last)
<ipython-input-5-3e61cc453d63> in <module>()
1 with open('compression.csv.gz') as fh:
----> 2 pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='gzip')
3
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
390 buffer_lines=buffer_lines)
391
--> 392 return _read(filepath_or_buffer, kwds)
393
394 parser_f.__name__ = name
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
199
200 # Create the parser.
--> 201 parser = TextFileReader(filepath_or_buffer, **kwds)
202
203 if nrows is not None:
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
499 self.options['has_index_names'] = kwds['has_index_names']
500
--> 501 self._make_engine(self.engine)
502
503 def _get_options_with_defaults(self, engine):
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
601 def _make_engine(self, engine='c'):
602 if engine == 'c':
--> 603 self._engine = CParserWrapper(self.f, **self.options)
604 else:
605 if engine == 'python':
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
880 # #2442
881 kwds['allow_leading_cols'] = self.index_col is not False
--> 882 self._reader = _parser.TextReader(src, **kwds)
883
884 # XXX
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3915)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader._get_header (pandas/src/parser.c:4956)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:6531)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.raise_parser_error (pandas/src/parser.c:16903)()
CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
However if I pass a GzipFile handle instead, we are able to read the csv;
In [6]: with gzip.GzipFile('compression.csv.gz') as fh:
pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='gzip')
...:
Also, opening a bzip2 version while passing in compression='bz2' fails:
In [7]: with open('compression.csv.bz2') as fh:
pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='bz2')
...:
---------------------------------------------------------------------------
CParserError Traceback (most recent call last)
<ipython-input-7-2736c68aa21d> in <module>()
1 with open('compression.csv.bz2') as fh:
----> 2 pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='bz2')
3
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
390 buffer_lines=buffer_lines)
391
--> 392 return _read(filepath_or_buffer, kwds)
393
394 parser_f.__name__ = name
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
199
200 # Create the parser.
--> 201 parser = TextFileReader(filepath_or_buffer, **kwds)
202
203 if nrows is not None:
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
499 self.options['has_index_names'] = kwds['has_index_names']
500
--> 501 self._make_engine(self.engine)
502
503 def _get_options_with_defaults(self, engine):
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
601 def _make_engine(self, engine='c'):
602 if engine == 'c':
--> 603 self._engine = CParserWrapper(self.f, **self.options)
604 else:
605 if engine == 'python':
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
880 # #2442
881 kwds['allow_leading_cols'] = self.index_col is not False
--> 882 self._reader = _parser.TextReader(src, **kwds)
883
884 # XXX
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3915)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader._get_header (pandas/src/parser.c:4956)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:6531)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.raise_parser_error (pandas/src/parser.c:16903)()
CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
Passing a BZ2File handle allows the parser to read the csv:
In [9]: with bz2.BZ2File('compression.csv.bz2') as fh:
pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='bz2')
...:
Version:
In [10]: pd.version.version
Out[10]: '0.10.0'
This issue also manifests in the current git master.
Thanks,
Vishnu