read_csv, compression, CParser

The CParser doesn't appear to handle the compression flag in 0.10.0

```
vishnu@grsectoo ~/python/pandas $ cat compression.csv
"Time","A","B"
"12/24/2012 12:00:00","1.00","2.00"
vishnu@grsectoo ~/python/pandas $ zcat compression.csv.gz 
"Time","A","B"
"12/24/2012 12:00:00","1.00","2.00",
vishnu@grsectoo ~/python/pandas $ bzcat compression.csv.bz2 
"Time","A","B"
"12/24/2012 12:00:00","1.00","2.00"
```

I've shown my results with Python 3.2.3 and IPython 0.13.1 but it also manifests in Python 2.7.3

```
vishnu@grsectoo ~/python/pandas $ ipython
Python 3.2.3 (default, Dec 17 2012, 23:03:08) 
Type "copyright", "credits" or "license" for more information.
IPython 0.13.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
In [1]: import pandas as pd
In [2]: import gzip
In [3]: import bz2
```

The raw csv file reads just fine:

```
In [4]: with open('compression.csv') as fh:                                  
    pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0])
   ...:     
```

Attempting to read the compressed version, while passing compression='gzip' fails:

```
In [5]: with open('compression.csv.gz') as fh:
    pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='gzip')
   ...:     
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-5-3e61cc453d63> in <module>()
      1 with open('compression.csv.gz') as fh:
----> 2     pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='gzip')
      3 
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    390                     buffer_lines=buffer_lines)
    391 
--> 392         return _read(filepath_or_buffer, kwds)
    393 
    394     parser_f.__name__ = name
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    199 
    200     # Create the parser.
--> 201     parser = TextFileReader(filepath_or_buffer, **kwds)
    202 
    203     if nrows is not None:
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    499             self.options['has_index_names'] = kwds['has_index_names']
    500 
--> 501         self._make_engine(self.engine)
    502 
    503     def _get_options_with_defaults(self, engine):
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    601     def _make_engine(self, engine='c'):
    602         if engine == 'c':
--> 603             self._engine = CParserWrapper(self.f, **self.options)
    604         else:
    605             if engine == 'python':
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
    880         # #2442
    881         kwds['allow_leading_cols'] = self.index_col is not False
--> 882         self._reader = _parser.TextReader(src, **kwds)
    883 
    884         # XXX
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3915)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader._get_header (pandas/src/parser.c:4956)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:6531)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.raise_parser_error (pandas/src/parser.c:16903)()
CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
```

However if I pass a GzipFile handle instead, we are able to read the csv;

```
In [6]: with gzip.GzipFile('compression.csv.gz') as fh:                                  
    pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='gzip')
   ...:     
```

Also, opening a bzip2 version while passing in compression='bz2' fails:

```
In [7]: with open('compression.csv.bz2') as fh:
    pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='bz2')
   ...:     
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-7-2736c68aa21d> in <module>()
      1 with open('compression.csv.bz2') as fh:
----> 2     pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='bz2')
      3 
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    390                     buffer_lines=buffer_lines)
    391 
--> 392         return _read(filepath_or_buffer, kwds)
    393 
    394     parser_f.__name__ = name
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    199 
    200     # Create the parser.
--> 201     parser = TextFileReader(filepath_or_buffer, **kwds)
    202 
    203     if nrows is not None:
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    499             self.options['has_index_names'] = kwds['has_index_names']
    500 
--> 501         self._make_engine(self.engine)
    502 
    503     def _get_options_with_defaults(self, engine):
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    601     def _make_engine(self, engine='c'):
    602         if engine == 'c':
--> 603             self._engine = CParserWrapper(self.f, **self.options)
    604         else:
    605             if engine == 'python':
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
    880         # #2442
    881         kwds['allow_leading_cols'] = self.index_col is not False
--> 882         self._reader = _parser.TextReader(src, **kwds)
    883 
    884         # XXX
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3915)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader._get_header (pandas/src/parser.c:4956)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:6531)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.raise_parser_error (pandas/src/parser.c:16903)()
CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
```

Passing a BZ2File handle allows the parser to read the csv:

```
In [9]: with bz2.BZ2File('compression.csv.bz2') as fh:   
    pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='bz2')
   ...:   
```

Version:

```
In [10]: pd.version.version
Out[10]: '0.10.0'
```

This issue also manifests in the current git master.

Thanks,
Vishnu


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

read_csv, compression, CParser #2593

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

read_csv, compression, CParser #2593

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions