Skip to content

read_csv, compression, CParser #2593

Closed
Closed
@vishnu2kmohan

Description

@vishnu2kmohan

The CParser doesn't appear to handle the compression flag in 0.10.0

vishnu@grsectoo ~/python/pandas $ cat compression.csv
"Time","A","B"
"12/24/2012 12:00:00","1.00","2.00"
vishnu@grsectoo ~/python/pandas $ zcat compression.csv.gz 
"Time","A","B"
"12/24/2012 12:00:00","1.00","2.00",
vishnu@grsectoo ~/python/pandas $ bzcat compression.csv.bz2 
"Time","A","B"
"12/24/2012 12:00:00","1.00","2.00"

I've shown my results with Python 3.2.3 and IPython 0.13.1 but it also manifests in Python 2.7.3

vishnu@grsectoo ~/python/pandas $ ipython
Python 3.2.3 (default, Dec 17 2012, 23:03:08) 
Type "copyright", "credits" or "license" for more information.
IPython 0.13.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
In [1]: import pandas as pd
In [2]: import gzip
In [3]: import bz2

The raw csv file reads just fine:

In [4]: with open('compression.csv') as fh:                                  
    pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0])
   ...:     

Attempting to read the compressed version, while passing compression='gzip' fails:

In [5]: with open('compression.csv.gz') as fh:
    pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='gzip')
   ...:     
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-5-3e61cc453d63> in <module>()
      1 with open('compression.csv.gz') as fh:
----> 2     pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='gzip')
      3 
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    390                     buffer_lines=buffer_lines)
    391 
--> 392         return _read(filepath_or_buffer, kwds)
    393 
    394     parser_f.__name__ = name
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    199 
    200     # Create the parser.
--> 201     parser = TextFileReader(filepath_or_buffer, **kwds)
    202 
    203     if nrows is not None:
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    499             self.options['has_index_names'] = kwds['has_index_names']
    500 
--> 501         self._make_engine(self.engine)
    502 
    503     def _get_options_with_defaults(self, engine):
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    601     def _make_engine(self, engine='c'):
    602         if engine == 'c':
--> 603             self._engine = CParserWrapper(self.f, **self.options)
    604         else:
    605             if engine == 'python':
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
    880         # #2442
    881         kwds['allow_leading_cols'] = self.index_col is not False
--> 882         self._reader = _parser.TextReader(src, **kwds)
    883 
    884         # XXX
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3915)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader._get_header (pandas/src/parser.c:4956)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:6531)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.raise_parser_error (pandas/src/parser.c:16903)()
CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

However if I pass a GzipFile handle instead, we are able to read the csv;

In [6]: with gzip.GzipFile('compression.csv.gz') as fh:                                  
    pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='gzip')
   ...:     

Also, opening a bzip2 version while passing in compression='bz2' fails:

In [7]: with open('compression.csv.bz2') as fh:
    pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='bz2')
   ...:     
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-7-2736c68aa21d> in <module>()
      1 with open('compression.csv.bz2') as fh:
----> 2     pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='bz2')
      3 
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    390                     buffer_lines=buffer_lines)
    391 
--> 392         return _read(filepath_or_buffer, kwds)
    393 
    394     parser_f.__name__ = name
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    199 
    200     # Create the parser.
--> 201     parser = TextFileReader(filepath_or_buffer, **kwds)
    202 
    203     if nrows is not None:
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    499             self.options['has_index_names'] = kwds['has_index_names']
    500 
--> 501         self._make_engine(self.engine)
    502 
    503     def _get_options_with_defaults(self, engine):
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    601     def _make_engine(self, engine='c'):
    602         if engine == 'c':
--> 603             self._engine = CParserWrapper(self.f, **self.options)
    604         else:
    605             if engine == 'python':
/usr/lib64/python3.2/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
    880         # #2442
    881         kwds['allow_leading_cols'] = self.index_col is not False
--> 882         self._reader = _parser.TextReader(src, **kwds)
    883 
    884         # XXX
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3915)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader._get_header (pandas/src/parser.c:4956)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:6531)()
/usr/lib64/python3.2/site-packages/pandas/_parser.cpython-32.so in pandas._parser.raise_parser_error (pandas/src/parser.c:16903)()
CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

Passing a BZ2File handle allows the parser to read the csv:

In [9]: with bz2.BZ2File('compression.csv.bz2') as fh:   
    pd.read_csv(fh, sep=',', parse_dates=[0], index_col=[0], compression='bz2')
   ...:   

Version:

In [10]: pd.version.version
Out[10]: '0.10.0'

This issue also manifests in the current git master.

Thanks,
Vishnu

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions