Skip to content

DOC: read_csv() ignores quotes when a regex is used in sep #11989

Closed
@mfixman

Description

@mfixman

When using a regular expression in the sep argument of read_csv, the Python parser disregards quotes in the input file.

In the following example, example.csv is parsed correctly by read_csv without regexes in sep, while the version with regexes (which should evaluate to exactly the same as the previous version) fails because it parses the commas inside the quotes.

example2.csv, which doesn't contain quotes, is parsed correctly using the same code.

In [1]: import pandas

In [2]: !cat example.csv
a,b,c,d
q,w,e,r
a,s,d,f
"z,x,c,v",i,o,p

In [3]: pandas.read_csv('example.csv', engine = 'python', quotechar = '"', sep = ',')
Out[3]: 
         a  b  c  d
0        q  w  e  r
1        a  s  d  f
2  z,x,c,v  i  o  p

In [4]: pandas.read_csv('example.csv', engine = 'python', quotechar = '"', sep = '[,]')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-5f9ad4fdcd46> in <module>()
----> 1 pandas.read_csv('example.csv', engine = 'python', quotechar = '"', sep = '[,]')

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    489                     skip_blank_lines=skip_blank_lines)
    490 
--> 491         return _read(filepath_or_buffer, kwds)
    492 
    493     parser_f.__name__ = name

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    276         return parser
    277 
--> 278     return parser.read()
    279 
    280 _parser_defaults = {

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in read(self, nrows)
    738                 raise ValueError('skip_footer not supported for iteration')
    739 
--> 740         ret = self._engine.read(nrows)
    741 
    742         if self.options.get('as_recarray'):

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in read(self, rows)
   1593             content = content[1:]
   1594 
-> 1595         alldata = self._rows_to_cols(content)
   1596         data = self._exclude_implicit_index(alldata)
   1597 

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in _rows_to_cols(self, content)
   1968             msg = ('Expected %d fields in line %d, saw %d' %
   1969                    (col_len, row_num + 1, zip_len))
-> 1970             raise ValueError(msg)
   1971 
   1972         if self.usecols:

ValueError: Expected 4 fields in line 4, saw 7

In [5]: !cat example2.csv
a,b,c,d
q,w,e,r
a,s,d,f
z,x,c,v

In [6]: pandas.read_csv('example2.csv', engine = 'python', quotechar = '"', sep = '[,]')
Out[6]: 
   a  b  c  d
0  q  w  e  r
1  a  s  d  f
2  z  x  c  v

In [7]: pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-74-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.1
pip: 1.5.4
setuptools: 1.1.4
Cython: None
numpy: 1.10.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 3.1.0
sphinx: None
patsy: 0.2.1
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions