Skip to content

read_csv crash if second input line is malformed #14782

Closed
@karlacio

Description

@karlacio

Problem description

Added 2020/04/21:

Similar issue as OP but segfaults (from below)!

import pandas
import io

pandas.read_csv(io.StringIO("\na\nb\n"),  skip_blank_lines=False, header=1)

===

I'm usign read_csv to import files containing data in 9 columns.
Some lines have a 'time tag' composed by two columns (the first line is always a 'time tag').
Some lines are malformed in various ways.

I use read_csv to read data and skip bad lines just with a warn and usually it works.

Recently I found a file that crash my program and it seems caused by a bad formed second line (not 9 fields).

The code:

import pandas as pd
print("before read_csv")
d = pd.read_csv("inputfile.txt",
                header=0,
                names=(11,22,33,44,55,66,77,88,99),
                compression='infer',
                error_bad_lines=False, warn_bad_lines=True)
print("after read_csv")
print(d)

works on input like this:

1,2
1,2,3,4,5,6,7,8,9
yunk
1,2,3,4,5,6,7,8,9

with output:

before read_csv
after read_csv
     11   22   33   44   55   66   77   88   99
0     1  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0
1  yunk  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2     1  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0

But fails on input like this:

1,2
yunk
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9

with output:

before read_csv
b'Skipping line 3: expected 2 fields, saw 9\nSkipping line 4: expected 2 fields, saw 9\n'
Traceback (most recent call last):
  File ".../bug_pandas.read_scv.py", line 42, in <module>
    error_bad_lines=False, warn_bad_lines=True)
  File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in parser_f
    return _read(filepath_or_buffer, kwds)
  File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 400, in _read
    data = parser.read()
  File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 938, in read
    ret = self._engine.read(nrows)
  File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 1507, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:10364)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10640)
  File "pandas/parser.pyx", line 945, in pandas.parser.TextReader._read_rows (pandas/parser.c:11677)
  File "pandas/parser.pyx", line 1007, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:12627)
pandas.io.common.CParserError: Too many columns specified: expected 9 and found 2

Process finished with exit code 1

The problem arises both on OSX and on CentOS 7.

I tried to find if it's an already known bug but I found nothing.
Sorry if it's already known.

Carlo.

Output of pd.show_versions() on OSX 10.11.6

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Output of pd.show_versions() on CentOS 7

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.36.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 29.0.1
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO CSVread_csv, to_csvSegfaultNon-Recoverable Error

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions