Description
Problem description
Added 2020/04/21:
Similar issue as OP but segfaults (from below)!
import pandas
import io
pandas.read_csv(io.StringIO("\na\nb\n"), skip_blank_lines=False, header=1)
===
I'm usign read_csv to import files containing data in 9 columns.
Some lines have a 'time tag' composed by two columns (the first line is always a 'time tag').
Some lines are malformed in various ways.
I use read_csv to read data and skip bad lines just with a warn and usually it works.
Recently I found a file that crash my program and it seems caused by a bad formed second line (not 9 fields).
The code:
import pandas as pd
print("before read_csv")
d = pd.read_csv("inputfile.txt",
header=0,
names=(11,22,33,44,55,66,77,88,99),
compression='infer',
error_bad_lines=False, warn_bad_lines=True)
print("after read_csv")
print(d)
works on input like this:
1,2
1,2,3,4,5,6,7,8,9
yunk
1,2,3,4,5,6,7,8,9
with output:
before read_csv
after read_csv
11 22 33 44 55 66 77 88 99
0 1 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
1 yunk NaN NaN NaN NaN NaN NaN NaN NaN
2 1 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
But fails on input like this:
1,2
yunk
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9
with output:
before read_csv
b'Skipping line 3: expected 2 fields, saw 9\nSkipping line 4: expected 2 fields, saw 9\n'
Traceback (most recent call last):
File ".../bug_pandas.read_scv.py", line 42, in <module>
error_bad_lines=False, warn_bad_lines=True)
File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in parser_f
return _read(filepath_or_buffer, kwds)
File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 400, in _read
data = parser.read()
File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 938, in read
ret = self._engine.read(nrows)
File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 1507, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:10364)
File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10640)
File "pandas/parser.pyx", line 945, in pandas.parser.TextReader._read_rows (pandas/parser.c:11677)
File "pandas/parser.pyx", line 1007, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:12627)
pandas.io.common.CParserError: Too many columns specified: expected 9 and found 2
Process finished with exit code 1
The problem arises both on OSX and on CentOS 7.
I tried to find if it's an already known bug but I found nothing.
Sorry if it's already known.
Carlo.
Output of pd.show_versions()
on OSX 10.11.6
pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
Output of pd.show_versions()
on CentOS 7
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.36.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 29.0.1
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None