Skip to content

Incorrect skipping of lines with inline comments and printing warnings #16472

Closed
@pankajp

Description

@pankajp

Code Sample, a copy-pastable example if possible

# Your code here
from io import StringIO
import numpy as np
import pandas as pd

test_input = u"""\
1 2
2 2 3
3 2 3 # 3 fields
4 2 3# 3 fields
5 2 # 2 fields
6 2# 2 fields
7 # 1 field, NaN
8# 1 field, NaN
9 2 3 # skipped line
# comment"""

df = pd.read_table(StringIO(test_input), comment='#', header=None,
                   delimiter='\\s+', skiprows=0, error_bad_lines=False)

print df
# Expected: only lines with <= 2 fields should appear in the df, others should be warned as skipped
assert (df == pd.DataFrame([[1, 2], [5, 2], [6, 2], [7, np.nan], [8, np.nan]],
                          index=list(range(5)), columns=[0,1])).all().all()

Problem description

Only lines with <= 2 fields should appear in the df, others should be skipped and their warning should be printed on stderr.

Output

Skipping line 2: expected 2 fields, saw 3
Skipping line 4: expected 2 fields, saw 6
Skipping line 6: expected 2 fields, saw 4

   0  1
0  1  2
1  7  8

Problems:

  • Lines skipped due to more fields than expected and which end with inline comments are never printed as skipped on stderr (lines 3-9)
  • Lines which end with inline comment after a space count one more field than present, so incorrectly skip or not skip the line (lines 3, 5, 7, 9)
  • The incorrect accounting joined the lines 7 and 8 as well which was not expected.

Expected Output

Skipping line 2: expected 2 fields, saw 3
Skipping line 3: expected 2 fields, saw 3
Skipping line 4: expected 2 fields, saw 3
Skipping line 9: expected 2 fields, saw 3

   0    1
0  1  2.0
1  5  2.0
2  6  2.0
3  7  NaN
4  8  NaN

Output of pd.show_versions()

# Paste the output here pd.show_versions() here

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.14-200.fc25.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 34.3.3
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.3
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.1.6
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: 0.2.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions