Skip to content

BUG: error in read_html when parsing badly-escaped HTML from an io object #17975

Closed
@erinzm

Description

@erinzm

Code Sample, a copy-pastable example if possible

Create test.html, with the contents:

<!doctype html>
<html>
<body>
<table>
	<tr><td>poorly-escaped cell with an & oh noes</td></tr>
</table>
</body>
</html>
>>> import pandas as pd
>>> pandas.__version__
'0.20.3'
>>> f = open('./test.html')
>>> pd.read_html(f)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    pd.read_html(f)
  File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 906, in read_html
    keep_default_na=keep_default_na)
  File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 743, in _parse
    raise_with_traceback(retained)
  File "/usr/lib/python3.6/site-packages/pandas/compat/__init__.py", line 344, in raise_with_traceback
    raise exc.with_traceback(traceback)
ValueError: No text parsed from document: <_io.TextIOWrapper name='/home/liam/test.html' mode='r' encoding='UTF-8'>

Problem description

Pandas attempts to invoke a series of parsers on HTML documents, returning when one produces a result, and continuing to the next on error. This works fine when passing a path or entire document to read_html(), but when an IO object is passed, the subsequent parsers will be reading from a file whose read cursor is at EOF, producing an inscrutable 'no text parsed from document' error.

This can easily be fixed by rewinding the file with seek(0) before continuing to the next parser (will add PR shortly).

Expected Output

[                                       0
0  poorly-escaped cell with an & oh noes]

Output of pd.show_versions()

>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: e1dabf37645f0fcabeed1d845a0ada7b32415606
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.6-1-ARCH
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0rc1+36.ge1dabf376.dirty
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.6.0
Cython: 0.27.2
numpy: 1.13.3
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Error ReportingIncorrect or improved errors from pandasIO HTMLread_html, to_html, Styler.apply, Styler.applymap

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions