Closed
Description
Code Sample, a copy-pastable example if possible
Create test.html
, with the contents:
<!doctype html>
<html>
<body>
<table>
<tr><td>poorly-escaped cell with an & oh noes</td></tr>
</table>
</body>
</html>
>>> import pandas as pd
>>> pandas.__version__
'0.20.3'
>>> f = open('./test.html')
>>> pd.read_html(f)
Traceback (most recent call last):
File "<input>", line 1, in <module>
pd.read_html(f)
File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 906, in read_html
keep_default_na=keep_default_na)
File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 743, in _parse
raise_with_traceback(retained)
File "/usr/lib/python3.6/site-packages/pandas/compat/__init__.py", line 344, in raise_with_traceback
raise exc.with_traceback(traceback)
ValueError: No text parsed from document: <_io.TextIOWrapper name='/home/liam/test.html' mode='r' encoding='UTF-8'>
Problem description
Pandas attempts to invoke a series of parsers on HTML documents, returning when one produces a result, and continuing to the next on error. This works fine when passing a path or entire document to read_html()
, but when an IO object is passed, the subsequent parsers will be reading from a file whose read cursor is at EOF, producing an inscrutable 'no text parsed from document' error.
This can easily be fixed by rewinding the file with seek(0)
before continuing to the next parser (will add PR shortly).
Expected Output
[ 0
0 poorly-escaped cell with an & oh noes]
Output of pd.show_versions()
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: e1dabf37645f0fcabeed1d845a0ada7b32415606
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.6-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.21.0rc1+36.ge1dabf376.dirty
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.6.0
Cython: 0.27.2
numpy: 1.13.3
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None