Skip to content

Clean / Consolidate pandas/tests/io/test_html.py #20293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Mar 14, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ci/requirements-2.7_COMPAT.pip
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
html5lib==1.0b2
beautifulsoup4==4.2.0
beautifulsoup4==4.2.1
openpyxl
argparse
2 changes: 1 addition & 1 deletion ci/requirements-optional-conda.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
beautifulsoup4
beautifulsoup4>=4.2.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to change this:

ci/requirements-2.7_COMPAT.pip:beautifulsoup4==4.2.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured as such. Without knowing the full impact, is that req file for a Travis build that tests the 4.2.0 incompatibility (ref #4259). If so do we even need this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just change it to the new minimum version 4.2.1 (we have 1 other build which is pinned as well). always like to test 1 or 2 builds for the min.

blosc
bottleneck
fastparquet
Expand Down
2 changes: 1 addition & 1 deletion ci/requirements-optional-pip.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# This file was autogenerated by scripts/convert_deps.py
# Do not modify directly
beautifulsoup4
beautifulsoup4>=4.2.1
blosc
bottleneck
fastparquet
Expand Down
9 changes: 6 additions & 3 deletions doc/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,12 @@ Optional Dependencies
* One of the following combinations of libraries is needed to use the
top-level :func:`~pandas.read_html` function:

.. versionchanged:: 0.23.0

.. note::

If using BeautifulSoup4 a minimum version of 4.2.1 is required

* `BeautifulSoup4`_ and `html5lib`_ (Any recent version of `html5lib`_ is
okay.)
* `BeautifulSoup4`_ and `lxml`_
Expand All @@ -282,9 +288,6 @@ Optional Dependencies
* You are highly encouraged to read :ref:`HTML Table Parsing gotchas <io.html.gotchas>`.
It explains issues surrounding the installation and
usage of the above three libraries.
* You may need to install an older version of `BeautifulSoup4`_:
Versions 4.2.1, 4.1.3 and 4.0.2 have been confirmed for 64 and 32-bit
Ubuntu/Debian

.. note::

Expand Down
16 changes: 9 additions & 7 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -358,13 +358,15 @@ Dependencies have increased minimum versions
We have updated our minimum supported versions of dependencies (:issue:`15184`).
If installed, we now require:

+-----------------+-----------------+----------+
| Package | Minimum Version | Required |
+=================+=================+==========+
| python-dateutil | 2.5.0 | X |
+-----------------+-----------------+----------+
| openpyxl | 2.4.0 | |
+-----------------+-----------------+----------+
+-----------------+-----------------+----------+---------------+
| Package | Minimum Version | Required | Issue |
+=================+=================+==========+===============+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

| python-dateutil | 2.5.0 | X | :issue:`15184`|
+-----------------+-----------------+----------+---------------+
| openpyxl | 2.4.0 | | :issue:`15184`|
+-----------------+-----------------+----------+---------------+
| beautifulsoup4 | 4.2.1 | | :issue:`20082`|
+-----------------+-----------------+----------+---------------+

.. _whatsnew_0230.api_breaking.dict_insertion_order:

Expand Down
4 changes: 4 additions & 0 deletions pandas/compat/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,9 @@ def lmap(*args, **kwargs):
def lfilter(*args, **kwargs):
return list(filter(*args, **kwargs))

from importlib import reload
reload = reload

else:
# Python 2
import re
Expand Down Expand Up @@ -184,6 +187,7 @@ def get_range_parameters(data):
lmap = builtins.map
lfilter = builtins.filter

reload = builtins.reload

if PY2:
def iteritems(obj, **kw):
Expand Down
61 changes: 28 additions & 33 deletions pandas/io/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,7 @@

from pandas.core.dtypes.common import is_list_like
from pandas.errors import EmptyDataError
from pandas.io.common import (_is_url, urlopen,
parse_url, _validate_header_arg)
from pandas.io.common import _is_url, urlopen, _validate_header_arg
from pandas.io.parsers import TextParser
from pandas.compat import (lrange, lmap, u, string_types, iteritems,
raise_with_traceback, binary_type)
Expand Down Expand Up @@ -554,8 +553,7 @@ def _parse_td(self, row):
return row.xpath('.//td|.//th')

def _parse_tr(self, table):
expr = './/tr[normalize-space()]'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't clear to me why normalize-space() was added here. It is inconsistent with how bs4 parses tr elements and was actually causing a failure in test_computer_sales_page.

Should the need to strip trailing / leading whitespace come back up I think it would be better done in the base class than only implementing here in lxml

return table.xpath(expr)
return table.xpath('.//tr')

def _parse_tables(self, doc, match, kwargs):
pattern = match.pattern
Expand Down Expand Up @@ -606,18 +604,20 @@ def _build_doc(self):
"""
from lxml.html import parse, fromstring, HTMLParser
from lxml.etree import XMLSyntaxError

parser = HTMLParser(recover=False, encoding=self.encoding)
parser = HTMLParser(recover=True, encoding=self.encoding)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in behavior giving lxml a little more power to work through malformed HTML. May or may not be acceptable (see other comments)


try:
# try to parse the input in the simplest way
r = parse(self.io, parser=parser)

if _is_url(self.io):
with urlopen(self.io) as f:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional is required for Py27 compat - in Python3 you can simply provide a call to urlopen on self.io directly as an argument to parse (i.e. without explicitly using the context manager)

r = parse(f, parser=parser)
else:
# try to parse the input in the simplest way
r = parse(self.io, parser=parser)
try:
r = r.getroot()
except AttributeError:
pass
except (UnicodeDecodeError, IOError):
except (UnicodeDecodeError, IOError) as e:
# if the input is a blob of html goop
if not _is_url(self.io):
r = fromstring(self.io, parser=parser)
Expand All @@ -627,17 +627,7 @@ def _build_doc(self):
except AttributeError:
pass
else:
# not a url
scheme = parse_url(self.io).scheme
if scheme not in _valid_schemes:
# lxml can't parse it
msg = (('{invalid!r} is not a valid url scheme, valid '
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating a custom error message here I re-raised. This makes the behavior consistent between lxml and bs4, allowing the test_bad_url_protocol and test_invalid_url tests to pass

'schemes are {valid}')
.format(invalid=scheme, valid=_valid_schemes))
raise ValueError(msg)
else:
# something else happened: maybe a faulty connection
raise
raise e
else:
if not hasattr(r, 'text_content'):
raise XMLSyntaxError("no text parsed from document", 0, 0, 0)
Expand All @@ -657,12 +647,21 @@ def _parse_raw_thead(self, table):
thead = table.xpath(expr)
res = []
if thead:
trs = self._parse_tr(thead[0])
for tr in trs:
cols = [_remove_whitespace(x.text_content()) for x in
self._parse_td(tr)]
# Grab any directly descending table headers first
ths = thead[0].xpath('./th')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because _parse_td with this parser doesn't really differentiate between td and th elements it was incorrectly parsing headers for things like spam.html where td and th elements are intermixed in the header. Hence to make the parsing more robust and pass the tests, I added an initial search for th elements before falling back to the existing behavior.

Even with that I'd argue it's confusing that _parse_td is implemented to return td and th elements and should be refactored to more clearly delineate, but I am trying to minimize behavior change with this PR

if ths:
cols = [_remove_whitespace(x.text_content()) for x in ths]
if any(col != '' for col in cols):
res.append(cols)
else:
trs = self._parse_tr(thead[0])

for tr in trs:
cols = [_remove_whitespace(x.text_content()) for x in
self._parse_td(tr)]

if any(col != '' for col in cols):
res.append(cols)
return res

def _parse_raw_tfoot(self, table):
Expand Down Expand Up @@ -739,14 +738,10 @@ def _parser_dispatch(flavor):
raise ImportError(
"BeautifulSoup4 (bs4) not found, please install it")
import bs4
if LooseVersion(bs4.__version__) == LooseVersion('4.2.0'):
raise ValueError("You're using a version"
" of BeautifulSoup4 (4.2.0) that has been"
" known to cause problems on certain"
" operating systems such as Debian. "
"Please install a version of"
" BeautifulSoup4 != 4.2.0, both earlier"
" and later releases will work.")
if LooseVersion(bs4.__version__) <= LooseVersion('4.2.0'):
raise ValueError("A minimum version of BeautifulSoup 4.2.1 "
"is required")

else:
if not _HAS_LXML:
raise ImportError("lxml not found, please install it")
Expand Down
1 change: 1 addition & 0 deletions pandas/tests/io/data/banklist.html
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,7 @@ <h1 class="page_title">Failed Bank List</h1>
<td class="closing">April 19, 2013</td>
<td class="updated">April 23, 2013</td>
</tr>
<tr>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this element was intentionally removed from the test, but the two parsers did take a different path to parsing. bs4 would "insert" the leading tr tag while lxml would remove the trailing tag.

On one hand this challenges if we really want to give lxml as much power in parsing here since most browsers (OK at least Safari and Chrome on my computer) matched the bs4 behavior, but on the other hand I'm not sure if it's generalizable to say that the magical insertion of this or like elements by bs4 would always be desired, and perhaps its just a risk that the user accepts when parsing malformed HTML?

<td class="institution"><a href="goldcanyon.html">Gold Canyon Bank</a></td>
<td class="city">Gold Canyon</td>
<td class="state">AZ</td>
Expand Down
Loading