-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Clean / Consolidate pandas/tests/io/test_html.py #20293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -554,8 +553,7 @@ def _parse_td(self, row): | |||
return row.xpath('.//td|.//th') | |||
|
|||
def _parse_tr(self, table): | |||
expr = './/tr[normalize-space()]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't clear to me why normalize-space()
was added here. It is inconsistent with how bs4 parses tr
elements and was actually causing a failure in test_computer_sales_page
.
Should the need to strip trailing / leading whitespace come back up I think it would be better done in the base class than only implementing here in lxml
r = parse(self.io, parser=parser) | ||
|
||
if _is_url(self.io): | ||
with urlopen(self.io) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conditional is required for Py27 compat - in Python3 you can simply provide a call to urlopen on self.io directly as an argument to parse (i.e. without explicitly using the context manager)
scheme = parse_url(self.io).scheme | ||
if scheme not in _valid_schemes: | ||
# lxml can't parse it | ||
msg = (('{invalid!r} is not a valid url scheme, valid ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than creating a custom error message here I re-raised. This makes the behavior consistent between lxml and bs4, allowing the test_bad_url_protocol
and test_invalid_url
tests to pass
@@ -340,6 +340,7 @@ <h1 class="page_title">Failed Bank List</h1> | |||
<td class="closing">April 19, 2013</td> | |||
<td class="updated">April 23, 2013</td> | |||
</tr> | |||
<tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this element was intentionally removed from the test, but the two parsers did take a different path to parsing. bs4 would "insert" the leading tr
tag while lxml would remove the trailing tag.
On one hand this challenges if we really want to give lxml as much power in parsing here since most browsers (OK at least Safari and Chrome on my computer) matched the bs4 behavior, but on the other hand I'm not sure if it's generalizable to say that the magical insertion of this or like elements by bs4 would always be desired, and perhaps its just a risk that the user accepts when parsing malformed HTML?
@@ -606,18 +604,20 @@ def _build_doc(self): | |||
""" | |||
from lxml.html import parse, fromstring, HTMLParser | |||
from lxml.etree import XMLSyntaxError | |||
|
|||
parser = HTMLParser(recover=False, encoding=self.encoding) | |||
parser = HTMLParser(recover=True, encoding=self.encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change in behavior giving lxml a little more power to work through malformed HTML. May or may not be acceptable (see other comments)
cols = [_remove_whitespace(x.text_content()) for x in | ||
self._parse_td(tr)] | ||
# Grab any directly descending table headers first | ||
ths = thead[0].xpath('./th') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because _parse_td
with this parser doesn't really differentiate between td
and th
elements it was incorrectly parsing headers for things like spam.html
where td
and th
elements are intermixed in the header. Hence to make the parsing more robust and pass the tests, I added an initial search for th
elements before falling back to the existing behavior.
Even with that I'd argue it's confusing that _parse_td
is implemented to return td
and th
elements and should be refactored to more clearly delineate, but I am trying to minimize behavior change with this PR
pandas/tests/io/test_html.py
Outdated
def setup_class(cls): | ||
_skip_if_no('lxml') | ||
|
||
@pytest.mark.xfail | ||
def test_data_fail(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changes I made this no longer fails but instead parses. If we are comfortable with that this should probably be removed, though I put as xfail
for visibility atm
|
||
assert self.read_html(bad) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this test to remove a seek back to the start of the file that lxml
was fine with. By definition of unseekable I'm not sure how that seek
call was allowed...
pandas/tests/io/test_html.py
Outdated
@@ -82,33 +44,56 @@ def assert_framelist_equal(list1, list2, *args, **kwargs): | |||
assert not frame_i.empty, 'frames are both empty' | |||
|
|||
|
|||
def _missing_bs4(): | |||
bs4 = td.safe_import('bs4') | |||
if not bs4 or LooseVersion(bs4.__version__) == LooseVersion('4.2.0'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that would be ok requiring a later version of bs4 to avoid some of the older issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the min version. FWIW in ci/requirements-2.7_COMPAT.pip
the requirement is hardcoded at 4.2.0
and ci/requirements-2.7_LOCALE.pip
has 4.2.1
. I'm not sure the backstory to those but I assume that will cause conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be rooted back in #4259?
Hello @WillAyd! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on March 13, 2018 at 23:39 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #20293 +/- ##
=========================================
Coverage ? 91.74%
=========================================
Files ? 150
Lines ? 49154
Branches ? 0
=========================================
Hits ? 45097
Misses ? 4057
Partials ? 0
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a note indicating the new min for bs4, may need to update install.rst, does this have equiv tests to before (lots of remove code is good!)
pandas/tests/io/test_html.py
Outdated
_skip_if_none_of(('bs4', 'html5lib')) | ||
def _missing_bs4(): | ||
bs4 = td.safe_import('bs4') | ||
if not bs4 or LooseVersion(bs4.__version__) == LooseVersion('4.2.0'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4.2.1
Missed your question before but the way this worked is that I combined all of the lxml and bs4 test classes into one, deleting tests where duplicated and also moving in some of the top level tests that weren't being tested for both (ex: The following were deleted:
|
+-----------------+-----------------+----------+ | ||
+-----------------+-----------------+----------+---------------+ | ||
| Package | Minimum Version | Required | Issue | | ||
+=================+=================+==========+===============+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
@@ -1,4 +1,4 @@ | |||
beautifulsoup4 | |||
beautifulsoup4>=4.2.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to change this:
ci/requirements-2.7_COMPAT.pip:beautifulsoup4==4.2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured as such. Without knowing the full impact, is that req file for a Travis build that tests the 4.2.0 incompatibility (ref #4259). If so do we even need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just change it to the new minimum version 4.2.1 (we have 1 other build which is pinned as well). always like to test 1 or 2 builds for the min.
CircleCI failure was a timeout - don't believe it's related to this change |
thanks note I saw a single skip on my local
|
|
Yep there was one test that was testing the failure message when trying to read an unseekable IO object twice. lxml actually didn't have any problem reading it twice, so I placed an imperative skip for that parser in the test |
This is an extremely aggressive change to get all of the test cases unified between the LXML and BS4 parsers. On the plus side both parsers can share the same set of tests and functionality, but on the downside it gives lxml a little more power than it had previously, where it would quickly fall back to bs4 for malformed sites.
Review / criticism appreciated