Skip to content

Clean / Consolidate pandas/tests/io/test_html.py #20293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Mar 14, 2018

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Mar 12, 2018

This is an extremely aggressive change to get all of the test cases unified between the LXML and BS4 parsers. On the plus side both parsers can share the same set of tests and functionality, but on the downside it gives lxml a little more power than it had previously, where it would quickly fall back to bs4 for malformed sites.

Review / criticism appreciated

@@ -554,8 +553,7 @@ def _parse_td(self, row):
return row.xpath('.//td|.//th')

def _parse_tr(self, table):
expr = './/tr[normalize-space()]'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't clear to me why normalize-space() was added here. It is inconsistent with how bs4 parses tr elements and was actually causing a failure in test_computer_sales_page.

Should the need to strip trailing / leading whitespace come back up I think it would be better done in the base class than only implementing here in lxml

r = parse(self.io, parser=parser)

if _is_url(self.io):
with urlopen(self.io) as f:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional is required for Py27 compat - in Python3 you can simply provide a call to urlopen on self.io directly as an argument to parse (i.e. without explicitly using the context manager)

scheme = parse_url(self.io).scheme
if scheme not in _valid_schemes:
# lxml can't parse it
msg = (('{invalid!r} is not a valid url scheme, valid '
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating a custom error message here I re-raised. This makes the behavior consistent between lxml and bs4, allowing the test_bad_url_protocol and test_invalid_url tests to pass

@@ -340,6 +340,7 @@ <h1 class="page_title">Failed Bank List</h1>
<td class="closing">April 19, 2013</td>
<td class="updated">April 23, 2013</td>
</tr>
<tr>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this element was intentionally removed from the test, but the two parsers did take a different path to parsing. bs4 would "insert" the leading tr tag while lxml would remove the trailing tag.

On one hand this challenges if we really want to give lxml as much power in parsing here since most browsers (OK at least Safari and Chrome on my computer) matched the bs4 behavior, but on the other hand I'm not sure if it's generalizable to say that the magical insertion of this or like elements by bs4 would always be desired, and perhaps its just a risk that the user accepts when parsing malformed HTML?

@@ -606,18 +604,20 @@ def _build_doc(self):
"""
from lxml.html import parse, fromstring, HTMLParser
from lxml.etree import XMLSyntaxError

parser = HTMLParser(recover=False, encoding=self.encoding)
parser = HTMLParser(recover=True, encoding=self.encoding)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in behavior giving lxml a little more power to work through malformed HTML. May or may not be acceptable (see other comments)

cols = [_remove_whitespace(x.text_content()) for x in
self._parse_td(tr)]
# Grab any directly descending table headers first
ths = thead[0].xpath('./th')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because _parse_td with this parser doesn't really differentiate between td and th elements it was incorrectly parsing headers for things like spam.html where td and th elements are intermixed in the header. Hence to make the parsing more robust and pass the tests, I added an initial search for th elements before falling back to the existing behavior.

Even with that I'd argue it's confusing that _parse_td is implemented to return td and th elements and should be refactored to more clearly delineate, but I am trying to minimize behavior change with this PR

def setup_class(cls):
_skip_if_no('lxml')

@pytest.mark.xfail
def test_data_fail(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes I made this no longer fails but instead parses. If we are comfortable with that this should probably be removed, though I put as xfail for visibility atm


assert self.read_html(bad)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this test to remove a seek back to the start of the file that lxml was fine with. By definition of unseekable I'm not sure how that seek call was allowed...

@@ -82,33 +44,56 @@ def assert_framelist_equal(list1, list2, *args, **kwargs):
assert not frame_i.empty, 'frames are both empty'


def _missing_bs4():
bs4 = td.safe_import('bs4')
if not bs4 or LooseVersion(bs4.__version__) == LooseVersion('4.2.0'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that would be ok requiring a later version of bs4 to avoid some of the older issues

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the min version. FWIW in ci/requirements-2.7_COMPAT.pip the requirement is hardcoded at 4.2.0 and ci/requirements-2.7_LOCALE.pip has 4.2.1. I'm not sure the backstory to those but I assume that will cause conflict

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be rooted back in #4259?

@jreback jreback added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Mar 12, 2018
@pep8speaks
Copy link

pep8speaks commented Mar 13, 2018

Hello @WillAyd! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 13, 2018 at 23:39 Hours UTC

@codecov
Copy link

codecov bot commented Mar 13, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@3783ccc). Click here to learn what that means.
The diff coverage is 81.81%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #20293   +/-   ##
=========================================
  Coverage          ?   91.74%           
=========================================
  Files             ?      150           
  Lines             ?    49154           
  Branches          ?        0           
=========================================
  Hits              ?    45097           
  Misses            ?     4057           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.13% <81.81%> (?)
#single 41.9% <13.63%> (?)
Impacted Files Coverage Δ
pandas/compat/__init__.py 57.74% <66.66%> (ø)
pandas/io/html.py 88.79% <84.21%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3783ccc...50d072d. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a note indicating the new min for bs4, may need to update install.rst, does this have equiv tests to before (lots of remove code is good!)

_skip_if_none_of(('bs4', 'html5lib'))
def _missing_bs4():
bs4 = td.safe_import('bs4')
if not bs4 or LooseVersion(bs4.__version__) == LooseVersion('4.2.0'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4.2.1

@WillAyd
Copy link
Member Author

WillAyd commented Mar 13, 2018

Missed your question before but the way this worked is that I combined all of the lxml and bs4 test classes into one, deleting tests where duplicated and also moving in some of the top level tests that weren't being tested for both (ex: test_importcheck_thread_safety)

The following were deleted:

  • test_data_fail as the changes to the lxml parser to make it more robust didn't trigger this failure
  • test_bs4_finds_tablesas it wasn't something that could be shared, and also wasn't testing the parser
  • test_lxml_finds_tables similar reason as above

+-----------------+-----------------+----------+
+-----------------+-----------------+----------+---------------+
| Package | Minimum Version | Required | Issue |
+=================+=================+==========+===============+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@@ -1,4 +1,4 @@
beautifulsoup4
beautifulsoup4>=4.2.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to change this:

ci/requirements-2.7_COMPAT.pip:beautifulsoup4==4.2.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured as such. Without knowing the full impact, is that req file for a Travis build that tests the 4.2.0 incompatibility (ref #4259). If so do we even need this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just change it to the new minimum version 4.2.1 (we have 1 other build which is pinned as well). always like to test 1 or 2 builds for the min.

@WillAyd
Copy link
Member Author

WillAyd commented Mar 14, 2018

CircleCI failure was a timeout - don't believe it's related to this change

@jreback jreback added this to the 0.23.0 milestone Mar 14, 2018
@jreback jreback added the Testing pandas testing functions or related to the test suite label Mar 14, 2018
@jreback jreback merged commit cabc05f into pandas-dev:master Mar 14, 2018
@jreback
Copy link
Contributor

jreback commented Mar 14, 2018

thanks

note I saw a single skip on my local

pandas/tests/io/test_html.py::TestReadHtml::test_parse_failure_unseekable[lxml] SKIPPED                                                                                                                                            [ 93%]

@jreback
Copy link
Contributor

jreback commented Mar 14, 2018

SKIP [1] /Users/jreback/pandas/pandas/tests/io/test_html.py:842: Not applicable for lxml

@WillAyd WillAyd deleted the cln-html-tests branch March 14, 2018 15:49
@WillAyd
Copy link
Member Author

WillAyd commented Mar 14, 2018

Yep there was one test that was testing the failure message when trying to read an unseekable IO object twice. lxml actually didn't have any problem reading it twice, so I placed an imperative skip for that parser in the test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants