-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Clean / Consolidate pandas/tests/io/test_html.py #20293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7230b69
9cb215e
8f0ce4d
476c19a
e8f356f
478601c
d407464
2360224
3d56d8b
a93a5a3
29904d1
f488fc8
41b77e1
d44b164
8cc21b9
6cff662
e6943b1
50d072d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
html5lib==1.0b2 | ||
beautifulsoup4==4.2.0 | ||
beautifulsoup4==4.2.1 | ||
openpyxl | ||
argparse |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
beautifulsoup4 | ||
beautifulsoup4>=4.2.1 | ||
blosc | ||
bottleneck | ||
fastparquet | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -358,13 +358,15 @@ Dependencies have increased minimum versions | |
We have updated our minimum supported versions of dependencies (:issue:`15184`). | ||
If installed, we now require: | ||
|
||
+-----------------+-----------------+----------+ | ||
| Package | Minimum Version | Required | | ||
+=================+=================+==========+ | ||
| python-dateutil | 2.5.0 | X | | ||
+-----------------+-----------------+----------+ | ||
| openpyxl | 2.4.0 | | | ||
+-----------------+-----------------+----------+ | ||
+-----------------+-----------------+----------+---------------+ | ||
| Package | Minimum Version | Required | Issue | | ||
+=================+=================+==========+===============+ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nice |
||
| python-dateutil | 2.5.0 | X | :issue:`15184`| | ||
+-----------------+-----------------+----------+---------------+ | ||
| openpyxl | 2.4.0 | | :issue:`15184`| | ||
+-----------------+-----------------+----------+---------------+ | ||
| beautifulsoup4 | 4.2.1 | | :issue:`20082`| | ||
+-----------------+-----------------+----------+---------------+ | ||
|
||
.. _whatsnew_0230.api_breaking.dict_insertion_order: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,8 +14,7 @@ | |
|
||
from pandas.core.dtypes.common import is_list_like | ||
from pandas.errors import EmptyDataError | ||
from pandas.io.common import (_is_url, urlopen, | ||
parse_url, _validate_header_arg) | ||
from pandas.io.common import _is_url, urlopen, _validate_header_arg | ||
from pandas.io.parsers import TextParser | ||
from pandas.compat import (lrange, lmap, u, string_types, iteritems, | ||
raise_with_traceback, binary_type) | ||
|
@@ -554,8 +553,7 @@ def _parse_td(self, row): | |
return row.xpath('.//td|.//th') | ||
|
||
def _parse_tr(self, table): | ||
expr = './/tr[normalize-space()]' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It wasn't clear to me why Should the need to strip trailing / leading whitespace come back up I think it would be better done in the base class than only implementing here in lxml |
||
return table.xpath(expr) | ||
return table.xpath('.//tr') | ||
|
||
def _parse_tables(self, doc, match, kwargs): | ||
pattern = match.pattern | ||
|
@@ -606,18 +604,20 @@ def _build_doc(self): | |
""" | ||
from lxml.html import parse, fromstring, HTMLParser | ||
from lxml.etree import XMLSyntaxError | ||
|
||
parser = HTMLParser(recover=False, encoding=self.encoding) | ||
parser = HTMLParser(recover=True, encoding=self.encoding) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a change in behavior giving lxml a little more power to work through malformed HTML. May or may not be acceptable (see other comments) |
||
|
||
try: | ||
# try to parse the input in the simplest way | ||
r = parse(self.io, parser=parser) | ||
|
||
if _is_url(self.io): | ||
with urlopen(self.io) as f: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This conditional is required for Py27 compat - in Python3 you can simply provide a call to urlopen on self.io directly as an argument to parse (i.e. without explicitly using the context manager) |
||
r = parse(f, parser=parser) | ||
else: | ||
# try to parse the input in the simplest way | ||
r = parse(self.io, parser=parser) | ||
try: | ||
r = r.getroot() | ||
except AttributeError: | ||
pass | ||
except (UnicodeDecodeError, IOError): | ||
except (UnicodeDecodeError, IOError) as e: | ||
# if the input is a blob of html goop | ||
if not _is_url(self.io): | ||
r = fromstring(self.io, parser=parser) | ||
|
@@ -627,17 +627,7 @@ def _build_doc(self): | |
except AttributeError: | ||
pass | ||
else: | ||
# not a url | ||
scheme = parse_url(self.io).scheme | ||
if scheme not in _valid_schemes: | ||
# lxml can't parse it | ||
msg = (('{invalid!r} is not a valid url scheme, valid ' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rather than creating a custom error message here I re-raised. This makes the behavior consistent between lxml and bs4, allowing the |
||
'schemes are {valid}') | ||
.format(invalid=scheme, valid=_valid_schemes)) | ||
raise ValueError(msg) | ||
else: | ||
# something else happened: maybe a faulty connection | ||
raise | ||
raise e | ||
else: | ||
if not hasattr(r, 'text_content'): | ||
raise XMLSyntaxError("no text parsed from document", 0, 0, 0) | ||
|
@@ -657,12 +647,21 @@ def _parse_raw_thead(self, table): | |
thead = table.xpath(expr) | ||
res = [] | ||
if thead: | ||
trs = self._parse_tr(thead[0]) | ||
for tr in trs: | ||
cols = [_remove_whitespace(x.text_content()) for x in | ||
self._parse_td(tr)] | ||
# Grab any directly descending table headers first | ||
ths = thead[0].xpath('./th') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because Even with that I'd argue it's confusing that |
||
if ths: | ||
cols = [_remove_whitespace(x.text_content()) for x in ths] | ||
if any(col != '' for col in cols): | ||
res.append(cols) | ||
else: | ||
trs = self._parse_tr(thead[0]) | ||
|
||
for tr in trs: | ||
cols = [_remove_whitespace(x.text_content()) for x in | ||
self._parse_td(tr)] | ||
|
||
if any(col != '' for col in cols): | ||
res.append(cols) | ||
return res | ||
|
||
def _parse_raw_tfoot(self, table): | ||
|
@@ -739,14 +738,10 @@ def _parser_dispatch(flavor): | |
raise ImportError( | ||
"BeautifulSoup4 (bs4) not found, please install it") | ||
import bs4 | ||
if LooseVersion(bs4.__version__) == LooseVersion('4.2.0'): | ||
raise ValueError("You're using a version" | ||
" of BeautifulSoup4 (4.2.0) that has been" | ||
" known to cause problems on certain" | ||
" operating systems such as Debian. " | ||
"Please install a version of" | ||
" BeautifulSoup4 != 4.2.0, both earlier" | ||
" and later releases will work.") | ||
if LooseVersion(bs4.__version__) <= LooseVersion('4.2.0'): | ||
raise ValueError("A minimum version of BeautifulSoup 4.2.1 " | ||
"is required") | ||
|
||
else: | ||
if not _HAS_LXML: | ||
raise ImportError("lxml not found, please install it") | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -340,6 +340,7 @@ <h1 class="page_title">Failed Bank List</h1> | |
<td class="closing">April 19, 2013</td> | ||
<td class="updated">April 23, 2013</td> | ||
</tr> | ||
<tr> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if this element was intentionally removed from the test, but the two parsers did take a different path to parsing. bs4 would "insert" the leading On one hand this challenges if we really want to give lxml as much power in parsing here since most browsers (OK at least Safari and Chrome on my computer) matched the bs4 behavior, but on the other hand I'm not sure if it's generalizable to say that the magical insertion of this or like elements by bs4 would always be desired, and perhaps its just a risk that the user accepts when parsing malformed HTML? |
||
<td class="institution"><a href="goldcanyon.html">Gold Canyon Bank</a></td> | ||
<td class="city">Gold Canyon</td> | ||
<td class="state">AZ</td> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to change this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured as such. Without knowing the full impact, is that req file for a Travis build that tests the 4.2.0 incompatibility (ref #4259). If so do we even need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just change it to the new minimum version 4.2.1 (we have 1 other build which is pinned as well). always like to test 1 or 2 builds for the min.