-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Add support for storage_options in pd.read_html #52620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add support for storage_options in pd.read_html #52620
Conversation
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
@mroeschke do you know who I could ping for a review |
Could you merge in main and resolve the merge conflict? |
@mroeschke done |
pandas/io/common.py
Outdated
@@ -267,6 +267,16 @@ def urlopen(*args, **kwargs): | |||
""" | |||
import urllib.request | |||
|
|||
if "storage_options" in kwargs and kwargs["storage_options"] is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be needed. This is already handled in _get_filepath_or_buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I missed that. I've reverted the implementation of urlopen()
to its original form, and I've used common.get_handle()
in html.py
instead.
pandas/tests/io/test_user_agent.py
Outdated
@@ -241,6 +255,7 @@ def responder(request): | |||
[ | |||
(CSVUserAgentResponder, pd.read_csv, None), | |||
(JSONUserAgentResponder, pd.read_json, None), | |||
(HTMLUserAgentResponder, HTMLUserAgentResponder.read_html_first_df, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of HTMLUserAgentResponder.read_html_first_df
, could you specify a lambda function of read_html
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like some of the CI checks are still failing
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
Signed-off-by: serhatgktp <efkan@ibm.com>
Change expected type for 'obj' param in _read()
Hi @mroeschke - I've attempted to address the failing tests and I've merged into main. Could you reopen this PR so that I can see the CI output? Thanks! |
Signed-off-by: serhatgktp <efkan@ibm.com>
@mroeschke All should be good now. Please let me know if anything seems off, thanks! |
Thanks for sticking with this @serhatgktp. Nice work |
The argument has been added to pandas 2.1.0 in pandas-dev/pandas#52620. https://pandas.pydata.org/pandas-docs/version/2.1/reference/api/pandas.read_html.html
The argument has been added to pandas 2.1.0 in pandas-dev/pandas#52620. https://pandas.pydata.org/pandas-docs/version/2.1/reference/api/pandas.read_html.html
storage_options
inpandas.read_html
#49944doc/source/whatsnew/v2.1.0.rst
file if fixing a bug or adding a new feature.Added support for the
storage_options
keyword forpandas.read_html
, so that users may pass headers alongside the HTTP request when usingpandas.read_html
to read from a URL.The solution involves two parts:
storage_options
as an optional parameter topandas.read_html
and propagating it down to the method(s) where an HTTP request is made (get_handle
andurlopen
in pandas.io.common)storage_options
in the function that makes an HTTP request and returns the response1) Propagating
storage_options
from read_html down to the HTTP requestThe sequence of functions leading to the HTTP request are as follows:
storage_options
must be added as an optional argument topandas.read_html
, and should be passed down to each function (or object instance) untilurlopen
orget_handle
is reached.2) Adding support for
storage_options
in the HTTP request functionsThere exist two types of parsers that are used by
pandas.read_html
:_BeautifulSoupHtml5LibFrameParser
and_LxmlFrameParser
. Both of these classes implement the abstract methods of their parent class_HtmlFrameParser
.The two parsers implement
_build_doc
differently such that_BeautifulSoupHtml5LibFrameParser
usesget_handle
to make the GET request whereas_LxmlFrameParser
usesurlopen
.get_handle
, as it is defined in pandas.io.common, already supports thestorage_options
keyword, and thus doesn't require any modifying.urlopen
, however, does not expect or handle thestorage_options
keyword. Therefore, we must add logic such that ifurlopen
is called withstorage_options
, then the contents ofstorage_options
should be passed as headers in the outgoing HTTP request. Otherwise,urlopen
should follow the exact same logic it normally would.For a more detailed problem analysis and solution description, please see this notion document.