Skip to content

ENH: Add support for storage_options in pd.read_html #52620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 18, 2023
Merged

ENH: Add support for storage_options in pd.read_html #52620

merged 15 commits into from
Jul 18, 2023

Conversation

serhatgktp
Copy link
Contributor

@serhatgktp serhatgktp commented Apr 12, 2023

Added support for the storage_options keyword for pandas.read_html , so that users may pass headers alongside the HTTP request when using pandas.read_html to read from a URL.

The solution involves two parts:

  1. Adding storage_options as an optional parameter to pandas.read_html and propagating it down to the method(s) where an HTTP request is made (get_handle and urlopen in pandas.io.common)
  2. If necessary, add support for storage_options in the function that makes an HTTP request and returns the response

1) Propagating storage_options from read_html down to the HTTP request

The sequence of functions leading to the HTTP request are as follows:

1. read_html                                    (pandas.io.html)
2. _parse                                       (pandas.io.html)
3. parser(args)         # initialize parser     (pandas.io.html)
4. parse_tables                                 (pandas.io.html)
5. _parse_tables                                (pandas.io.html)
6. _build_doc                                   (pandas.io.html)
  i)  urlopen                                   (pandas.io.common)
  ii) get_handle                                (pandas.io.common)

storage_options must be added as an optional argument to pandas.read_html, and should be passed down to each function (or object instance) until urlopen or get_handle is reached.

2) Adding support for storage_options in the HTTP request functions

There exist two types of parsers that are used by pandas.read_html: _BeautifulSoupHtml5LibFrameParser and _LxmlFrameParser. Both of these classes implement the abstract methods of their parent class _HtmlFrameParser.

The two parsers implement _build_doc differently such that _BeautifulSoupHtml5LibFrameParser uses get_handle to make the GET request whereas _LxmlFrameParser uses urlopen.

get_handle, as it is defined in pandas.io.common, already supports the storage_options keyword, and thus doesn't require any modifying.

urlopen, however, does not expect or handle the storage_options keyword. Therefore, we must add logic such that if urlopen is called with storage_options, then the contents of storage_options should be passed as headers in the outgoing HTTP request. Otherwise, urlopen should follow the exact same logic it normally would.

For a more detailed problem analysis and solution description, please see this notion document.

@mroeschke mroeschke added IO Network Local or Cloud (AWS, GCS, etc.) IO Issues IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Apr 12, 2023
@serhatgktp serhatgktp changed the title Add support for storage_options in pd.read_html ENH: Add support for storage_options in pd.read_html Apr 14, 2023
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 15, 2023
@serhatgktp
Copy link
Contributor Author

@mroeschke do you know who I could ping for a review

@mroeschke
Copy link
Member

Could you merge in main and resolve the merge conflict?

@serhatgktp
Copy link
Contributor Author

@mroeschke done

@@ -267,6 +267,16 @@ def urlopen(*args, **kwargs):
"""
import urllib.request

if "storage_options" in kwargs and kwargs["storage_options"] is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed. This is already handled in _get_filepath_or_buffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I missed that. I've reverted the implementation of urlopen() to its original form, and I've used common.get_handle() in html.py instead.

@@ -241,6 +255,7 @@ def responder(request):
[
(CSVUserAgentResponder, pd.read_csv, None),
(JSONUserAgentResponder, pd.read_json, None),
(HTMLUserAgentResponder, HTMLUserAgentResponder.read_html_first_df, None),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of HTMLUserAgentResponder.read_html_first_df, could you specify a lambda function of read_html?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

Tests are passing upon running pytest pandas\tests\io\test_user_agent.py from the root directory.

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some of the CI checks are still failing

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Jul 11, 2023
@serhatgktp
Copy link
Contributor Author

serhatgktp commented Jul 18, 2023

Hi @mroeschke - I've attempted to address the failing tests and I've merged into main. Could you reopen this PR so that I can see the CI output? Thanks!

@mroeschke mroeschke reopened this Jul 18, 2023
serhatgktp added 2 commits July 18, 2023 15:51
@serhatgktp
Copy link
Contributor Author

@mroeschke All should be good now. Please let me know if anything seems off, thanks!

@mroeschke mroeschke removed the Stale label Jul 18, 2023
@mroeschke mroeschke added this to the 2.1 milestone Jul 18, 2023
@mroeschke mroeschke merged commit 44c4168 into pandas-dev:main Jul 18, 2023
@mroeschke
Copy link
Member

Thanks for sticking with this @serhatgktp. Nice work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap IO Network Local or Cloud (AWS, GCS, etc.) IO Issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: support storage_options in pandas.read_html
2 participants