Description
Currently Pandas makes HTTP requests using "Python-urllib/3.8" as a User Agent.
This prevents from downloading some resources and static files from various places.
What if, Pandas would make requests using "Pandas/1.1.0" headers instead?
There should be possibility to add custom headers too (auth
, csrf tokens
, api versions
and so on).
Use Case:
I am writing a book on Pandas:
I published data in CSV and JSON to use in code listings:
You can access those resources via browser, curl
, or even requests
, but not using Pandas.
The only change you'd need to do is to set User-Agent.
This is due to the readthedocs.io
blocking "Python-urllib/3.8" User Agent for whatever reason.
The same problem affects many other places where you can get data (not only readthedocs.io
).
Currently I get those resources with requests
and then put response.text
to one of:
pd.read_csv
pd.read_json
pd.read_html
Unfortunately this makes even simplest code listings... quite complex (due to the explanation of requests
library and why I do this like that).
Pandas uses urllib.request.urlopen
which does not allow to set http_headers
https://github.com/pandas-dev/pandas/blob/master/pandas/io/common.py#L146
Although urllib.request.urlopen
can take urllib.request.Request
as an argument.
And urllib.request.Request
object has possibility to set custom http_headers
https://docs.python.org/3/library/urllib.request.html#urllib.request.Request
Possibility to add custom http_headers
should be in pd.read_csv
, pd.read_json
and pd.read_html
functions.
From what I see, the read_*
call stack is three to four function deep.
There are only 6 references in 4 files to urlopen(*args, **kwargs)
function.
So the change shouldn't be quite hard to implement.
http_headers
parameter can be Optional[List]
which will be fully backward compatible and would not require any changes to others code.