Description
xref #16716
Code Sample, a copy-pastable example if possible
# some shared hosting web server named 'apex'
uapex_no_port = 'http://handsome-equator.000webhostapp.com/no_auth/aaa.csv'
uapex_with_port = 'http://handsome-equator.000webhostapp.com:80/no_auth/aaa.csv'
# nginx web server
u_nginx_no_port = 'http://pandastest.mooo.com/aaa.csv'
u_nginx_with_port = 'http://pandastest.mooo.com:80/aaa.csv'
import pandas as pd
from urllib.request import urlopen, Request
from urllib.parse import urlparse
import requests
for url in [
uapex_no_port, # always succeeds
u_nginx_no_port, # always succeeds
u_nginx_with_port, # always succeeds
uapex_with_port, # Succeeds with requests. fails with urlopen/pandas
]:
print(url)
txt = requests.get(url) # always succeeds
try:
req1 = Request(url)
txt = urlopen(req1).read() # fails on apex with explicit port uapex_with_port
df = pd.read_csv(url) # fails on apex with explicit port uapex_with_port
except Exception as ex:
print('FAIL {} -- {}'.format(url, str(ex))) # returns HTTP 404 error.
# The reason that urlopen fails on apex but requests succeeds
# is because nginx can handle / but apex cannot handle
# HTTP GET HEADER: 'Host': '<fqdn>:<port>' as set by urlopen
# Requests does not set port number in host header: 'Host': '<fqdn>'
# so Requests works with all urls.
# specifying port number in host header is in the HTTP standard RFC
# But I dont know how prevalent this issue is beyond apex.
p = urlparse(url)
req2 = Request(url)
req2.add_header('Host', p.hostname)
txt = urlopen(req2).read() # always succeeds
Problem description
The problem is in atleast one version of web server, urlopen
and therefore pandas.read_csv
fails when a http://<fqdn>:<port>
is specified, even if it is default port 80. However, instead of urlopen
the python-requests
library is utilized, same url works. The issue is requests
sets header Host : fqdn
, as compared to urlopen
sets header to Host : fqdn:port
. While urlopen is still adhering to http RFC , requests \Firefox\chrome\IE\Curl all work with all urls. So possibly, pandas user would wonder why pandas returns code 404
The question is how big of an issue is this? I dont know. So I cannot immediately recommend this be fixed. But we should watch out of similar issues in future and then, either consider modifying host header or consider using requests library.
Expected Output
A dataframe should be read.