Skip to content

BUG: read_csv fails some http servers if port number is specified #17019

Closed
@skynss

Description

@skynss

xref #16716

Code Sample, a copy-pastable example if possible

# some shared hosting web server named 'apex'
uapex_no_port = 'http://handsome-equator.000webhostapp.com/no_auth/aaa.csv'
uapex_with_port = 'http://handsome-equator.000webhostapp.com:80/no_auth/aaa.csv'

# nginx web server
u_nginx_no_port = 'http://pandastest.mooo.com/aaa.csv'
u_nginx_with_port = 'http://pandastest.mooo.com:80/aaa.csv'

import pandas as pd
from urllib.request import urlopen, Request
from urllib.parse import urlparse
import requests

for url in [ 
            uapex_no_port,   # always succeeds
			u_nginx_no_port, # always succeeds
			u_nginx_with_port, # always succeeds 
            uapex_with_port, # Succeeds with requests. fails with urlopen/pandas
			]:
	print(url)
	txt = requests.get(url) # always succeeds
	try:
		req1 = Request(url)
		txt = urlopen(req1).read() # fails on apex with explicit port uapex_with_port
		df = pd.read_csv(url) # fails on apex with explicit port uapex_with_port
	except Exception as ex:
		print('FAIL {} -- {}'.format(url, str(ex))) # returns HTTP 404 error.
	# The reason that urlopen fails on apex but requests succeeds
        # is because nginx can handle / but apex cannot handle 
	# HTTP GET HEADER: 'Host': '<fqdn>:<port>' as set by urlopen
        # Requests does not set port number in host header: 'Host': '<fqdn>'
	# so Requests works with all urls.
	# specifying port number in host header is in the HTTP standard RFC
	# But I dont know how prevalent this issue is beyond apex.
	p = urlparse(url)
	req2 = Request(url)
	req2.add_header('Host', p.hostname)
	txt = urlopen(req2).read() # always succeeds

Problem description

The problem is in atleast one version of web server, urlopen and therefore pandas.read_csv fails when a http://<fqdn>:<port> is specified, even if it is default port 80. However, instead of urlopen the python-requests library is utilized, same url works. The issue is requests sets header Host : fqdn, as compared to urlopen sets header to Host : fqdn:port. While urlopen is still adhering to http RFC , requests \Firefox\chrome\IE\Curl all work with all urls. So possibly, pandas user would wonder why pandas returns code 404 The question is how big of an issue is this? I dont know. So I cannot immediately recommend this be fixed. But we should watch out of similar issues in future and then, either consider modifying host header or consider using requests library.

Expected Output

A dataframe should be read.

Output of pd.show_versions()

pandas: 0.20.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO CSVread_csv, to_csvIO NetworkLocal or Cloud (AWS, GCS, etc.) IO Issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions