Skip to content

ENH read_excel error when accessing AWS S3 URL #11447

Closed
@kingishb

Description

@kingishb

Summary: read_excel is unable to read a file using the same S3 URL syntax as read_csv. read_excel should support accessing S3 data in the same manner as read_csv

read_excel fails with the following error:

>>> import pandas as pd
>>> df = pd.read_excel("s3://my-bucket/my_file.xlsx")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib64/python2.6/site-packages/pandas/io/excel.py", line 163, in read_excel
    io = ExcelFile(io, engine=engine)
  File "/usr/local/lib64/python2.6/site-packages/pandas/io/excel.py", line 206, in __init__
    self.book = xlrd.open_workbook(io)
  File "/usr/local/lib/python2.6/site-packages/xlrd/__init__.py", line 394, in open_workbook
    f = open(filename, "rb")
IOError: [Errno 2] No such file or directory: 's3://my-bucket/my_file.xlsx'
>>> 

read_csv on the other hand is able to successfully read a csv file in the same S3 bucket using the same URL syntax:

>>> import pandas as pd
>>> df = pd.read_csv("s3://my-bucket/my_file.csv")
>>> len(df.index)
1187
>>>

For the record, read_csv can also see the xlsx file but returns parse errors when attempting to tokenize the data.

>>> import pandas as pd
>>> df = pd.read_csv("s3://my-bucket/my_file.xlsx")
Exception pandas.parser.CParserError: CParserError('Error tokenizing data. C error: Expected 9 fields in line 210, saw 10\n',) in 'pandas.parser.TextReader._tokenize_rows' ignored
>>> 

read_excel successfully reads and parses a local copy of the xlsx file

>>> import pandas as pd
>>> df = pd.read_excel("my_file.xlsx")
>>> len(df.index)
221
>>> 

Pandas version string and dependencies:

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.6.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.14.48-33.39.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.4
pip: 6.1.1
setuptools: 12.2
Cython: None
numpy: 1.10.1
scipy: 0.16.0
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
>>> 

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO DataIO issues that don't fit into a more specific labelIO Excelread_excel, to_excel

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions