Skip to content

COMPAT: make read_excel accept path objects for the filepath (#12655) #12717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.18.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -169,3 +169,4 @@ Bug Fixes
- Bug in ``pivot_table`` when ``margins=True`` and ``dropna=True`` where nulls still contributed to margin count (:issue:`12577`)
- Bug in ``Series.name`` when ``name`` attribute can be a hashable type (:issue:`12610`)
- Bug in ``.describe()`` resets categorical columns information (:issue:`11558`)
- ``read_excel`` now accepts path objects (e.g. ``pathlib.Path``, ``py.path.local``) for the file path, in line with other ``read_*`` functions (:issue:`12655`)
31 changes: 17 additions & 14 deletions pandas/io/excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from pandas.core.frame import DataFrame
from pandas.io.parsers import TextParser
from pandas.io.common import (_is_url, _urlopen, _validate_header_arg,
get_filepath_or_buffer, _is_s3_url)
get_filepath_or_buffer)
from pandas.tseries.period import Period
from pandas import json
from pandas.compat import (map, zip, reduce, range, lrange, u, add_metaclass,
Expand Down Expand Up @@ -82,7 +82,8 @@ def read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0,

Parameters
----------
io : string, file-like object, pandas ExcelFile, or xlrd workbook.
io : string, path object (pathlib.Path or py._path.local.LocalPath),
file-like object, pandas ExcelFile, or xlrd workbook.
The string could be a URL. Valid URL schemes include http, ftp, s3,
and file. For file URLs, a host is expected. For instance, a local
file could be file://localhost/path/to/workbook.xlsx
Expand Down Expand Up @@ -184,8 +185,9 @@ class ExcelFile(object):

Parameters
----------
io : string, file-like object or xlrd workbook
If a string, expected to be a path to xls or xlsx file
io : string, path object (pathlib.Path or py._path.local.LocalPath),
file-like object or xlrd workbook
If a string or path object, expected to be a path to xls or xlsx file
engine: string, default None
If io is not a buffer or path, this must be set to identify io.
Acceptable values are None or xlrd
Expand All @@ -207,21 +209,22 @@ def __init__(self, io, **kwds):
if engine is not None and engine != 'xlrd':
raise ValueError("Unknown engine: %s" % engine)

if isinstance(io, compat.string_types):
if _is_s3_url(io):
buffer, _, _ = get_filepath_or_buffer(io)
self.book = xlrd.open_workbook(file_contents=buffer.read())
elif _is_url(io):
data = _urlopen(io).read()
self.book = xlrd.open_workbook(file_contents=data)
else:
self.book = xlrd.open_workbook(io)
elif engine == 'xlrd' and isinstance(io, xlrd.Book):
# If io is a url, want to keep the data as bytes so can't pass
# to get_filepath_or_buffer()
if _is_url(io):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. can we do this internally in get_filepath_or_buffer? (looks like it should just work)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like other cases where get_filepath_or_buffer() is used on URLs, getting the data back as StringIO is OK, so at the moment it grabs the data and converts to unicode automatically. We don't want that here, so I guess we would need a flag to tell it not to convert or something?

Couldn't figure out how to do it myself, so left it as a special case in ExcelFile.__init__.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should return an encoded stream as BytesIO if needed (or StringIO) otherwise. These are hard cases and I believe they all work. If you cann't find a place where you have a failure don't make this a special case (as it would then need special tests and so on).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't special case URLs here, I get failures in URL related tests, get_filepath_or_buffer() returns the data as StringIO and reading from urls fails with errors like:

Traceback (most recent call last):
  File "/home/me/src/pandas/reading_tests.py", line 8, in <module>
    url_table = pd.read_excel(url)
  File "/home/me/src/pandas/pandas/io/excel.py", line 170, in read_excel
    io = ExcelFile(io, engine=engine)
  File "/home/me/src/pandas/pandas/io/excel.py", line 225, in __init__
    self.book = xlrd.open_workbook(file_contents=data)
  File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/__init__.py", line 441, in open_workbook
    ragged_rows=ragged_rows,
  File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/book.py", line 91, in open_workbook_xls
    biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
  File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/book.py", line 1226, in getbof
    opcode = self.get2bytes()
  File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/book.py", line 631, in get2bytes
    return (BYTES_ORD(hi) << 8) | BYTES_ORD(lo)
TypeError: unsupported operand type(s) for <<: 'str' and 'int'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked closely at the changes in this PR (will followup later), but I'm partway through refactoring our S3 handling to use s3fs, so I might be changing this soon anyway. My branch is here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@onesandzeroes ok. well you did clean things up a bit. pls rebase / squash. ping when green. @TomAugspurger can address if necessary later.

io = _urlopen(io)
# Deal with S3 urls, path objects, etc. Will convert them to
# buffer or path string
io, _, _ = get_filepath_or_buffer(io)

if engine == 'xlrd' and isinstance(io, xlrd.Book):
self.book = io
elif not isinstance(io, xlrd.Book) and hasattr(io, "read"):
# N.B. xlrd.Book has a read attribute too
data = io.read()
self.book = xlrd.open_workbook(file_contents=data)
elif isinstance(io, compat.string_types):
self.book = xlrd.open_workbook(io)
else:
raise ValueError('Must explicitly set engine if not passing in'
' buffer or path for io.')
Expand Down
27 changes: 27 additions & 0 deletions pandas/io/tests/test_excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -528,6 +528,33 @@ def test_read_from_file_url(self):

tm.assert_frame_equal(url_table, local_table)

def test_read_from_pathlib_path(self):
tm._skip_if_no_pathlib()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these all look good

from pathlib import Path

str_path = os.path.join(self.dirpath, 'test1' + self.ext)
expected = read_excel(str_path, 'Sheet1', index_col=0)

path_obj = Path(self.dirpath, 'test1' + self.ext)
actual = read_excel(path_obj, 'Sheet1', index_col=0)

tm.assert_frame_equal(expected, actual)

def test_read_from_py_localpath(self):
tm._skip_if_no_localpath()

from py.path import local as LocalPath

str_path = os.path.join(self.dirpath, 'test1' + self.ext)
expected = read_excel(str_path, 'Sheet1', index_col=0)

abs_dir = os.path.abspath(self.dirpath)
path_obj = LocalPath(abs_dir).join('test1' + self.ext)
actual = read_excel(path_obj, 'Sheet1', index_col=0)

tm.assert_frame_equal(expected, actual)

def test_reader_closes_file(self):

pth = os.path.join(self.dirpath, 'test1' + self.ext)
Expand Down