Skip to content

Class to read OpenDocument Tables #25427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 63 commits into from
Jul 3, 2019
Merged
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
479e639
Class to read OpenDocument Tables
detrout Feb 27, 2019
8be4b67
Remove unneeded assignments
detrout Feb 28, 2019
77d9033
Rename filepath_or_stream to filepath_or_buffer
detrout Feb 28, 2019
47b2ffb
Use compat.string_types instead of str
detrout Feb 28, 2019
0fa2ac9
Use pd as name as pandas
detrout Feb 28, 2019
e6e2365
Use single underscore for private functions
detrout Feb 28, 2019
1bbf284
Return an unparsed sheet.
detrout Feb 28, 2019
d5c7ec0
Move ODFReader get_sheet exception testing code to its own function
detrout Feb 28, 2019
691f1e9
Append _raises to end of function name that tests exceptions
detrout Feb 28, 2019
93c2b66
Remove test docstrings that include no useful information
detrout Feb 28, 2019
394c4bd
Indicate likely minimum version.
detrout Feb 28, 2019
b149d84
Convert notes about some OpenDocument tests to comments
detrout Apr 5, 2019
19587b3
Add note about new OpenDocument functionality to whatsnew
detrout Apr 5, 2019
60a5bc1
Sort imports correctly
detrout Apr 7, 2019
1fef008
Use str instead of compat.string_types
detrout Apr 8, 2019
7148995
Remove leading underscore from ODFParser
detrout May 14, 2019
5db1a0b
Remove obsolete class (object)
detrout May 15, 2019
83c0243
Improve docstring text
detrout Jun 14, 2019
735e2b4
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jun 28, 2019
8302fd7
Added test_odf
WillAyd Jun 28, 2019
d0df3bd
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jun 29, 2019
47597c9
Class naming consistency
WillAyd Jun 29, 2019
9e1799a
Whatsnew linting
WillAyd Jun 29, 2019
d5c60ab
Added optional dependency load
WillAyd Jun 29, 2019
39cfecf
typo
WillAyd Jun 29, 2019
8a9a66c
Updated inheritance to use excel reader interface
WillAyd Jun 29, 2019
fd7663f
Added ods test files
WillAyd Jun 29, 2019
3bcc1b7
Updated tests
WillAyd Jun 29, 2019
15e69eb
convert_float handling
WillAyd Jun 29, 2019
65615cd
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jun 30, 2019
9584753
Fixed missing value handling
WillAyd Jun 30, 2019
9dc34f4
Fixed error handling
WillAyd Jun 30, 2019
5e32f6d
Fixed bool handling
WillAyd Jun 30, 2019
6360c07
Skip missing file on master
WillAyd Jun 30, 2019
4227268
datetime compat
WillAyd Jun 30, 2019
80607b0
fixed row repeat
WillAyd Jun 30, 2019
43f7160
multiindex handling
WillAyd Jun 30, 2019
4da0445
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jun 30, 2019
cbbc653
Handled horizontally merged cells
WillAyd Jun 30, 2019
1227216
Converted to pytest idiom
WillAyd Jun 30, 2019
696ed5d
Test idiom cleanup
WillAyd Jun 30, 2019
49fff9f
Removed duplicative test files
WillAyd Jun 30, 2019
7b08304
Raised NotImplemented for vertical merging
WillAyd Jun 30, 2019
4d97d84
Table attribute access simplification
WillAyd Jun 30, 2019
59cdf0b
Typing and func cleanups
WillAyd Jun 30, 2019
98d3ca7
lint and isort
WillAyd Jun 30, 2019
fb48d8d
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jun 30, 2019
6576af9
typing fixup
WillAyd Jun 30, 2019
4dc1b51
Skip ods files for xlrd
WillAyd Jun 30, 2019
8ce45b4
Removed one-off tests
WillAyd Jul 1, 2019
f9f88b0
Handled defusedxml warnings
WillAyd Jul 1, 2019
3e0d758
Updated assert_warnings funcs to allow DeprecationWarnings
WillAyd Jul 1, 2019
ff28993
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jul 1, 2019
7396ad6
Updated to config_init.py
WillAyd Jul 2, 2019
5a440a4
Updated whatsnew
WillAyd Jul 2, 2019
250a3d3
Updated io.rst
WillAyd Jul 2, 2019
d7e7d05
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jul 2, 2019
93adedb
Refactored to simplify
WillAyd Jul 2, 2019
62a37e7
Removed unnecessary test
WillAyd Jul 2, 2019
13fb76f
lint fixup
WillAyd Jul 2, 2019
fb6c5ee
mypy error
WillAyd Jul 2, 2019
5c839f4
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jul 2, 2019
4026fc1
Doc updates
WillAyd Jul 2, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/deps/travis-36-cov.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ dependencies:
- nomkl
- numexpr
- numpy=1.15.*
- odfpy
- openpyxl
- pandas-gbq
# https://github.com/pydata/pandas-gbq/issues/271
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,7 @@ Other enhancements
- Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`)
- :class:`pandas.offsets.BusinessHour` supports multiple opening hours intervals (:issue:`15481`)
- :func:`read_excel` can now use ``openpyxl`` to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`)
- :func:`pandas.io.excel.read_excel` supports reading OpenDocument tables. Specify engine='odf' to enable. (:issue:`9070`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put enable='odf' in double back ticks


.. _whatsnew_0250.api_breaking:

Expand Down
1 change: 1 addition & 0 deletions pandas/compat/_optional.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
"lxml.etree": "3.8.0",
"matplotlib": "2.2.2",
"numexpr": "2.6.2",
"odfpy": "1.3.0",
"openpyxl": "2.4.8",
"pandas_gbq": "0.8.0",
"pyarrow": "0.9.0",
Expand Down
4 changes: 3 additions & 1 deletion pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -768,12 +768,14 @@ class ExcelFile:
Acceptable values are None or ``xlrd``.
"""

from pandas.io.excel._xlrd import _XlrdReader
from pandas.io.excel._odfreader import _ODFReader
from pandas.io.excel._openpyxl import _OpenpyxlReader
from pandas.io.excel._xlrd import _XlrdReader

_engines = {
'xlrd': _XlrdReader,
'openpyxl': _OpenpyxlReader,
'odf': _ODFReader,
}

def __init__(self, io, engine=None):
Expand Down
192 changes: 192 additions & 0 deletions pandas/io/excel/_odfreader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
from typing import Dict, List

from pandas.compat._optional import import_optional_dependency

import pandas as pd
from pandas._typing import FilePathOrBuffer, Scalar

from pandas.io.excel._base import _BaseExcelReader


class _ODFReader(_BaseExcelReader):
"""Read tables out of OpenDocument formatted files

Parameters
----------
filepath_or_buffer: string, path to be parsed or
an open readable stream.
"""
def __init__(self, filepath_or_buffer: FilePathOrBuffer):
import_optional_dependency("odf")
super().__init__(filepath_or_buffer)

@property
def _workbook_class(self):
from odf.opendocument import OpenDocument
return OpenDocument

def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
from odf.opendocument import load
return load(filepath_or_buffer)

@property
def sheet_names(self) -> List[str]:
"""Return a list of sheet names present in the document"""
from odf.table import Table

tables = self.book.getElementsByType(Table)
return [t.getAttribute("name") for t in tables]

def get_sheet_by_index(self, index: int):
from odf.table import Table
tables = self.book.getElementsByType(Table)
return tables[index]

def get_sheet_by_name(self, name: str):
from odf.table import Table

tables = self.book.getElementsByType(Table)

for table in tables:
if table.getAttribute("name") == name:
return table

raise ValueError("sheet {name} not found".format(name))

def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]:
"""Parse an ODF Table into a list of lists
"""
from odf.table import TableCell, TableRow

sheet_rows = sheet.getElementsByType(TableRow)
table = [] # type: List[List[Scalar]]
empty_rows = 0
max_row_len = 0
row_spans = {} # type: Dict[int, int]

for i, sheet_row in enumerate(sheet_rows):
sheet_cells = sheet_row.getElementsByType(TableCell)
empty_cells = 0
table_row = [] # type: List[Scalar]

for j, sheet_cell in enumerate(sheet_cells):
# Handle vertically merged cells; only works with first column
if row_spans.get(j, 0) > 1:
table_row.append('')
row_spans[j] = row_spans[j] - 1

value = self._get_cell_value(sheet_cell, convert_float)
column_repeat = self._get_column_repeat(sheet_cell)
column_span = self._get_column_span(sheet_cell)
row_span = self._get_row_span(sheet_cell)

if row_span > 1:
if j > 0:
raise NotImplementedError(
"The odf reader only supports vertical cell"
"merging in the initial column")
else:
row_spans[j] = row_span

if len(sheet_cell.childNodes) == 0:
empty_cells += column_repeat
else:
if empty_cells > 0:
table_row.extend([''] * empty_cells)
empty_cells = 0
table_row.extend([value] * column_repeat)

# horizontally merged cells should only show first value
if column_span > 1:
table_row.extend([''] * (column_span - 1))

if max_row_len < len(table_row):
max_row_len = len(table_row)

row_repeat = self._get_row_repeat(sheet_row)
if self._is_empty_row(sheet_row):
empty_rows += row_repeat
else:
if empty_rows > 0:
# add blank rows to our table
table.extend([['']] * empty_rows)
empty_rows = 0
for _ in range(row_repeat):
table.append(table_row)

# Make our table square
for row in table:
if len(row) < max_row_len:
row.extend([''] * (max_row_len - len(row)))

return table

def _get_row_repeat(self, row) -> int:
"""Return number of times this row was repeated
Repeating an empty row appeared to be a common way
of representing sparse rows in the table.
"""
from odf.namespaces import TABLENS

return int(row.attributes.get((TABLENS, 'number-rows-repeated'), 1))

def _get_column_repeat(self, cell) -> int:
from odf.namespaces import TABLENS
return int(cell.attributes.get(
(TABLENS, 'number-columns-repeated'), 1))

def _get_row_span(self, cell) -> int:
"""For handling cells merged vertically."""
from odf.namespaces import TABLENS
return int(cell.attributes.get((TABLENS, 'number-rows-spanned'), 1))

def _get_column_span(self, cell) -> int:
"""For handling cells merged horizontally."""
from odf.namespaces import TABLENS
return int(cell.attributes.get((TABLENS, 'number-columns-spanned'), 1))

def _is_empty_row(self, row) -> bool:
"""Helper function to find empty rows
"""
for column in row.childNodes:
if len(column.childNodes) > 0:
return False

return True

def _get_cell_value(self, cell, convert_float: bool) -> Scalar:
from odf.namespaces import OFFICENS
cell_type = cell.attributes.get((OFFICENS, 'value-type'))
if cell_type == 'boolean':
if str(cell) == "TRUE":
return True
return False
if cell_type is None:
return '' # compat with xlrd
elif cell_type == 'float':
# GH5394
cell_value = float(cell.attributes.get((OFFICENS, 'value')))

if cell_value == 0. and str(cell) != cell_value: # NA handling
return str(cell)

if convert_float:
val = int(cell_value)
if val == cell_value:
return val
return cell_value
elif cell_type == 'percentage':
cell_value = cell.attributes.get((OFFICENS, 'value'))
return float(cell_value)
elif cell_type == 'string':
return str(cell)
elif cell_type == 'currency':
cell_value = cell.attributes.get((OFFICENS, 'value'))
return float(cell_value)
elif cell_type == 'date':
cell_value = cell.attributes.get((OFFICENS, 'date-value'))
return pd.to_datetime(cell_value)
elif cell_type == 'time':
return pd.to_datetime(str(cell)).time()
else:
raise ValueError('Unrecognized type {}'.format(cell_type))
Binary file added pandas/tests/io/data/blank-row-repeat.ods
Binary file not shown.
Binary file added pandas/tests/io/data/blank.ods
Binary file not shown.
Binary file added pandas/tests/io/data/blank_with_header.ods
Binary file not shown.
Binary file added pandas/tests/io/data/invalid_value_type.ods
Binary file not shown.
Binary file added pandas/tests/io/data/lowerdiagonal.ods
Binary file not shown.
Binary file added pandas/tests/io/data/raising_repeats.ods
Binary file not shown.
Binary file added pandas/tests/io/data/runlengthencoding.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test1.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test2.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test3.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test4.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test5.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test_converters.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test_index_name_pre17.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test_multisheet.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test_squeeze.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test_types.ods
Binary file not shown.
Binary file added pandas/tests/io/data/testdateoverflow.ods
Binary file not shown.
Binary file added pandas/tests/io/data/testdtype.ods
Binary file not shown.
Binary file added pandas/tests/io/data/testmultiindex.ods
Binary file not shown.
Binary file added pandas/tests/io/data/testskiprows.ods
Binary file not shown.
Binary file added pandas/tests/io/data/times_1900.ods
Binary file not shown.
Binary file added pandas/tests/io/data/times_1904.ods
Binary file not shown.
Binary file added pandas/tests/io/data/writertable.odt
Binary file not shown.
2 changes: 1 addition & 1 deletion pandas/tests/io/excel/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def df_ref():
return df_ref


@pytest.fixture(params=['.xls', '.xlsx', '.xlsm'])
@pytest.fixture(params=['.xls', '.xlsx', '.xlsm', '.ods'])
def read_ext(request):
"""
Valid extensions for reading Excel files.
Expand Down
76 changes: 76 additions & 0 deletions pandas/tests/io/excel/test_odf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import functools

import numpy as np
import pytest

import pandas as pd
import pandas.util.testing as tm

pytest.importorskip("odf")


@pytest.fixture(autouse=True)
def cd_and_set_engine(monkeypatch, datapath):
func = functools.partial(pd.read_excel, engine="odf")
monkeypatch.setattr(pd, 'read_excel', func)
monkeypatch.chdir(datapath("io", "data"))


def test_read_invalid_types_raises():
# the invalid_value_type.ods required manually editing
# of the included content.xml file
with pytest.raises(ValueError,
match="Unrecognized type awesome_new_type"):
pd.read_excel("invalid_value_type.ods", header=None)


def test_read_lower_diagonal():
# Make sure we can parse:
# 1
# 2 3
# 4 5 6
# 7 8 9 10

sheet = pd.read_excel("lowerdiagonal.ods", 'Sheet1',
index_col=None, header=None)

assert sheet.shape == (4, 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might change this and subsequent tests below to use tm.assert_frame_equal in another iteration or PR



def test_read_writer_table():
# Also test reading tables from an text OpenDocument file
# (.odt)
index = pd.Index(["Row 1", "Row 2", "Row 3"], name="Header")
expected = pd.DataFrame([
[1, np.nan, 7],
[2, np.nan, 8],
[3, np.nan, 9],
], index=index, columns=["Column 1", "Unnamed: 2", "Column 3"])

result = pd.read_excel("writertable.odt", 'Table1', index_col=0)

tm.assert_frame_equal(result, expected)


def test_blank_row_repeat():
table = pd.read_excel("blank-row-repeat.ods", 'Value')

assert table.shape == (14, 2)
assert table['value'][7] == 9.0
assert pd.isnull(table['value'][8])
assert not pd.isnull(table['value'][11])


def test_runlengthencoding():
sheet = pd.read_excel("runlengthencoding.ods", 'Sheet1', header=None)
assert sheet.shape == (5, 3)
# check by column, not by row.
assert list(sheet[0]) == [1.0, 1.0, 2.0, 2.0, 2.0]
assert list(sheet[1]) == [1.0, 2.0, 2.0, 2.0, 2.0]
assert list(sheet[2]) == [1.0, 2.0, 2.0, 2.0, 2.0]


def test_raises_repeated_rows_not_in_col_0():
with pytest.raises(NotImplementedError,
match="merging in the initial column"):
pd.read_excel("raising_repeats.ods")
13 changes: 13 additions & 0 deletions pandas/tests/io/excel/test_readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ def ignore_xlrd_time_clock_warning():
pytest.param('xlrd', marks=td.skip_if_no('xlrd')),
pytest.param('openpyxl', marks=td.skip_if_no('openpyxl')),
pytest.param(None, marks=td.skip_if_no('xlrd')),
pytest.param("odf", marks=td.skip_if_no("odf")),
])
def engine(request):
"""
Expand All @@ -53,6 +54,11 @@ def cd_and_set_engine(self, engine, datapath, monkeypatch, read_ext):
"""
if engine == 'openpyxl' and read_ext == '.xls':
pytest.skip()
if engine == 'odf' and read_ext != '.ods':
pytest.skip()
if read_ext == ".ods" and engine != "odf":
pytest.skip()

func = partial(pd.read_excel, engine=engine)
monkeypatch.chdir(datapath("io", "data"))
monkeypatch.setattr(pd, 'read_excel', func)
Expand Down Expand Up @@ -439,6 +445,9 @@ def test_bad_engine_raises(self, read_ext):

@tm.network
def test_read_from_http_url(self, read_ext):
if read_ext == '.ods': # TODO: remove once on master
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only works when the file is available on master, so have to merge first and then can try again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this code is still hanging around ill aim to address it when I tackle: #29439

pytest.skip()

url = ('https://raw.github.com/pandas-dev/pandas/master/'
'pandas/tests/io/data/test1' + read_ext)
url_table = pd.read_excel(url)
Expand Down Expand Up @@ -736,6 +745,10 @@ def cd_and_set_engine(self, engine, datapath, monkeypatch, read_ext):
"""
Change directory and set engine for ExcelFile objects.
"""
if engine == 'odf' and read_ext != '.ods':
pytest.skip()
if read_ext == ".ods" and engine != "odf":
pytest.skip()
if engine == 'openpyxl' and read_ext == '.xls':
pytest.skip()

Expand Down
6 changes: 6 additions & 0 deletions pandas/tests/io/excel/test_xlrd.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,12 @@
xlwt = pytest.importorskip("xlwt")


@pytest.fixture(autouse=True)
def skip_ods_files(read_ext):
if read_ext == ".ods":
pytest.skip("Not valid for xlrd")


def test_read_xlrd_book(read_ext, frame):
df = frame

Expand Down