Skip to content

Commit 5a1d614

Browse files
committed
docs: update Table docs
1 parent 1cfcee7 commit 5a1d614

File tree

3 files changed

+215
-1
lines changed

3 files changed

+215
-1
lines changed

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ User Guide
7474
user/install
7575
user/quickstart
7676
user/documents
77+
user/tables
7778
user/text
7879
user/sections
7980
user/hdrftr

docs/user/tables.rst

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
.. _tables:
2+
3+
Working with Tables
4+
===================
5+
6+
Word provides sophisticated capabilities to create tables. As usual, this power comes with
7+
additional conceptual complexity.
8+
9+
This complexity becomes most apparent when *reading* tables, in particular from documents drawn from
10+
the wild where there is limited or no prior knowledge as to what the tables might contain or how
11+
they might be structured.
12+
13+
These are some of the important concepts you'll need to understand.
14+
15+
16+
Concept: Simple (uniform) tables
17+
--------------------------------
18+
19+
::
20+
21+
+---+---+---+
22+
| a | b | c |
23+
+---+---+---+
24+
| d | e | f |
25+
+---+---+---+
26+
| g | h | i |
27+
+---+---+---+
28+
29+
The basic concept of a table is intuitive enough. You have *rows* and *columns*, and at each (row,
30+
column) position is a different *cell*. It can be described as a *grid* or a *matrix*. Let's call
31+
this concept a *uniform table*. A relational database table and a Pandas dataframe are both examples
32+
of a uniform table.
33+
34+
The following invariants apply to uniform tables:
35+
36+
* Each row has the same number of cells, one for each column.
37+
* Each column has the same number of cells, one for each row.
38+
39+
40+
Complication 1: Merged Cells
41+
----------------------------
42+
43+
::
44+
45+
+---+---+---+ +---+---+---+
46+
| a | b | | | b | c |
47+
+---+---+---+ + a +---+---+
48+
| c | d | e | | | d | e |
49+
+---+---+---+ +---+---+---+
50+
| f | g | h | | f | g | h |
51+
+---+---+---+ +---+---+---+
52+
53+
While very suitable for data processing, a uniform table lacks expressive power desireable for
54+
tables intended for a human reader.
55+
56+
Perhaps the most important characteristic a uniform table lacks is *merged cells*. It is very common
57+
to want to group multiple cells into one, for example to form a column-group heading or provide the
58+
same value for a sequence of cells rather than repeat it for each cell. These make a rendered table
59+
more *readable* by reducing the cognitive load on the human reader and make certain relationships
60+
explicit that might easily be missed otherwise.
61+
62+
Unfortunately, accommodating merged cells breaks both the invariants of a uniform table:
63+
64+
* Each row can have a different number of cells.
65+
* Each column can have a different number of cells.
66+
67+
This challenges reading table contents programatically. One might naturally want to read the table
68+
into a uniform matrix data structure like a 3 x 3 "2D array" (list of lists perhaps), but this is
69+
not directly possible when the table is not known to be uniform.
70+
71+
72+
Concept: The layout grid
73+
------------------------
74+
75+
::
76+
77+
+ - + - + - +
78+
| | | |
79+
+ - + - + - +
80+
| | | |
81+
+ - + - + - +
82+
| | | |
83+
+ - + - + - +
84+
85+
In Word, each table has a *layout grid*.
86+
87+
- The layout grid is *uniform*. There is a layout position for every (layout-row, layout-column)
88+
pair.
89+
- The layout grid itself is not visible. However it is represented and referenced by certain
90+
elements and attributes within the table XML
91+
- Each table cell is located at a layout-grid position; i.e. the top-left corner of each cell is the
92+
top-left corner of a layout-grid cell.
93+
- Each table cell occupies one or more whole layout-grid cells. A merged cell will occupy multiple
94+
layout-grid cells. No table cell can occupy a partial layout-grid cell.
95+
- Another way of saying this is that every vertical boundary (left and right) of a cell aligns with
96+
a layout-grid vertical boundary, likewise for horizontal boundaries. But not all layout-grid
97+
boundaries need be occupied by a cell boundary of the table.
98+
99+
100+
Complication 2: Omitted Cells
101+
-----------------------------
102+
103+
::
104+
105+
+---+---+ +---+---+---+
106+
| a | b | | a | b | c |
107+
+---+---+---+ +---+---+---+
108+
| c | d | | d |
109+
+---+---+ +---+---+---+
110+
| e | | e | f | g |
111+
+---+ +---+---+---+
112+
113+
Word is unusual in that it allows cells to be omitted from the beginning or end (but not the middle)
114+
of a row. A typical practical example is a table with both a row of column headings and a column of
115+
row headings, but no top-left cell (position 0, 0), such as this XOR truth table.
116+
117+
::
118+
119+
+---+---+
120+
| T | F |
121+
+---+---+---+
122+
| T | F | T |
123+
+---+---+---+
124+
| F | T | F |
125+
+---+---+---+
126+
127+
In `python-docx`, omitted cells in a |_Row| object are represented by the ``.grid_cols_before`` and
128+
``.grid_cols_after`` properties. In the example above, for the first row, ``.grid_cols_before``
129+
would equal ``1`` and ``.grid_cols_after`` would equal ``0``.
130+
131+
Note that omitted cells are not just "empty" cells. They represent layout-grid positions that are
132+
unoccupied by a cell and they cannot be represented by a |_Cell| object. This distinction becomes
133+
important when trying to produce a uniform representation (e.g. a 2D array) for an arbitrary Word
134+
table.
135+
136+
137+
Concept: `python-docx` approximates uniform tables by default
138+
-------------------------------------------------------------
139+
140+
To accurately represent an arbitrary table would require a complex graph data structure. Navigating
141+
this data structure would be at least as complex as navigating the `python-docx` object graph for a
142+
table. When extracting content from a collection of arbitrary Word files, such as for indexing the
143+
document, it is common to choose a simpler data structure and *approximate* the table in that
144+
structure.
145+
146+
Reflecting on how a relational table or dataframe represents tabular information, a straightforward
147+
approximation would simply repeat merged-cell values for each layout-grid cell occupied by the
148+
merged cell::
149+
150+
151+
+---+---+---+ +---+---+---+
152+
| a | b | -> | a | a | b |
153+
+---+---+---+ +---+---+---+
154+
| | d | e | -> | c | d | e |
155+
+ c +---+---+ +---+---+---+
156+
| | f | g | -> | c | f | g |
157+
+---+---+---+ +---+---+---+
158+
159+
This is what ``_Row.cells`` does by default. Conceptually::
160+
161+
>>> [tuple(c.text for c in r.cells) for r in table.rows]
162+
[
163+
(a, a, b),
164+
(c, d, e),
165+
(c, f, g),
166+
]
167+
168+
Note this only produces a uniform "matrix" of cells when there are no omitted cells. Dealing with
169+
omitted cells requires a more sophisticated approach when maintaining column integrity is required::
170+
171+
# +---+---+
172+
# | a | b |
173+
# +---+---+---+
174+
# | c | d |
175+
# +---+---+
176+
# | e |
177+
# +---+
178+
179+
def iter_row_cell_texts(row: _Row) -> Iterator[str]:
180+
for _ in range(row.grid_cols_before):
181+
yield ""
182+
for c in row.cells:
183+
yield c.text
184+
for _ in range(row.grid_cols_after):
185+
yield ""
186+
187+
>>> [tuple(iter_row_cell_texts(r)) for r in table.rows]
188+
[
189+
("", "a", "b"),
190+
("c", "d", ""),
191+
("", "e", ""),
192+
]
193+
194+
195+
Complication 3: Tables are Recursive
196+
------------------------------------
197+
198+
Further complicating table processing is their recursive nature. In Word, as in HTML, a table cell
199+
can itself include one or more tables.
200+
201+
These can be detected using ``_Cell.tables`` or ``_Cell.iter_inner_content()``. The latter preserves
202+
the document order of the table with respect to paragraphs also in the cell.

src/docx/table.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -382,7 +382,18 @@ def __init__(self, tr: CT_Row, parent: TableParent):
382382

383383
@property
384384
def cells(self) -> tuple[_Cell, ...]:
385-
"""Sequence of |_Cell| instances corresponding to cells in this row."""
385+
"""Sequence of |_Cell| instances corresponding to cells in this row.
386+
387+
Note that Word allows table rows to start later than the first column and end before the
388+
last column.
389+
390+
- Only cells actually present are included in the return value.
391+
- This implies the length of this cell sequence may differ between rows of the same table.
392+
- If you are reading the cells from each row to form a rectangular "matrix" data structure
393+
of the table cell values, you will need to account for empty leading and/or trailing
394+
layout-grid positions using `.grid_cols_before` and `.grid_cols_after`.
395+
396+
"""
386397
return tuple(self.table.row_cells(self._index))
387398

388399
@property

0 commit comments

Comments
 (0)