Description
No code, just a question for proper behavior for rowspan/colspan with read_html
of an HTML table into a DataFrame. (I'm not asking what currently happens with read_html
now. I'm asking what should happen.)
Below is a simple HTML table that uses both colspan (a,e,i,o,u) and rowspan (Fruit, schwa, honk, and the rightmost 0 in the table). It renders identically on each of {Chrome, Firefox, Safari}. With these renderers, both rowspans and colspans are basically rendered midway through the span, either vertically (rowspan) or horizontally (colspan).
Now, let's say we wanted to import this into pandas with read_html
. It seems to me the behavior should be different for a pandas DataFrame than for a renderer:
- The header should have a MultiIndex, where the first column is
Fruit
and the second column is a combination ofa
andLong
, etc. We don't "fill" a rowspan (the first column shouldn't be twoFruit
s), but we do "fill" a colspan (a
would appear in the 2nd and 3rd columns). - The body should "fill" a rowspan or colspan with the provided values. So the rightmost column, instead of having one zero and two blanks on the three rows, should have a zero for each of the three rows. One would think a span in a DataFrame context within a body would mean "fill in the value for each cell in the span".
If this was the case, it would imply that we treat rowspans differently in header and body.
- If we "fill" a rowspan in a header, then we just repeat the header value in the MultiIndex output, which doesn't seem like what we want.
- If we don't "fill" a rowspan in the body, we leave some cells in the DataFrame blank, which also seems misguided.
I put the DataFrame that I think we want below. It incorporates different behavior for rowspan for header and body. One thing I don't know, though: If I don't "fill" the rowspan name for rowspan > 1, what do I put instead? None
? empty string? False
? What does the input to TextParser
look like when some column names are "taller" than others?
Thoughts? @chris-b1? (relevant to #17054)
a e i o u
Fruit Long Short Long Short Long Short Long Short Long Short schwa honk
0 Apple 0 1 0 0 0 0 0 0 0 0 1 0
1 Banana 0 3 0 0 0 0 0 0 0 0 0 0
2 Kiwi 0 0 2 0 0 0 0 0 0 0 0 0
<table>
<thead>
<tr>
<th rowspan=2>Fruit</th>
<th colspan=2>a</th>
<th colspan=2>e</th>
<th colspan=2>i</th>
<th colspan=2>o</th>
<th colspan=2>u</th>
<th rowspan=2>schwa</th>
<th rowspan=2>honk</th>
</tr>
<tr>
<th>Long</th>
<th>Short</th>
<th>Long</th>
<th>Short</th>
<th>Long</th>
<th>Short</th>
<th>Long</th>
<th>Short</th>
<th>Long</th>
<th>Short</th>
</tr>
</thead>
<tbody>
<tr>
<td>Apple</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td rowspan=3>0</td>
</tr>
<tr>
<td>Banana</td>
<td>0</td>
<td>3</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Kiwi</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>