Skip to content

Q: correct behavior for read_html with rowspan/colspan for DataFrames? #17073

Closed
@jowens

Description

@jowens

No code, just a question for proper behavior for rowspan/colspan with read_html of an HTML table into a DataFrame. (I'm not asking what currently happens with read_html now. I'm asking what should happen.)

Below is a simple HTML table that uses both colspan (a,e,i,o,u) and rowspan (Fruit, schwa, honk, and the rightmost 0 in the table). It renders identically on each of {Chrome, Firefox, Safari}. With these renderers, both rowspans and colspans are basically rendered midway through the span, either vertically (rowspan) or horizontally (colspan).

fruit_html

Now, let's say we wanted to import this into pandas with read_html. It seems to me the behavior should be different for a pandas DataFrame than for a renderer:

  • The header should have a MultiIndex, where the first column is Fruit and the second column is a combination of a and Long, etc. We don't "fill" a rowspan (the first column shouldn't be two Fruits), but we do "fill" a colspan (a would appear in the 2nd and 3rd columns).
  • The body should "fill" a rowspan or colspan with the provided values. So the rightmost column, instead of having one zero and two blanks on the three rows, should have a zero for each of the three rows. One would think a span in a DataFrame context within a body would mean "fill in the value for each cell in the span".

If this was the case, it would imply that we treat rowspans differently in header and body.

  • If we "fill" a rowspan in a header, then we just repeat the header value in the MultiIndex output, which doesn't seem like what we want.
  • If we don't "fill" a rowspan in the body, we leave some cells in the DataFrame blank, which also seems misguided.

I put the DataFrame that I think we want below. It incorporates different behavior for rowspan for header and body. One thing I don't know, though: If I don't "fill" the rowspan name for rowspan > 1, what do I put instead? None? empty string? False? What does the input to TextParser look like when some column names are "taller" than others?

Thoughts? @chris-b1? (relevant to #17054)

             a          e          i          o          u               
    Fruit Long Short Long Short Long Short Long Short Long Short schwa honk
0   Apple    0     1    0     0    0     0    0     0    0     0     1    0
1  Banana    0     3    0     0    0     0    0     0    0     0     0    0
2    Kiwi    0     0    2     0    0     0    0     0    0     0     0    0
<table>
  <thead>
    <tr>
      <th rowspan=2>Fruit</th>
      <th colspan=2>a</th>
      <th colspan=2>e</th>
      <th colspan=2>i</th>
      <th colspan=2>o</th>
      <th colspan=2>u</th>
      <th rowspan=2>schwa</th>
      <th rowspan=2>honk</th>
    </tr>
    <tr>
      <th>Long</th>
      <th>Short</th>
      <th>Long</th>
      <th>Short</th>
      <th>Long</th>
      <th>Short</th>
      <th>Long</th>
      <th>Short</th>
      <th>Long</th>
      <th>Short</th>
    </tr>
</thead>
  <tbody>
    <tr>
      <td>Apple</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td rowspan=3>0</td>
    </tr>
    <tr>
      <td>Banana</td>
      <td>0</td>
      <td>3</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <td>Kiwi</td>
      <td>0</td>
      <td>0</td>
      <td>2</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO HTMLread_html, to_html, Styler.apply, Styler.applymap

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions