-
Notifications
You must be signed in to change notification settings - Fork 1.2k
detect beginning of new row. if last row is incomplete, add empty cells. #564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Can you say how the documents were created and how one with this characteristic can be reproduced? |
Hi Scanny, thats a good question. I currently only have a document that has been generated in MS Word by our customer. There are multiple people involved in the creation of this document, so it's not so easy to find out how exactly they produced this document - but I will give it a try. I have modified the original document to obtain a minimal version demonstrating the problem. The table has a gridCol definition of 7 columns. In this table the first line consists of cells that sum up to 7 columns, the second and third line each sum up to only 6 columns. |
I talked once again to the customer. They're copy-pasting lot of cells and rows from tables in other documents. But we are still not able to reproduce such a document from scratch. |
I'd just like to mention that I found this patch helpful. I'm having a similar problem with unwieldy manually-edited-over-years tables. The original document is from 2012, so it has acquired from cruft over the years. Some of the tables have different row lengths (for aesthetics, I believe), or cell widths that have been adjusted independently of the column (which has the same effect on the script, from what I can tell). The expected behaviour is for the library to produce essentially an m x n dataframe that corresponds as closely as possible to the original table. The most straightforward way to do this is the one suggested by @0-173, where the shortened row would have empty cells added to the end. Instead, every time the script reached one of these inconsistencies, it fills any missing cells in the shortened row with cells from the next row (removing them from the next row, and thereby offsetting the data in the rest of the dataframe). Essentially, its current behaviour is to make an n x m table, then fill it in with whichever cells happen to exist (going LTR, TTB), regardless of which location they were originally in. I've got it working now my monkey-patching my script with a version of this patch, so thank you. |
This is happening to me after converting doc to docx using Microsoft Office Migration Planning Manager. Resulting docx looks the same as the source doc visually, but internally it exhibits this issue. |
Ah, that's a great datapoint @blackdaemon, it makes sense that the document was operated on by something other than Word itself. Thanks for mentioning that :) |
@scanny Is this going to get merged in and be released? I have come across this same issue with a bunch of files and my only solution at the moment would be to delete the extraneous column manually. Getting this merged would be a huge help! |
I wonder if this issue is related to the gridAfter and gridBefore properties. I have noticed in the table I'm working with that a fewer number of tcPr elements happens when trPr also has a gridAfter element defined for the row as non-zero. If either gridAfter or gridBefore are unspecified, zero is the default.. It seems to be the case that the the table's number of columns equals, for each row: For more details, see #881 |
It's actually really easy to produce a table with a mismatch column count on Word. On any table you can right click a cell and (I don't have Word in English so I'm guessing the names) insert -> insert cells... -> shift cells to the right. It will insert a single cell and shift the remaining cells of the row to the right. Or you can do the exact opposite and delete a single cell with right click -> delete cells... -> shift cells to the left. Both will produce a table with an inconsistent number of columns and accessing the table by row and column indexes will be unreliable. The patch from 0-173 fix this issue. By the way the issue #861 is probably the same. |
Try #881 (comment) |
I found that some tables in MS-Word created documents have inconsistend grid_span attributes in a row (not matching the tblGrid's gridCol element count).
When accessing the ._cells property of such a table, it will not consider the row/column definition of the xml but just count the cells and assume that the cell count per row and all grid_span attributes are consistent with the tblGrid specification.
In my case there were 7 columns in the table but one row contained only grid_span attribues of 1 + 1 + 2 + 2 = 6. The next row and all succeeding rows would get partly wrapped around and were broken.
This merge request fixes the issue by detecting a row's start (see #232) and then filling any incomplete row with empty cells if necessary. Unit test is included.