-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Added multicolumn/multirow support for latex #14184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/frame.py
Outdated
@@ -1613,7 +1613,7 @@ def to_latex(self, buf=None, columns=None, col_space=None, header=True, | |||
index=True, na_rep='NaN', formatters=None, float_format=None, | |||
sparsify=None, index_names=True, bold_rows=True, | |||
column_format=None, longtable=None, escape=None, | |||
encoding=None, decimal='.'): | |||
encoding=None, decimal='.',multicol=None,multirow=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need spaces after the ,
s here (check git diff upstream/master | flake8 --diff
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see!
Should have started using this sooner, quite nice.
You'll need to rebase your branch on master. Just post here if you need help with the git commands. |
Oh my, what a mess... |
You'll want to start by
This will start applying changes, but you'll have some merge conflicts. You'll need to edit the files (see
Side-issue, typically you'll develop on a feature branch, no master. So before you make any changes you would We can also jump into the chat room if that's easier. |
pandas/formats/format.py
Outdated
.format(ncol, coltext.strip())) | ||
else: | ||
row2.append(coltext) | ||
for c in crow[ilevels:]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the multicolumn feature, long names in higher levels break the layout severely:
compared to:
Thus, i would propose this feature to be enabled by default.
This means, that empty columns won't be printed, which may alter the code of the line with the row-names.
Not the result though, as latex ignores these trailing empty parts anyways.
Should i rather set the default to False and maintain compatibility in code, or fix the rendered output by rewriting fixing tests?
Or has somebody an idea to satisfy both?
Never mind, fixed.
Codecov Report
@@ Coverage Diff @@
## master #14184 +/- ##
==========================================
- Coverage 91.04% 91.03% -0.02%
==========================================
Files 136 136
Lines 49088 49152 +64
==========================================
+ Hits 44694 44744 +50
- Misses 4394 4408 +14
Continue to review full report at Codecov.
|
9c81088
to
69f05c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other comments:
- Can we use
multicolumn
instead ofmulticol
as keyword name? I know it is a bit shorter, but since latex also usesmulticolumn
I would use that - Now you center the multi-columns. I think left aligning as the other column may be better as a default.
Is there a way to specify this? - You added a bunch of code lines in the
write_result
function. Can you see whether it makes sense to refactor part of it into a separate function to have smaller more specific functions?
pandas/core/frame.py
Outdated
|
||
.. versionadded:: 0.18.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you leave this one (I think it belonged to the decimal option)
pandas/core/frame.py
Outdated
|
||
.. versionadded:: 0.18.0 | ||
.. versionadded:: 0.19.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.19 ->0.20
pandas/formats/format.py
Outdated
ncol = 1 | ||
coltext = '' | ||
|
||
def appendCol(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's an internal function, so not that important, but let's be consistent in PEP 8 naming, so appendCol -> append_col
Thanks, i've incorporated your renaming suggestions. |
lgtm. @jorisvandenbossche @TomAugspurger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments.
As I also said in my previous comment, you added a bunch of code lines in the write_result function. Can you see whether it makes sense to put some of that in separate helper functions, or can you also add some code comments to make it a bit easier to follow what happens?
pandas/core/frame.py
Outdated
Use \multirow to enhance MultiIndex rows. Requires adding a | ||
\usepackage{multirow} to your LaTeX preamble. | ||
|
||
.. versionadded:: 0.20.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this for the other keywords as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ..versionadded:: 0.20.0
part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes (so like you did!)
pandas/core/frame.py
Outdated
@@ -1644,12 +1645,31 @@ def to_latex(self, buf=None, columns=None, col_space=None, header=True, | |||
|
|||
.. versionadded:: 0.18.0 | |||
|
|||
multicolumn : boolean, default will be read from the config module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put "default will be read from the config module" in the explanation, and here just "default True"
pandas/formats/format.py
Outdated
@@ -873,6 +886,9 @@ def get_col_type(dtype): | |||
compat.string_types): # pragma: no cover | |||
raise AssertionError('column_format must be str or unicode, not %s' | |||
% type(column_format)) | |||
multicolumn_format = self.multicolumn_format | |||
if multicolumn_format is None: | |||
multicolumn_format = get_option("display.latex.multicolumn_format") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this already done in DataFrame.to_latex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is.
I was just unsure whether this function can ever be called from another context, in which case it could fail with the default value for multicolumn_format
.
Both the encoding
(in DataFrameFormatter.to_latex) and column_format
(write_result) argument are checked for being None
.
Is there a better way to do this?
If this is kept, i should probably also add a typecheck for multicolumn_format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is enough that it is only checked in DataFrame.to_latex
. A user should in principle never call the Formatter.to_latex directly
(it seems that we are a bit inconsistent here, as eg longtable and escate are only checked in DataFrame.to_latex)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove the check :)
nlevels = self.frame.columns.nlevels | ||
ilevels = self.frame.index.nlevels | ||
clevels = self.frame.columns.nlevels | ||
nlevels = clevels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to have both nlevels and clevels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No
Kinda.
I used clevels to keep from fiddling with the handling of the index-names-line.
On the compiled side there's no difference if i just use the nlevels, but the resulting code differs, and i'd rather not change the behaviour in parts that don't need to be modified.
pandas/tests/formats/test_format.py
Outdated
self.assertEqual(result, expected) | ||
|
||
df.index = df.T.index | ||
result = df.T.to_latex(multirow=True,multicolumn=True,multicolumn_format='c') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: spaces after comma. But it is strange that Travis does not fail because of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix it in the next iteration. Missed it because for the test, all pep8 checks are disabled at the file-level (line 4).
pandas/tests/formats/test_format.py
Outdated
\cline{1-7} | ||
\multirow{2}{*}{c2} & 0 & 0 & 1 & 2 & 3 & 4 \\ | ||
& 1 & 5 & 6 & 7 & 8 & 9 \\ | ||
\cline{1-7} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the cline?
Can you maybe show an example of how this looks like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely felt it necessary:
The multirows are centered, especially on deeper hierarchies it's rather confusing without them:
It's not as bad without multirows there the items are aligned on the top of the data-block.
I guess in this regard the multirow switch is more of a taste-decision akin to multicolum_format, less so than the multicolumn switch. Thus i had it default to False :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question for the default: would we rather want to have it top-aligned instead of center? (here it may look better centered, but once you don't have exactly the same number of rows in each subpart, then I think top-aligned will look better)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's why I defaulted it to False.
The multirow feature is specifically for centered labels, top-alignment can be made without it (as it is done already in the code).
Rather, should the cline be untied from the multirow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's why I defaulted it to False.
Ah, OK, I understand now.
Can you clarify this in the docstring explanation of multirow
? (the fact that it gives centered labels instead of top-aligned, which is done by default)
bef053f
to
840321a
Compare
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -108,6 +108,9 @@ Other enhancements | |||
|
|||
- ``.select_dtypes()`` now allows `datetimetz` to generically select datetimes with tz (:issue:`14910`) | |||
|
|||
- The ``.to_latex()`` method will now accept ``multicolumn`` and ``multirow`` arguments to use the accompanying LaTeX enhancements | |||
|
|||
- ``pd.cut`` and ``pd.qcut`` now support datetime64 and timedelta64 dtypes (issue:`14714`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this is an error from rebasing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes...should have really stuck to the instructions...
pandas/formats/format.py
Outdated
@@ -873,6 +886,9 @@ def get_col_type(dtype): | |||
compat.string_types): # pragma: no cover | |||
raise AssertionError('column_format must be str or unicode, not %s' | |||
% type(column_format)) | |||
multicolumn_format = self.multicolumn_format | |||
if multicolumn_format is None: | |||
multicolumn_format = get_option("display.latex.multicolumn_format") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is enough that it is only checked in DataFrame.to_latex
. A user should in principle never call the Formatter.to_latex directly
(it seems that we are a bit inconsistent here, as eg longtable and escate are only checked in DataFrame.to_latex)
pandas/tests/formats/test_format.py
Outdated
\cline{1-7} | ||
\multirow{2}{*}{c2} & 0 & 0 & 1 & 2 & 3 & 4 \\ | ||
& 1 & 5 & 6 & 7 & 8 & 9 \\ | ||
\cline{1-7} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question for the default: would we rather want to have it top-aligned instead of center? (here it may look better centered, but once you don't have exactly the same number of rows in each subpart, then I think top-aligned will look better)
status of this? |
Is there anything I should still improve? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it look so long to review again! Added some comments regarding docs, will take a final look at the code this evening.
Additional comment:
- Can you add the new options to the overview table of available options in the docs (options.rst, it is a bit duplication of content, but that is how it is done right now)
pandas/core/config_init.py
Outdated
: bool | ||
This specifies if the to_latex method of a Dataframe uses multicolumns | ||
to pretty-print MultiIndex columns. | ||
method. Valid values: False,True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this 'method. ' here is a left-over from copying it from somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes it is. Copied it from longtable or escape which are both still ill-formatted in the current master. I'll fix them too while i'm at it.
pandas/core/config_init.py
Outdated
: bool | ||
This specifies if the to_latex method of a Dataframe uses multirows | ||
to pretty-print MultiIndex rows. | ||
method. Valid values: False,True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
pandas/core/frame.py
Outdated
|
||
multicolumn_format : str, default 'l' | ||
default will be read from the config module | ||
The alignment for multicolumns, similar to column_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single backticks around column_format (`column_format`
)
pandas/core/frame.py
Outdated
@@ -1606,12 +1607,37 @@ def to_latex(self, buf=None, columns=None, col_space=None, header=True, | |||
|
|||
.. versionadded:: 0.18.0 | |||
|
|||
multicolumn : boolean, default True | |||
default will be read from the config module | |||
Use \multicolumn to enhance MultiIndex columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use a proper sentence (Capital, punctuation) for the first line of 'default ..' (the thing is, here this looks OK in plain text, but in the html docs, the line breaks are not preserved, so will look line one long sentence without '.' in the middle)
Also, would start with the actual explanation of the keyword, and put "Default will be ..." at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the same for the other ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is again also true for longtable and escape, so i'll do the same to them.
pandas/tests/formats/test_format.py
Outdated
\cline{1-7} | ||
\multirow{2}{*}{c2} & 0 & 0 & 1 & 2 & 3 & 4 \\ | ||
& 1 & 5 & 6 & 7 & 8 & 9 \\ | ||
\cline{1-7} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's why I defaulted it to False.
Ah, OK, I understand now.
Can you clarify this in the docstring explanation of multirow
? (the fact that it gives centered labels instead of top-aligned, which is done by default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the doc updates!
I looked in some more detail at the code, and added a few comments on how I think it can be made a bit more maintainable (it is now, as I mentioned earlier, a big chunk of code added in the for loop).
(to be clear, it is only about some restructuring, not fundamental concerns!)
Travis is doing a bit strange at the moment, but based on the other ones, it seems one of your tests is failing.
pandas/tests/formats/test_format.py
Outdated
('c2',1):dict((x,x+5) for x in range(5)), | ||
('c3',0):dict((x,x) for x in range(5)) | ||
}) | ||
result = df.to_latex(multicolumn=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for multicolumn=False as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further, can you add test cases with and without index/column level names? (I think now all are without?) Or is this already tested somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, seems like all tests have unnamed indices. Will do.
@@ -906,8 +927,59 @@ def get_col_type(dtype): | |||
if x else '{}') for x in row] | |||
else: | |||
crow = [x if x else '{}' for x in row] | |||
if i < clevels and self.fmt.header and self.multicolumn: | |||
# sum up columns to multicolumns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put the content contained in this if statement in a separate method on the class? Eg call it something like _format_multicolumn(..)
? Then you can give it a docstring with some explanation what exactly happens. And that will make the big for loop here a bit more comprehensible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are class-methods preferred over nested-functions?
pandas/formats/format.py
Outdated
if coltext: # write last column name | ||
append_col() | ||
crow = row2 | ||
if i >= nlevels and self.fmt.index and self.multirow: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this only happen when ilevels > 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, yeah. I'll add this check.
Just for curiosity, is it possible to create some clashes in a single index, e.g. via hiding parts of a MultiIndex? This thought never crossed my mind when i wrote this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to create some clashes in a single index, e.g. via hiding parts of a MultiIndex?
What do you mean exactly?
pandas/formats/format.py
Outdated
crow = row2 | ||
if i >= nlevels and self.fmt.index and self.multirow: | ||
# perform look-ahead to determine which rows can be joined | ||
for j in range(ilevels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do the same here? (Put this in a method)
This one is shorter, so less needed, but then it is a bit more consistent
pandas/formats/format.py
Outdated
for cl in clinebuf: | ||
cl[0] -= 1 | ||
if cl[0] == 0: | ||
buf.write('\cline{{{0:d}-{1:d}}}\n'.format(cl[1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this part very hard to understand? (not the buf.write
itself, but the lines above regarding clinebuf, and also the line below this). Can you see if you can simplify it?
Or maybe try to explain it in a few words to me, then maybe I see another way or what would be a good comment so I would understand it more easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the above block, i iterate over the index levels and check if a multirow should be used. If i find a multirow, i save the number of rows it spans, and it's level (indented by 1) in clinebuf (list, because multiple multirows can occur at the same time). When writing this line and the following, i decrease the first entry in the buffered pairs. If this value reaches zero, it marks the point where the multirow ends, and i place the cline. When all levels have been checked, i flush the buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would be more clear to just save the target-row (as i've done with the target-col :D ) and drop the list-entry when cl[0] matches the loopcounter.
Thanks for the feedback, i hope the changes go in the direction you intended. The test that failed previously was caused by the to_latex_multiindex test expecting no multicolumns, obviously. This has been fixed. |
Yes, there are currently some issues with travis, nothing to do with this PR. |
pandas/formats/format.py
Outdated
# sum up columns to multicolumns | ||
crow = self._format_multicolumn(crow, ilevels) | ||
if i >= nlevels and self.fmt.index and self.multirow and\ | ||
ilevels > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small style issue: in pandas we try to not use \
, but rather use (...)
for line continuation
pandas/formats/format.py
Outdated
if coltext: # write last column name | ||
append_col() | ||
crow = row2 | ||
if i >= nlevels and self.fmt.index and self.multirow: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to create some clashes in a single index, e.g. via hiding parts of a MultiIndex?
What do you mean exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think that makes it much clearer (certainly with the explanation in the docstrings)
Some small follow-up comments
pandas/formats/format.py
Outdated
for cl in self.clinebuf: | ||
if cl[0] == i: | ||
buf.write('\cline{{{0:d}-{1:d}}}\n'.format(cl[1], l)) | ||
self.clinebuf = [x for x in self.clinebuf if x[0] != i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this already clearer, only the rewriting of clinebuf
on the line above I don't fully understand (is that just to not have to iterate over all clinebuf elements that are not needed anymore because of previous lines? so not essential to have the code working? No problem with having it, just to be sure I understand?)
Maybe add a small comment above it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as soon as they're written they get deleted. Not necessary with the new mechanism I guess, I didn't like polluting the list with past entries.
pandas/formats/format.py
Outdated
self.clinebuf.append([i + nrow - 1, j + 1]) | ||
return row | ||
|
||
def _print_cline(self, buf, i, l): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l -> ncols ? (as more self-explanatory name, if ncols is correct of course)
21063df
to
d2ae82d
Compare
- [X] closes pandas-dev#13508 - [X] tests added / passed - [X] passes `git diff upstream/master | flake8 --diff` - [X] whatsnew entry Print names of MultiIndex columns. Added "multicolumn" and "multirow" flags to to_latex which trigger the corresponding feature. "multicolumn_format" is used to select alignment. Multirow adds clines to visually separate sections.
@hein09 Thanks a lot! |
closes pandas-dev#13508 Print names of MultiIndex columns. Added "multicolumn" and "multirow" flags to to_latex which trigger the corresponding feature. "multicolumn_format" is used to select alignment. Multirow adds clines to visually separate sections.
git diff upstream/master | flake8 --diff
Print names of MultiIndex columns.
Added "multicol" and "multirow" flags to to_latex
which trigger the corresponding feature.
Multirow adds clines to visually separate sections.