Skip to content

ENH: Added multicolumn/multirow support for latex #14184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 3, 2017

Conversation

sgsaenger
Copy link

@sgsaenger sgsaenger commented Sep 8, 2016

Print names of MultiIndex columns.
Added "multicol" and "multirow" flags to to_latex
which trigger the corresponding feature.
Multirow adds clines to visually separate sections.

@@ -1613,7 +1613,7 @@ def to_latex(self, buf=None, columns=None, col_space=None, header=True,
index=True, na_rep='NaN', formatters=None, float_format=None,
sparsify=None, index_names=True, bold_rows=True,
column_format=None, longtable=None, escape=None,
encoding=None, decimal='.'):
encoding=None, decimal='.',multicol=None,multirow=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need spaces after the ,s here (check git diff upstream/master | flake8 --diff)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!
Should have started using this sooner, quite nice.

@TomAugspurger
Copy link
Contributor

You'll need to rebase your branch on master. Just post here if you need help with the git commands.

@sgsaenger
Copy link
Author

Oh my, what a mess...
I did git rebase upstream/master as in the documentation, which seemed to work correctly.
Before pushing, though, git performed an automatic merge of my local and remote branch.
Any way to clean this up? Thanks for helping!

@TomAugspurger
Copy link
Contributor

You'll want to start by

git fetch upstream
git rebase -i upstream/master

This will start applying changes, but you'll have some merge conflicts. You'll need to edit the files (see git status for "Unmerged Paths") and look for the conflict markers git auto inserts (<<<<<<<, ======== and >>>>>>>). Fixup those files to look like how the should (make sure to remove the markers too), and the files you just edited, and then

git rebase --continue

Side-issue, typically you'll develop on a feature branch, no master. So before you make any changes you would git checkout -b <branch-name>. Not a big deal now though.

We can also jump into the chat room if that's easier.

.format(ncol, coltext.strip()))
else:
row2.append(coltext)
for c in crow[ilevels:]:
Copy link
Author

@sgsaenger sgsaenger Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the multicolumn feature, long names in higher levels break the layout severely:
nocolmultirow
compared to:
multicolmultirow
Thus, i would propose this feature to be enabled by default.

This means, that empty columns won't be printed, which may alter the code of the line with the row-names.
Not the result though, as latex ignores these trailing empty parts anyways.
Should i rather set the default to False and maintain compatibility in code, or fix the rendered output by rewriting fixing tests?
Or has somebody an idea to satisfy both?

Never mind, fixed.

@codecov-io
Copy link

codecov-io commented Sep 28, 2016

Codecov Report

Merging #14184 into master will decrease coverage by -0.02%.
The diff coverage is 98.55%.

@@            Coverage Diff             @@
##           master   #14184      +/-   ##
==========================================
- Coverage   91.04%   91.03%   -0.02%     
==========================================
  Files         136      136              
  Lines       49088    49152      +64     
==========================================
+ Hits        44694    44744      +50     
- Misses       4394     4408      +14
Impacted Files Coverage Δ
pandas/core/config_init.py 95.27% <100%> (+0.23%)
pandas/core/frame.py 97.83% <100%> (-0.1%)
pandas/formats/format.py 95.4% <98.24%> (+0.09%)
pandas/io/gbq.py 25% <0%> (-58.34%)
pandas/tools/merge.py 91.78% <0%> (-0.35%)
pandas/util/testing.py 81.87% <0%> (-0.19%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 211ecd5...ce0a0c9. Read the comment docs.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other comments:

  • Can we use multicolumn instead of multicol as keyword name? I know it is a bit shorter, but since latex also uses multicolumn I would use that
  • Now you center the multi-columns. I think left aligning as the other column may be better as a default.
    Is there a way to specify this?
  • You added a bunch of code lines in the write_result function. Can you see whether it makes sense to refactor part of it into a separate function to have smaller more specific functions?


.. versionadded:: 0.18.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave this one (I think it belonged to the decimal option)


.. versionadded:: 0.18.0
.. versionadded:: 0.19.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.19 ->0.20

ncol = 1
coltext = ''

def appendCol():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's an internal function, so not that important, but let's be consistent in PEP 8 naming, so appendCol -> append_col

@sgsaenger
Copy link
Author

Thanks, i've incorporated your renaming suggestions.
Also added an additional multicolumn_format parameter where one can specify the alignment similar to the existing column_format.
Unfortunately, I don't really see how one could separate the different parts of this routine without rewriting it from scratch, even before the routine was basically "take the result of _to_str_columns and completely change it".

@jreback
Copy link
Contributor

jreback commented Dec 5, 2016

lgtm. @jorisvandenbossche @TomAugspurger
(conflict in the whatsnew), but can fix on merge if all other good.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.

As I also said in my previous comment, you added a bunch of code lines in the write_result function. Can you see whether it makes sense to put some of that in separate helper functions, or can you also add some code comments to make it a bit easier to follow what happens?

Use \multirow to enhance MultiIndex rows. Requires adding a
\usepackage{multirow} to your LaTeX preamble.

.. versionadded:: 0.20.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this for the other keywords as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ..versionadded:: 0.20.0 part?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes (so like you did!)

@@ -1644,12 +1645,31 @@ def to_latex(self, buf=None, columns=None, col_space=None, header=True,

.. versionadded:: 0.18.0

multicolumn : boolean, default will be read from the config module
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put "default will be read from the config module" in the explanation, and here just "default True"

@@ -873,6 +886,9 @@ def get_col_type(dtype):
compat.string_types): # pragma: no cover
raise AssertionError('column_format must be str or unicode, not %s'
% type(column_format))
multicolumn_format = self.multicolumn_format
if multicolumn_format is None:
multicolumn_format = get_option("display.latex.multicolumn_format")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this already done in DataFrame.to_latex?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is.
I was just unsure whether this function can ever be called from another context, in which case it could fail with the default value for multicolumn_format.
Both the encoding (in DataFrameFormatter.to_latex) and column_format (write_result) argument are checked for being None.
Is there a better way to do this?
If this is kept, i should probably also add a typecheck for multicolumn_format?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is enough that it is only checked in DataFrame.to_latex. A user should in principle never call the Formatter.to_latex directly

(it seems that we are a bit inconsistent here, as eg longtable and escate are only checked in DataFrame.to_latex)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove the check :)

nlevels = self.frame.columns.nlevels
ilevels = self.frame.index.nlevels
clevels = self.frame.columns.nlevels
nlevels = clevels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to have both nlevels and clevels

Copy link
Author

@sgsaenger sgsaenger Dec 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No
Kinda.
I used clevels to keep from fiddling with the handling of the index-names-line.
On the compiled side there's no difference if i just use the nlevels, but the resulting code differs, and i'd rather not change the behaviour in parts that don't need to be modified.

self.assertEqual(result, expected)

df.index = df.T.index
result = df.T.to_latex(multirow=True,multicolumn=True,multicolumn_format='c')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: spaces after comma. But it is strange that Travis does not fail because of this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix it in the next iteration. Missed it because for the test, all pep8 checks are disabled at the file-level (line 4).

\cline{1-7}
\multirow{2}{*}{c2} & 0 & 0 & 1 & 2 & 3 & 4 \\
& 1 & 5 & 6 & 7 & 8 & 9 \\
\cline{1-7}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the cline?
Can you maybe show an example of how this looks like?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely felt it necessary:
temp
The multirows are centered, especially on deeper hierarchies it's rather confusing without them:
temp
It's not as bad without multirows there the items are aligned on the top of the data-block.
I guess in this regard the multirow switch is more of a taste-decision akin to multicolum_format, less so than the multicolumn switch. Thus i had it default to False :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question for the default: would we rather want to have it top-aligned instead of center? (here it may look better centered, but once you don't have exactly the same number of rows in each subpart, then I think top-aligned will look better)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why I defaulted it to False.
The multirow feature is specifically for centered labels, top-alignment can be made without it (as it is done already in the code).
Rather, should the cline be untied from the multirow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why I defaulted it to False.

Ah, OK, I understand now.
Can you clarify this in the docstring explanation of multirow? (the fact that it gives centered labels instead of top-aligned, which is done by default)

@sgsaenger sgsaenger force-pushed the master branch 2 times, most recently from bef053f to 840321a Compare December 21, 2016 00:40
@@ -108,6 +108,9 @@ Other enhancements

- ``.select_dtypes()`` now allows `datetimetz` to generically select datetimes with tz (:issue:`14910`)

- The ``.to_latex()`` method will now accept ``multicolumn`` and ``multirow`` arguments to use the accompanying LaTeX enhancements

- ``pd.cut`` and ``pd.qcut`` now support datetime64 and timedelta64 dtypes (issue:`14714`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is an error from rebasing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes...should have really stuck to the instructions...

@@ -873,6 +886,9 @@ def get_col_type(dtype):
compat.string_types): # pragma: no cover
raise AssertionError('column_format must be str or unicode, not %s'
% type(column_format))
multicolumn_format = self.multicolumn_format
if multicolumn_format is None:
multicolumn_format = get_option("display.latex.multicolumn_format")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is enough that it is only checked in DataFrame.to_latex. A user should in principle never call the Formatter.to_latex directly

(it seems that we are a bit inconsistent here, as eg longtable and escate are only checked in DataFrame.to_latex)

\cline{1-7}
\multirow{2}{*}{c2} & 0 & 0 & 1 & 2 & 3 & 4 \\
& 1 & 5 & 6 & 7 & 8 & 9 \\
\cline{1-7}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question for the default: would we rather want to have it top-aligned instead of center? (here it may look better centered, but once you don't have exactly the same number of rows in each subpart, then I think top-aligned will look better)

@jreback
Copy link
Contributor

jreback commented Jan 21, 2017

status of this?

@sgsaenger
Copy link
Author

Is there anything I should still improve?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it look so long to review again! Added some comments regarding docs, will take a final look at the code this evening.

Additional comment:

  • Can you add the new options to the overview table of available options in the docs (options.rst, it is a bit duplication of content, but that is how it is done right now)

: bool
This specifies if the to_latex method of a Dataframe uses multicolumns
to pretty-print MultiIndex columns.
method. Valid values: False,True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this 'method. ' here is a left-over from copying it from somewhere else?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes it is. Copied it from longtable or escape which are both still ill-formatted in the current master. I'll fix them too while i'm at it.

: bool
This specifies if the to_latex method of a Dataframe uses multirows
to pretty-print MultiIndex rows.
method. Valid values: False,True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here


multicolumn_format : str, default 'l'
default will be read from the config module
The alignment for multicolumns, similar to column_format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single backticks around column_format (`column_format`)

@@ -1606,12 +1607,37 @@ def to_latex(self, buf=None, columns=None, col_space=None, header=True,

.. versionadded:: 0.18.0

multicolumn : boolean, default True
default will be read from the config module
Use \multicolumn to enhance MultiIndex columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use a proper sentence (Capital, punctuation) for the first line of 'default ..' (the thing is, here this looks OK in plain text, but in the html docs, the line breaks are not preserved, so will look line one long sentence without '.' in the middle)

Also, would start with the actual explanation of the keyword, and put "Default will be ..." at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the same for the other ones

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is again also true for longtable and escape, so i'll do the same to them.

\cline{1-7}
\multirow{2}{*}{c2} & 0 & 0 & 1 & 2 & 3 & 4 \\
& 1 & 5 & 6 & 7 & 8 & 9 \\
\cline{1-7}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why I defaulted it to False.

Ah, OK, I understand now.
Can you clarify this in the docstring explanation of multirow? (the fact that it gives centered labels instead of top-aligned, which is done by default)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc updates!

I looked in some more detail at the code, and added a few comments on how I think it can be made a bit more maintainable (it is now, as I mentioned earlier, a big chunk of code added in the for loop).
(to be clear, it is only about some restructuring, not fundamental concerns!)

Travis is doing a bit strange at the moment, but based on the other ones, it seems one of your tests is failing.

('c2',1):dict((x,x+5) for x in range(5)),
('c3',0):dict((x,x) for x in range(5))
})
result = df.to_latex(multicolumn=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for multicolumn=False as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, can you add test cases with and without index/column level names? (I think now all are without?) Or is this already tested somewhere else?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, seems like all tests have unnamed indices. Will do.

@@ -906,8 +927,59 @@ def get_col_type(dtype):
if x else '{}') for x in row]
else:
crow = [x if x else '{}' for x in row]
if i < clevels and self.fmt.header and self.multicolumn:
# sum up columns to multicolumns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the content contained in this if statement in a separate method on the class? Eg call it something like _format_multicolumn(..) ? Then you can give it a docstring with some explanation what exactly happens. And that will make the big for loop here a bit more comprehensible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are class-methods preferred over nested-functions?

if coltext: # write last column name
append_col()
crow = row2
if i >= nlevels and self.fmt.index and self.multirow:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this only happen when ilevels > 1 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, yeah. I'll add this check.
Just for curiosity, is it possible to create some clashes in a single index, e.g. via hiding parts of a MultiIndex? This thought never crossed my mind when i wrote this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to create some clashes in a single index, e.g. via hiding parts of a MultiIndex?

What do you mean exactly?

crow = row2
if i >= nlevels and self.fmt.index and self.multirow:
# perform look-ahead to determine which rows can be joined
for j in range(ilevels):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do the same here? (Put this in a method)
This one is shorter, so less needed, but then it is a bit more consistent

for cl in clinebuf:
cl[0] -= 1
if cl[0] == 0:
buf.write('\cline{{{0:d}-{1:d}}}\n'.format(cl[1],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this part very hard to understand? (not the buf.write itself, but the lines above regarding clinebuf, and also the line below this). Can you see if you can simplify it?
Or maybe try to explain it in a few words to me, then maybe I see another way or what would be a good comment so I would understand it more easily.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above block, i iterate over the index levels and check if a multirow should be used. If i find a multirow, i save the number of rows it spans, and it's level (indented by 1) in clinebuf (list, because multiple multirows can occur at the same time). When writing this line and the following, i decrease the first entry in the buffered pairs. If this value reaches zero, it marks the point where the multirow ends, and i place the cline. When all levels have been checked, i flush the buffer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would be more clear to just save the target-row (as i've done with the target-col :D ) and drop the list-entry when cl[0] matches the loopcounter.

@sgsaenger
Copy link
Author

Thanks for the feedback, i hope the changes go in the direction you intended.

The test that failed previously was caused by the to_latex_multiindex test expecting no multicolumns, obviously. This has been fixed.
I don't know what causes the new failures, looks more like a technical issue to me, locally it just fails by trying to load pyqt4 and 5 simultaneously, everything i've touched works as inteded ;)

@jorisvandenbossche
Copy link
Member

I don't know what causes the new failures, looks more like a technical issue to me,

Yes, there are currently some issues with travis, nothing to do with this PR.

# sum up columns to multicolumns
crow = self._format_multicolumn(crow, ilevels)
if i >= nlevels and self.fmt.index and self.multirow and\
ilevels > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small style issue: in pandas we try to not use \, but rather use (...) for line continuation

if coltext: # write last column name
append_col()
crow = row2
if i >= nlevels and self.fmt.index and self.multirow:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to create some clashes in a single index, e.g. via hiding parts of a MultiIndex?

What do you mean exactly?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think that makes it much clearer (certainly with the explanation in the docstrings)

Some small follow-up comments

for cl in self.clinebuf:
if cl[0] == i:
buf.write('\cline{{{0:d}-{1:d}}}\n'.format(cl[1], l))
self.clinebuf = [x for x in self.clinebuf if x[0] != i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this already clearer, only the rewriting of clinebuf on the line above I don't fully understand (is that just to not have to iterate over all clinebuf elements that are not needed anymore because of previous lines? so not essential to have the code working? No problem with having it, just to be sure I understand?)
Maybe add a small comment above it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as soon as they're written they get deleted. Not necessary with the new mechanism I guess, I didn't like polluting the list with past entries.

self.clinebuf.append([i + nrow - 1, j + 1])
return row

def _print_cline(self, buf, i, l):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

l -> ncols ? (as more self-explanatory name, if ncols is correct of course)

@sgsaenger sgsaenger force-pushed the master branch 2 times, most recently from 21063df to d2ae82d Compare March 3, 2017 07:54
sgsaenger and others added 2 commits March 3, 2017 08:55
- [X] closes pandas-dev#13508
- [X] tests added / passed
- [X] passes `git diff upstream/master | flake8 --diff`
- [X] whatsnew entry

Print names of MultiIndex columns.
Added "multicolumn" and "multirow" flags to to_latex
which trigger the corresponding feature.
"multicolumn_format" is used to select alignment.
Multirow adds clines to visually separate sections.
@jorisvandenbossche jorisvandenbossche merged commit 24a2155 into pandas-dev:master Mar 3, 2017
@jorisvandenbossche
Copy link
Member

@hein09 Thanks a lot!
(the travis failure was a pep8 issue, fixed that before merging)

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Mar 3, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#13508

Print names of MultiIndex columns.
Added "multicolumn" and "multirow" flags to to_latex
which trigger the corresponding feature.
"multicolumn_format" is used to select alignment.
Multirow adds clines to visually separate sections.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use \multirow \multicolumn in latex output
5 participants