Description
There appears to be an issue with Excel writers when DataFrame column names are duplicated. This issue that was initially reported on StackOverflow.
For example consider the following program:
import pandas as pd
from pandas import DataFrame
df = DataFrame([[1, 2, 3], [1, 2, 3], [1, 2, 3]])
df.columns = ['A', 'B', 'B'] # !!!
df.to_csv('output.csv')
df.to_excel('output.xlsx')
Note the duplicated column name. The df
for this looks like this:
>>> df
A B B
0 1 2 3
1 1 2 3
2 1 2 3
The corresponding output of the CSV is as expected:
$ cat output.csv
,A,B,B
0,1,2,3
1,1,2,3
2,1,2,3
However, the output of the any of the Excel writers is incorrect:
The issue appears to be in pandas/core/format.py
. The output data is gathered based on column names, as shown below, which causes issues with duplicate names.
def _format_regular_rows(self):
...
for colidx, colname in enumerate(self.columns):
series = self.df[colname]
...
I initially thought that this might be the correct behaviour and that column names shouldn't be duplicated but given that the output is different to the csv writer it looks like a bug.
I'll write a test case but I'm not sure of the best way to fix the issue.