Description
Is your feature request related to a problem?
As far as I can tell there is no way to write a DataFrame to a specific (named) sheet in an existing Excel file, creating the sheet if it doesn't exist or overwriting if it does. The closest thing to this that exists is the openpyxl append mode, but if you try to write to sheet_which_already_exists
with this, the data will be written to sheet_which_already_exists1
. Using write mode overwrites the entire book.
Being able to overwrite existing sheets would be useful for populating "raw data" sheets in existing reports/publications on a regular basis, with pivot tables etc updating automatically.
This seems to be a very common issue:
https://stackoverflow.com/questions/62618680/overwrite-an-excel-sheet-with-pandas-dataframe-without-affecting-other-sheets
https://stackoverflow.com/questions/47602533/append-a-pandas-dataframe-to-an-existing-excel-table?noredirect=1&lq=1
https://stackoverflow.com/questions/34744863/python-how-to-use-excelwriter-to-write-into-an-existing-worksheet?noredirect=1&lq=1
#28653
#29990
Describe the solution you'd like
There are a few potential behaviours someone might want when they write to an existing sheet in append mode:
- the current behaviour, which creates a new sheet with a different name
- to delete the current contents of the sheet and write their data
- to write the data without deleting anything, for example to add columns or rows to an existing table, or a new table in a different position
Originally I considered adding a flag like overwrite=True
to pd.ExcelWriter
but I don't think this would be sufficient to cover all the potential behaviours, so would suggest an enum along the lines of the if_exists
option in df.to_sql
, with the following options:
- new_sheet: Create a new sheet with a different name.
- overwrite_sheet: Delete the contents of the sheet, then write to it.
- overwrite_cells: Write directly to the named sheet without deleting the previous contents.
This option would only be relevant when using mode="a"
with the openpyxl engine.
API breaking implications
The new_sheet
option would replicate the current behaviour, so this could be made the default.
Additional context
Demonstration of current behaviour:
import pandas as pd
from openpyxl import load_workbook
import io
df1 = pd.DataFrame({
"col1": ["apple", "banana", "carrot"],
"col2": [1, 2, 3],
})
df2 = pd.DataFrame({"col1": [5]})
excel = io.BytesIO()
with pd.ExcelWriter(excel, engine="openpyxl", mode="w") as writer:
# write two sheets
df1.to_excel(writer, sheet_name="sheet1", index=False)
df1.to_excel(writer, sheet_name="sheet2", index=False)
with pd.ExcelWriter(excel, engine="openpyxl", mode="a") as writer:
# try to overwrite one of them
df2.to_excel(writer, sheet_name="sheet1", index=False)
wb = load_workbook(excel)
# sheet1, sheet2, sheet11
print(wb.sheetnames)
Why does it work this way? At the moment, when the openpyxl ExcelWriter
is initialised in append mode it loads the workbook, but does not update self.sheets
to reflect the sheets in the loaded file.
https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_openpyxl.py#L54-L58
This seems unintentional and means that when write_cells
is called with a sheet name that already exists in the file, it is not found in self.sheets
:
https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_openpyxl.py#L414-L419
so we follow the else branch, and when assigning to wks.title
openpyxl suffixes the sheet name to avoid a naming conflict.